Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • W waLBerla
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 40
    • Issues 40
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 14
    • Merge requests 14
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Releases
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • waLBerla
  • waLBerla
  • Merge requests
  • !565

Missing synchronization for time measurements with CUDA

  • Review changes

  • Download
  • Patches
  • Plain diff
Open Richard Angersbach requested to merge zy69guqi/walberla:master into master Sep 06, 2022
  • Overview 5
  • Commits 67
  • Pipelines 0
  • Changes 50

Currently, waLBerla's timer does not provide accurate timings for CUDA kernels. It only measures the time for launching the CUDA kernel but does not wait until the kernel execution is finished. This MR employs explicit device synchronization for time measurements of CUDA kernels.

Changes in this MR:

  1. Added CUDA API wrapper functions similar to the MPI wrappers, i.e. when calling cudaMemcpy without WALBERLA_BUILD_WITH_CUDA=ON the code will still compile, but the wrapper function cudaMemcpy will now give a runtime error if not guarded appropriately with WALBERLA_DEVICE_SECTION, e.g.
WALBERLA_DEVICE_SECTION() {
    cudaMalloc(...);
    cudaMemcpy(...);
}

as an alternative to the previous

#if defined(WALBERLA_BUILD_WITH_CUDA)
    cudaMalloc(...);
    cudaMemcpy(...);
#endif
  1. Synchronize timers with cudaDeviceSynchronize when using CUDA. A specialized DeviceSynchronizePolicy was introduced for this purpose. Note that this policy is only meant for measuring individual kernels and can degrade the performance of overlapping kernels.
  2. (Sweep)Timeloops had functions that were only implemented for WcPolicy timers -> Added timing policy as a template argument and exported DeviceSynchronizeSweepTimeloop, DeviceSynchronizeTimeloop, ... class names via "using" directives.

CUDA applications can now use timing pools as in the following:

DeviceSynchronizeTimingPool timeloopTiming;
DeviceSynchronizeSweepTimeloop timeloop(blocks, nrOfSteps);

and still make use of the single-stepping scheme

for (int t = 0; t < nrOfSteps; ++t) {
    // perform a single simulation step
    timeloop.singleStep(timeloopTiming);

    // sub-cycles or other routines that do not belong in the timeloop
    // ...
}
  1. Integrate new CUDA wrappers in the src/cuda module, e.g. for cudaMemcpy
inline cudaError_t cudaMemcpy(void* /*dst*/, const void* /*src*/, size_t /*count*/, cudaMemcpyKind /*kind*/) { WALBERLA_DEVICE_FUNCTION_ERROR }
  1. There are currently no NVTX wrappers implemented. The cuda/NVTX.h header is now appropriately wrapped with a WALBERLA_BUILD_WITH_CUDA pragma.
  2. Extended existing MPI wrapper functions a little.
  3. Adapted existing apps (review required from maintainers of apps): Edit: this was reverted as it should be done in a separate MR.
  4. CMake infrastructure is slightly adapted. cuda module is always built -> in most cases something like this is not necessary anymore
if (WALBERLA_BUILD_WITH_CUDA)
    waLBerla_add_executable( NAME xyz FILE abc.cpp
            DEPENDS ... cuda )
else ()
    waLBerla_add_executable( NAME xyz FILE abc.cpp
            DEPENDS ...)
endif (WALBERLA_BUILD_WITH_CUDA)

If your app requires the cuda module, you can simply use

waLBerla_add_executable( NAME xyz FILE abc.cpp
            DEPENDS ... cuda )
Edited May 04, 2023 by Richard Angersbach
Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: master