Missing synchronization for time measurements with CUDA (!565) · Merge requests · waLBerla / waLBerla

Richard Angersbach requested to merge zy69guqi/walberla:master into master Sep 06, 2022

Currently, waLBerla's timer does not provide accurate timings for CUDA kernels. It only measures the time for launching the CUDA kernel but does not wait until the kernel execution is finished. This MR employs explicit device synchronization for time measurements of CUDA kernels.

Changes in this MR:

Added CUDA API wrapper functions similar to the MPI wrappers, i.e. when calling cudaMemcpy without WALBERLA_BUILD_WITH_CUDA=ON the code will still compile, but the wrapper function cudaMemcpy will now give a runtime error if not guarded appropriately with WALBERLA_DEVICE_SECTION, e.g.

WALBERLA_DEVICE_SECTION() {
    cudaMalloc(...);
    cudaMemcpy(...);
}

as an alternative to the previous

#if defined(WALBERLA_BUILD_WITH_CUDA)
    cudaMalloc(...);
    cudaMemcpy(...);
#endif

Synchronize timers with cudaDeviceSynchronize when using CUDA. A specialized DeviceSynchronizePolicy was introduced for this purpose. Note that this policy is only meant for measuring individual kernels and can degrade the performance of overlapping kernels.
(Sweep)Timeloops had functions that were only implemented for WcPolicy timers -> Added timing policy as a template argument and exported DeviceSynchronizeSweepTimeloop, DeviceSynchronizeTimeloop, ... class names via "using" directives.

CUDA applications can now use timing pools as in the following:

DeviceSynchronizeTimingPool timeloopTiming;
DeviceSynchronizeSweepTimeloop timeloop(blocks, nrOfSteps);

and still make use of the single-stepping scheme

for (int t = 0; t < nrOfSteps; ++t) {
    // perform a single simulation step
    timeloop.singleStep(timeloopTiming);

    // sub-cycles or other routines that do not belong in the timeloop
    // ...
}

Integrate new CUDA wrappers in the src/cuda module, e.g. for cudaMemcpy

inline cudaError_t cudaMemcpy(void* /*dst*/, const void* /*src*/, size_t /*count*/, cudaMemcpyKind /*kind*/) { WALBERLA_DEVICE_FUNCTION_ERROR }

There are currently no NVTX wrappers implemented. The cuda/NVTX.h header is now appropriately wrapped with a WALBERLA_BUILD_WITH_CUDA pragma.
Extended existing MPI wrapper functions a little.
~~Adapted existing apps (review required from maintainers of apps):~~ Edit: this was reverted as it should be done in a separate MR.
CMake infrastructure is slightly adapted. cuda module is always built -> in most cases something like this is not necessary anymore

if (WALBERLA_BUILD_WITH_CUDA)
    waLBerla_add_executable( NAME xyz FILE abc.cpp
            DEPENDS ... cuda )
else ()
    waLBerla_add_executable( NAME xyz FILE abc.cpp
            DEPENDS ...)
endif (WALBERLA_BUILD_WITH_CUDA)

If your app requires the cuda module, you can simply use

waLBerla_add_executable( NAME xyz FILE abc.cpp
            DEPENDS ... gpu )

Edited Jun 22, 2023 by Richard Angersbach

Missing synchronization for time measurements with CUDA

Merge request reports