Currently, waLBerla's timer does not provide accurate timings for CUDA kernels. It only measures the time for launching the CUDA kernel but does not wait until the kernel execution is finished. This MR employs explicit device synchronization for time measurements of CUDA kernels.
Changes in this MR:
- Added CUDA API wrapper functions similar to the MPI wrappers, i.e. when calling
cudaMemcpy
withoutWALBERLA_BUILD_WITH_CUDA=ON
the code will still compile, but the wrapper functioncudaMemcpy
will now give a runtime error if not guarded appropriately withWALBERLA_DEVICE_SECTION
, e.g.
WALBERLA_DEVICE_SECTION() {
cudaMalloc(...);
cudaMemcpy(...);
}
as an alternative to the previous
#if defined(WALBERLA_BUILD_WITH_CUDA)
cudaMalloc(...);
cudaMemcpy(...);
#endif
- Synchronize timers with
cudaDeviceSynchronize
when using CUDA. A specialized DeviceSynchronizePolicy was introduced for this purpose. Note that this policy is only meant for measuring individual kernels and can degrade the performance of overlapping kernels. - (Sweep)Timeloops had functions that were only implemented for WcPolicy timers -> Added timing policy as a template argument and exported
DeviceSynchronizeSweepTimeloop
,DeviceSynchronizeTimeloop
, ... class names via "using" directives.
CUDA applications can now use timing pools as in the following:
DeviceSynchronizeTimingPool timeloopTiming;
DeviceSynchronizeSweepTimeloop timeloop(blocks, nrOfSteps);
and still make use of the single-stepping scheme
for (int t = 0; t < nrOfSteps; ++t) {
// perform a single simulation step
timeloop.singleStep(timeloopTiming);
// sub-cycles or other routines that do not belong in the timeloop
// ...
}
- Integrate new CUDA wrappers in the src/cuda module, e.g. for cudaMemcpy
inline cudaError_t cudaMemcpy(void* /*dst*/, const void* /*src*/, size_t /*count*/, cudaMemcpyKind /*kind*/) { WALBERLA_DEVICE_FUNCTION_ERROR }
- There are currently no NVTX wrappers implemented. The cuda/NVTX.h header is now appropriately wrapped with a
WALBERLA_BUILD_WITH_CUDA
pragma. - Extended existing MPI wrapper functions a little.
-
Adapted existing apps (review required from maintainers of apps):Edit: this was reverted as it should be done in a separate MR. - CMake infrastructure is slightly adapted. cuda module is always built -> in most cases something like this is not necessary anymore
if (WALBERLA_BUILD_WITH_CUDA)
waLBerla_add_executable( NAME xyz FILE abc.cpp
DEPENDS ... cuda )
else ()
waLBerla_add_executable( NAME xyz FILE abc.cpp
DEPENDS ...)
endif (WALBERLA_BUILD_WITH_CUDA)
If your app requires the cuda module, you can simply use
waLBerla_add_executable( NAME xyz FILE abc.cpp
DEPENDS ... gpu )