Currently, waLBerla's timer does not provide accurate timings for CUDA kernels. It only measures the time for launching the CUDA kernel but does not wait until the kernel execution is finished. This MR employs explicit device synchronization for time measurements of CUDA kernels.