Investige usage of CUDA graphs to reduce kernel call overhead
Marco had the idea to use this in the CUDA backend for Petalisp. But maybe it's also useful in waLBerla timeloops.
https://developer.nvidia.com/blog/cuda-graphs/
It effectively tries to reduce call overhead of kernels when you repeatedly call the same group of kernels. Maybe not as relevant, since bottleneck should be communication not managing the CUDA kernels and our kernels are not exactly "short-running"-