Questions on waLBerla Conjugate Gradient Solver

I am looking at running a Conjugate Gradient solver (GC) on a single machine with 8 Nvidia GPUs using waLBerla.

From the documentation, waLBerla provides a ready-to-use CG solver, for which I'd have a couple of questions:

The source code seems to be only for openMP. Is there a CUDA version? How can I run it on a multi-GPU system?
If there is no native CUDA version, can I access the CG interface through pystencil and generate the CUDA version?
Implementing a CUDA version of the CG kernels directly in waLBerla (without using pystencil) should be relatively straightforward as they are standard sweeps. The parts that would be more of a challenge are the dot and norm operations. Would you have any pointers on implementing reduction operations (norm and dot) in waLBerla on CUDA?

Thank you very much. Max