CUDA indexing: clip to maximum cuda block size

- previous method did not work with kernels generated for walberla where
  block size changes are made at runtime
- device query does not always work, since the compile system may have
  no GPU or not the same GPU
-> max block size is passed as parameter and only optionally determined
   by a device query
6 jobs for release/0.2.3 in 6 minutes and 59 seconds (queued for 2 minutes and 42 seconds)
latest
Status Job ID Name Coverage
  Test
passed #275637
cuda docker
build-documentation

00:00:47

passed #275636
cuda docker
flake8-lint

00:00:16

passed #275635
docker
minimal-conda

00:00:23

passed #275634
docker
minimal-ubuntu

00:00:20

passed #275633
win
minimal-windows

00:01:29

passed #275632
AVX cuda docker
tests-and-coverage

00:03:41

78.63%