CUDA indexing: clip to maximum cuda block size

- previous method did not work with kernels generated for walberla where
  block size changes are made at runtime
- device query does not always work, since the compile system may have
  no GPU or not the same GPU
-> max block size is passed as parameter and only optionally determined
   by a device query
6 jobs for release/0.2.3 in 6 minutes and 59 seconds (queued for 2 minutes and 42 seconds)