Optimize traversal of sparse fields by "SIMD-bitmasks"

Currently, a LBM main sweep visits each cell and processes the necessary calculation steps.

The calculation steps and the needed memory accesses could be skipped (without breaking SIMD execution) if all cells processed by a SIMD group are inactive during this sweep. At the moment, one can use the 8-bit-per-cell flag field to conditionally skip the current cell. This is not possible for SIMD execution since one had to check all the flag field values (or just if using if using a SIMD all-false vote).

Suggestion: introduce a bitmask field indicating whether a part of the sweep can be skipped. I would use one bit to indicate "none of the cells that would be accessed by a SIMD group is active". The bitmask can be pre-computed beforehand. Example: 32x64 field on GPU. Field is processed by CUDA warps having 32 threads. They access cache lines of coalesced 3232bit words (3264bit can also be handled nowadays).

00000000001000001000000000000000 00000000000000000000000000000000 -> 2xuint32_t with one active bit for each bold face cache line that requires processing.
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000010000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000001000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000
00000000000000000000000000000000 00000000000000000000000000000000

Then, a SIMD-group/CUDA block can traverse a field with a reduced number of memory accesses. To find out next cache line to be processed, a block can use intrinsics like __clz (count leading zeros https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__INT.html) to find out where to start or which cache line to process next.

If one is willing to also count neighbors of fluid cells as active, one could remove completely inactive cache lines from allocated memory. One then has to use __popcount intrinsics to find out memory addresses of neighbors. A list data structure is not needed.

This suggestion is similar to @godenschwager's list LBM. But it does not necessarily require another data structure and is probability better suited for GPU execution (no pointers an stuff). Also the bitmask are probably small enough to reside in fast memory like __shared memory. But same concept is also applicable to SIMD on CPU. Modern CPUs also have intrinsics for bit counting stuff.

Edited Mar 20, 2019 by Stephan Seitz