Single-precision codegen leads to kernels with mixed types

I recently experimented with single-precision code generation to benchmark performance improvements over the default double-precision kernels. waLBerla is compiled with WALBERLA_DOUBLE_ACCURACY=ON and linked against a library that uses the double- and single-precision kernels via type traits and SFINAE from a templated class template <typename FloatType> class LBWalberla that is specialized for float and double.

For both double- and single-precision header files to coexist in the same translation unit, their signature needs to be different. Unfortunately, a few kernels have mixed types, which makes SFINAE trickier, generates compiler warnings, and makes benchmarking against double-precision unreliable (the single-precision kernels are not really single-precision). There are also a few kernels that use real_t, which ties them to a specific build of waLBerla. Finally, there is a couple of places in the kernels where quality-of-life improvements can be made.

Some of the issues find their roots in pystencils or lbmpy, rather than waLBerla itself. I am gathering my feedback here because it's not always clear to me where the issues originate from, but I wouldn't mind opening tickets on the pycodegen projects once we've identified where to do the changes. A few of the issues probably cannot be addressed for technical reasons. Others might be low-hanging fruits, perhaps I could contribute a patch with your guidance.

Here are my observations with single-precision code generation:

Regarding the dynamic UBB framework (see code sample at the end of this post)
1. The Dynamic_UBB class uses real_t instead of a placeholder for float/double; right now I use text substitution to replace real_t by the desired type
2. The boundary_Dynamic_UBB() kernel uses const double weights[] instead of a placeholder for float/double
3. The velocities in IndexInfo are currently hardcoded as double and the boundary_Dynamic_UBB() kernel is currently accessing vectors with hardcoded double alignment (e.g. *((float *)(&_data_indexVector[40 * ctr_0 + 24])) instead of *((double *)(&_data_indexVector[40 * ctr_0 + 20])))
4. The IndexVectors::operator==() should have signature bool operator==(IndexVectors const &other) const, just like IndexInfo does, otherwise two IndexVectors objects passed as const references cannot be compared
5. The IndexVectors::pointerCpu() should return cpuVectors_[t].data() to avoid dereferencing a possibly non-existent element
Regarding the macroscopic values accessor framework (python/lbmpy_walberla/templates/LatticeModel.tmpl.h):
1. The includes are missing #include "core/math/Matrix3.h"
2. The struct methods use real_t instead of a placeholder for float/double; since I'm working on a modified version of this template, I simply replaced real_t with {{real_t}}
3. Several generators like density_velocity_setter_macroscopic_values output code that mixes float types with double literals, e.g. const float u_0 = -0.5*force_field.get(x,y,z,0)/rho_in + u[0]; (should use 0.5f)
Regarding generated MRT collision kernels:
1. There is a 50/50 mix of double and float in the kernels
  - I use function rr_getter() from lbmpy_tests/test_fluctuating_lb.py) to declare relaxation rates as e.g. sp.Symbol("omega_bulk"), which by default becomes double; is there a way to use a float typed symbol?
  - if the kernel is changed from double to float, should philox_double2() be replaced by philox_float4()?
2. Compiler warning pragmas like #pragma GCC diagnostic ignored "-Wconversion" are present, but they don't suppress -Wfloat-conversion in projects that explicitely add -Wfloat-conversion to the compiler flags; adding an extra #pragma GCC diagnostic ignored "-Wfloat-conversion" would fix the issue
  - No longer needed, all mixed types issues have been resolved by pystencils 0.4.3 and !498 (closed).

Point 3.1 cannot be fixed by text substitution. Points 1.2, 1.3, 1.4 and 2.3 can be fixed by text substitution, but that's brittle.

Generated Dynamic_UBB code

/*
 * built with:
 * pystencils v0.3.4+4.g4fecf0c (backports the subexpression bugfix)
 * lbmpy v0.3.4+6.g2faceda (backports pressure tensor merge request)
 * walberla b17ca5caf00db7d19f86c5f85c6f67fec6c16aff
 * codegen script: https://github.com/espressomd/espresso/blob/0b0337e03/maintainer/walberla_kernels/generate_lb_kernels.py
 */
class Dynamic_UBB {
public:
  struct IndexInfo {
    int32_t x, y, z;
    int32_t dir;
    double vel_0, vel_1, vel_2;  // <--- should be float
    IndexInfo(int32_t x_, int32_t y_, int32_t z_, int32_t dir_)
        : x(x_), y(y_), z(z_), dir(dir_), vel_0(), vel_1(), vel_2() {}
    bool operator==(const IndexInfo &o) const {
      /* ... */
  };

  class IndexVectors {
  public:
    using CpuIndexVector = std::vector<IndexInfo>;
    enum Type { ALL = 0, INNER = 1, OUTER = 2, NUM_TYPES = 3 };
    IndexVectors() = default;
    bool operator==(IndexVectors &other) {  // <--- this should be const and take a const ref
      return other.cpuVectors_ == cpuVectors_;
    }
    CpuIndexVector &indexVector(Type t) { return cpuVectors_[t]; }
    IndexInfo *pointerCpu(Type t) {
      return &(cpuVectors_[t][0]);  // <--- undefined behavior if container size is zero
    }
    void syncGPU() {}

  private:
    std::vector<CpuIndexVector> cpuVectors_{NUM_TYPES};
  };
  /* ... */
};

Update from Nov 24: I've now experimented with kernels generated by lbmpy/pystencils 0.4.3 with !498 (closed) merged into de6b0007, with and without AVX2. Here are additional observations:

CodeGeneration cannot be default-constructed
Reproducible with commit :

import sys
print(sys)
from pystencils_walberla import CodeGeneration
ctx = CodeGeneration()

output:

['']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/work/jgrad/walberla/python/pystencils_walberla/cmake_integration.py", line 22, in __init__
    expected_files, cmake_vars = parse_json_args()
  File "/work/jgrad/walberla/python/pystencils_walberla/cmake_integration.py", line 67, in parse_json_args
    if value.lower() in ("on", "1", "yes", "true"):
AttributeError: 'bool' object has no attribute 'lower'

patch:

diff --git a/python/pystencils_walberla/cmake_integration.py b/python/pystencils_walberla/cmake_integration.py
index 3e5a4eba..14f25a17 100644
--- a/python/pystencils_walberla/cmake_integration.py
+++ b/python/pystencils_walberla/cmake_integration.py
@@ -64,9 +64,9 @@ def parse_json_args():
     expected_files = parsed['EXPECTED_FILES']
     cmake_vars = {}
     for key, value in parsed['CMAKE_VARS'].items():
-        if value.lower() in ("on", "1", "yes", "true"):
+        if str(value).lower() in ("on", "1", "yes", "true"):
             value = True
-        elif value.lower() in ("off", "0", "no", "false"):
+        elif str(value).lower() in ("off", "0", "no", "false"):
             value = False
         cmake_vars[key] = value
     return expected_files, cmake_vars

~~Single-precision kernels use Philox to generate double values.~~ Simply use pystencils.rng.PhiloxFourFloats.
~~Single-precision AVX2 kernels have mixed SIMD types; I tried manually patching the code (espressomd/espresso@82c007e0), but the kernels segfault, so my patch must be incorrect~~. Simply use pystencils.rng.PhiloxFourFloats.
1. Philox is used to generate __m256d values but __m256 values are needed
  - aka __vector(4) double instead of __vector(8) float
  - related to #6 (closed)
2. __m256 values are sometimes confused with float values
  - e.g. _mm256_set_ps(xi_87, xi_87, xi_87, xi_87, xi_87, xi_87, xi_87, xi_87) is passed to functions that take a __m256 but the variable is defined as const __m256 xi_87 = _mm256_sqrt_ps(/* ... */);

Edited Nov 25, 2021 by Jean-Noël Grad

Admin message

Single-precision codegen leads to kernels with mixed types