Single-precision codegen leads to kernels with mixed types
I recently experimented with single-precision code generation to benchmark performance improvements over the default double-precision kernels. waLBerla is compiled with WALBERLA_DOUBLE_ACCURACY=ON
and linked against a library that uses the double- and single-precision kernels via type traits and SFINAE from a templated class template <typename FloatType> class LBWalberla
that is specialized for float
and double
.
For both double- and single-precision header files to coexist in the same translation unit, their signature needs to be different. Unfortunately, a few kernels have mixed types, which makes SFINAE trickier, generates compiler warnings, and makes benchmarking against double-precision unreliable (the single-precision kernels are not really single-precision). There are also a few kernels that use real_t
, which ties them to a specific build of waLBerla. Finally, there is a couple of places in the kernels where quality-of-life improvements can be made.
Some of the issues find their roots in pystencils or lbmpy, rather than waLBerla itself. I am gathering my feedback here because it's not always clear to me where the issues originate from, but I wouldn't mind opening tickets on the pycodegen projects once we've identified where to do the changes. A few of the issues probably cannot be addressed for technical reasons. Others might be low-hanging fruits, perhaps I could contribute a patch with your guidance.
Here are my observations with single-precision code generation:
- Regarding the dynamic UBB framework (see code sample at the end of this post)
-
The Dynamic_UBB
class usesreal_t
instead of a placeholder forfloat
/double
; right now I use text substitution to replacereal_t
by the desired type -
The boundary_Dynamic_UBB()
kernel usesconst double weights[]
instead of a placeholder forfloat
/double
-
The velocities in IndexInfo
are currently hardcoded asdouble
and theboundary_Dynamic_UBB()
kernel is currently accessing vectors with hardcodeddouble
alignment (e.g.*((float *)(&_data_indexVector[40 * ctr_0 + 24]))
instead of*((double *)(&_data_indexVector[40 * ctr_0 + 20]))
) -
The IndexVectors::operator==()
should have signaturebool operator==(IndexVectors const &other) const
, just likeIndexInfo
does, otherwise twoIndexVectors
objects passed as const references cannot be compared -
The IndexVectors::pointerCpu()
should returncpuVectors_[t].data()
to avoid dereferencing a possibly non-existent element
-
- Regarding the macroscopic values accessor framework (
python/lbmpy_walberla/templates/LatticeModel.tmpl.h
):-
The includes are missing #include "core/math/Matrix3.h"
-
The struct methods use real_t
instead of a placeholder forfloat
/double
; since I'm working on a modified version of this template, I simply replacedreal_t
with{{real_t}}
-
Several generators like density_velocity_setter_macroscopic_values
output code that mixesfloat
types withdouble
literals, e.g.const float u_0 = -0.5*force_field.get(x,y,z,0)/rho_in + u[0];
(should use0.5f
)
-
- Regarding generated MRT collision kernels:
-
There is a 50/50 mix of double
andfloat
in the kernels- I use function
rr_getter()
fromlbmpy_tests/test_fluctuating_lb.py
) to declare relaxation rates as e.g.sp.Symbol("omega_bulk")
, which by default becomesdouble
; is there a way to use afloat
typed symbol? - if the kernel is changed from
double
tofloat
, shouldphilox_double2()
be replaced byphilox_float4()
?
- I use function
-
Compiler warning pragmas like #pragma GCC diagnostic ignored "-Wconversion"
are present, but they don't suppress-Wfloat-conversion
in projects that explicitely add-Wfloat-conversion
to the compiler flags; adding an extra#pragma GCC diagnostic ignored "-Wfloat-conversion"
would fix the issue- No longer needed, all mixed types issues have been resolved by pystencils 0.4.3 and !498 (closed).
-
Point 3.1 cannot be fixed by text substitution. Points 1.2, 1.3, 1.4 and 2.3 can be fixed by text substitution, but that's brittle.
Generated Dynamic_UBB code
/*
* built with:
* pystencils v0.3.4+4.g4fecf0c (backports the subexpression bugfix)
* lbmpy v0.3.4+6.g2faceda (backports pressure tensor merge request)
* walberla b17ca5caf00db7d19f86c5f85c6f67fec6c16aff
* codegen script: https://github.com/espressomd/espresso/blob/0b0337e03/maintainer/walberla_kernels/generate_lb_kernels.py
*/
class Dynamic_UBB {
public:
struct IndexInfo {
int32_t x, y, z;
int32_t dir;
double vel_0, vel_1, vel_2; // <--- should be float
IndexInfo(int32_t x_, int32_t y_, int32_t z_, int32_t dir_)
: x(x_), y(y_), z(z_), dir(dir_), vel_0(), vel_1(), vel_2() {}
bool operator==(const IndexInfo &o) const {
/* ... */
};
class IndexVectors {
public:
using CpuIndexVector = std::vector<IndexInfo>;
enum Type { ALL = 0, INNER = 1, OUTER = 2, NUM_TYPES = 3 };
IndexVectors() = default;
bool operator==(IndexVectors &other) { // <--- this should be const and take a const ref
return other.cpuVectors_ == cpuVectors_;
}
CpuIndexVector &indexVector(Type t) { return cpuVectors_[t]; }
IndexInfo *pointerCpu(Type t) {
return &(cpuVectors_[t][0]); // <--- undefined behavior if container size is zero
}
void syncGPU() {}
private:
std::vector<CpuIndexVector> cpuVectors_{NUM_TYPES};
};
/* ... */
};
Update from Nov 24: I've now experimented with kernels generated by lbmpy/pystencils 0.4.3 with !498 (closed) merged into de6b0007, with and without AVX2. Here are additional observations:
-
CodeGeneration
cannot be default-constructed
Reproducible with commit de6b0007:import sys print(sys) from pystencils_walberla import CodeGeneration ctx = CodeGeneration()
[''] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/work/jgrad/walberla/python/pystencils_walberla/cmake_integration.py", line 22, in __init__ expected_files, cmake_vars = parse_json_args() File "/work/jgrad/walberla/python/pystencils_walberla/cmake_integration.py", line 67, in parse_json_args if value.lower() in ("on", "1", "yes", "true"): AttributeError: 'bool' object has no attribute 'lower'
diff --git a/python/pystencils_walberla/cmake_integration.py b/python/pystencils_walberla/cmake_integration.py index 3e5a4eba..14f25a17 100644 --- a/python/pystencils_walberla/cmake_integration.py +++ b/python/pystencils_walberla/cmake_integration.py @@ -64,9 +64,9 @@ def parse_json_args(): expected_files = parsed['EXPECTED_FILES'] cmake_vars = {} for key, value in parsed['CMAKE_VARS'].items(): - if value.lower() in ("on", "1", "yes", "true"): + if str(value).lower() in ("on", "1", "yes", "true"): value = True - elif value.lower() in ("off", "0", "no", "false"): + elif str(value).lower() in ("off", "0", "no", "false"): value = False cmake_vars[key] = value return expected_files, cmake_vars
-
Single-precision kernels use Philox to generateSimply usedouble
values.pystencils.rng.PhiloxFourFloats
. -
Single-precision AVX2 kernels have mixed SIMD types; I tried manually patching the code (espressomd/espresso@82c007e0), but the kernels segfault, so my patch must be incorrect. Simply usepystencils.rng.PhiloxFourFloats
.-
Philox is used to generate __m256d
values but__m256
values are needed- aka
__vector(4) double
instead of__vector(8) float
- related to #6 (closed)
- aka
-
__m256
values are sometimes confused withfloat
values- e.g.
_mm256_set_ps(xi_87, xi_87, xi_87, xi_87, xi_87, xi_87, xi_87, xi_87)
is passed to functions that take a__m256
but the variable is defined asconst __m256 xi_87 = _mm256_sqrt_ps(/* ... */);
- e.g.
-