Memory alignment error in AVX kernels with Clang

While developing with AVX kernels, we stumbled upon a flaw in our codegen script that involves the following lbmpy parameters:

cpu_vectorize_info = {
    "instruction_set": "avx",
    "assume_inner_stride_one": True,
    "assume_aligned": True,
    "assume_sufficient_line_padding": False}

We assumed the walberla::field::AllocateAligned<unsigned char, alignment> class would choose the correct memory alignment for __m256d types, i.e. 32 bytes. This is the case for GCC 9.3.0 and Intel Parallel Studio XE 19.0.2.187, but not for Clang 10.0.0, where the chosen value is 16 bytes. This is an issue e.g. for _mm256_load_pd(double const *a) which mandates 32-byte alignment for the pointee. This function sends a SIGSEV in binaries build by Clang when the pointer to the x-stride is not an integer multiple of 32 bytes.

In a Debug build with ASAN and UBSAN, the following report is generated:

UndefinedBehaviorSanitizer: undefined-behavior /usr/include/boost/mpi/collectives/all_reduce.hpp:36:5
in /usr/lib/llvm-9/lib/clang/9.0.1/include/avxintrin.h:3072:10: runtime error: load of misaligned address
0x155549b2c230 for type '__m256d' (vector of 4 'double' values), which requires 32 byte alignment
0x155549b2c230: note: pointer points here
 3e e9 a3 3f  94 3e e9 93 3e e9 a3 3f  94 3e e9 93 3e e9 a3 3f  94 3e e9 93 3e e9 a3 3f  94 3e e9 93
              ^

Investigation with GDB reveals the following:

┌──default_codegen/InitialPDFsSetterAVX.cpp────────────────────────────────────────────────────────┐
│   79    double * RESTRICT _data_pdfs_27_10 = _stride_pdfs_1*ctr_1 + _data_pdfs_27;               │
│   80    double * RESTRICT _data_pdfs_28_10 = _stride_pdfs_1*ctr_1 + _data_pdfs_28;               │
│   81    {                                                                                        │
│   82       for (int64_t ctr_0 = 0; ctr_0 < (int64_t)((_size_pdfs_0) / (4)) * (4); ctr_0 += 4)    │
│   83       {                                                                                     │
│  >84          const __m256d u_0 = _mm256_load_pd(& _data_velocity_20_10[ctr_0]);                 │
│   85          const __m256d u_1 = _mm256_load_pd(& _data_velocity_21_10[ctr_0]);                 │
│   86          _mm256_store_pd(&_data_pdfs_20_10[ctr_0],_mm256_add_pd(_mm256_add_pd(_mm256_add_pd(│
│   87          _mm256_store_pd(&_data_pdfs_21_10[ctr_0],_mm256_add_pd(_mm256_add_pd(_mm256_add_pd(│
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
In: walberla::pystencils::internal_initialpdfssetteravx_initialpdfssetteravx::initialpdfssetteravx*
Thread 1 "ExampleAppCodeg" received signal SIGSEGV, Segmentation fault.
0x00000000006821fb in walberla::pystencils::internal_initialpdfssetteravx_initialpdfssetteravx::init
ialpdfssetteravx_initialpdfssetteravx (
    _data_pdfs=0x15554ce99f80, _data_velocity=0x15554d558f80, _size_pdfs_0=300, _size_pdfs_1=80,
    _stride_pdfs_1=302, _stride_pdfs_2=74292, _stride_velocity_1=302, _stride_velocity_2=74292,
    rho_0=1) at default_codegen/InitialPDFsSetterAVX.cpp:84
84	            const __m256d u_0 = _mm256_load_pd(& _data_velocity_20_10[ctr_0]);
(gdb) tui e
(gdb) print & _data_velocity_20_10[ctr_0]
$1 = (double *) 0x15554d5598f0
(gdb) python print(0x15554d5598f0 / 32)
733003558087.5

That particular address has 16-byte alignment, although we also observed 8-byte alignment.

We were expecting src/field/Field.impl.h:340-347 to select the walberla::field::AllocateAligned<unsigned char, 32> allocator. It seems like the alignment is too small by a factor of 2 for the Clang compiler (related MR: !458 (merged)).

The conditional evaluates sizeof(T) < alignment && alignment % sizeof(T) == 0 with T = const float [13] (sizeof(T) = 52) instead of a hypothetical T_underlying = const float (sizeof(T_underlying) = 4), which surprised me a bit. I also did not fully understand the behavior of this conditional in GDB. For example with Clang and GCC, I have found myself in a situation where:

(gdb) print sizeof(T) < alignment
$3 = false

yet the true branch was taken when stepping the code. I double checked C++ operator precedence to make sure there were no missing parenthesis and it looked ok, so I'm not exactly sure what is happening in GDB. If I understand the code correctly, it should actually take the false branch and instantiate the standard allocator, which has 8-byte alignment.

A minimal working example can be found in jngrad/example_app, branch mwe-avx. The codegen script automatically generates both AVX and non-AVX kernels and CMake builds the stand-alone binary without AVX. Then one has to manually compile and link the AVX version of the binary with -DWALBERLA_BUILD_WITH_AVX (details in the Readme file). This work was carried out using lbmpy/pystencils 0.4.4 and waLBerla de6b0007 with backported fixes from ff26f885.