We would like to generate code for kernels which contain spatially dependent expressions. Thus, the expression tree contains loop counter symbols. Unfortunately, vectorizing these kernels does not currently work.
The goal of this MR is to change that. It is far, far from being finished but I would appreciate some feedback early on. With the current state, I was able to generate one of our kernels with vectorization (AVX) enabled and the code runs on my AVX512 machine (the code contains AVX512 instructions...). This MR includes a very basic test so that you can play around with the new feature if you want:
Issues with the current implementation are:
- The vectorized loop counter uses a different vector width than the remaining code. E.g. for AVX the (f64) vector width is 4 and the loop is vectorized accordingly. However, the vectorized loop counter has 8 lanes.
- Vector expressions can not be casted to different types.
- Integer expressions can not be properly vectorized.
int64but many places assume that
int32. This is relevant here since conversions from
int64are not available before AVX512 (they require rounding).
The proposed change is to rework the instruction sets slightly. Instead of "dumb" dictionaries, they get promoted to proper classes. Every time an instruction is queried, the data type (base type and number of lanes) must be specified. This way, we can properly handle integers and vectors with less than the maximum bitwidth which is supported by the instruction set. The latter is e.g. necessary to work with i32x4 vectors on AVX. Furthermore, type conversion is implemented in the instruction sets.
This obviously changes the interface and has strong implications on how the vectorization is handled throughout the code base. Therefore, I would appreciate any feedback, whether these changes are in line with the goals of the project and the thoughts behind #46. Any comments are welcome.