Vectorization: Infrastructure and Base Implementation for x86
This MR introduces the first wave of vectorization infrastructure into the new backend.
Alongside this, several changes and additions are made to the AST, the symbol table, typification, constant folding, code printing, as well as the Target
API.
AST
- Introduce the
ast.vector
module for SIMD-related AST Nodes - Move
PsVectorMemAcc
toast.vector
, rename it toPsVecMemAcc
, and allow its stride to be an expression - Introduce
PsVecBroadcast
for scalar-to-vector broadcasts
Code Printing
- Split up the
CAstPrinter
into genericBasePrinter
and C-specific subclassCAstPrinter
- Introduce
IRAstPrinter
subclass ofBasePrinter
to print the entire IR to pseudocode (including untyped stuff, the vector IR, and all non-C-constructs) - Update
PsAstNode.__str__
to call theIRAstPrinter
Symbol Table
- Extend
duplicate_symbol
to allow changing the duplicate's data type - Add
get_new_symbol
to always receive a new symbol, even if the given name is already occupied
Typification
- Fix handling of vectorial boolean and integer types
- Add support for
PsVecBroadcast
and vector memory accesses
Constant Folding
- Update
EliminateConstants
to correctly process vector constants and vector types - Update
EliminateConstants
to foldPsCast
s andPsVecBroadcast
s of constants
AST Vectorization
Introduce the AstVectorizer
transformer, which takes a scalar IR subtree and transforms it into a SIMD version of itself,
along a given iteration axis.
At this point, the AstVectorizer
is capable of translating constants, symbols, arithmetic and math functions, type casts, and memory accesses with either lane-invariant or affine indices.
Vectorization and masking of conditionals and loops is future work.
Loop Vectorization
Introduce the LoopVectorizer
, which internally uses the AstVectorizer
to transform single scalar loops into SIMD versions of themselves, with optional handling of trailing iterations.
Intrinsic Selection
- Rename
MaterializeVectorIntrinsics
toSelectIntrinsics
- Refactor intrinsic selection API in
GenericVectorCPU
to directly receive AST nodes - Implement intrinsic selection pass for constants, symbols, unary and binary operations, and memory accesses
- Refactor x86 vector platform: Adapt to new API, fix some errors
Target API
- Move
Target
from.enums
to.target
; deprecate.enums
- Add
AVX512_FP16
target - Add automatic detection of available vector architectures on the current machine to
Target
Future Work
This MR provides basic infrastructure for vectorized code generation and a test suite for the basic functionality.
At this point, kernel vectorization is not yet part of the create_kernel
pipeline.
Only a limited set of intrinsics is so far implemented for x86 (e.g. gather/scatter, type casts, etc. are still missing);
and platforms for other hardware (ARM, RISC-V, PPC, ...) are missing alltogether.
Masked vectorization will also follow in the future.