Vectorization: Infrastructure and Base Implementation for x86
This MR introduces the first wave of vectorization infrastructure into the new backend.
Alongside this, several changes and additions are made to the AST, the symbol table, typification, constant folding, code printing, as well as the Target API.
AST
- Introduce the
ast.vectormodule for SIMD-related AST Nodes - Move
PsVectorMemAcctoast.vector, rename it toPsVecMemAcc, and allow its stride to be an expression - Introduce
PsVecBroadcastfor scalar-to-vector broadcasts
Code Printing
- Split up the
CAstPrinterinto genericBasePrinterand C-specific subclassCAstPrinter - Introduce
IRAstPrintersubclass ofBasePrinterto print the entire IR to pseudocode (including untyped stuff, the vector IR, and all non-C-constructs) - Update
PsAstNode.__str__to call theIRAstPrinter
Symbol Table
- Extend
duplicate_symbolto allow changing the duplicate's data type - Add
get_new_symbolto always receive a new symbol, even if the given name is already occupied
Typification
- Fix handling of vectorial boolean and integer types
- Add support for
PsVecBroadcastand vector memory accesses
Constant Folding
- Update
EliminateConstantsto correctly process vector constants and vector types - Update
EliminateConstantsto foldPsCasts andPsVecBroadcasts of constants
AST Vectorization
Introduce the AstVectorizer transformer, which takes a scalar IR subtree and transforms it into a SIMD version of itself,
along a given iteration axis.
At this point, the AstVectorizer is capable of translating constants, symbols, arithmetic and math functions, type casts, and memory accesses with either lane-invariant or affine indices.
Vectorization and masking of conditionals and loops is future work.
Loop Vectorization
Introduce the LoopVectorizer, which internally uses the AstVectorizer to transform single scalar loops into SIMD versions of themselves, with optional handling of trailing iterations.
Intrinsic Selection
- Rename
MaterializeVectorIntrinsicstoSelectIntrinsics - Refactor intrinsic selection API in
GenericVectorCPUto directly receive AST nodes - Implement intrinsic selection pass for constants, symbols, unary and binary operations, and memory accesses
- Refactor x86 vector platform: Adapt to new API, fix some errors
Target API
- Move
Targetfrom.enumsto.target; deprecate.enums - Add
AVX512_FP16target - Add automatic detection of available vector architectures on the current machine to
Target
Future Work
This MR provides basic infrastructure for vectorized code generation and a test suite for the basic functionality.
At this point, kernel vectorization is not yet part of the create_kernel pipeline.
Only a limited set of intrinsics is so far implemented for x86 (e.g. gather/scatter, type casts, etc. are still missing);
and platforms for other hardware (ARM, RISC-V, PPC, ...) are missing alltogether.
Masked vectorization will also follow in the future.