Vectorization: Infrastructure and Base Implementation for x86 (!424) · Merge requests · pycodegen / pystencils

This MR introduces the first wave of vectorization infrastructure into the new backend. Alongside this, several changes and additions are made to the AST, the symbol table, typification, constant folding, code printing, as well as the Target API.

AST

Introduce the ast.vector module for SIMD-related AST Nodes
Move PsVectorMemAcc to ast.vector, rename it to PsVecMemAcc, and allow its stride to be an expression
Introduce PsVecBroadcast for scalar-to-vector broadcasts

Code Printing

Split up the CAstPrinter into generic BasePrinter and C-specific subclass CAstPrinter
Introduce IRAstPrinter subclass of BasePrinter to print the entire IR to pseudocode (including untyped stuff, the vector IR, and all non-C-constructs)
Update PsAstNode.__str__ to call the IRAstPrinter

Symbol Table

Extend duplicate_symbol to allow changing the duplicate's data type
Add get_new_symbol to always receive a new symbol, even if the given name is already occupied

Typification

Fix handling of vectorial boolean and integer types
Add support for PsVecBroadcast and vector memory accesses

Constant Folding

Update EliminateConstants to correctly process vector constants and vector types
Update EliminateConstants to fold PsCasts and PsVecBroadcasts of constants

AST Vectorization

Introduce the AstVectorizer transformer, which takes a scalar IR subtree and transforms it into a SIMD version of itself, along a given iteration axis. At this point, the AstVectorizer is capable of translating constants, symbols, arithmetic and math functions, type casts, and memory accesses with either lane-invariant or affine indices. Vectorization and masking of conditionals and loops is future work.

Loop Vectorization

Introduce the LoopVectorizer, which internally uses the AstVectorizer to transform single scalar loops into SIMD versions of themselves, with optional handling of trailing iterations.

Intrinsic Selection

Rename MaterializeVectorIntrinsics to SelectIntrinsics
Refactor intrinsic selection API in GenericVectorCPU to directly receive AST nodes
Implement intrinsic selection pass for constants, symbols, unary and binary operations, and memory accesses
Refactor x86 vector platform: Adapt to new API, fix some errors

Target API

Move Target from .enums to .target; deprecate .enums
Add AVX512_FP16 target
Add automatic detection of available vector architectures on the current machine to Target

Future Work

This MR provides basic infrastructure for vectorized code generation and a test suite for the basic functionality. At this point, kernel vectorization is not yet part of the create_kernel pipeline. Only a limited set of intrinsics is so far implemented for x86 (e.g. gather/scatter, type casts, etc. are still missing); and platforms for other hardware (ARM, RISC-V, PPC, ...) are missing alltogether. Masked vectorization will also follow in the future.

Vectorization: Infrastructure and Base Implementation for x86

AST