Skip to content

Vectorization: Infrastructure and Base Implementation for x86

Frederik Hennig requested to merge fhennig/ast-vectorization into v2.0-dev

This MR introduces the first wave of vectorization infrastructure into the new backend. Alongside this, several changes and additions are made to the AST, the symbol table, typification, constant folding, code printing, as well as the Target API.

AST

  • Introduce the ast.vector module for SIMD-related AST Nodes
  • Move PsVectorMemAcc to ast.vector, rename it to PsVecMemAcc, and allow its stride to be an expression
  • Introduce PsVecBroadcast for scalar-to-vector broadcasts

Code Printing

  • Split up the CAstPrinter into generic BasePrinter and C-specific subclass CAstPrinter
  • Introduce IRAstPrinter subclass of BasePrinter to print the entire IR to pseudocode (including untyped stuff, the vector IR, and all non-C-constructs)
  • Update PsAstNode.__str__ to call the IRAstPrinter

Symbol Table

  • Extend duplicate_symbol to allow changing the duplicate's data type
  • Add get_new_symbol to always receive a new symbol, even if the given name is already occupied

Typification

  • Fix handling of vectorial boolean and integer types
  • Add support for PsVecBroadcast and vector memory accesses

Constant Folding

  • Update EliminateConstants to correctly process vector constants and vector types
  • Update EliminateConstants to fold PsCasts and PsVecBroadcasts of constants

AST Vectorization

Introduce the AstVectorizer transformer, which takes a scalar IR subtree and transforms it into a SIMD version of itself, along a given iteration axis. At this point, the AstVectorizer is capable of translating constants, symbols, arithmetic and math functions, type casts, and memory accesses with either lane-invariant or affine indices. Vectorization and masking of conditionals and loops is future work.

Loop Vectorization

Introduce the LoopVectorizer, which internally uses the AstVectorizer to transform single scalar loops into SIMD versions of themselves, with optional handling of trailing iterations.

Intrinsic Selection

  • Rename MaterializeVectorIntrinsics to SelectIntrinsics
  • Refactor intrinsic selection API in GenericVectorCPU to directly receive AST nodes
  • Implement intrinsic selection pass for constants, symbols, unary and binary operations, and memory accesses
  • Refactor x86 vector platform: Adapt to new API, fix some errors

Target API

  • Move Target from .enums to .target; deprecate .enums
  • Add AVX512_FP16 target
  • Add automatic detection of available vector architectures on the current machine to Target

Future Work

This MR provides basic infrastructure for vectorized code generation and a test suite for the basic functionality. At this point, kernel vectorization is not yet part of the create_kernel pipeline. Only a limited set of intrinsics is so far implemented for x86 (e.g. gather/scatter, type casts, etc. are still missing); and platforms for other hardware (ARM, RISC-V, PPC, ...) are missing alltogether. Masked vectorization will also follow in the future.

Merge request reports