Vector scatter/gather support (!241) · Merge requests · pycodegen / pystencils

Some modern processors support scatter/gather operations in hardware to be able to vectorize even when the stride between consecutive elements is not 1. Supporting this in pystencils turned out to be surprisingly easy. On a Core i7-7820X, the D3Q19 TRT benchmark shows an appreciable performance benefit for cases with nonideal memory layout:

15% for fzyx without assume_inner_stride_one and with split
20% for fzyx without assume_inner_stride_one
30% for zyxf

AVX2 only supported gather, so this requires AVX512. The internet says it was quite slow on AVX2 processors anyway. Even on AVX512 the latency is quite high and the throughput is quite low, but it's still better than not vectorizing. SVE also supports it, so all future ARM processors will benefit too, and they will probably have better hardware support for higher throughput.

Fixes #34 (closed)

Edited Apr 28, 2021 by Michael Kuron

Vector scatter/gather support

Merge request reports