Support ARM64 Streaming SVE
I learned the other day that the Scalable Matrix Extensions (SME) that came with ARMv9.2 also include a variation of SVE called Streaming SVE. Unlike SVE, which is executed on a vector unit, SME and Streaming SVE are executed on something more like a matrix coprocessor that has higher latency but greater throughput. While pystencils doesn't care about the matrix aspect of it, it can still use Streaming SVE. It's just another vector ISA dialect, so I couldn't resist implementing it.
A processor may support SVE or SME or both, though I am not sure whether any with SME are already shipping. Nevertheless Linux, Clang, and QEMU have supported it for a year or two so we can test it already. It appears as if the new Apple M4 supports SME but not SVE so it might benefit from this PR once they put it into a Mac, though that will require a follow-up PR for CPU detection and compiler flags.
Streaming SVE is enabled via a function attribute (__attribute__((arm_locally_streaming))
). Functions with __attribute__((arm_streaming_compatible))
can be called from both streaming and non-streaming SVE. While in streaming mode, SME's new matrix instructions which we don't need are available and some SVE instructions are unavailable, mostly exotic stuff like the histogramming or Neon interoperability, but sadly (though not unexpectedly) also the scatter/gather instructions. The changes required in pystencils were thus quite minimal.