WIP: ARM NEON vectorization
With Apple's new laptops having ARM processors, I thought it might be time to add ARM NEON vectorization to pystencils. I don't currently have hardware to test on, but a bunch of test cases from both pystencils and lbmpy at least compile successfully. A Raspberry Pi 4 might actually be a useful and cheap device to add to CI for this purpose.
This may also become useful once ARM HPC clusters actually get deployed, though these might end up using SVE instead of NEON -- while I have added a few
ifs for that case, additional work is needed because SVE's vector width is determined at runtime.