ARM has a cache line zero instruction that prevents data that will be overwritten anyway from being loaded from RAM. Kind of a light version of a non-temporal store. Saw this near the end of https://www.youtube.com/watch?v=BP7XD7JHgrI in the context of SVE, but might be relevant on Neon too. Just wanted to keep a note of this here.
Integrating this into pystencils is probably not completely straight-forward as you first need to check how much would be zeroed (64 bytes on all current chips, not guaranteed to match the cache line size), zero it, and then write the corresponding amount of data. Not sure if there are guarantees as to whether it's a multiple of the vector width.
There is not a whole lot of information for ARM, but the exact same thing has existed on IBM‘s PowerPC architecture (!228 (merged)) for decades. There, a cache line has 128 bytes (can be queried from the kernel via
sysconf(_SC_LEVEL1_DCACHE_LINESIZE)) and can be zeroed with the