pystencils merge requestshttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests2021-03-16T20:29:52+01:00https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/222ARM Neon: Fix makeVec and add Philox2021-03-16T20:29:52+01:00Michael Kuronmkuron@icp.uni-stuttgart.deARM Neon: Fix makeVec and add PhiloxI did a quick find-and-replace translation of the Philox from SSE to Neon and noticed that `makeVec` was broken. Tests pass now.I did a quick find-and-replace translation of the Philox from SSE to Neon and noticed that `makeVec` was broken. Tests pass now.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/223Fix benchmark generation2021-02-26T09:07:23+01:00Markus HolzerFix benchmark generationFixes the generation of small benchmarking code with kerncraft and likwid for vectorized and aligned kernels.Fixes the generation of small benchmarking code with kerncraft and likwid for vectorized and aligned kernels.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/224Draft: Develop2023-09-14T11:03:48+02:00Markus HolzerDraft: DevelopThis MR adds two features to pystencils. First, the base pointer specification is revealed to the user which allows producing kernels with less register usage. Second, the summands insider the summation printer are printer recursively no...This MR adds two features to pystencils. First, the base pointer specification is revealed to the user which allows producing kernels with less register usage. Second, the summands insider the summation printer are printer recursively now which allows for more parallelism inside a single core.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/225WIP: ARM cache line zeroing2021-04-01T22:59:29+02:00Michael Kuronmkuron@icp.uni-stuttgart.deWIP: ARM cache line zeroingARM has a cache line zero instruction that prevents data that will be overwritten anyway from being loaded from RAM. Kind of a light version of a non-temporal store. Saw this near the end of https://www.youtube.com/watch?v=BP7XD7JHgrI in...ARM has a cache line zero instruction that prevents data that will be overwritten anyway from being loaded from RAM. Kind of a light version of a non-temporal store. Saw this near the end of https://www.youtube.com/watch?v=BP7XD7JHgrI in the context of SVE, but might be relevant on Neon too. Just wanted to keep a note of this here.
Integrating this into pystencils is probably not completely straight-forward as you first need to check how much would be zeroed (64 bytes on all current chips, not guaranteed to match the cache line size), zero it, and then write the corresponding amount of data. Not sure if there are guarantees as to whether it's a multiple of the vector width.
There is not a whole lot of information for ARM, but the exact same thing has existed on IBM‘s PowerPC architecture (!228) for decades. There, a cache line has 128 bytes (can be queried from the kernel via `sysconf(_SC_LEVEL1_DCACHE_LINESIZE)`) and can be zeroed with the `__dcbz` intrinsic.https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/226Fix field size2021-03-03T16:41:21+01:00Markus HolzerFix field sizeFixes #32 and #31 and #3 and #7Fixes #32 and #31 and #3 and #7Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/227Fix installation from git2021-03-15T10:42:15+01:00Markus HolzerFix installation from gitIf `pystencils` is installed from git and `cython` is not installed you get the error: `gcc: error: pystencils/boundaries/createindexlistcython.c: No such file or directory`. This is because the generated C file is not shipped and thus i...If `pystencils` is installed from git and `cython` is not installed you get the error: `gcc: error: pystencils/boundaries/createindexlistcython.c: No such file or directory`. This is because the generated C file is not shipped and thus it is not available.
On pypi the C-file is shipped and thus it works.
Fixes #14Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/228Vectorization improvements2021-03-29T22:31:22+02:00Michael Kuronmkuron@icp.uni-stuttgart.deVectorization improvementsAfter we cleaned up vectorization support as part of our ARM Neon experiments a few weeks ago (!188, !220, !222), I did the same thing with AltiVec/VSX intrinsics for POWER processors. Adding a new SIMD instruction set to pystencils real...After we cleaned up vectorization support as part of our ARM Neon experiments a few weeks ago (!188, !220, !222), I did the same thing with AltiVec/VSX intrinsics for POWER processors. Adding a new SIMD instruction set to pystencils really is just a matter of some quick find-and-replace now. I had test access to a POWER8 machine today, ran in both little-endian and big-endian mode, and all tests passed. So pystencils now actually supports _all_ SIMD instruction sets out there (ignoring MIPS and SPARC processors, which are essentially dead).
This pull request also contains some minor unrelated changes:
- switches the AES RNG to aligned stores
- adds a missing `pytest.importorskip`
- fixes the `vec_any`/`vec_all` operations (which used to only work on 256 bit doubles)
- removes the `q_registers` argument from `get_vector_instruction_set` because there is no point in using half-width vectors
- fix the AES-NI RNG on Ice Lake/Tiger Lake processorsMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/229Add type conversion for SP types2021-04-03T06:01:47+02:00Markus HolzerAdd type conversion for SP typesIf Assignments are already typed for double-precision but the kernel is created for single-precision the assignments should be adapted.If Assignments are already typed for double-precision but the kernel is created for single-precision the assignments should be adapted.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/230Improve non-temporal stores2021-04-29T08:23:17+02:00Michael Kuronmkuron@icp.uni-stuttgart.deImprove non-temporal storesARM doesn't have real non-temporal stores like x86 does. It does, however, have a special instruction (`dc zva`) that sets a cacheline to zero without reading it from memory, thus saving memory bandwidth. This pull request makes use of i...ARM doesn't have real non-temporal stores like x86 does. It does, however, have a special instruction (`dc zva`) that sets a cacheline to zero without reading it from memory, thus saving memory bandwidth. This pull request makes use of it.
The other important part of non-temporal stores is that they don't pollute the cache. Some architectures like PowerPC can emulate that with a special store instruction that flushes a cache line from cache. It's not available on ARM, but this pull requests adds support anyway.
Furthermore, this pull request adds fence instructions for non-temporal stores on x86. This ensures that everything has actually arrived in memory before the kernel returns, thus ensuring that subsequent memory accesses will not get stale data. This was not an issue in practice as the overhead of exiting and calling a kernel would have usually taken enough time for the data to arrive in memory, but it wasn't guaranteed.
Note that non-temporal stores are not guaranteed to be faster than regular stores on all processors. For example, on my Apple M1, the non-temporal stores are actually slightly slower (so I did not gain anything by implementing this). This is actually not due to the overhead of checking whether a vector is the first one in a cache line (which can easily be confirmed by replacing `dc zva` with `nop`), but likely an artifact of Apple implementing `dc zva` differently than the ARM documentation suggests. So on ARM, one should always compare regular and non-temporal stores and use whichever is faster.
Fixes #25. Supersedes !225.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/231fix error introduced in !2302021-04-14T13:50:47+02:00Michael Kuronmkuron@icp.uni-stuttgart.defix error introduced in !230Just looked at the code again and noticed that there is no _stream_ in this instruction set anymore, so this line would have raised an error.Just looked at the code again and noticed that there is no _stream_ in this instruction set anymore, so this line would have raised an error.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/232SVE vectorization2021-04-22T20:19:32+02:00Michael Kuronmkuron@icp.uni-stuttgart.deSVE vectorizationTo continue my vectorization spree, here is the version with ARM SVE instructions. Tested in QEMU 5.2. Compiler support is still a bit wonky -- GCC 10 generates very bloated assembly and produces incorrect code for non-native vector size...To continue my vectorization spree, here is the version with ARM SVE instructions. Tested in QEMU 5.2. Compiler support is still a bit wonky -- GCC 10 generates very bloated assembly and produces incorrect code for non-native vector sizes, while Clang 11 misses some obvious optimizations.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/233Vectorization: improve test coverage2021-04-21T19:55:08+02:00Michael Kuronmkuron@icp.uni-stuttgart.deVectorization: improve test coverageSome things were not tested with all available vectorizations. `maskStore` was previously untested and only worked with AVX512 float/double and AVX double. `maskLoad` was clearly broken, unused, and is a pretty useless instruction anyway...Some things were not tested with all available vectorizations. `maskStore` was previously untested and only worked with AVX512 float/double and AVX double. `maskLoad` was clearly broken, unused, and is a pretty useless instruction anyway, so I removed it.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/234Sizeless vectorization2021-05-21T10:11:44+02:00Michael Kuronmkuron@icp.uni-stuttgart.deSizeless vectorizationSurprisingly easy follow-up to !232 to support sizeless ARM SVE and RISC-V V. It uses some ugly hacks to sneak C functions like `svcntb()` into places that expect Python integers. Python duck-typing and SymPy made it possible. Not sure w...Surprisingly easy follow-up to !232 to support sizeless ARM SVE and RISC-V V. It uses some ugly hacks to sneak C functions like `svcntb()` into places that expect Python integers. Python duck-typing and SymPy made it possible. Not sure whether this should be merged as-is, but making it nicer would require re-writing `CBackend`. At least I couldn't think of a better way to obtain the innermost loop counter and loop stop.Jan HönigJan Hönighttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/235maskStore improvements2021-04-28T15:24:48+02:00Michael Kuronmkuron@icp.uni-stuttgart.demaskStore improvementsLittle follow-up to !233 after I thought about it again.
- fix the aligned version (it was using `maskStore` in some instruction sets and `maskStoreA` in others)
- make sure the test case is incommensurate with the vector width (previou...Little follow-up to !233 after I thought about it again.
- fix the aligned version (it was using `maskStore` in some instruction sets and `maskStoreA` in others)
- make sure the test case is incommensurate with the vector width (previously it couldn't distinguish `store` from `storeMask` on 128-bit vector instruction sets)
- implement a fallback for instruction sets that don't support it natively (turns out this is really easy using a load-blend-store combination)Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/236Adapt an ifdef for AMD Epyc 70032021-04-28T15:23:33+02:00Michael Kuronmkuron@icp.uni-stuttgart.deAdapt an ifdef for AMD Epyc 7003The new Zen 3 series has vector AES but no AVX512.The new Zen 3 series has vector AES but no AVX512.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/237Fix Sympy pipeline2021-04-26T16:46:20+02:00Markus HolzerFix Sympy pipelineFix #35Fix #35Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/238Versioneer2021-04-27T11:02:20+02:00Markus HolzerVersioneerThis MR enables Versioneer to have a consistent way for pystencils to get a version stringThis MR enables Versioneer to have a consistent way for pystencils to get a version stringMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/239Sympy 1.9 support2021-04-26T18:24:04+02:00Michael Kuronmkuron@icp.uni-stuttgart.deSympy 1.9 support- deepcopy support was broken due to https://github.com/sympy/sympy/pull/21260
- clean up some constructors
- fix detection of sympy development versions
fixes #35, fixes !237- deepcopy support was broken due to https://github.com/sympy/sympy/pull/21260
- clean up some constructors
- fix detection of sympy development versions
fixes #35, fixes !237Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/240Incorporate header files and compiler flags into object cache hash2021-04-29T09:29:30+02:00Michael Kuronmkuron@icp.uni-stuttgart.deIncorporate header files and compiler flags into object cache hashEnsure that code is recompiled when one of our custom headers is changed or the compiler flags are modified.Ensure that code is recompiled when one of our custom headers is changed or the compiler flags are modified.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/241Vector scatter/gather support2023-08-18T20:43:16+02:00Michael Kuronmkuron@icp.uni-stuttgart.deVector scatter/gather supportSome modern processors support scatter/gather operations in hardware to be able to vectorize even when the stride between consecutive elements is not 1. Supporting this in pystencils turned out to be surprisingly easy. On a Core i7-7820X...Some modern processors support scatter/gather operations in hardware to be able to vectorize even when the stride between consecutive elements is not 1. Supporting this in pystencils turned out to be surprisingly easy. On a Core i7-7820X, the D3Q19 TRT benchmark shows an appreciable performance benefit for cases with nonideal memory layout:
- 15% for fzyx without assume_inner_stride_one and with split
- 20% for fzyx without assume_inner_stride_one
- 30% for zyxf
AVX2 only supported gather, so this requires AVX512. The internet says it was quite slow on AVX2 processors anyway. Even on AVX512 the latency is quite high and the throughput is quite low, but it's still better than not vectorizing. SVE also supports it, so all future ARM processors will benefit too, and they will probably have better hardware support for higher throughput.
Fixes #34Markus HolzerMarkus Holzer