pystencils merge requests

pystencils merge requests https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests 2021-03-16T20:29:52+01:00 https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/222 ARM Neon: Fix makeVec and add Philox 2021-03-16T20:29:52+01:00 Michael Kuron mkuron@icp.uni-stuttgart.de

ARM Neon: Fix makeVec and add Philox

I did a quick find-and-replace translation of the Philox from SSE to Neon and noticed that `makeVec` was broken. Tests pass now. I did a quick find-and-replace translation of the Philox from SSE to Neon and noticed that `makeVec` was broken. Tests pass now. Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/223 Fix benchmark generation 2021-02-26T09:07:23+01:00 Markus Holzer

Fix benchmark generation

Fixes the generation of small benchmarking code with kerncraft and likwid for vectorized and aligned kernels. Fixes the generation of small benchmarking code with kerncraft and likwid for vectorized and aligned kernels. Bug Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/224 Draft: Develop 2023-09-14T11:03:48+02:00 Markus Holzer

Draft: Develop

This MR adds two features to pystencils. First, the base pointer specification is revealed to the user which allows producing kernels with less register usage. Second, the summands insider the summation printer are printer recursively no... This MR adds two features to pystencils. First, the base pointer specification is revealed to the user which allows producing kernels with less register usage. Second, the summands insider the summation printer are printer recursively now which allows for more parallelism inside a single core. feature Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/225 WIP: ARM cache line zeroing 2021-04-01T22:59:29+02:00 Michael Kuron mkuron@icp.uni-stuttgart.de

WIP: ARM cache line zeroing

ARM has a cache line zero instruction that prevents data that will be overwritten anyway from being loaded from RAM. Kind of a light version of a non-temporal store. Saw this near the end of https://www.youtube.com/watch?v=BP7XD7JHgrI in... ARM has a cache line zero instruction that prevents data that will be overwritten anyway from being loaded from RAM. Kind of a light version of a non-temporal store. Saw this near the end of https://www.youtube.com/watch?v=BP7XD7JHgrI in the context of SVE, but might be relevant on Neon too. Just wanted to keep a note of this here. Integrating this into pystencils is probably not completely straight-forward as you first need to check how much would be zeroed (64 bytes on all current chips, not guaranteed to match the cache line size), zero it, and then write the corresponding amount of data. Not sure if there are guarantees as to whether it's a multiple of the vector width. There is not a whole lot of information for ARM, but the exact same thing has existed on IBM‘s PowerPC architecture (!228) for decades. There, a cache line has 128 bytes (can be queried from the kernel via `sysconf(_SC_LEVEL1_DCACHE_LINESIZE)`) and can be zeroed with the `__dcbz` intrinsic. https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/226 Fix field size 2021-03-03T16:41:21+01:00 Markus Holzer

Fix field size

Fixes #32 and #31 and #3 and #7 Fixes #32 and #31 and #3 and #7 Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/227 Fix installation from git 2021-03-15T10:42:15+01:00 Markus Holzer

Fix installation from git

If `pystencils` is installed from git and `cython` is not installed you get the error: `gcc: error: pystencils/boundaries/createindexlistcython.c: No such file or directory`. This is because the generated C file is not shipped and thus i... If `pystencils` is installed from git and `cython` is not installed you get the error: `gcc: error: pystencils/boundaries/createindexlistcython.c: No such file or directory`. This is because the generated C file is not shipped and thus it is not available. On pypi the C-file is shipped and thus it works. Fixes #14 Bug Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/228 Vectorization improvements 2021-03-29T22:31:22+02:00 Michael Kuron mkuron@icp.uni-stuttgart.de

Vectorization improvements

After we cleaned up vectorization support as part of our ARM Neon experiments a few weeks ago (!188, !220, !222), I did the same thing with AltiVec/VSX intrinsics for POWER processors. Adding a new SIMD instruction set to pystencils real... After we cleaned up vectorization support as part of our ARM Neon experiments a few weeks ago (!188, !220, !222), I did the same thing with AltiVec/VSX intrinsics for POWER processors. Adding a new SIMD instruction set to pystencils really is just a matter of some quick find-and-replace now. I had test access to a POWER8 machine today, ran in both little-endian and big-endian mode, and all tests passed. So pystencils now actually supports _all_ SIMD instruction sets out there (ignoring MIPS and SPARC processors, which are essentially dead). This pull request also contains some minor unrelated changes: - switches the AES RNG to aligned stores - adds a missing `pytest.importorskip` - fixes the `vec_any`/`vec_all` operations (which used to only work on 256 bit doubles) - removes the `q_registers` argument from `get_vector_instruction_set` because there is no point in using half-width vectors - fix the AES-NI RNG on Ice Lake/Tiger Lake processors Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/229 Add type conversion for SP types 2021-04-03T06:01:47+02:00 Markus Holzer

Add type conversion for SP types

If Assignments are already typed for double-precision but the kernel is created for single-precision the assignments should be adapted. If Assignments are already typed for double-precision but the kernel is created for single-precision the assignments should be adapted. Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/230 Improve non-temporal stores 2021-04-29T08:23:17+02:00 Michael Kuron mkuron@icp.uni-stuttgart.de

Improve non-temporal stores

ARM doesn't have real non-temporal stores like x86 does. It does, however, have a special instruction (`dc zva`) that sets a cacheline to zero without reading it from memory, thus saving memory bandwidth. This pull request makes use of i... ARM doesn't have real non-temporal stores like x86 does. It does, however, have a special instruction (`dc zva`) that sets a cacheline to zero without reading it from memory, thus saving memory bandwidth. This pull request makes use of it. The other important part of non-temporal stores is that they don't pollute the cache. Some architectures like PowerPC can emulate that with a special store instruction that flushes a cache line from cache. It's not available on ARM, but this pull requests adds support anyway. Furthermore, this pull request adds fence instructions for non-temporal stores on x86. This ensures that everything has actually arrived in memory before the kernel returns, thus ensuring that subsequent memory accesses will not get stale data. This was not an issue in practice as the overhead of exiting and calling a kernel would have usually taken enough time for the data to arrive in memory, but it wasn't guaranteed. Note that non-temporal stores are not guaranteed to be faster than regular stores on all processors. For example, on my Apple M1, the non-temporal stores are actually slightly slower (so I did not gain anything by implementing this). This is actually not due to the overhead of checking whether a vector is the first one in a cache line (which can easily be confirmed by replacing `dc zva` with `nop`), but likely an artifact of Apple implementing `dc zva` differently than the ARM documentation suggests. So on ARM, one should always compare regular and non-temporal stores and use whichever is faster. Fixes #25. Supersedes !225. Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/231 fix error introduced in !230 2021-04-14T13:50:47+02:00 Michael Kuron mkuron@icp.uni-stuttgart.de

fix error introduced in !230

Just looked at the code again and noticed that there is no _stream_ in this instruction set anymore, so this line would have raised an error. Just looked at the code again and noticed that there is no _stream_ in this instruction set anymore, so this line would have raised an error. Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/232 SVE vectorization 2021-04-22T20:19:32+02:00 Michael Kuron mkuron@icp.uni-stuttgart.de

SVE vectorization

To continue my vectorization spree, here is the version with ARM SVE instructions. Tested in QEMU 5.2. Compiler support is still a bit wonky -- GCC 10 generates very bloated assembly and produces incorrect code for non-native vector size... To continue my vectorization spree, here is the version with ARM SVE instructions. Tested in QEMU 5.2. Compiler support is still a bit wonky -- GCC 10 generates very bloated assembly and produces incorrect code for non-native vector sizes, while Clang 11 misses some obvious optimizations. Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/233 Vectorization: improve test coverage 2021-04-21T19:55:08+02:00 Michael Kuron mkuron@icp.uni-stuttgart.de

Vectorization: improve test coverage

Some things were not tested with all available vectorizations. `maskStore` was previously untested and only worked with AVX512 float/double and AVX double. `maskLoad` was clearly broken, unused, and is a pretty useless instruction anyway... Some things were not tested with all available vectorizations. `maskStore` was previously untested and only worked with AVX512 float/double and AVX double. `maskLoad` was clearly broken, unused, and is a pretty useless instruction anyway, so I removed it. Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/234 Sizeless vectorization 2021-05-21T10:11:44+02:00 Michael Kuron mkuron@icp.uni-stuttgart.de

Sizeless vectorization

Surprisingly easy follow-up to !232 to support sizeless ARM SVE and RISC-V V. It uses some ugly hacks to sneak C functions like `svcntb()` into places that expect Python integers. Python duck-typing and SymPy made it possible. Not sure w... Surprisingly easy follow-up to !232 to support sizeless ARM SVE and RISC-V V. It uses some ugly hacks to sneak C functions like `svcntb()` into places that expect Python integers. Python duck-typing and SymPy made it possible. Not sure whether this should be merged as-is, but making it nicer would require re-writing `CBackend`. At least I couldn't think of a better way to obtain the innermost loop counter and loop stop. Jan Hönig Jan Hönig https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/235 maskStore improvements 2021-04-28T15:24:48+02:00 Michael Kuron mkuron@icp.uni-stuttgart.de

maskStore improvements

Little follow-up to !233 after I thought about it again. - fix the aligned version (it was using `maskStore` in some instruction sets and `maskStoreA` in others) - make sure the test case is incommensurate with the vector width (previou... Little follow-up to !233 after I thought about it again. - fix the aligned version (it was using `maskStore` in some instruction sets and `maskStoreA` in others) - make sure the test case is incommensurate with the vector width (previously it couldn't distinguish `store` from `storeMask` on 128-bit vector instruction sets) - implement a fallback for instruction sets that don't support it natively (turns out this is really easy using a load-blend-store combination) Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/236 Adapt an ifdef for AMD Epyc 7003 2021-04-28T15:23:33+02:00 Michael Kuron mkuron@icp.uni-stuttgart.de

Adapt an ifdef for AMD Epyc 7003

The new Zen 3 series has vector AES but no AVX512. The new Zen 3 series has vector AES but no AVX512. Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/237 Fix Sympy pipeline 2021-04-26T16:46:20+02:00 Markus Holzer

Fix Sympy pipeline

Fix #35 Fix #35 Bug Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/238 Versioneer 2021-04-27T11:02:20+02:00 Markus Holzer

Versioneer

This MR enables Versioneer to have a consistent way for pystencils to get a version string This MR enables Versioneer to have a consistent way for pystencils to get a version string feature Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/239 Sympy 1.9 support 2021-04-26T18:24:04+02:00 Michael Kuron mkuron@icp.uni-stuttgart.de

Sympy 1.9 support

- deepcopy support was broken due to https://github.com/sympy/sympy/pull/21260 - clean up some constructors - fix detection of sympy development versions fixes #35, fixes !237 - deepcopy support was broken due to https://github.com/sympy/sympy/pull/21260 - clean up some constructors - fix detection of sympy development versions fixes #35, fixes !237 Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/240 Incorporate header files and compiler flags into object cache hash 2021-04-29T09:29:30+02:00 Michael Kuron mkuron@icp.uni-stuttgart.de

Incorporate header files and compiler flags into object cache hash

Ensure that code is recompiled when one of our custom headers is changed or the compiler flags are modified. Ensure that code is recompiled when one of our custom headers is changed or the compiler flags are modified. Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/241 Vector scatter/gather support 2023-08-18T20:43:16+02:00 Michael Kuron mkuron@icp.uni-stuttgart.de

Vector scatter/gather support

Some modern processors support scatter/gather operations in hardware to be able to vectorize even when the stride between consecutive elements is not 1. Supporting this in pystencils turned out to be surprisingly easy. On a Core i7-7820X... Some modern processors support scatter/gather operations in hardware to be able to vectorize even when the stride between consecutive elements is not 1. Supporting this in pystencils turned out to be surprisingly easy. On a Core i7-7820X, the D3Q19 TRT benchmark shows an appreciable performance benefit for cases with nonideal memory layout: - 15% for fzyx without assume_inner_stride_one and with split - 20% for fzyx without assume_inner_stride_one - 30% for zyxf AVX2 only supported gather, so this requires AVX512. The internet says it was quite slow on AVX2 processors anyway. Even on AVX512 the latency is quite high and the throughput is quite low, but it's still better than not vectorizing. SVE also supports it, so all future ARM processors will benefit too, and they will probably have better hardware support for higher throughput. Fixes #34 Markus Holzer Markus Holzer