pystencils merge requestshttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests2021-04-29T08:57:00+02:00https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/242Disallow OpenMP + blocking + cacheline-zero2021-04-29T08:57:00+02:00Michael Kuronmkuron@icp.uni-stuttgart.deDisallow OpenMP + blocking + cacheline-zeroThe loop over the blocks is OpenMP-collapsed, so blocks might be worked on simultaneously. If the innermost block size does not align with a cache line and non-temporal stores are enabled on architectures that only do cacheline-zeroing (...The loop over the blocks is OpenMP-collapsed, so blocks might be worked on simultaneously. If the innermost block size does not align with a cache line and non-temporal stores are enabled on architectures that only do cacheline-zeroing (!230), threads would then erase each others' data. So we disallow the problematic combination.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/230Improve non-temporal stores2021-04-29T08:23:17+02:00Michael Kuronmkuron@icp.uni-stuttgart.deImprove non-temporal storesARM doesn't have real non-temporal stores like x86 does. It does, however, have a special instruction (`dc zva`) that sets a cacheline to zero without reading it from memory, thus saving memory bandwidth. This pull request makes use of i...ARM doesn't have real non-temporal stores like x86 does. It does, however, have a special instruction (`dc zva`) that sets a cacheline to zero without reading it from memory, thus saving memory bandwidth. This pull request makes use of it.
The other important part of non-temporal stores is that they don't pollute the cache. Some architectures like PowerPC can emulate that with a special store instruction that flushes a cache line from cache. It's not available on ARM, but this pull requests adds support anyway.
Furthermore, this pull request adds fence instructions for non-temporal stores on x86. This ensures that everything has actually arrived in memory before the kernel returns, thus ensuring that subsequent memory accesses will not get stale data. This was not an issue in practice as the overhead of exiting and calling a kernel would have usually taken enough time for the data to arrive in memory, but it wasn't guaranteed.
Note that non-temporal stores are not guaranteed to be faster than regular stores on all processors. For example, on my Apple M1, the non-temporal stores are actually slightly slower (so I did not gain anything by implementing this). This is actually not due to the overhead of checking whether a vector is the first one in a cache line (which can easily be confirmed by replacing `dc zva` with `nop`), but likely an artifact of Apple implementing `dc zva` differently than the ARM documentation suggests. So on ARM, one should always compare regular and non-temporal stores and use whichever is faster.
Fixes #25. Supersedes !225.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/235maskStore improvements2021-04-28T15:24:48+02:00Michael Kuronmkuron@icp.uni-stuttgart.demaskStore improvementsLittle follow-up to !233 after I thought about it again.
- fix the aligned version (it was using `maskStore` in some instruction sets and `maskStoreA` in others)
- make sure the test case is incommensurate with the vector width (previou...Little follow-up to !233 after I thought about it again.
- fix the aligned version (it was using `maskStore` in some instruction sets and `maskStoreA` in others)
- make sure the test case is incommensurate with the vector width (previously it couldn't distinguish `store` from `storeMask` on 128-bit vector instruction sets)
- implement a fallback for instruction sets that don't support it natively (turns out this is really easy using a load-blend-store combination)Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/236Adapt an ifdef for AMD Epyc 70032021-04-28T15:23:33+02:00Michael Kuronmkuron@icp.uni-stuttgart.deAdapt an ifdef for AMD Epyc 7003The new Zen 3 series has vector AES but no AVX512.The new Zen 3 series has vector AES but no AVX512.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/238Versioneer2021-04-27T11:02:20+02:00Markus HolzerVersioneerThis MR enables Versioneer to have a consistent way for pystencils to get a version stringThis MR enables Versioneer to have a consistent way for pystencils to get a version stringMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/239Sympy 1.9 support2021-04-26T18:24:04+02:00Michael Kuronmkuron@icp.uni-stuttgart.deSympy 1.9 support- deepcopy support was broken due to https://github.com/sympy/sympy/pull/21260
- clean up some constructors
- fix detection of sympy development versions
fixes #35, fixes !237- deepcopy support was broken due to https://github.com/sympy/sympy/pull/21260
- clean up some constructors
- fix detection of sympy development versions
fixes #35, fixes !237Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/232SVE vectorization2021-04-22T20:19:32+02:00Michael Kuronmkuron@icp.uni-stuttgart.deSVE vectorizationTo continue my vectorization spree, here is the version with ARM SVE instructions. Tested in QEMU 5.2. Compiler support is still a bit wonky -- GCC 10 generates very bloated assembly and produces incorrect code for non-native vector size...To continue my vectorization spree, here is the version with ARM SVE instructions. Tested in QEMU 5.2. Compiler support is still a bit wonky -- GCC 10 generates very bloated assembly and produces incorrect code for non-native vector sizes, while Clang 11 misses some obvious optimizations.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/233Vectorization: improve test coverage2021-04-21T19:55:08+02:00Michael Kuronmkuron@icp.uni-stuttgart.deVectorization: improve test coverageSome things were not tested with all available vectorizations. `maskStore` was previously untested and only worked with AVX512 float/double and AVX double. `maskLoad` was clearly broken, unused, and is a pretty useless instruction anyway...Some things were not tested with all available vectorizations. `maskStore` was previously untested and only worked with AVX512 float/double and AVX double. `maskLoad` was clearly broken, unused, and is a pretty useless instruction anyway, so I removed it.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/231fix error introduced in !2302021-04-14T13:50:47+02:00Michael Kuronmkuron@icp.uni-stuttgart.defix error introduced in !230Just looked at the code again and noticed that there is no _stream_ in this instruction set anymore, so this line would have raised an error.Just looked at the code again and noticed that there is no _stream_ in this instruction set anymore, so this line would have raised an error.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/228Vectorization improvements2021-03-29T22:31:22+02:00Michael Kuronmkuron@icp.uni-stuttgart.deVectorization improvementsAfter we cleaned up vectorization support as part of our ARM Neon experiments a few weeks ago (!188, !220, !222), I did the same thing with AltiVec/VSX intrinsics for POWER processors. Adding a new SIMD instruction set to pystencils real...After we cleaned up vectorization support as part of our ARM Neon experiments a few weeks ago (!188, !220, !222), I did the same thing with AltiVec/VSX intrinsics for POWER processors. Adding a new SIMD instruction set to pystencils really is just a matter of some quick find-and-replace now. I had test access to a POWER8 machine today, ran in both little-endian and big-endian mode, and all tests passed. So pystencils now actually supports _all_ SIMD instruction sets out there (ignoring MIPS and SPARC processors, which are essentially dead).
This pull request also contains some minor unrelated changes:
- switches the AES RNG to aligned stores
- adds a missing `pytest.importorskip`
- fixes the `vec_any`/`vec_all` operations (which used to only work on 256 bit doubles)
- removes the `q_registers` argument from `get_vector_instruction_set` because there is no point in using half-width vectors
- fix the AES-NI RNG on Ice Lake/Tiger Lake processorsMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/204int64_t prevents warnings when comparing loop variable to arguments which are...2021-03-18T16:58:59+01:00Dominik Thoennesdominik.thoennes@fau.deint64_t prevents warnings when comparing loop variable to arguments which are int64_t by defaultThe default type for integers as arguments is `int64_t`
This change prevents warnings when comparing arguments with loop variables which are currently `int`The default type for integers as arguments is `int64_t`
This change prevents warnings when comparing arguments with loop variables which are currently `int`https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/222ARM Neon: Fix makeVec and add Philox2021-03-16T20:29:52+01:00Michael Kuronmkuron@icp.uni-stuttgart.deARM Neon: Fix makeVec and add PhiloxI did a quick find-and-replace translation of the Philox from SSE to Neon and noticed that `makeVec` was broken. Tests pass now.I did a quick find-and-replace translation of the Philox from SSE to Neon and noticed that `makeVec` was broken. Tests pass now.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/220Improve ARM64 support2021-03-16T20:29:52+01:00Michael Kuronmkuron@icp.uni-stuttgart.deImprove ARM64 supportAll tests except for the vectorized random number generators now pass on Apple's new ARM64 computers.All tests except for the vectorized random number generators now pass on Apple's new ARM64 computers.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/188Neon intrinsics2021-03-16T20:29:52+01:00Markus HolzerNeon intrinsicsThis MR implements neon intrinsics to enable vectorization for the ARM architecture.
This may also become useful once ARM HPC clusters actually get deployed, though these might end up using SVE instead of NEON. For that case, additional...This MR implements neon intrinsics to enable vectorization for the ARM architecture.
This may also become useful once ARM HPC clusters actually get deployed, though these might end up using SVE instead of NEON. For that case, additional work is needed because SVE's vector width is determined at runtime.Michael Kuronmkuron@icp.uni-stuttgart.deMichael Kuronmkuron@icp.uni-stuttgart.dehttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/227Fix installation from git2021-03-15T10:42:15+01:00Markus HolzerFix installation from gitIf `pystencils` is installed from git and `cython` is not installed you get the error: `gcc: error: pystencils/boundaries/createindexlistcython.c: No such file or directory`. This is because the generated C file is not shipped and thus i...If `pystencils` is installed from git and `cython` is not installed you get the error: `gcc: error: pystencils/boundaries/createindexlistcython.c: No such file or directory`. This is because the generated C file is not shipped and thus it is not available.
On pypi the C-file is shipped and thus it works.
Fixes #14Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/226Fix field size2021-03-03T16:41:21+01:00Markus HolzerFix field sizeFixes #32 and #31 and #3 and #7Fixes #32 and #31 and #3 and #7Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/223Fix benchmark generation2021-02-26T09:07:23+01:00Markus HolzerFix benchmark generationFixes the generation of small benchmarking code with kerncraft and likwid for vectorized and aligned kernels.Fixes the generation of small benchmarking code with kerncraft and likwid for vectorized and aligned kernels.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/221Ship c-file with pypi2021-02-22T22:59:51+01:00Markus HolzerShip c-file with pypiShip the generated C-File for boundary creation with pypiShip the generated C-File for boundary creation with pypiMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/219Make RNG vectorization more robust2021-02-22T16:32:21+01:00Michael Kuronmkuron@icp.uni-stuttgart.deMake RNG vectorization more robustFollow-up to https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/216. Needed for https://i10git.cs.fau.de/walberla/walberla/-/merge_requests/414.Follow-up to https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/216. Needed for https://i10git.cs.fau.de/walberla/walberla/-/merge_requests/414.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/216some fixes for lbmpy vectorization2021-02-21T12:58:19+01:00Michael Kuronmkuron@icp.uni-stuttgart.desome fixes for lbmpy vectorizationFollow-up to !212 and new feature for https://i10git.cs.fau.de/pycodegen/lbmpy/-/merge_requests/65.Follow-up to !212 and new feature for https://i10git.cs.fau.de/pycodegen/lbmpy/-/merge_requests/65.Markus HolzerMarkus Holzer