pystencils merge requestshttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests2021-05-26T13:44:59+02:00https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/247OpenCL RNG2021-05-26T13:44:59+02:00Michael Kuronmkuron@icp.uni-stuttgart.deOpenCL RNGUnfortunately most OpenCL implementations don't support C++Unfortunately most OpenCL implementations don't support C++Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/246ship C-file2021-05-11T09:31:17+02:00Markus Holzership C-fileShipping the generated C-files to pypi is a good idea since it is less error-prone. A New Cython version might deal with the provided pyx file in a way we did not intend.
In more detail the best practice can be found here:
http://blog.b...Shipping the generated C-files to pypi is a good idea since it is less error-prone. A New Cython version might deal with the provided pyx file in a way we did not intend.
In more detail the best practice can be found here:
http://blog.behnel.de/posts/ship-generated-c-code-or-not.htmlMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/245Cherry-pick non-sizeless SVE things from !2342021-05-13T15:42:59+02:00Michael Kuronmkuron@icp.uni-stuttgart.deCherry-pick non-sizeless SVE things from !234Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/244Add CI job for non-x86 vectorization2021-05-06T09:43:35+02:00Michael Kuronmkuron@icp.uni-stuttgart.deAdd CI job for non-x86 vectorizationDepends on https://i10git.cs.fau.de/pycodegen/pycodegen/-/merge_requests/12. A job for ARM SVE vectorization will be added in a future pull request as it appears to not be running completely stable yet.Depends on https://i10git.cs.fau.de/pycodegen/pycodegen/-/merge_requests/12. A job for ARM SVE vectorization will be added in a future pull request as it appears to not be running completely stable yet.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/243atomically write cache status file2021-05-03T14:19:21+02:00Michael Kuronmkuron@icp.uni-stuttgart.deatomically write cache status fileTry to fix https://i10git.cs.fau.de/pycodegen/pycodegen/-/jobs/568263, introduced in !240Try to fix https://i10git.cs.fau.de/pycodegen/pycodegen/-/jobs/568263, introduced in !240Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/242Disallow OpenMP + blocking + cacheline-zero2021-04-29T08:57:00+02:00Michael Kuronmkuron@icp.uni-stuttgart.deDisallow OpenMP + blocking + cacheline-zeroThe loop over the blocks is OpenMP-collapsed, so blocks might be worked on simultaneously. If the innermost block size does not align with a cache line and non-temporal stores are enabled on architectures that only do cacheline-zeroing (...The loop over the blocks is OpenMP-collapsed, so blocks might be worked on simultaneously. If the innermost block size does not align with a cache line and non-temporal stores are enabled on architectures that only do cacheline-zeroing (!230), threads would then erase each others' data. So we disallow the problematic combination.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/241Vector scatter/gather support2023-08-18T20:43:16+02:00Michael Kuronmkuron@icp.uni-stuttgart.deVector scatter/gather supportSome modern processors support scatter/gather operations in hardware to be able to vectorize even when the stride between consecutive elements is not 1. Supporting this in pystencils turned out to be surprisingly easy. On a Core i7-7820X...Some modern processors support scatter/gather operations in hardware to be able to vectorize even when the stride between consecutive elements is not 1. Supporting this in pystencils turned out to be surprisingly easy. On a Core i7-7820X, the D3Q19 TRT benchmark shows an appreciable performance benefit for cases with nonideal memory layout:
- 15% for fzyx without assume_inner_stride_one and with split
- 20% for fzyx without assume_inner_stride_one
- 30% for zyxf
AVX2 only supported gather, so this requires AVX512. The internet says it was quite slow on AVX2 processors anyway. Even on AVX512 the latency is quite high and the throughput is quite low, but it's still better than not vectorizing. SVE also supports it, so all future ARM processors will benefit too, and they will probably have better hardware support for higher throughput.
Fixes #34Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/240Incorporate header files and compiler flags into object cache hash2021-04-29T09:29:30+02:00Michael Kuronmkuron@icp.uni-stuttgart.deIncorporate header files and compiler flags into object cache hashEnsure that code is recompiled when one of our custom headers is changed or the compiler flags are modified.Ensure that code is recompiled when one of our custom headers is changed or the compiler flags are modified.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/239Sympy 1.9 support2021-04-26T18:24:04+02:00Michael Kuronmkuron@icp.uni-stuttgart.deSympy 1.9 support- deepcopy support was broken due to https://github.com/sympy/sympy/pull/21260
- clean up some constructors
- fix detection of sympy development versions
fixes #35, fixes !237- deepcopy support was broken due to https://github.com/sympy/sympy/pull/21260
- clean up some constructors
- fix detection of sympy development versions
fixes #35, fixes !237Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/238Versioneer2021-04-27T11:02:20+02:00Markus HolzerVersioneerThis MR enables Versioneer to have a consistent way for pystencils to get a version stringThis MR enables Versioneer to have a consistent way for pystencils to get a version stringMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/236Adapt an ifdef for AMD Epyc 70032021-04-28T15:23:33+02:00Michael Kuronmkuron@icp.uni-stuttgart.deAdapt an ifdef for AMD Epyc 7003The new Zen 3 series has vector AES but no AVX512.The new Zen 3 series has vector AES but no AVX512.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/235maskStore improvements2021-04-28T15:24:48+02:00Michael Kuronmkuron@icp.uni-stuttgart.demaskStore improvementsLittle follow-up to !233 after I thought about it again.
- fix the aligned version (it was using `maskStore` in some instruction sets and `maskStoreA` in others)
- make sure the test case is incommensurate with the vector width (previou...Little follow-up to !233 after I thought about it again.
- fix the aligned version (it was using `maskStore` in some instruction sets and `maskStoreA` in others)
- make sure the test case is incommensurate with the vector width (previously it couldn't distinguish `store` from `storeMask` on 128-bit vector instruction sets)
- implement a fallback for instruction sets that don't support it natively (turns out this is really easy using a load-blend-store combination)Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/234Sizeless vectorization2021-05-21T10:11:44+02:00Michael Kuronmkuron@icp.uni-stuttgart.deSizeless vectorizationSurprisingly easy follow-up to !232 to support sizeless ARM SVE and RISC-V V. It uses some ugly hacks to sneak C functions like `svcntb()` into places that expect Python integers. Python duck-typing and SymPy made it possible. Not sure w...Surprisingly easy follow-up to !232 to support sizeless ARM SVE and RISC-V V. It uses some ugly hacks to sneak C functions like `svcntb()` into places that expect Python integers. Python duck-typing and SymPy made it possible. Not sure whether this should be merged as-is, but making it nicer would require re-writing `CBackend`. At least I couldn't think of a better way to obtain the innermost loop counter and loop stop.Jan HönigJan Hönighttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/233Vectorization: improve test coverage2021-04-21T19:55:08+02:00Michael Kuronmkuron@icp.uni-stuttgart.deVectorization: improve test coverageSome things were not tested with all available vectorizations. `maskStore` was previously untested and only worked with AVX512 float/double and AVX double. `maskLoad` was clearly broken, unused, and is a pretty useless instruction anyway...Some things were not tested with all available vectorizations. `maskStore` was previously untested and only worked with AVX512 float/double and AVX double. `maskLoad` was clearly broken, unused, and is a pretty useless instruction anyway, so I removed it.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/232SVE vectorization2021-04-22T20:19:32+02:00Michael Kuronmkuron@icp.uni-stuttgart.deSVE vectorizationTo continue my vectorization spree, here is the version with ARM SVE instructions. Tested in QEMU 5.2. Compiler support is still a bit wonky -- GCC 10 generates very bloated assembly and produces incorrect code for non-native vector size...To continue my vectorization spree, here is the version with ARM SVE instructions. Tested in QEMU 5.2. Compiler support is still a bit wonky -- GCC 10 generates very bloated assembly and produces incorrect code for non-native vector sizes, while Clang 11 misses some obvious optimizations.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/231fix error introduced in !2302021-04-14T13:50:47+02:00Michael Kuronmkuron@icp.uni-stuttgart.defix error introduced in !230Just looked at the code again and noticed that there is no _stream_ in this instruction set anymore, so this line would have raised an error.Just looked at the code again and noticed that there is no _stream_ in this instruction set anymore, so this line would have raised an error.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/230Improve non-temporal stores2021-04-29T08:23:17+02:00Michael Kuronmkuron@icp.uni-stuttgart.deImprove non-temporal storesARM doesn't have real non-temporal stores like x86 does. It does, however, have a special instruction (`dc zva`) that sets a cacheline to zero without reading it from memory, thus saving memory bandwidth. This pull request makes use of i...ARM doesn't have real non-temporal stores like x86 does. It does, however, have a special instruction (`dc zva`) that sets a cacheline to zero without reading it from memory, thus saving memory bandwidth. This pull request makes use of it.
The other important part of non-temporal stores is that they don't pollute the cache. Some architectures like PowerPC can emulate that with a special store instruction that flushes a cache line from cache. It's not available on ARM, but this pull requests adds support anyway.
Furthermore, this pull request adds fence instructions for non-temporal stores on x86. This ensures that everything has actually arrived in memory before the kernel returns, thus ensuring that subsequent memory accesses will not get stale data. This was not an issue in practice as the overhead of exiting and calling a kernel would have usually taken enough time for the data to arrive in memory, but it wasn't guaranteed.
Note that non-temporal stores are not guaranteed to be faster than regular stores on all processors. For example, on my Apple M1, the non-temporal stores are actually slightly slower (so I did not gain anything by implementing this). This is actually not due to the overhead of checking whether a vector is the first one in a cache line (which can easily be confirmed by replacing `dc zva` with `nop`), but likely an artifact of Apple implementing `dc zva` differently than the ARM documentation suggests. So on ARM, one should always compare regular and non-temporal stores and use whichever is faster.
Fixes #25. Supersedes !225.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/228Vectorization improvements2021-03-29T22:31:22+02:00Michael Kuronmkuron@icp.uni-stuttgart.deVectorization improvementsAfter we cleaned up vectorization support as part of our ARM Neon experiments a few weeks ago (!188, !220, !222), I did the same thing with AltiVec/VSX intrinsics for POWER processors. Adding a new SIMD instruction set to pystencils real...After we cleaned up vectorization support as part of our ARM Neon experiments a few weeks ago (!188, !220, !222), I did the same thing with AltiVec/VSX intrinsics for POWER processors. Adding a new SIMD instruction set to pystencils really is just a matter of some quick find-and-replace now. I had test access to a POWER8 machine today, ran in both little-endian and big-endian mode, and all tests passed. So pystencils now actually supports _all_ SIMD instruction sets out there (ignoring MIPS and SPARC processors, which are essentially dead).
This pull request also contains some minor unrelated changes:
- switches the AES RNG to aligned stores
- adds a missing `pytest.importorskip`
- fixes the `vec_any`/`vec_all` operations (which used to only work on 256 bit doubles)
- removes the `q_registers` argument from `get_vector_instruction_set` because there is no point in using half-width vectors
- fix the AES-NI RNG on Ice Lake/Tiger Lake processorsMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/227Fix installation from git2021-03-15T10:42:15+01:00Markus HolzerFix installation from gitIf `pystencils` is installed from git and `cython` is not installed you get the error: `gcc: error: pystencils/boundaries/createindexlistcython.c: No such file or directory`. This is because the generated C file is not shipped and thus i...If `pystencils` is installed from git and `cython` is not installed you get the error: `gcc: error: pystencils/boundaries/createindexlistcython.c: No such file or directory`. This is because the generated C file is not shipped and thus it is not available.
On pypi the C-file is shipped and thus it works.
Fixes #14Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/226Fix field size2021-03-03T16:41:21+01:00Markus HolzerFix field sizeFixes #32 and #31 and #3 and #7Fixes #32 and #31 and #3 and #7Markus HolzerMarkus Holzer