pystencils merge requestshttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests2021-05-21T10:11:44+02:00https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/234Sizeless vectorization2021-05-21T10:11:44+02:00Michael Kuronmkuron@icp.uni-stuttgart.deSizeless vectorizationSurprisingly easy follow-up to !232 to support sizeless ARM SVE and RISC-V V. It uses some ugly hacks to sneak C functions like `svcntb()` into places that expect Python integers. Python duck-typing and SymPy made it possible. Not sure w...Surprisingly easy follow-up to !232 to support sizeless ARM SVE and RISC-V V. It uses some ugly hacks to sneak C functions like `svcntb()` into places that expect Python integers. Python duck-typing and SymPy made it possible. Not sure whether this should be merged as-is, but making it nicer would require re-writing `CBackend`. At least I couldn't think of a better way to obtain the innermost loop counter and loop stop.Jan HönigJan Hönighttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/233Vectorization: improve test coverage2021-04-21T19:55:08+02:00Michael Kuronmkuron@icp.uni-stuttgart.deVectorization: improve test coverageSome things were not tested with all available vectorizations. `maskStore` was previously untested and only worked with AVX512 float/double and AVX double. `maskLoad` was clearly broken, unused, and is a pretty useless instruction anyway...Some things were not tested with all available vectorizations. `maskStore` was previously untested and only worked with AVX512 float/double and AVX double. `maskLoad` was clearly broken, unused, and is a pretty useless instruction anyway, so I removed it.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/232SVE vectorization2021-04-22T20:19:32+02:00Michael Kuronmkuron@icp.uni-stuttgart.deSVE vectorizationTo continue my vectorization spree, here is the version with ARM SVE instructions. Tested in QEMU 5.2. Compiler support is still a bit wonky -- GCC 10 generates very bloated assembly and produces incorrect code for non-native vector size...To continue my vectorization spree, here is the version with ARM SVE instructions. Tested in QEMU 5.2. Compiler support is still a bit wonky -- GCC 10 generates very bloated assembly and produces incorrect code for non-native vector sizes, while Clang 11 misses some obvious optimizations.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/231fix error introduced in !2302021-04-14T13:50:47+02:00Michael Kuronmkuron@icp.uni-stuttgart.defix error introduced in !230Just looked at the code again and noticed that there is no _stream_ in this instruction set anymore, so this line would have raised an error.Just looked at the code again and noticed that there is no _stream_ in this instruction set anymore, so this line would have raised an error.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/230Improve non-temporal stores2021-04-29T08:23:17+02:00Michael Kuronmkuron@icp.uni-stuttgart.deImprove non-temporal storesARM doesn't have real non-temporal stores like x86 does. It does, however, have a special instruction (`dc zva`) that sets a cacheline to zero without reading it from memory, thus saving memory bandwidth. This pull request makes use of i...ARM doesn't have real non-temporal stores like x86 does. It does, however, have a special instruction (`dc zva`) that sets a cacheline to zero without reading it from memory, thus saving memory bandwidth. This pull request makes use of it.
The other important part of non-temporal stores is that they don't pollute the cache. Some architectures like PowerPC can emulate that with a special store instruction that flushes a cache line from cache. It's not available on ARM, but this pull requests adds support anyway.
Furthermore, this pull request adds fence instructions for non-temporal stores on x86. This ensures that everything has actually arrived in memory before the kernel returns, thus ensuring that subsequent memory accesses will not get stale data. This was not an issue in practice as the overhead of exiting and calling a kernel would have usually taken enough time for the data to arrive in memory, but it wasn't guaranteed.
Note that non-temporal stores are not guaranteed to be faster than regular stores on all processors. For example, on my Apple M1, the non-temporal stores are actually slightly slower (so I did not gain anything by implementing this). This is actually not due to the overhead of checking whether a vector is the first one in a cache line (which can easily be confirmed by replacing `dc zva` with `nop`), but likely an artifact of Apple implementing `dc zva` differently than the ARM documentation suggests. So on ARM, one should always compare regular and non-temporal stores and use whichever is faster.
Fixes #25. Supersedes !225.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/228Vectorization improvements2021-03-29T22:31:22+02:00Michael Kuronmkuron@icp.uni-stuttgart.deVectorization improvementsAfter we cleaned up vectorization support as part of our ARM Neon experiments a few weeks ago (!188, !220, !222), I did the same thing with AltiVec/VSX intrinsics for POWER processors. Adding a new SIMD instruction set to pystencils real...After we cleaned up vectorization support as part of our ARM Neon experiments a few weeks ago (!188, !220, !222), I did the same thing with AltiVec/VSX intrinsics for POWER processors. Adding a new SIMD instruction set to pystencils really is just a matter of some quick find-and-replace now. I had test access to a POWER8 machine today, ran in both little-endian and big-endian mode, and all tests passed. So pystencils now actually supports _all_ SIMD instruction sets out there (ignoring MIPS and SPARC processors, which are essentially dead).
This pull request also contains some minor unrelated changes:
- switches the AES RNG to aligned stores
- adds a missing `pytest.importorskip`
- fixes the `vec_any`/`vec_all` operations (which used to only work on 256 bit doubles)
- removes the `q_registers` argument from `get_vector_instruction_set` because there is no point in using half-width vectors
- fix the AES-NI RNG on Ice Lake/Tiger Lake processorsMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/227Fix installation from git2021-03-15T10:42:15+01:00Markus HolzerFix installation from gitIf `pystencils` is installed from git and `cython` is not installed you get the error: `gcc: error: pystencils/boundaries/createindexlistcython.c: No such file or directory`. This is because the generated C file is not shipped and thus i...If `pystencils` is installed from git and `cython` is not installed you get the error: `gcc: error: pystencils/boundaries/createindexlistcython.c: No such file or directory`. This is because the generated C file is not shipped and thus it is not available.
On pypi the C-file is shipped and thus it works.
Fixes #14Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/226Fix field size2021-03-03T16:41:21+01:00Markus HolzerFix field sizeFixes #32 and #31 and #3 and #7Fixes #32 and #31 and #3 and #7Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/223Fix benchmark generation2021-02-26T09:07:23+01:00Markus HolzerFix benchmark generationFixes the generation of small benchmarking code with kerncraft and likwid for vectorized and aligned kernels.Fixes the generation of small benchmarking code with kerncraft and likwid for vectorized and aligned kernels.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/222ARM Neon: Fix makeVec and add Philox2021-03-16T20:29:52+01:00Michael Kuronmkuron@icp.uni-stuttgart.deARM Neon: Fix makeVec and add PhiloxI did a quick find-and-replace translation of the Philox from SSE to Neon and noticed that `makeVec` was broken. Tests pass now.I did a quick find-and-replace translation of the Philox from SSE to Neon and noticed that `makeVec` was broken. Tests pass now.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/221Ship c-file with pypi2021-02-22T22:59:51+01:00Markus HolzerShip c-file with pypiShip the generated C-File for boundary creation with pypiShip the generated C-File for boundary creation with pypiMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/220Improve ARM64 support2021-03-16T20:29:52+01:00Michael Kuronmkuron@icp.uni-stuttgart.deImprove ARM64 supportAll tests except for the vectorized random number generators now pass on Apple's new ARM64 computers.All tests except for the vectorized random number generators now pass on Apple's new ARM64 computers.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/219Make RNG vectorization more robust2021-02-22T16:32:21+01:00Michael Kuronmkuron@icp.uni-stuttgart.deMake RNG vectorization more robustFollow-up to https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/216. Needed for https://i10git.cs.fau.de/walberla/walberla/-/merge_requests/414.Follow-up to https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/216. Needed for https://i10git.cs.fau.de/walberla/walberla/-/merge_requests/414.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/218Update setup.py2021-02-20T12:44:25+01:00Markus HolzerUpdate setup.pyhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/217Add kerncraft templates2021-02-20T12:23:43+01:00Markus HolzerAdd kerncraft templatesThe kerncraft coupling works only if these templates are available.The kerncraft coupling works only if these templates are available.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/216some fixes for lbmpy vectorization2021-02-21T12:58:19+01:00Michael Kuronmkuron@icp.uni-stuttgart.desome fixes for lbmpy vectorizationFollow-up to !212 and new feature for https://i10git.cs.fau.de/pycodegen/lbmpy/-/merge_requests/65.Follow-up to !212 and new feature for https://i10git.cs.fau.de/pycodegen/lbmpy/-/merge_requests/65.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/215fix aligned_alloc on windows2021-02-19T15:15:44+01:00Michael Kuronmkuron@icp.uni-stuttgart.defix aligned_alloc on windowsAnother attempt at https://i10git.cs.fau.de/pycodegen/pystencils/-/issues/24. Follow-up to https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/214.Another attempt at https://i10git.cs.fau.de/pycodegen/pystencils/-/issues/24. Follow-up to https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/214.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/214default to C++17 on macOS and Windows2021-02-19T13:24:17+01:00Michael Kuronmkuron@icp.uni-stuttgart.dedefault to C++17 on macOS and WindowsFixes https://i10git.cs.fau.de/pycodegen/pystencils/-/issues/24 and https://i10git.cs.fau.de/pycodegen/lbmpy/-/jobs/537199Fixes https://i10git.cs.fau.de/pycodegen/pystencils/-/issues/24 and https://i10git.cs.fau.de/pycodegen/lbmpy/-/jobs/537199Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/213Fix clear cache on startup2021-02-18T14:05:01+01:00Stephan SeitzFix clear cache on startupDue two weird circular dependencies `clear_cache` will not be
available at this point. Inlining this function will make
the config key `cache.clear_cache_on_start: true` work again.Due two weird circular dependencies `clear_cache` will not be
available at this point. Inlining this function will make
the config key `cache.clear_cache_on_start: true` work again.https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/212vectorization: improve treatment of unary minus2021-02-19T16:41:53+01:00Michael Kuronmkuron@icp.uni-stuttgart.devectorization: improve treatment of unary minusChanges in !48 caused test failures in https://i10git.cs.fau.de/pycodegen/lbmpy/-/jobs/534348.
Also add a test for vectorized unary minus and `sp.Abs`, suppress a warning with older versions of randomgen and make double-precision vector...Changes in !48 caused test failures in https://i10git.cs.fau.de/pycodegen/lbmpy/-/jobs/534348.
Also add a test for vectorized unary minus and `sp.Abs`, suppress a warning with older versions of randomgen and make double-precision vector RNG accurate on Clang in fast-math mode.
The errors in test_resting_fluid and test_point_force are fixed by https://i10git.cs.fau.de/pycodegen/lbmpy/-/merge_requests/63. The error in test_phi_staggered_equivalence_on_random is fixed by https://i10git.cs.fau.de/pycodegen/pygrandchem/-/merge_requests/4.Markus HolzerMarkus Holzer