pystencils merge requestshttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests2021-06-08T08:33:34+02:00https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/251Use int64 for indexing2021-06-08T08:33:34+02:00Markus HolzerUse int64 for indexingFor indexed kernels, int32 is too small for large domain sizes. Thus the coordinates are cast to int64 in this MR to allow huge domain sizes.
As an example of the adaption the generated code for a Neumann boundary is shown. Before:
```...For indexed kernels, int32 is too small for large domain sizes. Thus the coordinates are cast to int64 in this MR to allow huge domain sizes.
As an example of the adaption the generated code for a Neumann boundary is shown. Before:
```cpp
FUNC_PREFIX void kernel(double * RESTRICT _data_C, uint8_t * RESTRICT const _data_indexField, int64_t const _size_indexField_0, int64_t const _stride_indexField_0)
{
#pragma omp parallel
{
#pragma omp for schedule(static)
for (int64_t ctr_0 = 0; ctr_0 < _size_indexField_0; ctr_0 += 1)
{
const int32_t x = *((int32_t *)(& _data_indexField[12*_stride_indexField_0*ctr_0]));
const int32_t y = *((int32_t *)(& _data_indexField[12*_stride_indexField_0*ctr_0 + 4]));
const int64_t cx [] = { 0, 0, 0, -1, 1, -1, 1, -1, 1 };
const int64_t cy [] = { 0, 1, -1, 0, 0, 1, 1, -1, -1 };
const int invdir [] = { 0, 2, 1, 4, 3, 8, 7, 6, 5 };
const int32_t dir = *((int32_t *)(& _data_indexField[12*_stride_indexField_0*ctr_0 + 8]));
_data_C[x + 11*y] = _data_C[x + 11*y + cx[dir] + 11*cy[dir]];
}
}
}
```
After:
```cpp
FUNC_PREFIX void kernel(double * RESTRICT _data_C, uint8_t * RESTRICT const _data_indexField, int64_t const _size_indexField_0, int64_t const _stride_indexField_0)
{
#pragma omp parallel
{
#pragma omp for schedule(static)
for (int64_t ctr_0 = 0; ctr_0 < _size_indexField_0; ctr_0 += 1)
{
const int64_t x = *((int32_t *)(& _data_indexField[12*_stride_indexField_0*ctr_0]));
const int64_t y = *((int32_t *)(& _data_indexField[12*_stride_indexField_0*ctr_0 + 4]));
const int64_t cx [] = { 0, 0, 0, -1, 1, -1, 1, -1, 1 };
const int64_t cy [] = { 0, 1, -1, 0, 0, 1, 1, -1, -1 };
const int64_t invdir [] = { 0, 2, 1, 4, 3, 8, 7, 6, 5 };
const int64_t dir = *((int32_t *)(& _data_indexField[12*_stride_indexField_0*ctr_0 + 8]));
_data_C[x + 11*y] = _data_C[x + 11*y + cx[dir] + 11*cy[dir]];
}
}
}
```Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/250Switch index type from int32 to int642021-06-07T13:38:27+02:00Markus HolzerSwitch index type from int32 to int64For large domain sizes, int32 is not sufficient. Thus it is planned for waLBerla to change `cell_index_t` from `int` to `int64`. To make it consistent with pystencils and to prevent conversion warnings the index type for pystencils is al...For large domain sizes, int32 is not sufficient. Thus it is planned for waLBerla to change `cell_index_t` from `int` to `int64`. To make it consistent with pystencils and to prevent conversion warnings the index type for pystencils is also adapted to int64
Fixes https://i10git.cs.fau.de/pycodegen/lbmpy/-/issues/18Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/249Undo some changes from !248 that are no longer needed2021-05-27T19:41:41+02:00Michael Kuronmkuron@icp.uni-stuttgart.deUndo some changes from !248 that are no longer neededIt turns out these were only needed before I moved the vectorization of the `RNGBase` objects to the right place. The vectorized C printer does actually print scalar code when it is passed scalar variables and field accesses.It turns out these were only needed before I moved the vectorization of the `RNGBase` objects to the right place. The vectorized C printer does actually print scalar code when it is passed scalar variables and field accesses.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/248Fix RNG vectorization for LB2021-05-27T19:41:42+02:00Michael Kuronmkuron@icp.uni-stuttgart.deFix RNG vectorization for LBReported by @RudolfWeeber. Not sure why it wasn't caught by the pycodegen nightly job. Added new tests to lbmpy in https://i10git.cs.fau.de/pycodegen/lbmpy/-/merge_requests/82.
Forgotten type check in `_print_Function`:
```
File "pyst...Reported by @RudolfWeeber. Not sure why it wasn't caught by the pycodegen nightly job. Added new tests to lbmpy in https://i10git.cs.fau.de/pycodegen/lbmpy/-/merge_requests/82.
Forgotten type check in `_print_Function`:
```
File "pystencils/pystencils/backends/cbackend.py", line 673, in _print_Function
(isinstance(arg, TypedSymbol) and arg.dtype.is_int())
AttributeError: 'VectorType' object has no attribute 'is_int'
```
Generation of invalid tail loops when `assume_sufficient_line_padding=False`:
```c++
_data_pdfs_tmp_20_30_10[ctr_0] = _mm256_add_pd(_mm256_add_pd(_mm256_add_pd(_mm256_add_pd(_mm256_add_pd(_mm256_add_pd(_mm256_add_pd(_mm256_add_pd(_mm256_add_pd(_mm256_mul_pd(xi_88,_mm256_set_pd(0.142857142857143,0.142857142857143,0.142857142857143,0.142857142857143)),_mm256_mul_pd(xi_89,_mm256_set_pd(0.2,0.2,0.2,0.2))),_mm256_mul_pd(xi_91,_mm256_set_pd(-1.0,-1.0,-1.0,-1.0))),_mm256_mul_pd(xi_92,_mm256_set_pd(0.0857142857142857,0.0857142857142857,0.0857142857142857,0.0857142857142857))),_mm256_set_pd(xi_108*-0.5,xi_108*-0.5,xi_108*-0.5,xi_108*-0.5)),_mm256_set_pd(xi_112*0.0238095238095238,xi_112*0.0238095238095238,xi_112*0.0238095238095238,xi_112*0.0238095238095238)),_mm256_set_pd(xi_95*0.1,xi_95*0.1,xi_95*0.1,xi_95*0.1)),_mm256_set_pd(xi_98*0.0428571428571429,xi_98*0.0428571428571429,xi_98*0.0428571428571429,xi_98*0.0428571428571429)),_mm256_set_pd(_data_pdfs_20_30_10[ctr_0],_data_pdfs_20_30_10[ctr_0],_data_pdfs_20_30_10[ctr_0],_data_pdfs_20_30_10[ctr_0])),_mm256_set_pd(forceTerm_0,forceTerm_0,forceTerm_0,forceTerm_0));
```
Generation of partially-vectorized code with `assume_inner_stride_is_one=False` in conjunction with the random number generator:
```c++
const double xi_22 = -xi_64 + xi_66;
const float64x2_t xi_26 = vaddq_f64(random_1_1,vdupq_n_f64(-0.5));
```Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/247OpenCL RNG2021-05-26T13:44:59+02:00Michael Kuronmkuron@icp.uni-stuttgart.deOpenCL RNGUnfortunately most OpenCL implementations don't support C++Unfortunately most OpenCL implementations don't support C++Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/246ship C-file2021-05-11T09:31:17+02:00Markus Holzership C-fileShipping the generated C-files to pypi is a good idea since it is less error-prone. A New Cython version might deal with the provided pyx file in a way we did not intend.
In more detail the best practice can be found here:
http://blog.b...Shipping the generated C-files to pypi is a good idea since it is less error-prone. A New Cython version might deal with the provided pyx file in a way we did not intend.
In more detail the best practice can be found here:
http://blog.behnel.de/posts/ship-generated-c-code-or-not.htmlMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/245Cherry-pick non-sizeless SVE things from !2342021-05-13T15:42:59+02:00Michael Kuronmkuron@icp.uni-stuttgart.deCherry-pick non-sizeless SVE things from !234Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/244Add CI job for non-x86 vectorization2021-05-06T09:43:35+02:00Michael Kuronmkuron@icp.uni-stuttgart.deAdd CI job for non-x86 vectorizationDepends on https://i10git.cs.fau.de/pycodegen/pycodegen/-/merge_requests/12. A job for ARM SVE vectorization will be added in a future pull request as it appears to not be running completely stable yet.Depends on https://i10git.cs.fau.de/pycodegen/pycodegen/-/merge_requests/12. A job for ARM SVE vectorization will be added in a future pull request as it appears to not be running completely stable yet.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/243atomically write cache status file2021-05-03T14:19:21+02:00Michael Kuronmkuron@icp.uni-stuttgart.deatomically write cache status fileTry to fix https://i10git.cs.fau.de/pycodegen/pycodegen/-/jobs/568263, introduced in !240Try to fix https://i10git.cs.fau.de/pycodegen/pycodegen/-/jobs/568263, introduced in !240Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/242Disallow OpenMP + blocking + cacheline-zero2021-04-29T08:57:00+02:00Michael Kuronmkuron@icp.uni-stuttgart.deDisallow OpenMP + blocking + cacheline-zeroThe loop over the blocks is OpenMP-collapsed, so blocks might be worked on simultaneously. If the innermost block size does not align with a cache line and non-temporal stores are enabled on architectures that only do cacheline-zeroing (...The loop over the blocks is OpenMP-collapsed, so blocks might be worked on simultaneously. If the innermost block size does not align with a cache line and non-temporal stores are enabled on architectures that only do cacheline-zeroing (!230), threads would then erase each others' data. So we disallow the problematic combination.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/241Vector scatter/gather support2023-08-18T20:43:16+02:00Michael Kuronmkuron@icp.uni-stuttgart.deVector scatter/gather supportSome modern processors support scatter/gather operations in hardware to be able to vectorize even when the stride between consecutive elements is not 1. Supporting this in pystencils turned out to be surprisingly easy. On a Core i7-7820X...Some modern processors support scatter/gather operations in hardware to be able to vectorize even when the stride between consecutive elements is not 1. Supporting this in pystencils turned out to be surprisingly easy. On a Core i7-7820X, the D3Q19 TRT benchmark shows an appreciable performance benefit for cases with nonideal memory layout:
- 15% for fzyx without assume_inner_stride_one and with split
- 20% for fzyx without assume_inner_stride_one
- 30% for zyxf
AVX2 only supported gather, so this requires AVX512. The internet says it was quite slow on AVX2 processors anyway. Even on AVX512 the latency is quite high and the throughput is quite low, but it's still better than not vectorizing. SVE also supports it, so all future ARM processors will benefit too, and they will probably have better hardware support for higher throughput.
Fixes #34Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/240Incorporate header files and compiler flags into object cache hash2021-04-29T09:29:30+02:00Michael Kuronmkuron@icp.uni-stuttgart.deIncorporate header files and compiler flags into object cache hashEnsure that code is recompiled when one of our custom headers is changed or the compiler flags are modified.Ensure that code is recompiled when one of our custom headers is changed or the compiler flags are modified.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/239Sympy 1.9 support2021-04-26T18:24:04+02:00Michael Kuronmkuron@icp.uni-stuttgart.deSympy 1.9 support- deepcopy support was broken due to https://github.com/sympy/sympy/pull/21260
- clean up some constructors
- fix detection of sympy development versions
fixes #35, fixes !237- deepcopy support was broken due to https://github.com/sympy/sympy/pull/21260
- clean up some constructors
- fix detection of sympy development versions
fixes #35, fixes !237Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/238Versioneer2021-04-27T11:02:20+02:00Markus HolzerVersioneerThis MR enables Versioneer to have a consistent way for pystencils to get a version stringThis MR enables Versioneer to have a consistent way for pystencils to get a version stringMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/237Fix Sympy pipeline2021-04-26T16:46:20+02:00Markus HolzerFix Sympy pipelineFix #35Fix #35Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/236Adapt an ifdef for AMD Epyc 70032021-04-28T15:23:33+02:00Michael Kuronmkuron@icp.uni-stuttgart.deAdapt an ifdef for AMD Epyc 7003The new Zen 3 series has vector AES but no AVX512.The new Zen 3 series has vector AES but no AVX512.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/235maskStore improvements2021-04-28T15:24:48+02:00Michael Kuronmkuron@icp.uni-stuttgart.demaskStore improvementsLittle follow-up to !233 after I thought about it again.
- fix the aligned version (it was using `maskStore` in some instruction sets and `maskStoreA` in others)
- make sure the test case is incommensurate with the vector width (previou...Little follow-up to !233 after I thought about it again.
- fix the aligned version (it was using `maskStore` in some instruction sets and `maskStoreA` in others)
- make sure the test case is incommensurate with the vector width (previously it couldn't distinguish `store` from `storeMask` on 128-bit vector instruction sets)
- implement a fallback for instruction sets that don't support it natively (turns out this is really easy using a load-blend-store combination)Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/234Sizeless vectorization2021-05-21T10:11:44+02:00Michael Kuronmkuron@icp.uni-stuttgart.deSizeless vectorizationSurprisingly easy follow-up to !232 to support sizeless ARM SVE and RISC-V V. It uses some ugly hacks to sneak C functions like `svcntb()` into places that expect Python integers. Python duck-typing and SymPy made it possible. Not sure w...Surprisingly easy follow-up to !232 to support sizeless ARM SVE and RISC-V V. It uses some ugly hacks to sneak C functions like `svcntb()` into places that expect Python integers. Python duck-typing and SymPy made it possible. Not sure whether this should be merged as-is, but making it nicer would require re-writing `CBackend`. At least I couldn't think of a better way to obtain the innermost loop counter and loop stop.Jan HönigJan Hönighttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/233Vectorization: improve test coverage2021-04-21T19:55:08+02:00Michael Kuronmkuron@icp.uni-stuttgart.deVectorization: improve test coverageSome things were not tested with all available vectorizations. `maskStore` was previously untested and only worked with AVX512 float/double and AVX double. `maskLoad` was clearly broken, unused, and is a pretty useless instruction anyway...Some things were not tested with all available vectorizations. `maskStore` was previously untested and only worked with AVX512 float/double and AVX double. `maskLoad` was clearly broken, unused, and is a pretty useless instruction anyway, so I removed it.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/232SVE vectorization2021-04-22T20:19:32+02:00Michael Kuronmkuron@icp.uni-stuttgart.deSVE vectorizationTo continue my vectorization spree, here is the version with ARM SVE instructions. Tested in QEMU 5.2. Compiler support is still a bit wonky -- GCC 10 generates very bloated assembly and produces incorrect code for non-native vector size...To continue my vectorization spree, here is the version with ARM SVE instructions. Tested in QEMU 5.2. Compiler support is still a bit wonky -- GCC 10 generates very bloated assembly and produces incorrect code for non-native vector sizes, while Clang 11 misses some obvious optimizations.Markus HolzerMarkus Holzer