pystencils merge requestshttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests2021-07-05T11:55:34+02:00https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/256Add citations to ReadMe2021-07-05T11:55:34+02:00Markus HolzerAdd citations to ReadMepystencils citations have been added to the ReadMe. Furthermore, the authors list is reorderd and the GitLab CI is adapted to have a pre test stage.pystencils citations have been added to the ReadMe. Furthermore, the authors list is reorderd and the GitLab CI is adapted to have a pre test stage.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/255Remove gmpy workaround2021-06-26T15:43:54+02:00Markus HolzerRemove gmpy workaroundThe gmpy workaroung is not needed anymore because this problem is fixed in smpy with https://github.com/sympy/sympy/pull/13530The gmpy workaroung is not needed anymore because this problem is fixed in smpy with https://github.com/sympy/sympy/pull/13530Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/254FVM: Choose better stencil for derivative in flux for D3Q272021-06-15T07:26:29+02:00Michael Kuronmkuron@icp.uni-stuttgart.deFVM: Choose better stencil for derivative in flux for D3Q27As reported by @Tischler, the FVM discretization does not use the correct stencils for fluxes with derivatives in D3Q27. The result is not wrong, but uses more neighbors than necessary.
This merge request adds a test case and improves t...As reported by @Tischler, the FVM discretization does not use the correct stencils for fluxes with derivatives in D3Q27. The result is not wrong, but uses more neighbors than necessary.
This merge request adds a test case and improves the stencil-choosing heuristic. It now produces the same two-point finite differences that a human would choose by optimizing the free weights later in the process.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/253Use closest normal for boundary index list with single_link2021-06-18T12:07:29+02:00Markus HolzerUse closest normal for boundary index list with single_linkFor creating the index list just the first stencil entry was taken which is a neighbour of the investigated cell if `single_link=True`. With this MR the discrete normal is calculated and the neighbouring cell in the normal direction is t...For creating the index list just the first stencil entry was taken which is a neighbour of the investigated cell if `single_link=True`. With this MR the discrete normal is calculated and the neighbouring cell in the normal direction is taken to build up the index array.
Furthermore, the computational cost of the python versions for `create_boundary_index_list` is reduced drastically because the iteration space is now restricted to the boundary cells and not the entire domain anymore.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/252Fixes for buffers in loops with step size > 12021-06-08T08:54:17+02:00Frederik HennigFixes for buffers in loops with step size > 1This MR introduces some additions and fixes for generating CPU loops with step sizes > 1:
- The CPU `create_kernel` function now exposes a flag to disable the double field write check
- Rewrote `get_base_buffer_index` to use pure integ...This MR introduces some additions and fixes for generating CPU loops with step sizes > 1:
- The CPU `create_kernel` function now exposes a flag to disable the double field write check
- Rewrote `get_base_buffer_index` to use pure integer arithmetic, and corrected the computation of the buffer base index
to correctly incorporate loop step sizes. Added test case to check correctness.
- Added rudimentary `evalf` functionality to integer division sympy function `int_div` (its absence lead to an infinite recursion during code generation).
- Added correct printing of integer-typed expressions in `CustomSympyPrinter._typed_number`.https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/251Use int64 for indexing2021-06-08T08:33:34+02:00Markus HolzerUse int64 for indexingFor indexed kernels, int32 is too small for large domain sizes. Thus the coordinates are cast to int64 in this MR to allow huge domain sizes.
As an example of the adaption the generated code for a Neumann boundary is shown. Before:
```...For indexed kernels, int32 is too small for large domain sizes. Thus the coordinates are cast to int64 in this MR to allow huge domain sizes.
As an example of the adaption the generated code for a Neumann boundary is shown. Before:
```cpp
FUNC_PREFIX void kernel(double * RESTRICT _data_C, uint8_t * RESTRICT const _data_indexField, int64_t const _size_indexField_0, int64_t const _stride_indexField_0)
{
#pragma omp parallel
{
#pragma omp for schedule(static)
for (int64_t ctr_0 = 0; ctr_0 < _size_indexField_0; ctr_0 += 1)
{
const int32_t x = *((int32_t *)(& _data_indexField[12*_stride_indexField_0*ctr_0]));
const int32_t y = *((int32_t *)(& _data_indexField[12*_stride_indexField_0*ctr_0 + 4]));
const int64_t cx [] = { 0, 0, 0, -1, 1, -1, 1, -1, 1 };
const int64_t cy [] = { 0, 1, -1, 0, 0, 1, 1, -1, -1 };
const int invdir [] = { 0, 2, 1, 4, 3, 8, 7, 6, 5 };
const int32_t dir = *((int32_t *)(& _data_indexField[12*_stride_indexField_0*ctr_0 + 8]));
_data_C[x + 11*y] = _data_C[x + 11*y + cx[dir] + 11*cy[dir]];
}
}
}
```
After:
```cpp
FUNC_PREFIX void kernel(double * RESTRICT _data_C, uint8_t * RESTRICT const _data_indexField, int64_t const _size_indexField_0, int64_t const _stride_indexField_0)
{
#pragma omp parallel
{
#pragma omp for schedule(static)
for (int64_t ctr_0 = 0; ctr_0 < _size_indexField_0; ctr_0 += 1)
{
const int64_t x = *((int32_t *)(& _data_indexField[12*_stride_indexField_0*ctr_0]));
const int64_t y = *((int32_t *)(& _data_indexField[12*_stride_indexField_0*ctr_0 + 4]));
const int64_t cx [] = { 0, 0, 0, -1, 1, -1, 1, -1, 1 };
const int64_t cy [] = { 0, 1, -1, 0, 0, 1, 1, -1, -1 };
const int64_t invdir [] = { 0, 2, 1, 4, 3, 8, 7, 6, 5 };
const int64_t dir = *((int32_t *)(& _data_indexField[12*_stride_indexField_0*ctr_0 + 8]));
_data_C[x + 11*y] = _data_C[x + 11*y + cx[dir] + 11*cy[dir]];
}
}
}
```Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/249Undo some changes from !248 that are no longer needed2021-05-27T19:41:41+02:00Michael Kuronmkuron@icp.uni-stuttgart.deUndo some changes from !248 that are no longer neededIt turns out these were only needed before I moved the vectorization of the `RNGBase` objects to the right place. The vectorized C printer does actually print scalar code when it is passed scalar variables and field accesses.It turns out these were only needed before I moved the vectorization of the `RNGBase` objects to the right place. The vectorized C printer does actually print scalar code when it is passed scalar variables and field accesses.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/248Fix RNG vectorization for LB2021-05-27T19:41:42+02:00Michael Kuronmkuron@icp.uni-stuttgart.deFix RNG vectorization for LBReported by @RudolfWeeber. Not sure why it wasn't caught by the pycodegen nightly job. Added new tests to lbmpy in https://i10git.cs.fau.de/pycodegen/lbmpy/-/merge_requests/82.
Forgotten type check in `_print_Function`:
```
File "pyst...Reported by @RudolfWeeber. Not sure why it wasn't caught by the pycodegen nightly job. Added new tests to lbmpy in https://i10git.cs.fau.de/pycodegen/lbmpy/-/merge_requests/82.
Forgotten type check in `_print_Function`:
```
File "pystencils/pystencils/backends/cbackend.py", line 673, in _print_Function
(isinstance(arg, TypedSymbol) and arg.dtype.is_int())
AttributeError: 'VectorType' object has no attribute 'is_int'
```
Generation of invalid tail loops when `assume_sufficient_line_padding=False`:
```c++
_data_pdfs_tmp_20_30_10[ctr_0] = _mm256_add_pd(_mm256_add_pd(_mm256_add_pd(_mm256_add_pd(_mm256_add_pd(_mm256_add_pd(_mm256_add_pd(_mm256_add_pd(_mm256_add_pd(_mm256_mul_pd(xi_88,_mm256_set_pd(0.142857142857143,0.142857142857143,0.142857142857143,0.142857142857143)),_mm256_mul_pd(xi_89,_mm256_set_pd(0.2,0.2,0.2,0.2))),_mm256_mul_pd(xi_91,_mm256_set_pd(-1.0,-1.0,-1.0,-1.0))),_mm256_mul_pd(xi_92,_mm256_set_pd(0.0857142857142857,0.0857142857142857,0.0857142857142857,0.0857142857142857))),_mm256_set_pd(xi_108*-0.5,xi_108*-0.5,xi_108*-0.5,xi_108*-0.5)),_mm256_set_pd(xi_112*0.0238095238095238,xi_112*0.0238095238095238,xi_112*0.0238095238095238,xi_112*0.0238095238095238)),_mm256_set_pd(xi_95*0.1,xi_95*0.1,xi_95*0.1,xi_95*0.1)),_mm256_set_pd(xi_98*0.0428571428571429,xi_98*0.0428571428571429,xi_98*0.0428571428571429,xi_98*0.0428571428571429)),_mm256_set_pd(_data_pdfs_20_30_10[ctr_0],_data_pdfs_20_30_10[ctr_0],_data_pdfs_20_30_10[ctr_0],_data_pdfs_20_30_10[ctr_0])),_mm256_set_pd(forceTerm_0,forceTerm_0,forceTerm_0,forceTerm_0));
```
Generation of partially-vectorized code with `assume_inner_stride_is_one=False` in conjunction with the random number generator:
```c++
const double xi_22 = -xi_64 + xi_66;
const float64x2_t xi_26 = vaddq_f64(random_1_1,vdupq_n_f64(-0.5));
```Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/247OpenCL RNG2021-05-26T13:44:59+02:00Michael Kuronmkuron@icp.uni-stuttgart.deOpenCL RNGUnfortunately most OpenCL implementations don't support C++Unfortunately most OpenCL implementations don't support C++Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/246ship C-file2021-05-11T09:31:17+02:00Markus Holzership C-fileShipping the generated C-files to pypi is a good idea since it is less error-prone. A New Cython version might deal with the provided pyx file in a way we did not intend.
In more detail the best practice can be found here:
http://blog.b...Shipping the generated C-files to pypi is a good idea since it is less error-prone. A New Cython version might deal with the provided pyx file in a way we did not intend.
In more detail the best practice can be found here:
http://blog.behnel.de/posts/ship-generated-c-code-or-not.htmlMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/245Cherry-pick non-sizeless SVE things from !2342021-05-13T15:42:59+02:00Michael Kuronmkuron@icp.uni-stuttgart.deCherry-pick non-sizeless SVE things from !234Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/244Add CI job for non-x86 vectorization2021-05-06T09:43:35+02:00Michael Kuronmkuron@icp.uni-stuttgart.deAdd CI job for non-x86 vectorizationDepends on https://i10git.cs.fau.de/pycodegen/pycodegen/-/merge_requests/12. A job for ARM SVE vectorization will be added in a future pull request as it appears to not be running completely stable yet.Depends on https://i10git.cs.fau.de/pycodegen/pycodegen/-/merge_requests/12. A job for ARM SVE vectorization will be added in a future pull request as it appears to not be running completely stable yet.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/243atomically write cache status file2021-05-03T14:19:21+02:00Michael Kuronmkuron@icp.uni-stuttgart.deatomically write cache status fileTry to fix https://i10git.cs.fau.de/pycodegen/pycodegen/-/jobs/568263, introduced in !240Try to fix https://i10git.cs.fau.de/pycodegen/pycodegen/-/jobs/568263, introduced in !240Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/242Disallow OpenMP + blocking + cacheline-zero2021-04-29T08:57:00+02:00Michael Kuronmkuron@icp.uni-stuttgart.deDisallow OpenMP + blocking + cacheline-zeroThe loop over the blocks is OpenMP-collapsed, so blocks might be worked on simultaneously. If the innermost block size does not align with a cache line and non-temporal stores are enabled on architectures that only do cacheline-zeroing (...The loop over the blocks is OpenMP-collapsed, so blocks might be worked on simultaneously. If the innermost block size does not align with a cache line and non-temporal stores are enabled on architectures that only do cacheline-zeroing (!230), threads would then erase each others' data. So we disallow the problematic combination.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/241Vector scatter/gather support2023-08-18T20:43:16+02:00Michael Kuronmkuron@icp.uni-stuttgart.deVector scatter/gather supportSome modern processors support scatter/gather operations in hardware to be able to vectorize even when the stride between consecutive elements is not 1. Supporting this in pystencils turned out to be surprisingly easy. On a Core i7-7820X...Some modern processors support scatter/gather operations in hardware to be able to vectorize even when the stride between consecutive elements is not 1. Supporting this in pystencils turned out to be surprisingly easy. On a Core i7-7820X, the D3Q19 TRT benchmark shows an appreciable performance benefit for cases with nonideal memory layout:
- 15% for fzyx without assume_inner_stride_one and with split
- 20% for fzyx without assume_inner_stride_one
- 30% for zyxf
AVX2 only supported gather, so this requires AVX512. The internet says it was quite slow on AVX2 processors anyway. Even on AVX512 the latency is quite high and the throughput is quite low, but it's still better than not vectorizing. SVE also supports it, so all future ARM processors will benefit too, and they will probably have better hardware support for higher throughput.
Fixes #34Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/240Incorporate header files and compiler flags into object cache hash2021-04-29T09:29:30+02:00Michael Kuronmkuron@icp.uni-stuttgart.deIncorporate header files and compiler flags into object cache hashEnsure that code is recompiled when one of our custom headers is changed or the compiler flags are modified.Ensure that code is recompiled when one of our custom headers is changed or the compiler flags are modified.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/239Sympy 1.9 support2021-04-26T18:24:04+02:00Michael Kuronmkuron@icp.uni-stuttgart.deSympy 1.9 support- deepcopy support was broken due to https://github.com/sympy/sympy/pull/21260
- clean up some constructors
- fix detection of sympy development versions
fixes #35, fixes !237- deepcopy support was broken due to https://github.com/sympy/sympy/pull/21260
- clean up some constructors
- fix detection of sympy development versions
fixes #35, fixes !237Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/238Versioneer2021-04-27T11:02:20+02:00Markus HolzerVersioneerThis MR enables Versioneer to have a consistent way for pystencils to get a version stringThis MR enables Versioneer to have a consistent way for pystencils to get a version stringMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/236Adapt an ifdef for AMD Epyc 70032021-04-28T15:23:33+02:00Michael Kuronmkuron@icp.uni-stuttgart.deAdapt an ifdef for AMD Epyc 7003The new Zen 3 series has vector AES but no AVX512.The new Zen 3 series has vector AES but no AVX512.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/235maskStore improvements2021-04-28T15:24:48+02:00Michael Kuronmkuron@icp.uni-stuttgart.demaskStore improvementsLittle follow-up to !233 after I thought about it again.
- fix the aligned version (it was using `maskStore` in some instruction sets and `maskStoreA` in others)
- make sure the test case is incommensurate with the vector width (previou...Little follow-up to !233 after I thought about it again.
- fix the aligned version (it was using `maskStore` in some instruction sets and `maskStoreA` in others)
- make sure the test case is incommensurate with the vector width (previously it couldn't distinguish `store` from `storeMask` on 128-bit vector instruction sets)
- implement a fallback for instruction sets that don't support it natively (turns out this is really easy using a load-blend-store combination)Markus HolzerMarkus Holzer