pystencils merge requestshttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests2024-03-28T13:47:06+01:00https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/369Fix kernel function parameters2024-03-28T13:47:06+01:00Daniel BauerFix kernel function parametersThis MR implements equality and hashing for `PsSymbol` such that the parameters of `KernelFunction`s are unique.
Also improves some error messages.This MR implements equality and hashing for `PsSymbol` such that the parameters of `KernelFunction`s are unique.
Also improves some error messages.Daniel BauerDaniel Bauerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/366Increase supported python version2024-01-16T11:56:08+01:00Markus HolzerIncrease supported python versionSupport for Python 3.12Support for Python 3.12Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/358Draft: [FIX] Index fields exclusively containing coordinates are dropped by c...2024-01-31T11:48:23+01:00Frederik HennigDraft: [FIX] Index fields exclusively containing coordinates are dropped by code generatorIndex fields that exclusively contain coordinate data (members `x`, `y` and `z`) and that are not explicitly accessed in the kernel assignments are dropped by `pystencils.cpu.create_indexed_kernel` in `cpu/kernelcreation.py`, prev. line ...Index fields that exclusively contain coordinate data (members `x`, `y` and `z`) and that are not explicitly accessed in the kernel assignments are dropped by `pystencils.cpu.create_indexed_kernel` in `cpu/kernelcreation.py`, prev. line 119.
Then in line 128 the list of index fields is empty, and the code generator finds no field containing the coordinate information.
Code generation then aborts.
Is there a reason why index fields are first filtered this way?Frederik HennigFrederik Hennighttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/353Draft: Generalise usage of Structs for nested array access2023-09-28T09:47:11+02:00Markus HolzerDraft: Generalise usage of Structs for nested array accessIn this, MR Structs are introduced in a more general form than they are used in the index kernel. The structs here can hold data and pointers to fields. This makes it possible to iterate over a struct and extract field pointers in each l...In this, MR Structs are introduced in a more general form than they are used in the index kernel. The structs here can hold data and pointers to fields. This makes it possible to iterate over a struct and extract field pointers in each loop iteration. The extracted fields are then updated in the normal loop nest.
The idea can be illustrated in a small example:
```python
import numpy as np
import pystencils as ps
from pystencils.typing import BasicType, FieldPointerSymbol, PointerType
from pystencils.struct import Struct
dtype = BasicType(np.float64)
f = ps.fields(f'f(1): double[3d]')
g = ps.fields(f'g(1): double[3d]')
struct_src = Struct("src")
struct_src.add_member(PointerType(dtype, const=False, restrict=False, double_pointer=True))
struct_dst = Struct("dst")
struct_dst.add_member(PointerType(dtype, const=False, restrict=False, double_pointer=True))
update_rule = [ps.Assignment(FieldPointerSymbol("f", dtype, const=True), struct_src[0]),
ps.Assignment(FieldPointerSymbol("g", dtype, const=False), struct_dst[0]),
ps.Assignment(g.center, f.center)]
ast = ps.create_kernel(update_rule)
```
This produces the following C-Code:
```c++
FUNC_PREFIX void kernel(double ** _data_dst, double ** _data_src, int64_t const _size_dst, int64_t const _size_f_0, int64_t const _size_f_1, int64_t const _size_f_2, int64_t const _stride_f_0, int64_t const _stride_f_1, int64_t const _stride_f_2, int64_t const _stride_g_0, int64_t const _stride_g_1, int64_t const _stride_g_2)
{
for (int64_t ctr_0 = 0; ctr_0 < _size_dst; ctr_0 += 1)
{
double * RESTRICT _data_f = _data_src[ctr_0];
double * RESTRICT _data_g = _data_dst[ctr_0];
for (int64_t ctr_1 = 0; ctr_1 < _size_f_0; ctr_1 += 1)
{
for (int64_t ctr_2 = 0; ctr_2 < _size_f_1; ctr_2 += 1)
{
for (int64_t ctr_3 = 0; ctr_3 < _size_f_2; ctr_3 += 1)
{
_data_g[_stride_g_0*ctr_1 + _stride_g_1*ctr_2 + _stride_g_2*ctr_3] = _data_f[_stride_f_0*ctr_1 + _stride_f_1*ctr_2 + _stride_f_2*ctr_3];
}
}
}
}
}
```
Thus the struct is used as a container for an arbitrary number of subarrays that are all updated at once. Since the struct only holds a single pointer per Element in the above example we can represent it as a double pointer **Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/349[BugFix] Fix indexing with ghostlayers2023-09-07T11:10:33+02:00Markus Holzer[BugFix] Fix indexing with ghostlayersThe Block indexing has bug when created with an iteration slice and ghost layers. With !341 The Block indexing supports slices more naturally by limiting the iteration space to the sliced size. Thus the counter index is multiplied by the...The Block indexing has bug when created with an iteration slice and ghost layers. With !341 The Block indexing supports slices more naturally by limiting the iteration space to the sliced size. Thus the counter index is multiplied by the step size. This was done also for the offset of the ghostlayers which is wrong.
This MR fixes the problemMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/342Draft: Do not reorder accesses in `move_constants_before_loop`2023-08-18T10:39:05+02:00Daniel BauerDraft: Do not reorder accesses in `move_constants_before_loop`Prior to this MR, `move_constants_before_loop` tries to move constants as far to the top as possible.
This might reorder read/write accesses to fields.
For example:
```python
import pystencils as ps
from pystencils import CreateKernelCo...Prior to this MR, `move_constants_before_loop` tries to move constants as far to the top as possible.
This might reorder read/write accesses to fields.
For example:
```python
import pystencils as ps
from pystencils import CreateKernelConfig
from pystencils.astnodes import Block, KernelFunction, LoopOverCoordinate, SympyAssignment
from pystencils.field import Field, FieldType
from sympy.abc import x, y
field = Field.create_generic("field", 1, field_type=FieldType.CUSTOM)
counter = LoopOverCoordinate.get_loop_counter_symbol(0)
load = SympyAssignment(x, field.absolute_access((counter,), (0,)))
store = SympyAssignment(field.absolute_access((counter+1,), (0,)), 2*x)
body = ps.typing.transformations.add_types(Block([load, store]), CreateKernelConfig())
loop = LoopOverCoordinate(body, 0, 0, 42)
block = Block([loop])
ps.transformations.resolve_field_accesses(block)
new_loops = ps.transformations.cut_loop(loop, [41])
ps.transformations.move_constants_before_loop(new_loops.args[1])
kernel = KernelFunction(
block,
ps.Target.CPU,
ps.Backend.C,
ps.cpu.cpujit.make_python_function,
None,
)
code = ps.get_code_str(kernel)
print(code)
```
prints
```c
FUNC_PREFIX void kernel(double * RESTRICT _data_field, int64_t const _stride_field_0)
{
const double x = _data_field[41*_stride_field_0];
_data_field[42*_stride_field_0] = x*2.0;
{
for (int64_t ctr_0 = 0; ctr_0 < 41; ctr_0 += 1)
{
const double x = _data_field[_stride_field_0*ctr_0];
_data_field[_stride_field_0*(ctr_0 + 1)] = x*2.0;
}
{
}
}
}
```
Note that the last (cut) loop iteration is moved before the primary loop, leading to a wrong load from index 41.
This MR changes `move_constants_before_loop` such that assignments can not be moved before their last modification.
Essentially, it replaces `symbols_defined` by `symbols_modified` [here](https://i10git.cs.fau.de/terraneo/pystencils/-/commit/be78ab165339d593869b5c77ef00a590a63ba130#99785d4b53b75ce54c83c3e499248de2a07fb2cd_598_597).
This new property is implemented for all AST nodes.
Note the implementation of `CustomCCodeNode`. I did not want to introduce breaking changes to the API.
Additionally, declarations are now inserted where the caller requests, instead of pushing them all the way to the top (https://i10git.cs.fau.de/terraneo/pystencils/-/commit/5c65d06216d050c22e28ba0b9487544342fc0926).
Lastly, a test for the new behavior is included.https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/327[Fix] Update for Docker Images2023-06-04T16:14:23+02:00Markus Holzer[Fix] Update for Docker ImagesDue to an update of the docker images minor changes are required for the CIDue to an update of the docker images minor changes are required for the CIMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/316Draft: feat: implement `__cuda_array_interface__`2023-09-14T10:43:31+02:00Stephan SeitzDraft: feat: implement `__cuda_array_interface__`https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html
This is supported by:
- pycuda
- numba
- cupy
- torch
- nvcv https://github.com/CvCuda/CV-CUDA
- maybe by tensorflow in future: https://github.com/tensorflow/tensorfl...https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html
This is supported by:
- pycuda
- numba
- cupy
- torch
- nvcv https://github.com/CvCuda/CV-CUDA
- maybe by tensorflow in future: https://github.com/tensorflow/tensorflow/issues/29039
Also allow to execute with cupy (https://docs.cupy.dev/en/stable/index.html)
instead of pycuda
TODO:
- [ ] check that pointers in correct CUDA context and if not import into
current
- [x] make execution with pycuda aware of `__cuda_array_interface__`
- [ ] what/how to testhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/305Fix #622022-10-21T09:24:20+02:00Markus HolzerFix #62Fixes problems around #62Fixes problems around #62Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/302Regression !3002022-10-10T13:37:53+02:00Markus HolzerRegression !300In !300 all written field sizes are added to the SympyAssignment as unknown parameters. This solves the problem that all field sizes need to be passed as arguments when using NT stores with non-x86 architectures. However, it introduces t...In !300 all written field sizes are added to the SympyAssignment as unknown parameters. This solves the problem that all field sizes need to be passed as arguments when using NT stores with non-x86 architectures. However, it introduces two problems.
1. In all other cases these parameters are not used. Thus waLBerla fails in some cases when compiled with -Wall. Other than that it is not nice either to pass unused parameters.
2. For the GPU code generation problems arose with the usage of `get_parameters` in waLBerla:
https://i10git.cs.fau.de/pycodegen/pystencils/-/blob/master/pystencils/astnodes.py#L244
Overall it seems that the easiest way to fix the problem is to only pass the additional size arguments when needed and in no other cases.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/283Draft: Remove too many zeros2023-03-27T10:40:59+02:00Markus HolzerDraft: Remove too many zerosRemove unnecessary from numbers: 1.80000000 --> 1.8Remove unnecessary from numbers: 1.80000000 --> 1.8Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/275WIP: Revamp the type system2022-05-11T14:33:30+02:00Markus HolzerWIP: Revamp the type systemMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/270Fixed kernel_decorator with config parameter2021-11-03T22:23:36+01:00Jan HönigFixed kernel_decorator with config parameterThe current kernel decorator does not work properly with the introduced `CreateKernelConfig`.
This MR fixes that.The current kernel decorator does not work properly with the introduced `CreateKernelConfig`.
This MR fixes that.Jan HönigJan Hönighttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/250Switch index type from int32 to int642021-06-07T13:38:27+02:00Markus HolzerSwitch index type from int32 to int64For large domain sizes, int32 is not sufficient. Thus it is planned for waLBerla to change `cell_index_t` from `int` to `int64`. To make it consistent with pystencils and to prevent conversion warnings the index type for pystencils is al...For large domain sizes, int32 is not sufficient. Thus it is planned for waLBerla to change `cell_index_t` from `int` to `int64`. To make it consistent with pystencils and to prevent conversion warnings the index type for pystencils is also adapted to int64
Fixes https://i10git.cs.fau.de/pycodegen/lbmpy/-/issues/18Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/237Fix Sympy pipeline2021-04-26T16:46:20+02:00Markus HolzerFix Sympy pipelineFix #35Fix #35Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/229Add type conversion for SP types2021-04-03T06:01:47+02:00Markus HolzerAdd type conversion for SP typesIf Assignments are already typed for double-precision but the kernel is created for single-precision the assignments should be adapted.If Assignments are already typed for double-precision but the kernel is created for single-precision the assignments should be adapted.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/225WIP: ARM cache line zeroing2021-04-01T22:59:29+02:00Michael Kuronmkuron@icp.uni-stuttgart.deWIP: ARM cache line zeroingARM has a cache line zero instruction that prevents data that will be overwritten anyway from being loaded from RAM. Kind of a light version of a non-temporal store. Saw this near the end of https://www.youtube.com/watch?v=BP7XD7JHgrI in...ARM has a cache line zero instruction that prevents data that will be overwritten anyway from being loaded from RAM. Kind of a light version of a non-temporal store. Saw this near the end of https://www.youtube.com/watch?v=BP7XD7JHgrI in the context of SVE, but might be relevant on Neon too. Just wanted to keep a note of this here.
Integrating this into pystencils is probably not completely straight-forward as you first need to check how much would be zeroed (64 bytes on all current chips, not guaranteed to match the cache line size), zero it, and then write the corresponding amount of data. Not sure if there are guarantees as to whether it's a multiple of the vector width.
There is not a whole lot of information for ARM, but the exact same thing has existed on IBM‘s PowerPC architecture (!228) for decades. There, a cache line has 128 bytes (can be queried from the kernel via `sysconf(_SC_LEVEL1_DCACHE_LINESIZE)`) and can be zeroed with the `__dcbz` intrinsic.https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/224Draft: Develop2023-09-14T11:03:48+02:00Markus HolzerDraft: DevelopThis MR adds two features to pystencils. First, the base pointer specification is revealed to the user which allows producing kernels with less register usage. Second, the summands insider the summation printer are printer recursively no...This MR adds two features to pystencils. First, the base pointer specification is revealed to the user which allows producing kernels with less register usage. Second, the summands insider the summation printer are printer recursively now which allows for more parallelism inside a single core.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/210WIP: Assembly2021-03-26T20:17:13+01:00Markus HolzerWIP: AssemblyAdds the functionality to directly show the assembly output of the generated code.
Further, the base pointer specification is revealed to the user which is helpful to minimize register spilling in some cases.Adds the functionality to directly show the assembly output of the generated code.
Further, the base pointer specification is revealed to the user which is helpful to minimize register spilling in some cases.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/187WIP: ARM NEON vectorization2020-11-18T14:59:55+01:00Michael Kuronmkuron@icp.uni-stuttgart.deWIP: ARM NEON vectorizationWith Apple's new laptops having ARM processors, I thought it might be time to add ARM NEON vectorization to pystencils. I don't currently have hardware to test on, but a bunch of test cases from both pystencils and lbmpy at least compile...With Apple's new laptops having ARM processors, I thought it might be time to add ARM NEON vectorization to pystencils. I don't currently have hardware to test on, but a bunch of test cases from both pystencils and lbmpy at least compile successfully. A Raspberry Pi 4 might actually be a useful and cheap device to add to CI for this purpose.
This may also become useful once ARM HPC clusters actually get deployed, though these might end up using SVE instead of NEON -- while I have added a few `if`s for that case, additional work is needed because SVE's vector width is determined at runtime.Markus HolzerMarkus Holzer