pystencils merge requests

Switch index type from int32 to int64

2021-06-07T13:38:27+02:00

For large domain sizes, int32 is not sufficient. Thus it is planned for waLBerla to change `cell_index_t` from `int` to `int64`. To make it consistent with pystencils and to prevent conversion warnings the index type for pystencils is also adapted to int64 Fixes https://i10git.cs.fau.de/pycodegen/lbmpy/-/issues/18

Fix Sympy pipeline

2021-04-26T16:46:20+02:00

Fix Sympy pipeline

Add type conversion for SP types

2021-04-03T06:01:47+02:00

If Assignments are already typed for double-precision but the kernel is created for single-precision the assignments should be adapted.

WIP: ARM cache line zeroing

2021-04-01T22:59:29+02:00

ARM has a cache line zero instruction that prevents data that will be overwritten anyway from being loaded from RAM. Kind of a light version of a non-temporal store. Saw this near the end of https://www.youtube.com/watch?v=BP7XD7JHgrI in the context of SVE, but might be relevant on Neon too. Just wanted to keep a note of this here. Integrating this into pystencils is probably not completely straight-forward as you first need to check how much would be zeroed (64 bytes on all current chips, not guaranteed to match the cache line size), zero it, and then write the corresponding amount of data. Not sure if there are guarantees as to whether it's a multiple of the vector width. There is not a whole lot of information for ARM, but the exact same thing has existed on IBM‘s PowerPC architecture (!228) for decades. There, a cache line has 128 bytes (can be queried from the kernel via `sysconf(_SC_LEVEL1_DCACHE_LINESIZE)`) and can be zeroed with the `__dcbz` intrinsic.

WIP: Assembly

2021-03-26T20:17:13+01:00

Adds the functionality to directly show the assembly output of the generated code. Further, the base pointer specification is revealed to the user which is helpful to minimize register spilling in some cases.

WIP: ARM NEON vectorization

2020-11-18T14:59:55+01:00

With Apple's new laptops having ARM processors, I thought it might be time to add ARM NEON vectorization to pystencils. I don't currently have hardware to test on, but a bunch of test cases from both pystencils and lbmpy at least compile successfully. A Raspberry Pi 4 might actually be a useful and cheap device to add to CI for this purpose. This may also become useful once ARM HPC clusters actually get deployed, though these might end up using SVE instead of NEON -- while I have added a few `if`s for that case, additional work is needed because SVE's vector width is determined at runtime.

Updated Kerncraft Coupling

2020-11-06T15:45:24+01:00

Updated Kerncraft Coupling

WIP: Cuda autotune

2020-10-07T13:04:35+02:00

This PR introduces ~~two~~ one change~~s~~: - ~~rotate (32,1,1) depending on field strides to fastest dimension. So (1,1,32) for c-layout and (32,1,1) for fortran layout. So pystencils will be fast also for c-layout (this will always be performed)~~ - auto-tune the block dimensions to whatevers is fastest for a specific kernel on localhost. On first kernel call different layouts are tried and the kernel will be called henceforth with the fastest configuration (disk_cached). This could be intersting for OpenCL where we don't know which launch config is the fastest (on OpenCL the runtime can alternatively give a hint on that). One drawback: the test calls are only correct if input and output fields do not overlap (so no in-place kernels).

Use dark mode for code preview if user prefers `prefers-color-scheme: dark`

2020-04-23T07:59:41+02:00

Use dark mode for code preview if user prefers `prefers-color-scheme: dark`

Fix import: sympy.numbers -> sympy.core.numbers

2020-03-24T00:57:30+01:00

Apparently `sympy` no longer exports `sympy.numbers` directly.

Add TypedMatrixSymbol (for usage of `MatrixSymbol` in kernels)

2020-02-21T15:16:29+01:00

I don't know whether this is a good idea but SymPy supports assigning MatrixSymbols. Like ```python >>>A = MatrixSymbols('A', 3, 3) >>>B = MatrixSymbols('B', 3, 3) In [12]: pystencils.Assignment(A, B) Out[12]: A := B ``` With this hack I can generate code like this: ```cpp #define FUNC_PREFIX static 2 FUNC_PREFIX void kernel(float * RESTRICT _data_y, int64_t const _size_y_0, int64_t const _size_y_1, int64_t const _size_y_2, int64_t const _stride_y_0, int64_t const _stride_y_1, int64_t const _stride_y_2, std::function< Vector3 < double >(int, int, int) > my_fun) 3 { 4 for (int ctr_0 = 0; ctr_0 < _size_y_0; ctr_0 += 1) 5 { 6 float * RESTRICT _data_y_00 = _data_y + _stride_y_0*ctr_0; 7 for (int ctr_1 = 0; ctr_1 < _size_y_1; ctr_1 += 1) 8 { 9 float * RESTRICT _data_y_00_10 = _stride_y_1*ctr_1 + _data_y_00; 10 for (int ctr_2 = 0; ctr_2 < _size_y_2; ctr_2 += 1) 11 { 12 const Vector3 A = my_fun(ctr_0, ctr_1, ctr_2); 13 _data_y_00_10[_stride_y_2*ctr_2] = A[0] + A[1] + A[2]; 14 } 15 } 16 } 17 } 1 #define FUNC_PREFIX static 2 template 3 FUNC_PREFIX void kernel(float * RESTRICT _data_y, int64_t const _size_y_0, int64_t const _size_y_1, int64_t const _size_y_2, int64_t const _stride_y_0, int64_t const _stride_y_1, int64_t const _stride_y_2, Functor_T my_fun) 4 { 5 for (int ctr_0 = 0; ctr_0 < _size_y_0; ctr_0 += 1) 6 { 7 float * RESTRICT _data_y_00 = _data_y + _stride_y_0*ctr_0; 8 for (int ctr_1 = 0; ctr_1 < _size_y_1; ctr_1 += 1) 9 { 10 float * RESTRICT _data_y_00_10 = _stride_y_1*ctr_1 + _data_y_00; 11 for (int ctr_2 = 0; ctr_2 < _size_y_2; ctr_2 += 1) 12 { 13 const Vector3 A = my_fun(ctr_0, ctr_1, ctr_2); 14 _data_y_00_10[_stride_y_2*ctr_2] = A[0] + A[1] + A[2]; 15 } 16 } 17 } 18 } ``` from ```python x, y = pystencils.fields('x, y: float32[3d]') from pystencils.data_types import TypedMatrixSymbol A = TypedMatrixSymbol('A', 3, 1, create_type('double'), 'Vector3') my_fun_call = DynamicFunction(TypedSymbol('my_fun', 'std::function< Vector3 < double >(int, int, int) >'), A.dtype, *pystencils.x_vector(3)) assignments = pystencils.AssignmentCollection({ A: my_fun_call, y.center: A[0] + A[1] + A[2] }) ast = pystencils.create_kernel(assignments) pystencils.show_code(ast, custom_backend=FrameworkIntegrationPrinter()) ```

WIP: Graph datahandling

2020-01-28T14:24:04+01:00

This is the draft for a data handling that (optionally) forwards all calls to SerialDatahandling. All calls and data transfers get recorded for the creation of an execution graph. Needs to be changed after the breaking changes in datahandling. Needs a tiny change in lbmpy: Instead of using `TimeLoop(...)` for time loop creation a custom function is used.

WIP: Add InterpolatorAccess.getnewargs

2020-01-28T14:23:15+01:00

it was missing and instead TypedSymbol.__getnewargs__ was used

Test pystencils_autodiff in integration test

2020-01-08T13:49:44+01:00

Test pystencils_autodiff in integration test

Throw error when trying to sympify `pystencils.Field` (e.g. using it in an...

2020-01-03T13:24:14+01:00

Throw error when trying to sympify `pystencils.Field` (e.g. using it in an Assignment without indexing) This is a typical error when using pystencils: you forget the index and use a field directly in an Assignment. Edit: apparently, this error is only triggered on recent versions of Sympy that can sympify using `__sympy__` (not on CI).

Test pystencils_autodiff in integration test

2019-12-17T18:56:19+01:00

Test pystencils_autodiff in integration test

Add CI minimal CI test for old sympy

2019-12-17T18:51:26+01:00

The minimal test cannot catch everything but its something.

Opencl datahandling

2019-12-05T12:14:33+01:00

Closes #15 OpenCL kernels are now integrated in the normal `create_kernel` workflow. Also there exists a created a `opencljit.init_globally` function that just creates some CL queue/contex if you do not want to give it as a parameter to every kernel. SerialDatahandling is extended to work with alternative GPU array libraries to PyCuda. There is now some overlapping code with the `_custom_transfer_functions` but I suppose they are for certain quantities that have a separate transfer function as oppose to using a whole different backend. @kuron can you have a look on it? I think the solution is not as elegant as I thought it would be. pycuda.gpuarray.GPUArrays are not wrapped. So if you use `dh.gpuarrays['foo']` you get either a pycuda array or a opencl array. I thought this step would be to drastic for one PR. Using OpenCL should still be a lot easier now.

Fix Opencl and LLVM GPU tests

2019-12-05T11:02:49+01:00

Fix tests for LLVM GPU and OpenCL - !96 made it impossible to print functions without names (only important for LLVM GPU test) - !87 made it impossible to run OpenCL kernels on CUDA OpenCL `int(...)`. is not a valid cast for it - SymPy moved `sympy.boolalg` to `sympy.logic.boolalg`

WIP: Complex number support

2019-10-11T17:57:49+02:00

Depends on !43 Pystencils should eventually support complex numbers. Even if complex fields can be considered harmful for CPU vectorization. The concept is nice since SymPy and Python support complex numbers and there should be no performance disadvantage for normal CPU and GPU code. Many applications in physics and signal processing rely on complex numbers. Complex output fields can be passed directly to libraries like `cufft`. Problem: In C++, one cannot mix calculation with `std::complex` with `std::complex`. So user has to specify `data_type='float32'` when single precision complex floats are desired. TODO: * GPU support with the header pycuda provides * only use `complex_helper.h` when needed * remove commits from !34 (probably the code will be changed) * rebase -i

pystencils merge requests

Switch index type from int32 to int64

Fix Sympy pipeline

Add type conversion for SP types

WIP: ARM cache line zeroing

WIP: Assembly

WIP: ARM NEON vectorization

Updated Kerncraft Coupling

WIP: Cuda autotune

Use dark mode for code preview if user prefers `prefers-color-scheme: dark`

Fix import: sympy.numbers -> sympy.core.numbers

Add TypedMatrixSymbol (for usage of `MatrixSymbol` in kernels)

WIP: Graph datahandling

WIP: Add InterpolatorAccess.__getnewargs__

Test pystencils_autodiff in integration test

Throw error when trying to sympify `pystencils.Field` (e.g. using it in an...

Test pystencils_autodiff in integration test

Add CI minimal CI test for old sympy

Opencl datahandling

Fix Opencl and LLVM GPU tests

WIP: Complex number support

WIP: Add InterpolatorAccess.getnewargs