pystencils merge requests

pystencils merge requests https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests 2023-09-14T20:54:35+02:00 https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/350 Reveal base pointer spec 2023-09-14T20:54:35+02:00 Markus Holzer

Reveal base pointer spec

This MR reveals the base pointer specification to the user in the config This MR reveals the base pointer specification to the user in the config feature Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/348 Fix integration pipeline 2023-09-07T11:09:55+02:00 Daniel Bauer

Fix integration pipeline

This MR fixes several issues with the integration pipeline. 1. !349. 2. Fixes an oversight introduced in https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/344. The issue was with expressions like: ``` a: scalar fl... This MR fixes several issues with the integration pipeline. 1. !349. 2. Fixes an oversight introduced in https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/344. The issue was with expressions like: ``` a: scalar float = CastFunc(b: vector float, float) ``` The mentioned MR used `get_type_of_expression` to determine whether an expression will be vectorized. That works in most cases due to type collation but fails for `CastFuncs`, which are always vectorized. This MR replaces `get_type_of_expression` by a small helper function that checks whether an expression is entirely scalar or requires additional vector casts. Bug Daniel Bauer Daniel Bauer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/341 Refactor gpu indexing 2023-09-04T16:53:09+02:00 Markus Holzer

Refactor gpu indexing

To map an iteration space to GPU Threads indexing classes are used. These indexing classes receive a field and iterations slice to determine the iteration space. This MR refactors the indexing classes to directly receive an iteration spa... To map an iteration space to GPU Threads indexing classes are used. These indexing classes receive a field and iterations slice to determine the iteration space. This MR refactors the indexing classes to directly receive an iteration space. With this, the indexing classes are more general and not dependent on pystencils Fields. Further improvements/fixes: - Line indexing works now with iteration slices. This did not work at all before - Both indexing schemes calculate a correct block and grid size for iteration slices. This means if for example if only every second element is touched (due to a given iteration slice) the number of threads will be half. This removes modulo calculation that was needed before - Both indexing schemes now support up to 4 dimensions Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/344 Vectorize all scalar symbols in vector expressions 2023-09-04T14:59:27+02:00 Daniel Bauer

Vectorize all scalar symbols in vector expressions

Pystencils fails to vectorize very simple kernels: ```python import numpy as np import pystencils as ps from pystencils.astnodes import SympyAssignment, TypedSymbol f = ps.fields("f: [1D]") x = TypedSymbol("x", np.float64) kernel = p... Pystencils fails to vectorize very simple kernels: ```python import numpy as np import pystencils as ps from pystencils.astnodes import SympyAssignment, TypedSymbol f = ps.fields("f: [1D]") x = TypedSymbol("x", np.float64) kernel = ps.create_kernel( [SympyAssignment(x, 2.0), SympyAssignment(f[0], x)], cpu_vectorize_info={"assume_inner_stride_one": True}, ) ps.show_code(kernel) ``` This example throws an exception in `show_code`, complaining that the printer can not vectorize type casts. The problem is that `x = 2.0` is moved out of the loop (since it is constant). What remains in the loop is `f[i] = x`. While the left-hand-side of this expression is vectorized, the right-hand-side is left scalar, leading to the exception. The issue comes from the `insert_vector_casts` function. It traverses each expression from the leafs to the root, leaving scalars scalar [^1] and collating mixed expressions to vectors. However, it handles the rhs of assignments separate from the lhs, leading to above issue. Moreover, expressions like `a (vec) + (b (scalar) * c (scalar))` are converted to `a (vec) + CastToVec(b (scalar) * c (scalar))`, which leads to the same exception. The correct way is to directly cast `b` and `c` to vectors, not their product. Therefore, `insert_vector_casts` must know beforehand, whether an expression appears inside a vectorized expression. This MR fixes that for SympyAssignments. To that end, it first checks whether either side contains a vectorized expression, and if so, casts all symbols to vectors. Since I am not really sure how to handle the cases for `VectorMemoryAccess` ([line 370/386](https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/344/diffs#2d8e7266b6ec295a8ceac4159fcd9cea9ede6ca8_370_386)) and `ast.Conditional` ([line 374/390](https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/344/diffs#2d8e7266b6ec295a8ceac4159fcd9cea9ede6ca8_374_390)), I left those untouched. [^1]: The exception is that CastFunctions are always replaced by vector casts. I do not know whether this is intentional. https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/347 Distinguish between SymPy and pystencils Assignement better 2023-08-29T10:29:02+02:00 Markus Holzer

Distinguish between SymPy and pystencils Assignement better

This MR clears the usage of SympyAssignement in contrast to the sympy assignement. In the backend of pystencils only SympyAssignements are used now that inherit from pystencils base Node class fixes #61 Additionally, AddAugumentedAss... This MR clears the usage of SympyAssignement in contrast to the sympy assignement. In the backend of pystencils only SympyAssignements are used now that inherit from pystencils base Node class fixes #61 Additionally, AddAugumentedAssignement is introduced for convenience feature Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/346 Extension to field read extraction 2023-08-24T09:47:32+02:00 Markus Holzer

Extension to field read extraction

With `add_subexpressions_for_field_reads` it is possible to extract field reads from the kernel and put them in individual assignements. For mixed precision kernels, however, it is useful if the lhs of this new assignement is of given ty... With `add_subexpressions_for_field_reads` it is possible to extract field reads from the kernel and put them in individual assignements. For mixed precision kernels, however, it is useful if the lhs of this new assignement is of given type. This isolates casts and prevents calculations in the data type of the stored values feature Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/345 AVX512VL and AVX10 support 2023-08-23T20:01:05+02:00 Michael Kuron mkuron@icp.uni-stuttgart.de

AVX512VL and AVX10 support

AVX512VL is the 256-bit version of all the AVX512F instructions. It is primarily useful on those processors that only have one AVX512 vector unit and drastically reduce their clock frequency when executing 512-bit instructions. For purpo... AVX512VL is the 256-bit version of all the AVX512F instructions. It is primarily useful on those processors that only have one AVX512 vector unit and drastically reduce their clock frequency when executing 512-bit instructions. For purposes of pystencils, this mostly means scatter/gather support (up to 30% improvements as per !241) and no reduced clock frequencies ([up](https://cdrdv2-public.intel.com/336065/336065_Intel%20Xeon%20Processor%20Scalable%20Family%20Public%20Specification%20Update_rev17.pdf) [to](https://cdrdv2-public.intel.com/338848/338848_2nd%20Gen%20Intel®%20Xeon®%20Scalable%20Processors%20Specification%20Update_Rev027US.pdf) 45% improvements on Xeon Bronze 31xx/32xx, Silver 41xx/42xx, Gold 51xx/52xx). I suppose we never bothered implementing it because it offers no advantage on Xeon Gold 61xx/62xx and Platinum with their two AVX512 units, and not on newer x3xx and x4xx (or the Ice Lake/Tiger Lake/Rocket Lake desktop/laptop processors) which [don't clock down](https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html) anymore. Many of Intel's future processors, however, will be using AVX10-256 instead of AVX512, which is, in a sense, half way between AVX2 and AVX512. AVX10.1-128 and AVX10.1-256 are essentially a rebranded AVX512VL that can be enabled without AVX512F. This just needs a few changed ifdefs and awareness of the CPU detection. The /proc/cpuinfo flag is just a guess, but a very likely one. AVX10.1-512 is the same as AVX512F and enabling one always enables the other. I've adapted the ifdefs nonetheless just in case. All information is based on https://www.phoronix.com/news/GCC-Lands-Initial-AVX10.1. Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/241 Vector scatter/gather support 2023-08-18T20:43:16+02:00 Michael Kuron mkuron@icp.uni-stuttgart.de

Vector scatter/gather support

Some modern processors support scatter/gather operations in hardware to be able to vectorize even when the stride between consecutive elements is not 1. Supporting this in pystencils turned out to be surprisingly easy. On a Core i7-7820X... Some modern processors support scatter/gather operations in hardware to be able to vectorize even when the stride between consecutive elements is not 1. Supporting this in pystencils turned out to be surprisingly easy. On a Core i7-7820X, the D3Q19 TRT benchmark shows an appreciable performance benefit for cases with nonideal memory layout: - 15% for fzyx without assume_inner_stride_one and with split - 20% for fzyx without assume_inner_stride_one - 30% for zyxf AVX2 only supported gather, so this requires AVX512. The internet says it was quite slow on AVX2 processors anyway. Even on AVX512 the latency is quite high and the throughput is quite low, but it's still better than not vectorizing. SVE also supports it, so all future ARM processors will benefit too, and they will probably have better hardware support for higher throughput. Fixes #34 Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/343 Do not reorder accesses in `move_constants_before_loop` (quickly) 2023-08-18T12:15:30+02:00 Daniel Bauer

Do not reorder accesses in `move_constants_before_loop` (quickly)

Reimplementation of https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/342. While playing around with the old MR, I realized that the changes proposed there have a significant impact on the execution time of `move_constants_... Reimplementation of https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/342. While playing around with the old MR, I realized that the changes proposed there have a significant impact on the execution time of `move_constants_before_loop` (for some kernels). Before the MR, we would not descend into blocks, loops or conditionals to check whether dependencies are modified in their body. The MR changed that for the sake of correctness. However, the implementation was quite inefficient. Note that for each assignment we must find a block to move the assignment to. Essentially, the old MR would move up the AST, at each level determining a set of "critical symbols" by *descending* the tree from the current element again. This means that the AST was traversed a lot, and set objects were created and updated a lot. This MR changes this behavior. Now, the AST is only traversed once, from the current assignment up to the block we can move the assignment to. If we encounter blocks, loops, etc. on the way, we still descend into the block. However, we do this only once. Moreover, the new implementation does not create a huge set of critical symbols but instead exits early once it finds a dependency. Overall, my not-sophisticated-at-all tests suggest that the new implementation is even slightly faster than the version from master. The new implementation also does not change the `ast.Node` interface, which I like quite a lot. https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/340 Fix symbol counters 2023-07-25T11:39:39+02:00 Markus Holzer

Fix symbol counters

When simplifications are applied on an AssignementCollection that is created with Assignments coming from another AssignementCollection that was simplified before the counter for the symbol creation was not respected When simplifications are applied on an AssignementCollection that is created with Assignments coming from another AssignementCollection that was simplified before the counter for the symbol creation was not respected Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/338 JSON Serializer for pystencils config 2023-07-17T14:16:40+02:00 Helen Schottenhamml

JSON Serializer for pystencils config

This MR adds a custom JSON serializer to allow pystencils configs to be used as parameters in databases. Useful in parameter studies when using the more modern way of setting up simulations, i.e., using pystencils' CreateKernelConfig. ... This MR adds a custom JSON serializer to allow pystencils configs to be used as parameters in databases. Useful in parameter studies when using the more modern way of setting up simulations, i.e., using pystencils' CreateKernelConfig. Can be extended in the future for other custom classes if needed. Helen Schottenhamml Helen Schottenhamml https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/336 Remove pystencils.GPU_DEVICE 2023-07-13T09:58:30+02:00 Michael Kuron mkuron@icp.uni-stuttgart.de

Remove pystencils.GPU_DEVICE

- `SerialDataHandling` now performs the device selection upon construction. It can also be constructed with an explicit device number to deviate from the default selection. - For `ParallelDataHandling`, the assignment of devices to MPI r... - `SerialDataHandling` now performs the device selection upon construction. It can also be constructed with an explicit device number to deviate from the default selection. - For `ParallelDataHandling`, the assignment of devices to MPI ranks _should_ be handled by Walberla by calling `cudaSetDevice()`. It has [`selectDeviceBasedOnMpiRank`](https://i10git.cs.fau.de/walberla/walberla/-/blob/master/src/gpu/DeviceSelectMPI.cpp) for this purpose. I am not sure it actually calls it -- I think it should be called from [`MPIManager::initializeMPI`](https://i10git.cs.fau.de/walberla/walberla/-/blob/master/src/core/mpi/MPIManager.cpp). Right now everything probably just ends up on the first GPU. - The kernel wrapper now determines the correct device by inspecting the fields. - `gpu_indexing_params` needs an explicit device number, I don't think any kind of default is reasonable. - Some tests now iterate over all devices instead of using a default device. This is actually the right thing to do because it tests whether the device selection works correctly. lbmpy's test_gpu_block_size_limiting.py::test_gpu_block_size_limiting fails since !335, but that is due to an error in the test, which https://i10git.cs.fau.de/pycodegen/lbmpy/-/merge_requests/146 fixes. Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/337 Add adjacent direcitons to stencil module 2023-07-12T18:13:30+02:00 Markus Holzer

Add adjacent direcitons to stencil module

Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/339 Remove windows CI 2023-07-12T15:37:13+02:00 Markus Holzer

Remove windows CI

Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/335 Fix indexing for AMD GPUs 2023-07-08T12:43:56+02:00 Markus Holzer

Fix indexing for AMD GPUs

Due to https://github.com/cupy/cupy/issues/7676 `BlockIndexing` did not work correctly on AMD GPUs. This is MR fixes it. Due to https://github.com/cupy/cupy/issues/7676 `BlockIndexing` did not work correctly on AMD GPUs. This is MR fixes it. Bug Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/333 Make AMD GPU support compatible with both hipcc and hiprtc 2023-06-30T21:53:24+02:00 Michael Kuron mkuron@icp.uni-stuttgart.de

Make AMD GPU support compatible with both hipcc and hiprtc

Please give this a test on your AMD machine, @holzer. I think it should now work everywhere with both backend=nvcc and backend=nvrtc. Please give this a test on your AMD machine, @holzer. I think it should now work everywhere with both backend=nvcc and backend=nvrtc. Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/334 Re-enable test_loop_cutting.py::test_staggered_iteration 2023-06-30T08:55:41+02:00 Michael Kuron mkuron@icp.uni-stuttgart.de

Re-enable test_loop_cutting.py::test_staggered_iteration

It passes on current master, so don't xfail it. It passes on current master, so don't xfail it. Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/332 Add experimental half precison support 2023-06-28T20:35:25+02:00 Markus Holzer

Add experimental half precison support

With this MR experimental half-precision support is added With this MR experimental half-precision support is added feature Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/331 Implement Pinned GPU memory 2023-06-24T08:23:36+02:00 Markus Holzer

Implement Pinned GPU memory

CPU arrys with an equivalent GPU array should be pinned. Further, this MR fixes non-aligned strides between CPU and GPU arrays. CPU arrys with an equivalent GPU array should be pinned. Further, this MR fixes non-aligned strides between CPU and GPU arrays. Bug Markus Holzer Markus Holzer https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/330 Replace PyCuda with CuPy 2023-06-23T08:31:06+02:00 Markus Holzer

Replace PyCuda with CuPy

Replaces [PyCuda](https://documen.tician.de/pycuda/) with [CuPy](https://cupy.dev/) Advantages of [CuPy](https://cupy.dev/): - AMD support - probably higher maintained due to NVIDIA support - SciPy compatible. Fixes #70 Fixes #69 Replaces [PyCuda](https://documen.tician.de/pycuda/) with [CuPy](https://cupy.dev/) Advantages of [CuPy](https://cupy.dev/): - AMD support - probably higher maintained due to NVIDIA support - SciPy compatible. Fixes #70 Fixes #69 feature refactor Markus Holzer Markus Holzer