pystencils merge requestshttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests2023-09-19T20:22:27+02:00https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/351[FIX] Alignement detection2023-09-19T20:22:27+02:00Markus Holzer[FIX] Alignement detectionFor the SIMD vectorization it needs to be determined if a memory address lies points to an aligned address or not. This detection only worked for pointers depending on the inner loop counter so far.For the SIMD vectorization it needs to be determined if a memory address lies points to an aligned address or not. This detection only worked for pointers depending on the inner loop counter so far.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/350Reveal base pointer spec2023-09-14T20:54:35+02:00Markus HolzerReveal base pointer specThis MR reveals the base pointer specification to the user in the configThis MR reveals the base pointer specification to the user in the configMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/349[BugFix] Fix indexing with ghostlayers2023-09-07T11:10:33+02:00Markus Holzer[BugFix] Fix indexing with ghostlayersThe Block indexing has bug when created with an iteration slice and ghost layers. With !341 The Block indexing supports slices more naturally by limiting the iteration space to the sliced size. Thus the counter index is multiplied by the...The Block indexing has bug when created with an iteration slice and ghost layers. With !341 The Block indexing supports slices more naturally by limiting the iteration space to the sliced size. Thus the counter index is multiplied by the step size. This was done also for the offset of the ghostlayers which is wrong.
This MR fixes the problemMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/348Fix integration pipeline2023-09-07T11:09:55+02:00Daniel BauerFix integration pipelineThis MR fixes several issues with the integration pipeline.
1. !349.
2. Fixes an oversight introduced in https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/344.
The issue was with expressions like:
```
a: scalar fl...This MR fixes several issues with the integration pipeline.
1. !349.
2. Fixes an oversight introduced in https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/344.
The issue was with expressions like:
```
a: scalar float = CastFunc(b: vector float, float)
```
The mentioned MR used `get_type_of_expression` to determine whether an expression will be vectorized.
That works in most cases due to type collation but fails for `CastFuncs`, which are always vectorized.
This MR replaces `get_type_of_expression` by a small helper function that checks whether an expression is entirely scalar or requires additional vector casts.Daniel BauerDaniel Bauerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/347Distinguish between SymPy and pystencils Assignement better2023-08-29T10:29:02+02:00Markus HolzerDistinguish between SymPy and pystencils Assignement betterThis MR clears the usage of SympyAssignement in contrast to the sympy assignement.
In the backend of pystencils only SympyAssignements are used now that inherit from pystencils base Node class
fixes #61
Additionally, AddAugumentedAss...This MR clears the usage of SympyAssignement in contrast to the sympy assignement.
In the backend of pystencils only SympyAssignements are used now that inherit from pystencils base Node class
fixes #61
Additionally, AddAugumentedAssignement is introduced for convenienceMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/346Extension to field read extraction2023-08-24T09:47:32+02:00Markus HolzerExtension to field read extractionWith `add_subexpressions_for_field_reads` it is possible to extract field reads from the kernel and put them in individual assignements. For mixed precision kernels, however, it is useful if the lhs of this new assignement is of given ty...With `add_subexpressions_for_field_reads` it is possible to extract field reads from the kernel and put them in individual assignements. For mixed precision kernels, however, it is useful if the lhs of this new assignement is of given type. This isolates casts and prevents calculations in the data type of the stored valuesMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/345AVX512VL and AVX10 support2023-08-23T20:01:05+02:00Michael Kuronmkuron@icp.uni-stuttgart.deAVX512VL and AVX10 supportAVX512VL is the 256-bit version of all the AVX512F instructions. It is primarily useful on those processors that only have one AVX512 vector unit and drastically reduce their clock frequency when executing 512-bit instructions. For purpo...AVX512VL is the 256-bit version of all the AVX512F instructions. It is primarily useful on those processors that only have one AVX512 vector unit and drastically reduce their clock frequency when executing 512-bit instructions. For purposes of pystencils, this mostly means scatter/gather support (up to 30% improvements as per !241) and no reduced clock frequencies ([up](https://cdrdv2-public.intel.com/336065/336065_Intel%20Xeon%20Processor%20Scalable%20Family%20Public%20Specification%20Update_rev17.pdf) [to](https://cdrdv2-public.intel.com/338848/338848_2nd%20Gen%20Intel®%20Xeon®%20Scalable%20Processors%20Specification%20Update_Rev027US.pdf) 45% improvements on Xeon Bronze 31xx/32xx, Silver 41xx/42xx, Gold 51xx/52xx). I suppose we never bothered implementing it because it offers no advantage on Xeon Gold 61xx/62xx and Platinum with their two AVX512 units, and not on newer x3xx and x4xx (or the Ice Lake/Tiger Lake/Rocket Lake desktop/laptop processors) which [don't clock down](https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html) anymore.
Many of Intel's future processors, however, will be using AVX10-256 instead of AVX512, which is, in a sense, half way between AVX2 and AVX512. AVX10.1-128 and AVX10.1-256 are essentially a rebranded AVX512VL that can be enabled without AVX512F. This just needs a few changed ifdefs and awareness of the CPU detection. The /proc/cpuinfo flag is just a guess, but a very likely one.
AVX10.1-512 is the same as AVX512F and enabling one always enables the other. I've adapted the ifdefs nonetheless just in case.
All information is based on https://www.phoronix.com/news/GCC-Lands-Initial-AVX10.1.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/344Vectorize all scalar symbols in vector expressions2023-09-04T14:59:27+02:00Daniel BauerVectorize all scalar symbols in vector expressionsPystencils fails to vectorize very simple kernels:
```python
import numpy as np
import pystencils as ps
from pystencils.astnodes import SympyAssignment, TypedSymbol
f = ps.fields("f: [1D]")
x = TypedSymbol("x", np.float64)
kernel = p...Pystencils fails to vectorize very simple kernels:
```python
import numpy as np
import pystencils as ps
from pystencils.astnodes import SympyAssignment, TypedSymbol
f = ps.fields("f: [1D]")
x = TypedSymbol("x", np.float64)
kernel = ps.create_kernel(
[SympyAssignment(x, 2.0), SympyAssignment(f[0], x)],
cpu_vectorize_info={"assume_inner_stride_one": True},
)
ps.show_code(kernel)
```
This example throws an exception in `show_code`, complaining that the printer can not vectorize type casts.
The problem is that `x = 2.0` is moved out of the loop (since it is constant).
What remains in the loop is `f[i] = x`.
While the left-hand-side of this expression is vectorized, the right-hand-side is left scalar, leading to the exception.
The issue comes from the `insert_vector_casts` function.
It traverses each expression from the leafs to the root, leaving scalars scalar [^1] and collating mixed expressions to vectors.
However, it handles the rhs of assignments separate from the lhs, leading to above issue.
Moreover, expressions like `a (vec) + (b (scalar) * c (scalar))` are converted to `a (vec) + CastToVec(b (scalar) * c (scalar))`, which leads to the same exception.
The correct way is to directly cast `b` and `c` to vectors, not their product.
Therefore, `insert_vector_casts` must know beforehand, whether an expression appears inside a vectorized expression.
This MR fixes that for SympyAssignments.
To that end, it first checks whether either side contains a vectorized expression, and if so, casts all symbols to vectors.
Since I am not really sure how to handle the cases for `VectorMemoryAccess` ([line 370/386](https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/344/diffs#2d8e7266b6ec295a8ceac4159fcd9cea9ede6ca8_370_386)) and `ast.Conditional` ([line 374/390](https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/344/diffs#2d8e7266b6ec295a8ceac4159fcd9cea9ede6ca8_374_390)), I left those untouched.
[^1]: The exception is that CastFunctions are always replaced by vector casts. I do not know whether this is intentional.https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/343Do not reorder accesses in `move_constants_before_loop` (quickly)2023-08-18T12:15:30+02:00Daniel BauerDo not reorder accesses in `move_constants_before_loop` (quickly)Reimplementation of https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/342.
While playing around with the old MR, I realized that the changes proposed there have a significant impact on the execution time of `move_constants_...Reimplementation of https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/342.
While playing around with the old MR, I realized that the changes proposed there have a significant impact on the execution time of `move_constants_before_loop` (for some kernels).
Before the MR, we would not descend into blocks, loops or conditionals to check whether dependencies are modified in their body.
The MR changed that for the sake of correctness.
However, the implementation was quite inefficient.
Note that for each assignment we must find a block to move the assignment to.
Essentially, the old MR would move up the AST, at each level determining a set of "critical symbols" by *descending* the tree from the current element again.
This means that the AST was traversed a lot, and set objects were created and updated a lot.
This MR changes this behavior.
Now, the AST is only traversed once, from the current assignment up to the block we can move the assignment to.
If we encounter blocks, loops, etc. on the way, we still descend into the block.
However, we do this only once.
Moreover, the new implementation does not create a huge set of critical symbols but instead exits early once it finds a dependency.
Overall, my not-sophisticated-at-all tests suggest that the new implementation is even slightly faster than the version from master.
The new implementation also does not change the `ast.Node` interface, which I like quite a lot.https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/342Draft: Do not reorder accesses in `move_constants_before_loop`2023-08-18T10:39:05+02:00Daniel BauerDraft: Do not reorder accesses in `move_constants_before_loop`Prior to this MR, `move_constants_before_loop` tries to move constants as far to the top as possible.
This might reorder read/write accesses to fields.
For example:
```python
import pystencils as ps
from pystencils import CreateKernelCo...Prior to this MR, `move_constants_before_loop` tries to move constants as far to the top as possible.
This might reorder read/write accesses to fields.
For example:
```python
import pystencils as ps
from pystencils import CreateKernelConfig
from pystencils.astnodes import Block, KernelFunction, LoopOverCoordinate, SympyAssignment
from pystencils.field import Field, FieldType
from sympy.abc import x, y
field = Field.create_generic("field", 1, field_type=FieldType.CUSTOM)
counter = LoopOverCoordinate.get_loop_counter_symbol(0)
load = SympyAssignment(x, field.absolute_access((counter,), (0,)))
store = SympyAssignment(field.absolute_access((counter+1,), (0,)), 2*x)
body = ps.typing.transformations.add_types(Block([load, store]), CreateKernelConfig())
loop = LoopOverCoordinate(body, 0, 0, 42)
block = Block([loop])
ps.transformations.resolve_field_accesses(block)
new_loops = ps.transformations.cut_loop(loop, [41])
ps.transformations.move_constants_before_loop(new_loops.args[1])
kernel = KernelFunction(
block,
ps.Target.CPU,
ps.Backend.C,
ps.cpu.cpujit.make_python_function,
None,
)
code = ps.get_code_str(kernel)
print(code)
```
prints
```c
FUNC_PREFIX void kernel(double * RESTRICT _data_field, int64_t const _stride_field_0)
{
const double x = _data_field[41*_stride_field_0];
_data_field[42*_stride_field_0] = x*2.0;
{
for (int64_t ctr_0 = 0; ctr_0 < 41; ctr_0 += 1)
{
const double x = _data_field[_stride_field_0*ctr_0];
_data_field[_stride_field_0*(ctr_0 + 1)] = x*2.0;
}
{
}
}
}
```
Note that the last (cut) loop iteration is moved before the primary loop, leading to a wrong load from index 41.
This MR changes `move_constants_before_loop` such that assignments can not be moved before their last modification.
Essentially, it replaces `symbols_defined` by `symbols_modified` [here](https://i10git.cs.fau.de/terraneo/pystencils/-/commit/be78ab165339d593869b5c77ef00a590a63ba130#99785d4b53b75ce54c83c3e499248de2a07fb2cd_598_597).
This new property is implemented for all AST nodes.
Note the implementation of `CustomCCodeNode`. I did not want to introduce breaking changes to the API.
Additionally, declarations are now inserted where the caller requests, instead of pushing them all the way to the top (https://i10git.cs.fau.de/terraneo/pystencils/-/commit/5c65d06216d050c22e28ba0b9487544342fc0926).
Lastly, a test for the new behavior is included.https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/341Refactor gpu indexing2023-09-04T16:53:09+02:00Markus HolzerRefactor gpu indexingTo map an iteration space to GPU Threads indexing classes are used. These indexing classes receive a field and iterations slice to determine the iteration space. This MR refactors the indexing classes to directly receive an iteration spa...To map an iteration space to GPU Threads indexing classes are used. These indexing classes receive a field and iterations slice to determine the iteration space. This MR refactors the indexing classes to directly receive an iteration space. With this, the indexing classes are more general and not dependent on pystencils Fields.
Further improvements/fixes:
- Line indexing works now with iteration slices. This did not work at all before
- Both indexing schemes calculate a correct block and grid size for iteration slices. This means if for example if only every second element is touched (due to a given iteration slice) the number of threads will be half. This removes modulo calculation that was needed before
- Both indexing schemes now support up to 4 dimensionsMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/340Fix symbol counters2023-07-25T11:39:39+02:00Markus HolzerFix symbol countersWhen simplifications are applied on an AssignementCollection that is created with Assignments coming from another AssignementCollection that was simplified before the counter for the symbol creation was not respectedWhen simplifications are applied on an AssignementCollection that is created with Assignments coming from another AssignementCollection that was simplified before the counter for the symbol creation was not respectedMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/339Remove windows CI2023-07-12T15:37:13+02:00Markus HolzerRemove windows CIMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/338JSON Serializer for pystencils config2023-07-17T14:16:40+02:00Helen SchottenhammlJSON Serializer for pystencils configThis MR adds a custom JSON serializer to allow pystencils configs to be used as parameters in databases. Useful in parameter studies when using the more modern way of setting up simulations, i.e., using pystencils' CreateKernelConfig.
...This MR adds a custom JSON serializer to allow pystencils configs to be used as parameters in databases. Useful in parameter studies when using the more modern way of setting up simulations, i.e., using pystencils' CreateKernelConfig.
Can be extended in the future for other custom classes if needed.Helen SchottenhammlHelen Schottenhammlhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/337Add adjacent direcitons to stencil module2023-07-12T18:13:30+02:00Markus HolzerAdd adjacent direcitons to stencil moduleMarkus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/336Remove pystencils.GPU_DEVICE2023-07-13T09:58:30+02:00Michael Kuronmkuron@icp.uni-stuttgart.deRemove pystencils.GPU_DEVICE- `SerialDataHandling` now performs the device selection upon construction. It can also be constructed with an explicit device number to deviate from the default selection.
- For `ParallelDataHandling`, the assignment of devices to MPI r...- `SerialDataHandling` now performs the device selection upon construction. It can also be constructed with an explicit device number to deviate from the default selection.
- For `ParallelDataHandling`, the assignment of devices to MPI ranks _should_ be handled by Walberla by calling `cudaSetDevice()`. It has [`selectDeviceBasedOnMpiRank`](https://i10git.cs.fau.de/walberla/walberla/-/blob/master/src/gpu/DeviceSelectMPI.cpp) for this purpose. I am not sure it actually calls it -- I think it should be called from [`MPIManager::initializeMPI`](https://i10git.cs.fau.de/walberla/walberla/-/blob/master/src/core/mpi/MPIManager.cpp). Right now everything probably just ends up on the first GPU.
- The kernel wrapper now determines the correct device by inspecting the fields.
- `gpu_indexing_params` needs an explicit device number, I don't think any kind of default is reasonable.
- Some tests now iterate over all devices instead of using a default device. This is actually the right thing to do because it tests whether the device selection works correctly.
lbmpy's test_gpu_block_size_limiting.py::test_gpu_block_size_limiting fails since !335, but that is due to an error in the test, which https://i10git.cs.fau.de/pycodegen/lbmpy/-/merge_requests/146 fixes.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/335Fix indexing for AMD GPUs2023-07-08T12:43:56+02:00Markus HolzerFix indexing for AMD GPUsDue to https://github.com/cupy/cupy/issues/7676 `BlockIndexing` did not work correctly on AMD GPUs. This is MR fixes it.Due to https://github.com/cupy/cupy/issues/7676 `BlockIndexing` did not work correctly on AMD GPUs. This is MR fixes it.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/334Re-enable test_loop_cutting.py::test_staggered_iteration2023-06-30T08:55:41+02:00Michael Kuronmkuron@icp.uni-stuttgart.deRe-enable test_loop_cutting.py::test_staggered_iterationIt passes on current master, so don't xfail it.It passes on current master, so don't xfail it.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/333Make AMD GPU support compatible with both hipcc and hiprtc2023-06-30T21:53:24+02:00Michael Kuronmkuron@icp.uni-stuttgart.deMake AMD GPU support compatible with both hipcc and hiprtcPlease give this a test on your AMD machine, @holzer. I think it should now work everywhere with both backend=nvcc and backend=nvrtc.Please give this a test on your AMD machine, @holzer. I think it should now work everywhere with both backend=nvcc and backend=nvrtc.Markus HolzerMarkus Holzerhttps://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/332Add experimental half precison support2023-06-28T20:35:25+02:00Markus HolzerAdd experimental half precison supportWith this MR experimental half-precision support is addedWith this MR experimental half-precision support is addedMarkus HolzerMarkus Holzer