pystencils merge requests

Contest: ignore two more files if waLBerla is not available

2019-08-21T18:43:43+02:00

- Contest: ignore two more files if waLBerla is not available (need when executing - Skip collection of `pystencils.autodiff` always (not only if `'CI' in `os.environ`)

Implement sp.Sum, sp.Product

2019-08-21T18:45:35+02:00

Sum and Product have a indexing variable which is a Atom but not a free symbol. So logic, that defines the undefined symbols in a `SympyAssignment` should not be `atoms(sp.Symbol)` but `free_symbols`. `sp.Indexed` from the `ResolvedFieldAcess`es forms an edge case. So we could also use `atoms(sp.Symbol).intersection(...free_symbols)`. I hope I extracted from my fork all the necessary code to implement this feature.

Fix get_type_of_expression for constants like sympy.pi

2019-08-22T08:31:17+02:00

Problem: some constant expressions are neither Float,Integer,Rational and don't have arguments. ```python >>> from sympy import * >>> isinstance(pi, Integer) False >>> isinstance(pi, Float) False >>> isinstance(pi, Rational) F...

Basic support for OpenCL (experimental)

2019-08-22T08:37:37+02:00

Basic support for OpenCL Problem: OpenCL cannot import `stdint.h`. Temporary fix: define custom `opencl_stdint.h` (~~defines currently only `int64_t`~~ `) TODO: - ~~implement `opencl_stdint.h`~~ - implement shard_mem, textures, built-in functions - ~~avoid CUDA intrinsics (`fast_div`)~~

AES-NI Random Number Generator

2019-09-02T10:21:21+02:00

I was looking at how to vectorize the Philox RNG yesterday. Before I knew it, I had implemented a working RNG using AES-NI instructions :nerd: ... Not entirely what I had intended to do, but it might still be useful to someone and should be similarly fast as a vectorized Philox. There is one place that could be optimized because I fall back to scalar instructions: I failed to reimplement `_mm_cvtepu64_pd` (the solution from https://stackoverflow.com/a/41148578 produces incorrect results in the least-significant half of the mantissa). Perhaps someone else can try to fix that. I did not integrate this with the `vector_instruction_set` parameter of the code generation. Perhaps you can do that, @bauer. It needs support for SSE2 and AES instructions (which look like SSE2 instructions, but their availability is determined by a separate CPUID flag). It will also make use of `_mm_cvtepu32_ps` and `_mm_cvtepu64_pd` from AVX512 if available (these are 128-bit instructions that actually look like SSE2 instructions).

Add PyPI badge

2019-09-02T13:40:44+02:00

Badge with current PyPI version and link to the PyPI page.

Fix typo in "pre-push"

2019-09-02T13:41:09+02:00

Fix typo in "pre-push"

Add pyhtml to tests and artifacts

2019-09-02T13:42:18+02:00

I think this should suffice to produce the artifacts. But `pytest-html` and `ansi2html` need to be added to the docker images.

Actually increment counter inside random_symbol

2019-09-04T15:06:57+02:00

@rudolfweeber is currently looking at the statistical mechanics of the fluctuating LB and found a velocity bias. It turned out that this is due to all generated random numbers using the same key. Instead it should be incremented when generating multiple random numbers in the same kernel. So in the generated code, ```c++ philox_float4(time_step, ctr_0, ctr_1, ctr_2, 0, 2, Dummy_38, Dummy_39, Dummy_40, Dummy_41); philox_float4(time_step, ctr_0, ctr_1, ctr_2, 0, 2, Dummy_34, Dummy_35, Dummy_36, Dummy_37); philox_float4(time_step, ctr_0, ctr_1, ctr_2, 0, 2, Dummy_30, Dummy_31, Dummy_32, Dummy_33); philox_float4(time_step, ctr_0, ctr_1, ctr_2, 0, 2, Dummy_26, Dummy_27, Dummy_28, Dummy_29); ``` becomes ```c++ philox_float4(time_step, ctr_0, ctr_1, ctr_2, 3, 2, Dummy_38, Dummy_39, Dummy_40, Dummy_41); philox_float4(time_step, ctr_0, ctr_1, ctr_2, 2, 2, Dummy_34, Dummy_35, Dummy_36, Dummy_37); philox_float4(time_step, ctr_0, ctr_1, ctr_2, 1, 2, Dummy_30, Dummy_31, Dummy_32, Dummy_33); philox_float4(time_step, ctr_0, ctr_1, ctr_2, 0, 2, Dummy_26, Dummy_27, Dummy_28, Dummy_29); ```

Close pystencils' config file after writing

2019-09-17T09:06:15+02:00

I got a warning that this file remains unclosed.

CI: Add minimal-sympy-master

2019-09-17T09:06:54+02:00

Test for #11 This could warn us when SymPy introduces breaking changes. The test with SymPy master is allowed to fail.

AES-NI vectorization improvements

2019-09-17T09:08:05+02:00

!30 didn't implement an SSE-vectorized `_mm_cvtepu64_pd` equivalent because the [stackoverflow](https://stackoverflow.com/a/41148578) solution didn't work. That turned out to be due to a bad optimization in GCC 5+ in fast-math mode. None of the other compilers (Clang, Intel, MSVC) have that issue, so we just disable fast-math for that function. Also, we now use fused multiply-add if available.

Address #13: Use sympy.codegen.rewriting.optimize

2019-09-23T10:55:13+02:00

It's really comfortable to write optimizations in terms of `sympy.codegen.rewrite.RewriteOptim`: ```python # Evaluates all constant terms evaluate_constant_terms = ReplaceOptim( lambda e: hasattr(e, 'is_constant') and e.is_constant, lambda p: p.evalf() ) ``` This PR adds a parameter `sympy_optimizations` to the `create_*_kernel` functions that applies the list of optimizations to the assignments before creating the AST. `sympy.codegen.rewrite` already has some optimizations. Some similar to the optimizations of pystencils. For example `create_expand_pow_optimization(limit)` is really similar to the logic in `CustomSympyPrinter._print_Pow`. See #13 Problem: old versions of sympy (e.g. from ubuntu CI) don't have `sympy.codegen.rewrite`. The optimizations are skipped in that case. `test_and_coverage` applies all optimizations. We could also try to implement a fma-optimization (fused-multipy add) with that and `sympy.Wild`.

Sort headers/global definitions to enable reproducible code generation

2019-09-23T11:03:53+02:00

headers and global_declarations are generated by methods that return sets. So even with the same inputs it is not guaranteed that the same source code is generated since sets do not guarantee a specific order when iterating over them. I was supprised that my generated code could often not be reused from the cache. The problem was that the included headers appeared in random order.

Compile CUDA using the LLVM backend

2019-09-23T12:49:30+02:00

We can compile CUDA to PTX using the LLVM backend :wink: `llc` produces PTX files without complaining.

Use get_type_of_expression in typing_form_sympy_inspection to infer types

2019-09-23T16:16:50+02:00

DANGER ZONE: this changes something in the core behavior of pystencils. Be careful before merging! In summary, when `typing_form_sympy_inspection` reaches the point where it would just use `default_type`, we try to use `get_type_of_expression` to infer the actual type. We use information of previously defined variables in current scope. Another approach would be to just type all the intermediate variable with `auto`. ```python x = pystencils.fields('x: float32[3d]') assignments = pystencils.AssignmentCollection({ a: cast_func(10, create_type('float64')), b: cast_func(10, create_type('uint16')), e: 11, c: b, f: c + b, d: c + b + x.center + e, x.center: c + b + x.center }) ``` Before: ```cpp FUNC_PREFIX void kernel(float * RESTRICT _data_x, int64_t const _size_x_0, int64_t const _size_x_1, int64_t const _size_x_2, int64_t const _stride_x_0, int64_t const _stride_x_1, int64_t const _stri de_x_2) { const double a = 10.0; const double b = 10; const double e = 11.0; const double c = b; const double f = b + c; for (int ctr_0 = 0; ctr_0 < _size_x_0; ctr_0 += 1) { float * RESTRICT _data_x_00 = _data_x + _stride_x_0*ctr_0; for (int ctr_1 = 0; ctr_1 < _size_x_1; ctr_1 += 1) { float * RESTRICT _data_x_00_10 = _stride_x_1*ctr_1 + _data_x_00; for (int ctr_2 = 0; ctr_2 < _size_x_2; ctr_2 += 1) { const double d = b + c + e + _data_x_00_10[_stride_x_2*ctr_2]; _data_x_00_10[_stride_x_2*ctr_2] = b + c + _data_x_00_10[_stride_x_2*ctr_2]; } } } } ``` After: ```cpp FUNC_PREFIX void kernel(float * RESTRICT _data_x, int64_t const _size_x_0, int64_t const _size_x_1, int64_t const _size_x_2, int64_t const _stride_x_0, int64_t const _stride_x_1, int64_t const _stri de_x_2) { const double a = 10.0; const uint16_t b = 10; const int64_t e = 11.0; const uint16_t c = b; const uint16_t f = b + c; for (int ctr_0 = 0; ctr_0 < _size_x_0; ctr_0 += 1) { float * RESTRICT _data_x_00 = _data_x + _stride_x_0*ctr_0; for (int ctr_1 = 0; ctr_1 < _size_x_1; ctr_1 += 1) { float * RESTRICT _data_x_00_10 = _stride_x_1*ctr_1 + _data_x_00; for (int ctr_2 = 0; ctr_2 < _size_x_2; ctr_2 += 1) { const float d = b + c + e + _data_x_00_10[_stride_x_2*ctr_2]; _data_x_00_10[_stride_x_2*ctr_2] = b + c + _data_x_00_10[_stride_x_2*ctr_2]; } } } } ```

Extra asserts sympy issue

2019-09-25T15:38:17+02:00

Add extra assertions to be super sure.

Interpolation 24.0.9

2019-09-25T15:41:24+02:00

This is another rebased PR for integrating interpolated accesses. Iterpolation accesses work like `absolute_access` except they can be savely applied on all fields (i.e. with boundary checks). More info here: !20 This PR contains some dead code that uses https://github.com/theHamsta/CubicInterpolationCUDA . I have not included it as a submodule in pystencils in this PR. This PR break the hash of those two test: ``` [gw11] [ 14%] FAILED lbmpy_tests/test_code_hashequivalence.py::test_hash_equivalence_llvm lbmpy_tests/test_conserved_quantity_relaxation_invariance.py::test_srt [gw8] [ 15%] FAILED lbmpy_tests/test_code_hashequivalence.py::test_hash_equivalence ```

Add AssignmentCollection.{free_fields,bound_fields}

2019-09-25T15:41:44+02:00

Add AssignmentCollection.{free_fields,bound_fields}

Document backends.json

2019-09-26T12:49:19+02:00

Document backends.json