Commit 1e02cdc7 authored by Martin Bauer's avatar Martin Bauer
Browse files

Separated modules into subfolders with own setup.py

This restructuring allows for easier separation of modules into
separate repositories later. Also, now pip install with repo url can be
used.

The setup.py files have also been updated to correctly reference each
other. Module versions are not extracted from git state
parent b91e6990
%% Cell type:code id: tags:
``` python
from pystencils.session import *
import timeit
%load_ext Cython
```
%% Cell type:markdown id: tags:
# Demo: Benchmark numpy, cython, pystencils
In this benchmark we compare different ways of implementing a simple stencil kernel in Python.
The benchmark kernel computes the average of the four neighbors in 2D and stores in a second array. To prevent out-of-bounds accesses, we skip the cells at the border and compute values only in the range `[1:-1, 1:-1]`
%% Cell type:markdown id: tags:
## Implementations
The first implementation is a pure Python implementation:
%% Cell type:code id: tags:
``` python
def avg_pure_python(input_arr, output_arr):
for x in range(1, input_arr.shape[0] - 1):
for y in range(1, input_arr.shape[1] - 1):
output_arr[x, y] = (input_arr[x + 1, y] + input_arr[x - 1, y] +
input_arr[x, y + 1] + input_arr[x, y - 1]) / 4
```
%% Cell type:markdown id: tags:
Obviously, this will be a rather slow version, since the loops are written directly in Python.
Next, we use *numpy* functions to delegate the looping to numpy. The first version uses the `roll` function to shift the array by one element in each direction. This version has to allocate a new array for each accessed neighbor.
%% Cell type:code id: tags:
``` python
def avg_numpy_roll(input_arr, output_arr):
neighbors = [np.roll(input_arr, axis=a, shift=s) for a in (0, 1) for s in (-1, 1)]
np.divide(sum(neighbors), 4, out=output_arr)
```
%% Cell type:markdown id: tags:
Using views, we can get rid of the additional copies:
%% Cell type:code id: tags:
``` python
def avg_numpy_slice(input_arr, output_arr):
output_arr[1:-1, 1:-1] = input_arr[2:, 1:-1] + input_arr[:-2, 1:-1] + \
input_arr[1:-1, 2:] + input_arr[1:-1, :-2]
```
%% Cell type:markdown id: tags:
To further optimize the kernel we switch to Cython, to get a compiled C version.
%% Cell type:code id: tags:
``` python
%%cython
import cython
@cython.boundscheck(False)
@cython.wraparound(False)
def avg_cython(object[double, ndim=2] input_arr, object[double, ndim=2] output_arr):
cdef int xs, ys, x, y
xs, ys = input_arr.shape
for x in range(1, xs - 1):
for y in range(1, ys - 1):
output_arr[x, y] = (input_arr[x + 1, y] + input_arr[x - 1, y] +
input_arr[x, y + 1] + input_arr[x, y - 1]) / 4
```
%% Cell type:markdown id: tags:
And finally we also create a *pystencils* version of the same stencil code:
%% Cell type:code id: tags:
``` python
src, dst = ps.fields("src, dst: [2D]")
update = ps.Assignment(dst[0,0],
(src[1, 0] + src[-1, 0] + src[0, 1] + src[0, -1]) / 4)
kernel = ps.create_kernel(update).compile()
def avg_pystencils(input_arr, output_arr):
kernel(src=input_arr, dst=output_arr)
```
%% Cell type:code id: tags:
``` python
all_implementations = {
'pure Python': avg_pure_python,
'numpy roll': avg_numpy_roll,
'numpy slice': avg_numpy_slice,
'Cython': None,
'pystencils': avg_pystencils,
}
if 'avg_cython' in globals():
all_implementations['Cython'] = avg_cython
else:
del all_implementations['Cython']
```
%% Cell type:markdown id: tags:
## Benchmark functions
We implement a short function to get in- and output arrays of a given shape and to measure the runtime.
%% Cell type:code id: tags:
``` python
def get_arrays(shape):
in_arr = np.random.rand(*shape)
out_arr = np.empty_like(in_arr)
return in_arr, out_arr
def do_benchmark(func, shape):
in_arr, out_arr = get_arrays(shape)
timer = timeit.Timer('f(a, b)', globals={'f': func, 'a': in_arr, 'b': out_arr})
calls, time_taken = timer.autorange()
return time_taken / calls
```
%% Cell type:markdown id: tags:
## Comparison
%% Cell type:code id: tags:
``` python
def bar_plot(*shape):
names = tuple(all_implementations.keys())
runtimes = tuple(do_benchmark(all_implementations[name], shape) for name in names)
for runtime, name in zip(runtimes, names):
assert runtime >= runtimes[names.index('pystencils')], runtimes
speedups = tuple(runtime / min(runtimes) for runtime in runtimes)
y_pos = np.arange(len(names))
labels = tuple(f"{name} ({round(speedup, 1)} x)" for name, speedup in zip(names, speedups))
plt.text(0.5, 0.5, f"Size {shape}", horizontalalignment='center', fontsize=16,
verticalalignment='center', transform=plt.gca().transAxes)
plt.barh(y_pos, runtimes, log=True)
plt.yticks(y_pos, labels);
plt.xlabel('Runtime of single iteration')
plt.figure(figsize=(8, 8))
plt.subplot(3, 1, 1)
bar_plot(16, 16)
plt.subplot(3, 1, 2)
bar_plot(128, 128)
plt.subplot(3, 1, 3)
bar_plot(1024, 1024)
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
All runtimes are plotted logarithmically. Next number next to the labels shows how much slower the version is than the fastest one. For small arrays Cython produces faster code than *pystencils*. The larger the arrays, the better pystencils gets.
This source diff could not be displayed because it is too large. You can view the blob instead.
API Reference
=============
.. toctree::
:maxdepth: 3
kernel_compile_and_call.rst
simplifications.rst
datahandling.rst
configuration.rst
field.rst
finite_differences.rst
plot.rst
ast.rst
*********************************************
For developers: AST Nodes and Transformations
*********************************************
AST Nodes
=========
.. automodule:: pystencils.astnodes
:members:
Transformations
===============
.. automodule:: pystencils.transformations
:members:
*************
Configuration
*************
.. automodule:: pystencils.cpu.cpujit
\ No newline at end of file
************
DataHandling
************
.. autoclass:: pystencils.datahandling.DataHandling
:members:
\ No newline at end of file
*****
Field
*****
.. automodule:: pystencils.field
:members:
\ No newline at end of file
******************
Finite Differences
******************
.. automodule:: pystencils.fd
:members:
*****************************************
Creating and calling kernels from Python
*****************************************
Creating kernels
----------------
.. autofunction:: pystencils.create_kernel
.. autofunction:: pystencils.create_indexed_kernel
.. autofunction:: pystencils.create_staggered_kernel
Code printing
-------------
.. autofunction:: pystencils.show_code
GPU Indexing
-------------
.. autoclass:: pystencils.gpucuda.AbstractIndexing
:members:
.. autoclass:: pystencils.gpucuda.BlockIndexing
:members:
.. autoclass:: pystencils.gpucuda.LineIndexing
:members:
**********************
Plotting and Animation
**********************
.. automodule:: pystencils.plot2d
:members:
***************************************
Assignment Collection & Simplifications
***************************************
AssignmentCollection
====================
.. autoclass:: pystencils.AssignmentCollection
:members:
Simplifications
===============
.. automodule:: pystencils.simp
:members:
Tutorials
=========
These tutorials are a good place to start if you are new to pystencils.
All tutorials and demos listed here are based on Jupyter notebooks that you can find in the pystencils repository.
It is a good idea to download them and run them directly to be able to play around with the code.
.. toctree::
:maxdepth: 1
/notebooks/01_tutorial_getting_started.ipynb
/notebooks/02_tutorial_basic_kernels.ipynb
/notebooks/03_tutorial_advection_diffusion.ipynb
/notebooks/04_tutorial_phasefield_spinodal_decomposition.ipynb
/notebooks/05_tutorial_phasefield_dentritic_growth.ipynb
/notebooks/06_tutorial_datahandling.ipynb
/notebooks/demo_assignment_collection.ipynb
/notebooks/demo_benchmark.ipynb
/notebooks/demo_wave_equation.ipynb
import os
from time import perf_counter
import subprocess
from tempfile import TemporaryDirectory
from pystencils import create_data_handling
from pystencils.backends.cbackend import CBackend
from jinja2 import Environment, FileSystemLoader
from pystencils.backends.cbackend import generate_c
script_path = os.path.dirname(os.path.realpath(__file__))
PAXX_ROOT = '/local/bauer/code/pacxx/install'
DEFAULT_PAXX_COMPILE_OPTIONS = ('-Ofast', '-march=native')
def generate_benchmark_code(target_file, kernel_ast, target):
assert target in ('cpu', 'gpu')
assert hasattr(kernel_ast, 'indexing'), "AST has to be a CUDA kernel in order to create a PACXX kernel from it"
backend = CBackend()
function_body = kernel_ast.body
f_sizes = {f.shape[-1] for f in kernel_ast.fields_accessed}
assert len(f_sizes) == 1
env = Environment(loader=FileSystemLoader(script_path))
result = env.get_template("benchmark_template.cpp").render(f_size=f_sizes.pop(),
code=backend(function_body),
target=target)
with open(target_file, 'w') as f:
f.write(result)
def pacxx_compile(source, executable, options=DEFAULT_PAXX_COMPILE_OPTIONS):
command = ['pacxx++', *options, source, '-o', executable, ]
env = os.environ.copy()
env['PATH'] = "{}:{}".format(env.get('PATH', ''), os.path.join(PAXX_ROOT, 'bin'))
env['LD_LIBRARY_PATH'] = "{}:{}".format(env.get('LD_LIBRARY_PATH', ''), os.path.join(PAXX_ROOT, 'lib'))
try:
subprocess.check_output(command, env=env, stderr=subprocess.STDOUT)
except subprocess.CalledProcessError as e:
print(" ".join(command))
print(e.output.decode('utf8'))
raise e
def run_paxx_benchmark(executable, domain_size, iterations):
assert len(domain_size) == 3
arguments = [executable, *domain_size, iterations]
arguments = [str(e) for e in arguments]
output = subprocess.check_output(arguments)
return float(output) / iterations
def paxx_benchmark(ast, domain_size, iterations, target='cpu', compile_options=DEFAULT_PAXX_COMPILE_OPTIONS):
"""Generates, compiles and runs the kernel with PAXX
Args:
ast: pystencils AST object (has to be generated for CUDA, even when run on CPU with pacxx)
domain_size: x, y, z extent of spatial domain
iterations: number of outer iterations
target: either 'cpu' or 'gpu' to specify where pacxx should run the kernel
compile_options: compile options for pacxx
Returns:
seconds for one outer iteration
"""
with TemporaryDirectory() as base_dir:
code = os.path.join(base_dir, 'code.cpp')
executable = os.path.join(base_dir, 'bench')
generate_benchmark_code(code, ast, target)
pacxx_compile(code, executable, compile_options)
time_per_iteration = run_paxx_benchmark(executable, domain_size, iterations)
return time_per_iteration
def lbm_performance_compare(domain_size, iterations, **lb_params):
"""Runs benchmark with pacxx and with normal pystencils backends.
Args:
domain_size: 3-tuple with size of spatial domain
iterations: number of outer iterations
**lb_params: parameters passed to lbmpy to choose lattice Boltzmann algorithm & optimization options
Returns:
dictionary with measurements of time per iteration for different backends
"""
import pycuda.driver as drv
from lbmpy.creationfunctions import create_lb_ast
if 'optimization' not in lb_params:
lb_params['optimization'] = {}
lb_params['optimization']['target'] = 'cpu'
cpu_ast = create_lb_ast(**lb_params)
lb_params['optimization']['target'] = 'gpu'
gpu_ast = create_lb_ast(**lb_params)
# print kernel code of CPU or GPU version - just for comparison, files are not used
with open("pystencils_cpu_code.c", 'w') as f:
print(generate_c(cpu_ast), file=f)
with open("pystencils_gpu_code.cu", 'w') as f:
print(generate_c(gpu_ast), file=f)
cpu_kernel = cpu_ast.compile()
gpu_kernel = gpu_ast.compile()
f_sizes = {f.shape[-1] for f in cpu_ast.fields_accessed}
assert len(f_sizes) == 1
f_size = f_sizes.pop()
dh = create_data_handling(domain_size, default_target='gpu', default_layout='fzyx')
dh.add_array('src', values_per_cell=f_size)
dh.add_array('dst', values_per_cell=f_size)
dh.fill('src', 0)
dh.fill('dst', 0)
# to keep it simple we run outer loop directly from Python
# make domain size large enough, otherwise we measure the python call overhead
def run_benchmark(kernel):
dh.all_to_gpu()
for i in range(10): # warmup
dh.run_kernel(kernel)
drv.Context.synchronize()
start = perf_counter()
for i in range(iterations):
dh.run_kernel(kernel)
drv.Context.synchronize()
return (perf_counter() - start) / iterations
return {
'pystencils_cpu': run_benchmark(cpu_kernel),
'pystencils_gpu': run_benchmark(gpu_kernel),
'pacxx_cpu': paxx_benchmark(gpu_ast, domain_size, iterations, target='cpu'),
'pacxx_gpu': paxx_benchmark(gpu_ast, domain_size, iterations, target='gpu'),
}
if __name__ == '__main__':
no_opt = {
'openmp': 8, # number of threads - pacxx uses also HT cores
'split': False,
'vectorization': False,
'gpu_indexing_params': {'block_size': (64, 8, 1)},
}
only_vectorization = {
'openmp': 4,
'split': False,
'gpu_indexing_params': {'block_size': (64, 8, 1)},
'vectorization': {'instruction_set': 'avx',
'assume_inner_stride_one': True,
'nontemporal': False},
}
best = {
'openmp': 4,
'split': True,
'gpu_indexing_params': {'block_size': (64, 8, 1)},
'vectorization': {'instruction_set': 'avx',
'assume_inner_stride_one': True,
'nontemporal': True}
}
res = lbm_performance_compare(stencil='D3Q19', relaxation_rate=1.8, compressible=False,
domain_size=(512, 128, 32), iterations=500,
optimization=only_vectorization)
cpu_speedup = ((res['pacxx_cpu'] / res['pystencils_cpu']) - 1) * 100
gpu_speedup = ((res['pacxx_gpu'] / res['pystencils_gpu']) - 1) * 100
print("Time for one kernel call [s]")
for config_name, time in res.items():
print(" {0: <16}: {1}".format(config_name, time))
print("CPU {:.02f}% GPU {:.02f}%".format(cpu_speedup, gpu_speedup))
#include <PACXX.h>
#include <vector>
#include <sstream>
#include <iostream>
#include <chrono>
using namespace pacxx::v2;
size_t division_round_up(size_t a, size_t b)
{
if( a % b == 0)
return a / b;
else
return (a / b) + 1;
}
int main(int argc, char** argv)
{
{% if target == 'cpu' %}
Executor::Create<NativeRuntime>(0);
{% elif target == 'gpu' %}
Executor::Create<CUDARuntime>(0);
{% endif %}
if( argc != 5 ) {
std::cout << "Usage: ./benchmark xSize ySize zSize iterations" << std::endl;
return 1;
}
Dimension3 domainSize;
int64_t iterations;
auto &exec = Executor::get(0);
std::stringstream( argv[1] ) >> domainSize.x;
std::stringstream( argv[2] ) >> domainSize.y;
std::stringstream( argv[3] ) >> domainSize.z;
std::stringstream( argv[4] ) >> iterations;
// add ghost layers to be comparable to pystencils native backend
domainSize.x += 2;
domainSize.y += 2;
domainSize.z += 2;
int64_t totalSize = domainSize.x * domainSize.y * domainSize.z * {{f_size}};
std::vector<double> src( totalSize, 0.0 );
std::vector<double> dst( totalSize, 0.0 );
auto & dsrc = exec.allocate<double>(src.size());
auto & ddst = exec.allocate<double>(dst.size());
dsrc.upload(src.data(), src.size());
ddst.upload(dst.data(), dst.size());
double * _data_src = dsrc.get();
double * _data_dst = ddst.get();
const int64_t _size_src_0 = domainSize.x;
const int64_t _size_src_1 = domainSize.y;
const int64_t _size_src_2 = domainSize.z;
// fzyx layout
const int64_t _stride_src_0 = 1;
const int64_t _stride_src_1 = domainSize.x;
const int64_t _stride_src_2 = domainSize.x * domainSize.y;
const int64_t _stride_src_3 = domainSize.x * domainSize.y * domainSize.z;
auto pacxxKernel = [=]( range & config ) {
struct Vec3D {int x; int y; int z; };
const Vec3D blockDim = { config.get_block_size(0), config.get_block_size(1), config.get_block_size(2) };
const Vec3D blockIdx = { config.get_block(0), config.get_block(1), config.get_block(2) };
const Vec3D threadIdx = { config.get_local(0), config.get_local(1), config.get_local(2) };
{{ code|indent(8) }}
};
size_t blockSize[] = {64, 8, 1};
KernelConfiguration config( { division_round_up(domainSize.x - 2, blockSize[0]),
division_round_up(domainSize.y - 2, blockSize[1]),
division_round_up(domainSize.z -2, blockSize[2]) },
{ blockSize[0],
blockSize[1],
blockSize[2] });