WIP: Cuda autotune (!106) · Merge requests · pycodegen / pystencils

This PR introduces ~~two~~ one changes:

~~rotate (32,1,1) depending on field strides to fastest dimension. So (1,1,32) for c-layout and (32,1,1) for fortran layout. So pystencils will be fast also for c-layout (this will always be performed)~~
auto-tune the block dimensions to whatevers is fastest for a specific kernel on localhost. On first kernel call different layouts are tried and the kernel will be called henceforth with the fastest configuration (disk_cached). This could be intersting for OpenCL where we don't know which launch config is the fastest (on OpenCL the runtime can alternatively give a hint on that).

One drawback: the test calls are only correct if input and output fields do not overlap (so no in-place kernels).

Edited Oct 07, 2020 by Stephan Seitz

Admin message