Skip to content

WIP: Cuda autotune

Stephan Seitz requested to merge seitz/pystencils:cuda-autotune into master

This PR introduces two one changes:

  • rotate (32,1,1) depending on field strides to fastest dimension. So (1,1,32) for c-layout and (32,1,1) for fortran layout. So pystencils will be fast also for c-layout (this will always be performed)
  • auto-tune the block dimensions to whatevers is fastest for a specific kernel on localhost. On first kernel call different layouts are tried and the kernel will be called henceforth with the fastest configuration (disk_cached). This could be intersting for OpenCL where we don't know which launch config is the fastest (on OpenCL the runtime can alternatively give a hint on that).

One drawback: the test calls are only correct if input and output fields do not overlap (so no in-place kernels).

Edited by Stephan Seitz

Merge request reports