WIP: Cuda autotune
This PR introduces
two one change s: rotate (32,1,1) depending on field strides to fastest dimension. So (1,1,32) for c-layout and (32,1,1) for fortran layout. So pystencils will be fast also for c-layout (this will always be performed)
- auto-tune the block dimensions to whatevers is fastest for a specific kernel on localhost. On first kernel call different layouts are tried and the kernel will be called henceforth with the fastest configuration (disk_cached). This could be intersting for OpenCL where we don't know which launch config is the fastest (on OpenCL the runtime can alternatively give a hint on that).
One drawback: the test calls are only correct if input and output fields do not overlap (so no in-place kernels).