iteration_slices cause out-of-bound memory writes on GPU
Essentially it's related to the issue #58 (closed), because the fix introduced out-of-bound memory writes.
The guards are not aware of the scaled accesses to the arrays, which in my example leads to half of the memory written to to be out of bounds.
1 FUNC_PREFIX __launch_bounds__(256) void kernel(double * RESTRICT _data_field_1)
2 {
3 if (blockDim.x*blockIdx.x + threadIdx.x + 1 < 37 && blockDim.y*blockIdx.y + threadIdx.y + 1 < 37)
4 {
5 const int64_t ctr_0 = 2*blockDim.x*blockIdx.x + 2*threadIdx.x + 1;
6 const int64_t ctr_1 = 2*blockDim.y*blockIdx.y + 2*threadIdx.y + 1;
7 _data_field_1[ctr_0 + 38*ctr_1] = 1.0;
8 }
9 }
The guard allows blockDim.x*blockIdx.x + threadIdx.x
and blockDim.y*blockIdx.y + threadIdx.y
both to have a maximum value of 36. However, the accesses are scaled, ctr_0
and ctr_1
both max out at 2 * 36 + 1 = 73
, which in the access to the field _data_field_1[ctr_0 + 38*ctr_1]
results in the last memory-access to be _data_field_1[73 + 38*73]
, way out of bounds for an array of size 38^2
. In my simulations this is screwing up all my memory. It took me ages to figure this out, because of course this also depends on the chosen domain size, sometimes resulting in visibly screwed up memory of neighboring arrays and sometimes not directly visible.
import pystencils as ps
domain_size = (36, 36)
dh_gpu = ps.create_data_handling(
domain_size=domain_size,
periodicity=True,
default_target=ps.enums.Target.GPU,
)
kernel_config_gpu = ps.CreateKernelConfig(
target=dh_gpu.default_target,
# gpu_indexing_params={"block_size": (16, 16)},
iteration_slice=([slice(1, -1, 2)] * dh_gpu.dim),
)
field_1 = dh_gpu.add_array("field_1", values_per_cell=1)
assignment = ps.Assignment(field_1.center, 1)
kernel = ps.create_kernel(assignment, config=kernel_config_gpu).compile()
ps.show_code(kernel)
dh_gpu.fill(field_1.name, 0, ghost_layers=True)
dh_gpu.to_gpu(field_1.name)
dh_gpu.run_kernel(kernel)
dh_gpu.to_cpu(field_1.name)
print(dh_gpu.gather_array(field_1.name, ghost_layers=True))