Juwels-Booster: selectDeviceBasedOnMpiRank() presumably assigns the wrong host memory to the device
The function selectDeviceBasedOnMpiRank() seems to assign all devices (GPUs) on a node to the same host memory. This could cause performance issues for CPU to GPU communication (cudaMemcopy), because GPUs are not communicating to their closest host memory. So if you allocate 4 GPUs on a node and call the program with 4 MPI processes, all 4 GPUs are assigned to the same MPI process memory (of process 1) by cudaSetDevice(). This is the case, because the function gpuGetDeviceCount() returns 1 instead of 4 devices (GPUs). This behavior is only tested for juwels-booster so far, further investigation is needed...