Pure GPU communication scheme

Usage example:

PureGPUCommunicationScheme scheme(cudaEnabledMPIAvailable, blocks, fieldID);
scheme();
//or
scheme.startCommunication()
scheme.wait()

pack kernel for every direction (perhaps we can generate this?) -> look at code generation first!
one buffer on the GPU per direction
then either:
- copy to CPU (if non CUDA enabled MPI)
- send off directly