Explore usage of `fma` for optimization on CUDA
I don't know whether nvcc automatically uses fma instructions (fused-multiply-add) when compiling with -fast-math flag.
If not, it could be easy to use fma whenever possible to accelerate compute-bound kernels.