Explore usage of `fma` for optimization on CUDA

I don't know whether nvcc automatically uses fma instructions (fused-multiply-add) when compiling with -fast-math flag. If not, it could be easy to use fma whenever possible to accelerate compute-bound kernels.

https://devblogs.nvidia.com/lerp-faster-cuda/