Explore usage of `fma` for optimization on CUDA
I don't know whether nvcc automatically uses fma
instructions (fused-multiply-add) when compiling with -fast-math
flag.
If not, it could be easy to use fma
whenever possible to accelerate compute-bound kernels.