AES-NI vectorization improvements

!30 (merged) didn't implement an SSE-vectorized _mm_cvtepu64_pd equivalent because the stackoverflow solution didn't work. That turned out to be due to a bad optimization in GCC 5+ in fast-math mode. None of the other compilers (Clang, Intel, MSVC) have that issue, so we just disable fast-math for that function.

Also, we now use fused multiply-add if available.

