reorder mulhs so they go with the corresponding muls

Supposedly some processors and compilers will fuse mul+mulh into one instruction
14 jobs for opencl in 48 minutes and 38 seconds (queued for 4 seconds)