reorder mulhs so they go with the corresponding muls

Supposedly some processors and compilers will fuse mul+mulh into one instruction
14 jobs for opencl in 48 minutes and 38 seconds (queued for 4 seconds)
Status Job ID Name Coverage
  Test
passed #578069
docker
arm64v8

00:03:51

passed #578079
docker
arm64v9

00:30:42

passed #578077
docker
build-documentation

00:01:11

passed #578076
docker
flake8-lint

00:00:20

passed #578066
AVX docker
latest-python

00:02:59

passed #578073
cuda docker
minimal-conda

00:00:18

passed #578074
cuda docker
minimal-sympy-master

00:02:02

passed #578067
win
minimal-windows

00:11:10

passed #578070
docker
ppc64le

00:08:35

manual #578075
AVX cuda11 docker allowed to fail manual
pycodegen-integration
passed #578072
docker
riscv64

00:05:44

passed #578065
AVX cuda11 docker
tests-and-coverage

00:03:53

87.41%
passed #578068
AVX cuda11 docker
ubuntu

00:02:30

failed #578071
docker
arm64v9

00:10:52