import waLBerla hangs after installation
Hi waLBerla developers and contributors!
With other colleagues in the EESSI and MultiXscale projects, we are trying to build and deploy optmized waLBerla v6.1 installations and ran into an issue when building it with two specific toolchains that we'd like to report and hopefully get your input on.
A summary of the issue:
We are building waLBerla through EasyBuild using the foss2022b
and foss2023a
toolchains with two identical easyconfig files. With either toolchain the installation proceeds until the sanity check step which simply runs python -c import waLBerla
, upon which the system hangs. We see this happen on the EasyBuild test clusters but not on our personal laptops or in the HPC at the University of Groningen.
We tried to change the sanity check to mpirun -np 1 python -c "import waLBerla"
in the chance that the issue was with the test cluster's environment, but the same hang occurs.
One successful workaround is to set UCX_LOG_LEVEL=info
in the sanity check so that it reads UCX_LOG_LEVEL=info python -c "import waLBerla"
. We don't know why changing the log level of UCX
resolves this problem, and my colleague who discovered this has also opened a ticket about it in the UCX
repo here.
Another workaround seems to be importing mpi4py
before waLBerla. This is surprising, because mpi4py
is not a dependency of waLBerla. We would rather not add mpi4py
as a dependency for this issue, especially without knowing the consequences of this.
Given that we were only seeing this problem in the EasyBuild test clusters and not in other systems, and also the fact that the UCX
workaround seems to work for the EESSI test clusters, we assumed import waLBerla
was likely hanging due to some quirk of the EasyBuild test clusters. However, we received a report from another EasyBuild maintainer with a notice of this problem in another system. Because of this, we are now not convinced that whatever is causing this has to do with the EasyBuild clusters and their environment.
We have a summary of our attempts in our support portal, where you can find more details.
Would you have any idea of what could be causing this, or have you perhaps encountered something similar in the past? We'd love your input as we're quite confused about this problem. Thanks in advance!