MPI_ERR_TRUNCATE in WcTimingPool
I use mesh refinement with a refinement time step.
Additionally, I create two WcTimingPools and pass them to the refinement time step.
After running the simulation, I use logResultOnRoot()
to print the time measurements.
auto timingRef = std::make_shared<WcTimingPool>();
auto timingRefLvl = std::make_shared<WcTimingPool>();
⋮
// setup timeloop
refinementTimeStep->enableTiming(timingRef, timingRefLvl);
timeloop->addFuncBeforeTimeStep(makeSharedFunctor(refinementTimeStep), str::refinementTimeStep);
⋮
// run timeloop
auto timing = std::make_shared<WcTimingPool>();
timeloop->run(*timing);
// print timings
timing ->logResultOnRoot();
timingRef ->logResultOnRoot();
timingRefLvl->logResultOnRoot();
I run my simulation with 288 processes on 8 nodes and get the following behavior:
The first two timings (timing
and timingRef
) print the results as expected.
The level wise timing fails to print with the error:
[node098:207482] * An error occurred in MPI_Reduce
[node098:207482] * reported by process [3786932225,192]
[node098:207482] * on communicator MPI_COMM_WORLD
[node098:207482] * MPI_ERR_TRUNCATE: message truncated
[node098:207482] * MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node098:207482] * and potentially your MPI job)
[node089.cluster:17591] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 1741
[node089.cluster:17591] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 1741
[node089.cluster:17591] 12 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[node089.cluster:17591] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
The error must come from TimingPool.cpp:172-185.
I use gcc 8.3.0 and openmpi 3.1.5.