How to support Checkpoint/Restart
Hi,
as the guys working on the TerraNeo app(s) have started performing longer production runs on supercomputers the question on how to support a classical checkpoint/restart mechanism is becoming more urgent. This issue is intended to discuss thoughts on how we can/want/should work on this.
Some questions that come directly into my mind are e.g.:
- What functionality already exists in waLBerla and HyTeG to support this (e.g. we already can (de)serialize primitives and attached data for migrating these via MPI)
- Do we start with supporting only a one-file-per-process approach or can we also provide some N-processes-to-M-checkpoint files setting.
- How should an app indicate what data to checkpoint; will we go for a registration approach like with the
VTKOutput
; will that data always be FE functions or do we also (need to) support other user-defined data structures?
Cheers
Marcus
Edited by Marcus Mohr