Skip to content

Checkpoint-restore functionality using ADIOS2

Marcus Mohr requested to merge mohr/ADIOS-Checkpointing-mc into master

This MR implements a first version of a Checkpoint/Restore functionality for HyTeG based on the ADIOS2 library.

From the HyTeG side the basic API is given by the two classes

  • AdiosCheckpointExporter
  • AdiosCheckpointImporter

The basic workflow of checkpointing is the following

  • generate an exporter object
  • register FE functions to be include in the checkpoint with the exporter object
  • trigger writing of the checkpoint

When writing of the checkpoint is triggered we export for each primitive and each scalar sub/component function an ADIOS variable to the checkpoint "file". The term file is actually not fully correct as we are using ADIOS2's BP format in version 5, so we genertate a directory that gets populated with multiple files. We include all data, i.e. also values at ghost points, into the checkpoint for two reasons. (a) this gives us a complete snapshot of the function's state and (b) it is much easier to implement and more efficient performance-wise since we can simply write/read complete data-buffers.

Our checkpoint/restore currently supports the following kinds of functions

  • P1Function and P1VectorFunction
  • P2Function and P2VectorFunction
  • P2P1TaylorHoodFunction

The latter is handled slightly different, as we treat the velocity and pressure component separately. The implementation works with all our usual data/value-types, i.e. (float, double, int32_t, int64_t). Additional at registration the user passes values for minLevel and maxLevel and function data is stored for the complete range [minLevel,maxLevel].

When writing the checkpoint additional meta-information is include, such as details for each registered function being composed of

  • function name (string)
  • function kind (string)
  • function value type (string)
  • minLevel (integer)
  • maxLevel (integer)

Also meta-information on the checkpointing format, HyTeG version, etc. is added and the user can add to this additional information composed of two vectors of strings giving attribute names and attribute values. Thus, one can e.g. add the name of the meshfile.

Importing from a checkpoint currently requires to first setup the desired FE function on a storage with the same primitives and then calling restoreFunction on the importer object. The latter also provides functionality to obtain all function descriptions stored in the checkpoint the assist in setting up the FE functions or printing checkpoint info to standard output.

Note that, while for import the PrimitiveStorage needs to be composed of the same primitives as during export, we are not restricted to having either the same number of MPI process, nor the same parallel distribution of primitives.

The best way to inspect how this works in practices probably is to take a look that the CheckpointRestoreTest.cpp file.

Obviously this is only a frist draft and some functionality is missing, e.g. the importer does currently not provide access to the user defined attributes the exporter might have included into the checkpoint. However, before I implement this, I wanted to have a more general discussion on the approach and how to deals with potential aspects.

Cheers
Marcus

Merge request reports