Supported Engines

This section provides a description of the Available Engines in ADIOS2 and their specific parameters to allow extra-control from the user. Parameters are passed in key-value pairs for:

Engine specific parameters
Engine supported transports and parameters

Parameters are passed at:

Compile time IO::SetParameters (adios2_set_parameter in C, Fortran)
Compile time IO::AddTransport (adios2_set_transport_parameter in C, Fortran)
Runtime Configuration Files in the ADIOS component.

BP5

The BP5 Engine writes and reads files in ADIOS2 native binary-pack (bp version 5) format. This was a new format for ADIOS 2.8, improving on the metadata operations and the memory consumption of the older BP4/BP3 formats. BP5 is the default file format as of ADIOS 2.9. As compared to the older format, BP5 provides three main advantages:

Lower memory consumption. Deferred Puts will use user buffer for I/O wherever possible thus saving on a memory copy. Aggregation uses a fixed-size shared-memory segment on each compute node instead of using MPI to send data from one process to another. Memory consumption can get close to half of BP4 in some cases.

Faster metadata management improves write/read performance where hundreds or more variables are added to the output.

Improved functionality around appending many output steps into the same file. Better performance than writing new files each step. Restart can append to an existing series by truncating unwanted steps. Readers can filter out unwanted steps to only see and process a limited set of steps. Just like as in BP4, existing steps cannot be corrupted by appending new steps.

In 2.8 BP5 was a brand new file format and engine. It still does NOT support some functionality of BP4:

Burst buffer support for writing data.

BP5 files have the following structure given a “name” string passed as the first argument of IO::Open:

io.SetEngine("BP5");
adios2::Engine bpFile = io.Open("name", adios2::Mode::Write);

will generate:

% BP5 datasets are always a directory
name.bp/

% data and metadata files
name.bp/
        data.0
        data.1
        ...
        data.M
        md.0
        md.idx
        mmd.0

Note

BP5 file names are compatible with the Unix (/) and Windows (\\) file system naming convention for directories and files.

Note

BP5 has an mmd.0 file in the directory, which BP4 does not have.

This engine allows the user to fine tune the buffering operations through the following optional parameters:

Streaming through file
1. OpenTimeoutSecs: (Streaming mode) Reader may want to wait for the creation of the file in io.Open(). By default the Open() function returns with an error if file is not found.
2. BeginStepPollingFrequencySecs: (Streaming mode) Reader can set how frequently to check the file (and file system) for new steps. Default is 1 seconds which may be stressful for the file system and unnecessary for the application.
Aggregation
1. AggregationType: TwoLevelShm, EveryoneWritesSerial, DataSizeBased, and EveryoneWrites are four data aggregation strategies. See Aggregation in BP5. The default is TwoLevelShm.
2. NumAggregators: The number of processes that will ever write data directly to storage. The default is set to the number of compute nodes the application is running on (i.e. one process per compute node). TwoLevelShm will select a fixed number of processes per compute-node to get close to the intention of the user but does not guarantee the exact number of aggregators. DataSaizeBased will use this configuration setting by default as the number of subfiles to write to, and if not set, it will use NumSubFiles (if neither are set, the number of nodes will be used as the number of subfiles to write to).
3. AggregatorRatio: An alternative option to NumAggregators to pick every nth process as aggregator. The number of aggregators will be automatically kept to be within 1 and total number of processes no matter what bad number is supplied here. Moreover, TwoLevelShm will select an fixed number of processes per compute-node to get close to the intention of this ratio but does not guarantee the exact number of aggregators.
4. NumSubFiles: The number of data files to write to in the .bp/ directory. Used by TwoLevelShm and DataSizeBased aggregators. For TwoLevelShm the number of files can be smaller then the number of aggregators, while for DataSizeBased, the number of aggregators and subfiles will always be the same (NumSubFiles is used if NumAggregators is not specified, and if neither are specified, the number of subfiles will be set to the number of nodes). The default is set to NumAggregators.
5. EnableWriterRerouting: This option can be turned on when DataSizeBased aggregation is enabled. The goal of writer rerouting is to provide resilience in the presence of external application interference that would otherwise increase wall time to write. When writer rerouting is turned on, writer chains are built as usual with DataSizeBased aggregation, and if some chains finish writing before others, then ranks from unfinished chains are rerouted to write to subfiles associated with chains that have completed. To avoid situations where subfiles become too unbalanced in size, a limiting threshold ratio can be configured using the parameter ReroutingThresholdFactor. See Aggregation in BP5. The default is off/false.
6. ReroutingThresholdFactor: This option specifies a ratio beyond which fast-writing chains/subfiles cannot accept more rerouted ranks writing to them. The ratio is specified as a fraction of the originally scheduled data size, so a value of 0.5, for example, indicates that once a subfile contains half again the amount of data originally scheduled to be written to it, it will no longer accept rerouted data. See Aggregation in BP5.
7. StripeSize: The data blocks of different processes are aligned to this size (default is 4096 bytes) in the files. Its purpose is to avoid multiple processes to write to the same file system block and potentially slow down the write.
8. MaxShmSize: Upper limit for how much shared memory an aggregator process in TwoLevelShm can allocate. For optimum performance, this should be at least 2xM +1KB where M is the maximum size any process writes in a single step. However, there is no point in allowing for more than 4GB. The default is 4GB.
9. UseSelectiveMetadataAggregation: There are two metadata aggregation strategies in BP5. If this parameter is true (the default), SelectiveMetadataAggregation is employed, which uses a multi-phase approach to limit the amount of data exchanged. If false, a less complex two-level metadata aggregation is performed. In most cases the default is more efficient.
10. OneLevelGatherRankLimit: For the SelectiveMetadataAggregation method, this parameter specifies an MPI cohort size above which it resorts to a two-stage aggregation process rather than gathering all metadata to rank 0 in one MPI collective operation. Some HPC machines have unpredictable behaviour with gatherv at both large numbers of ranks and large amounts of data. The default value (6000) avoids this behaviour on ORNL’s Frontier. Higher or lower values may be useful on other machines.
Buffering
1. BufferVType: chunk or malloc, default is chunking. Chunking maintains the buffer as a list of memory blocks, either ADIOS-owned for sync-ed Puts and small Puts, and user-owned pointers of deferred Puts. Malloc maintains a single memory block and extends it (reallocates) whenever more data is buffered. Chunking incurs extra cost in I/O by having to write data in chunks (multiple write system calls), which can be helped by increasing BufferChunkSize and MinDeferredSize. Malloc incurs extra cost by reallocating memory whenever more data is buffered (by Put()), which can be helped by increasing InitialBufferSize.
2. BufferChunkSize: (for chunk buffer type) The size of each memory buffer chunk, default is 128MB but it is worth increasing up to 2147381248 (a bit less than 2GB) if possible for maximum write performance.
3. MinDeferredSize: (for chunk buffer type) Small user variables are always buffered, default is 4MB.
4. InitialBufferSize: (for malloc buffer type) initial memory provided for buffering (default and minimum is 16Kb). To avoid reallocations, it is worth increasing this size to the expected maximum total size of data any process would write in any step (not counting deferred Puts).
5. GrowthFactor: (for malloc buffer type) exponential growth factor for initial buffer > 1, default = 1.05.
Managing steps
1. AppendAfterSteps: BP5 enables overwriting some existing steps by opening in adios2::Mode::Append mode and specifying how many existing steps to keep. Default value is MAX_INT, so it always appends after the last step. -1 would achieve the same thing. If you have 10 steps in the file,
  - value 0 means starting from the beginning, truncating all existing data
  - value 1 means appending after the first step, so overwrite 2,3…10
  - value 10 means appending after all existing steps
  - value >10 means the same, append after all existing steps (gaps in steps are impossible)
  - -1 means appending after the last step, i.e. same as 10 or higher
  - -2 means removing the last step, i.e. starting from the 10th
  - -11 (and <-11) means truncating all existing data
2. SelectSteps: BP5 reading allows for only seeing selected steps. This is a string of space-separated list of range definitions in the form of “start:end:step”. Indexing starts from 0. If ‘end’ is ‘n’ or ‘N’, then it is an unlimited range expression. Range definitions are adding up. Note that in the reading functions, counting the steps is always 0 to s-1 where s steps are presented, so even after applying this selection, the selected steps are presented as 0 to s-1. Examples:
  - “0 6 3 2” selects four steps indexed 0,2,3 and 6 (presented in reading as 0,1,2,3)
  - “1:5” selects 5 consecutive steps, skipping step 0, and starting from 1
  - “2:n” selects all steps from step 2
  - “0:n:2” selects every other steps from the beginning (0,2,4,6…)
  - “0:n:3 10:n:5” selects every third step from the beginning and additionally every fifth steps from step 10.
Asynchronous writing I/O
1. AsyncOpen: true/false Call the open function asynchronously. It decreases I/O overhead when creating lots of subfiles (NumAggregators is large) and one calls io.Open() well ahead of the first write step. Only implemented for writing. Default is true.
2. AsyncWrite: true/false Perform data writing operations asynchronously after EndStep(). Default is false. If the application calls EnterComputationBlock()/ExitComputationBlock() to indicate phases where no communication is happening, ADIOS will try to perform all data writing during those phases, otherwise it will write immediately and eagerly after EndStep().
Direct I/O. Experimental, see discussion on GitHub.
1. DirectIO: Turn on O_DIRECT when using POSIX transport. Do not use this on parallel file systems.
2. DirectIOAlignOffset: Alignment for file offsets. Default is 512 which is usually
3. DirectIOAlignBuffer: Alignment for memory pointers. Default is to be same as DirectIOAlignOffset.
Miscellaneous
1. StatsLevel: 1 turns on Min/Max calculation for every variable, 0 turns this off. Default is 1. It has some cost to generate this metadata so it can be turned off if there is no need for this information.
2. MaxOpenFilesAtOnce: Specify how many subfiles a process can keep open at once. Default is unlimited. If a dataset contains more subfiles than how many open file descriptors the system allows (see ulimit -n) then one can either try to raise that system limit (set it with ulimit -n), or set this parameter to force the reader to close some subfiles to stay within the limits.
3. Threads: Read side: Specify how many threads one process can use to speed up data reading. The default value is 0, to let the engine estimate the number of threads based on how many processes are running on the compute node and how many hardware threads are available on the compute node but it will use maximum 16 threads. Value 1 forces the engine to read everything within the main thread of the process. Other values specify the exact number of threads the engine can use. Although multithreaded reading works in a single Get(adios2::Mode::Sync) call if the read selection spans multiple data blocks in the file, the best parallelization is achieved by using deferred mode and reading everything in PerformGets()/EndStep().
4. MetadataThreads: Read side: Specify the maximum number of threads one process can use to speed up metadata installation. The default value is 8, but the engine will never use more threads than the number of ranks that were used when the file was written..
5. FlattenSteps: This is a writer-side parameter specifies that the reader should interpret multiple writer-created timesteps as a single timestep, essentially flattening all Put()s into a single step.
6. IgnoreFlattenSteps: This is a reader-side parameter that tells the reader to ignore any FlattenSteps parameter supplied to the writer.

Key	Value Format	Default and Examples
OpenTimeoutSecs	float	0 for ReadRandomAccess mode, 3600 for Read mode, `10.0`, `5`
BeginStepPollingFrequencySecs	float	1, 10.0
AggregationType	string	TwoLevelShm, EveryoneWritesSerial, DataSizeBased, EveryoneWrites
NumAggregators	integer >= 1	0 (one file per compute node)
AggregatorRatio	integer >= 1	not used unless set
NumSubFiles	integer >= 1	=NumAggregators, used when AggregationType=TwoLevelShm or AggregationType=DataSizeBased
EnableWriterRerouting	boolean	off, on, true, false
ReroutingThresholdFactor	float > 0	0.3
StripeSize	integer+units	4KB
MaxShmSize	integer+units	4294762496
BufferVType	string	chunk, malloc
BufferChunkSize	integer+units	128MB, worth increasing up to min(2GB, datasize/process/step)
MinDeferredSize	integer+units	4MB
InitialBufferSize	float+units >= 16Kb	16Kb, 10Mb, 0.5Gb
GrowthFactor	float > 1	1.05, 1.01, 1.5, 2
AppendAfterSteps	integer >= 0	INT_MAX
SelectSteps	string	“0 6 3 2”, “1:5”, “0:n:3 10:n:5”
AsyncOpen	string On/Off	On, Off, true, false
AsyncWrite	string On/Off	Off, On, true, false
DirectIO	string On/Off	Off, On, true, false
DirectIOAlignOffset	integer >= 0	512
DirectIOAlignBuffer	integer >= 0	set to DirectIOAlignOffset if unset
UseSelectiveMetadataAggregation	boolean	On, Off, true, false
OneLevelGatherRanksLimit	integer	6000
StatsLevel	integer, 0 or 1	1, 0
MaxOpenFilesAtOnce	integer >= 0	UINT_MAX, 1024, 1
Threads	integer >= 0	0, 1, 32
FlattenSteps	boolean	off, on, true, false
IgnoreFlattenSteps	boolean	off, on, true, false

S3 Object Storage

BP5 supports writing data files to S3-compatible object storage (Amazon S3, MinIO, Ceph, etc.) while keeping metadata files on the local filesystem for fast access. A sidecar file (s3.json) is written locally so that readers can auto-detect the S3 location.

io.SetEngine("BP5");
io.SetParameter("DataFileTransport", "awssdk");
io.SetParameter("S3Endpoint", "https://s3.amazonaws.com");
io.SetParameter("S3Bucket", "mybucket");

Credentials can be provided as engine parameters (S3AccessKeyID, S3SecretKey, S3SessionToken) or via standard AWS environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN).

AsyncWrite is automatically enabled when DataFileTransport is set, unless explicitly overridden.

The default multi-object mode stores each data segment as a separate S3 object (data.0.0, data.0.1, …) for automatic crash recovery. On Open(Write), stale data objects from a previous run are deleted. Mode::Append and AppendAfterSteps are supported: append discovers existing segments and continues numbering; AppendAfterSteps truncates by deleting segment objects beyond the target step.

Key	Value Format	Default and Examples
DataFileTransport	string	“”, `awssdk`
S3Endpoint	string	“”, `https://s3.amazonaws.com`, `http://localhost:9000`
S3Bucket	string	“”, `mybucket`
S3ObjectMode	string	multi, `single`
S3AccessKeyID	string	“” (falls back to `AWS_ACCESS_KEY_ID`)
S3SecretKey	string	“” (falls back to `AWS_SECRET_ACCESS_KEY`)
S3SessionToken	string	“” (falls back to `AWS_SESSION_TOKEN`)
S3DirectUploadThreshold	integer+units	1MB

Only file transport types are supported. Optional parameters for IO::AddTransport or in runtime config file transport field:

Transport type: File

Key	Value Format	Default and Examples
Library	string	POSIX (UNIX), FStream (Windows), stdio, IME

The IME transport directly reads and writes files stored on DDN’s IME burst buffer using the IME native API. To use the IME transport, IME must be avaiable on the target system and ADIOS2 needs to be configured with ADIOS2_USE_IME. By default, data written to the IME is automatically flushed to the parallel filesystem at every EndStep() call. You can disable this automatic flush by setting the transport parameter SyncToPFS to OFF.

BP4

The BP4 Engine writes and reads files in ADIOS2 native binary-pack (bp version 4) format. This was a new format for ADIOS 2.5 and improved on the metadata operations of the older BP3 format. Compared to the older format, BP4 provides three main advantages:

Fast and safe appending of multiple output steps into the same file. Better performance than writing new files each step. Existing steps cannot be corrupted by appending new steps.

Streaming through files (i.e. online processing). Consumer apps can read existing steps while the Producer is still writing new steps. Reader’s loop can block (with timeout) and wait for new steps to arrive. Same reader code can read the entire data in post or in situ. No restrictions on the Producer.

Burst buffer support for writing data. It can write the output to a local file system on each compute node and drain the data to the parallel file system in a separate asynchronous thread. Streaming read from the target file system are still supported when data goes through the burst buffer. Appending to an existing file on the target file system is NOT supported currently.

BP4 files have the following structure given a “name” string passed as the first argument of IO::Open:

io.SetEngine("BP4");
adios2::Engine bpFile = io.Open("name", adios2::Mode::Write);

will generate:

% BP4 datasets are always a directory
name.bp/

% data and metadata files
name.bp/
        data.0
        data.1
        ...
        data.M
        md.0
        md.idx

Note

BP4 file names are compatible with the Unix (/) and Windows (\\) file system naming convention for directories and files.

This engine allows the user to fine tune the buffering operations through the following optional parameters:

Profile: turns ON/OFF profiling information right after a run
ProfileUnits: set profile units according to the required measurement scale for intensive operations
Threads: number of threads provided from the application for buffering, use this for very large variables in data size
InitialBufferSize: initial memory provided for buffering (minimum is 16Kb)
BufferGrowthFactor: exponential growth factor for initial buffer > 1, default = 1.05.
MaxBufferSize: maximum allowable buffer size (must be larger than 16Kb). If too large adios2 will throw an exception.
FlushStepsCount: users can select how often to produce the more expensive collective metadata file in terms of steps: default is 1. Increase to reduce adios2 collective operations footprint, with the trade-off of reducing checkpoint frequency. Buffer size will increase until first steps count if MaxBufferSize is not set.
NumAggregators (or SubStreams): Users can select how many sub-files (M) are produced during a run, ranges between 1 and the number of mpi processes from MPI_Size (N), adios2 will internally aggregate data buffers (N-to-M) to output the required number of sub-files. Default is 0, which will let adios2 to group processes per shared-memory-access (i.e. one per compute node) and use one process per node as an aggregator. If NumAggregators is larger than the number of processes then it will be set to the number of processes.
AggregatorRatio: An alternative option to NumAggregators to pick every Nth process as aggregator. An integer divider of the number of processes is required, otherwise a runtime exception is thrown.
OpenTimeoutSecs: (Streaming mode) Reader may want to wait for the creation of the file in io.Open(). By default the Open() function returns with an error if file is not found.
BeginStepPollingFrequencySecs: (Streaming mode) Reader can set how frequently to check the file (and file system) for new steps. Default is 1 seconds which may be stressful for the file system and unnecessary for the application.
StatsLevel: Turn on/off calculating statistics for every variable (Min/Max). Default is On. It has some cost to generate this metadata so it can be turned off if there is no need for this information.
StatsBlockSize: Calculate Min/Max for a given size of each process output. Default is one Min/Max per writer. More fine-grained min/max can be useful for querying the data.
NodeLocal or Node-Local: For distributed file system. Every writer process must make sure the .bp/ directory is created on the local file system. Required when writing to local disk/SSD/NVMe in a cluster. Note: the BurstBuffer* parameters are newer and should be used for using the local storage as temporary instead of this parameter.
BurstBufferPath: Redirect output file to another location and drain it to the original target location in an asynchronous thread. It requires to be able to launch one thread per aggregator (see SubStreams) on the system. This feature can be used on machines that have local NVMe/SSDs on each node to accelerate the output writing speed. On Summit at OLCF, use “/mnt/bb/<username>” for the path where <username> is your user account name. Temporary files on the accelerated storage will be automatically deleted after the application closes the output and ADIOS drains all data to the file system, unless draining is turned off (see the next parameter). Note: at this time, this feature cannot be used to append data to an existing dataset on the target system.
BurstBufferDrain: To write only to the accelerated storage but to not drain it to the target file system, set this flag to false. Data will NOT be deleted from the accelerated storage on close. By default, setting the BurstBufferPath will turn on draining.
BurstBufferVerbose: Verbose level 1 will cause each draining thread to print a one line report at the end (to standard output) about where it has spent its time and the number of bytes moved. Verbose level 2 will cause each thread to print a line for each draining operation (file creation, copy block, write block from memory, etc).
StreamReader: By default the BP4 engine parses all available metadata in Open(). An application may turn this flag on to parse a limited number of steps at once, and update metadata when those steps have been processed. If the flag is ON, reading only works in streaming mode (using BeginStep/EndStep); file reading mode will not work as there will be zero steps processed in Open().

Key	Value Format	Default and Examples
Profile	string On/Off	On, Off
ProfileUnits	string	Microseconds, Milliseconds, Seconds, Minutes, Hours
Threads	integer > 1	1, 2, 3, 4, 16, 32, 64
InitialBufferSize	float+units >= 16Kb	16Kb, 10Mb, 0.5Gb
MaxBufferSize	float+units >= 16Kb	at EndStep, 10Mb, 0.5Gb
BufferGrowthFactor	float > 1	1.05, 1.01, 1.5, 2
FlushStepsCount	integer > 1	1, 5, 1000, 50000
NumAggregators	integer >= 1	0 (one file per compute node), `MPI_Size`/2, … , 2, (N-to-1) 1
AggregatorRatio	integer >= 1	not used unless set, `MPI_Size`/N must be an integer value
OpenTimeoutSecs	float	0, `10.0`, `5`
BeginStepPollingFrequencySecs	float	1, `10.0`
StatsLevel	integer, 0 or 1	1, `0`
StatsBlockSize	integer > 0	a very big number, `1073741824` for blocks with 1M elements
NodeLocal	string On/Off	Off, On
Node-Local	string On/Off	Off, On
BurstBufferPath	string	“”, /mnt/bb/norbert, /ssd
BurstBufferDrain	string On/Off	On, Off
BurstBufferVerbose	integer, 0-2	0, `1`, `2`
StreamReader	string On/Off	On, Off

Only file transport types are supported. Optional parameters for IO::AddTransport or in runtime config file transport field:

Transport type: File

Key	Value Format	Default and Examples
Library	string	POSIX (UNIX), FStream (Windows), stdio, IME

The IME transport directly reads and writes files stored on DDN’s IME burst buffer using the IME native API. To use the IME transport, IME must be avaiable on the target system and ADIOS2 needs to be configured with ADIOS2_USE_IME. By default, data written to the IME is automatically flushed to the parallel filesystem at every EndStep() call. You can disable this automatic flush by setting the transport parameter SyncToPFS to OFF.

BP3

The BP3 Engine writes and reads files in ADIOS2 native binary-pack (bp) format. BP files are backwards compatible with ADIOS1.x and have the following structure given a “name” string passed as the first argument of IO::Open:

adios2::Engine bpFile = io.Open("name", adios2::Mode::Write);

will generate:

% collective metadata file
name.bp

% data directory and files
name.bp.dir/
            name.bp.0
            name.bp.1
            ...
            name.bp.M

Note

BP3 file names are compatible with the Unix (/) and Windows (\\) file system naming convention for directories and files.

Caution

The default BP3 engine will check if the .bp is the extension of the first argument of IO::Open and will add .bp and .bp.dir if not.

This engine allows the user to fine tune the buffering operations through the following optional parameters:

Profile: turns ON/OFF profiling information right after a run
ProfileUnits: set profile units according to the required measurement scale for intensive operations
CollectiveMetadata: turns ON/OFF forming collective metadata during run (used by large scale HPC applications)
Threads: number of threads provided from the application for buffering, use this for very large variables in data size
InitialBufferSize: initial memory provided for buffering (minimum is 16Kb)
BufferGrowthFactor: exponential growth factor for initial buffer > 1, default = 1.05.
MaxBufferSize: maximum allowable buffer size (must be larger than 16Kb). If too large adios2 will throw an exception.
FlushStepsCount: users can select how often to produce the more expensive collective metadata file in terms of steps: default is 1. Increase to reduce adios2 collective operations footprint, with the trade-off of reducing checkpoint frequency. Buffer size will increase until first steps count if MaxBufferSize is not set.
NumAggregators (or SubStreams): Users can select how many sub-files (M) are produced during a run, ranges between 1 and the number of mpi processes from MPI_Size (N), adios2 will internally aggregate data buffers (N-to-M) to output the required number of sub-files. Default is 0, which will let adios2 to group processes per shared-memory-access (i.e. one per compute node) and use one process per node as an aggregator. If NumAggregators is larger than the number of processes then it will be set to the number of processes.
AggregatorRatio: An alternative option to NumAggregators to pick every Nth process as aggregator. An integer divider of the number of processes is required, otherwise a runtime exception is thrown.
Node-Local: For distributed file system. Every writer process must make sure the .bp/ directory is created on the local file system. Required for using local disk/SSD/NVMe in a cluster.

Key	Value Format	Default and Examples
Profile	string On/Off	On, Off
ProfileUnits	string	Microseconds, Milliseconds, Seconds, Minutes, Hours
CollectiveMetadata	string On/Off	On, Off
Threads	integer > 1	1, 2, 3, 4, 16, 32, 64
InitialBufferSize	float+units >= 16Kb	16Kb, 10Mb, 0.5Gb
MaxBufferSize	float+units >= 16Kb	at EndStep, 10Mb, 0.5Gb
BufferGrowthFactor	float > 1	1.05, 1.01, 1.5, 2
FlushStepsCount	integer > 1	1, 5, 1000, 50000
NumAggregators	integer >= 1	0 (one file per compute node), `MPI_Size`/2, … , 2, (N-to-1) 1
AggregatorRatio	integer >= 1	not used unless set, `MPI_Size`/N must be an integer value
Node-Local	string On/Off	Off, On

Only file transport types are supported. Optional parameters for IO::AddTransport or in runtime config file transport field:

Transport type: File

Key	Value Format	Default and Examples
Library	string	POSIX (UNIX), FStream (Windows), stdio, IME

HDF5

In ADIOS2, the default engine for reading and writing HDF5 files is called “HDF5”. To use this engine, you can either specify it in your xml config file, with tag <engine type=HDF5> or, set it in client code. For example, here is how to create a hdf5 reader:

adios2::IO h5IO = adios.DeclareIO("SomeName");
h5IO.SetEngine("HDF5");
adios2::Engine h5Reader = h5IO.Open(filename, adios2::Mode::Read);

To read back the h5 files generated with VDS to ADIOS2, one can use the HDF5 engine. Please make sure you are using the HDF5 library that has version greater than or equal to 1.11 in ADIOS2.

The h5 file generated by ADIOS2 has two levels of groups: The top Group, / and its subgroups: Step0 … StepN, where N is number of steps. All datasets belong to the subgroups.

Any other h5 file can be read back to ADIOS as well. To be consistent, when reading back to ADIOS2, we assume a default Step0, and all datasets from the original h5 file belong to that subgroup. The full path of a dataset (from the original h5 file) is used when represented in ADIOS2.

We can pass options to HDF5 API from ADIOS xml configuration. Currently we support CollectionIO (default false), and chunk specifications. The chunk specification uses space to seperate values, and by default, if a valid H5ChunkDim exists, it applies to all variables, unless H5ChunkVar is specified. Examples:

<parameter key="H5CollectiveMPIO" value="yes"/>
<parameter key="H5ChunkDim" value="200 200"/>
<parameter key="H5ChunkVar" value="VarName1 VarName2"/>

We suggest to read HDF5 documentation before appling these options.

After the subfile feature is introduced in HDF5 version 1.14, the ADIOS2 HDF5 engine will use subfiles as the default h5 format as it improves I/O in general (for example, see https://escholarship.org/uc/item/6fs7s3jb)

To use the subfile feature, client needs to support MPI_Init_thread with MPI_THREAD_MULTIPLE.

Useful parameters from the HDF library to tune subfiles are:

H5FD_SUBFILING_IOC_PER_NODE (num of subfiles per node)
  set H5FD_SUBFILING_IOC_PER_NODE to 0 if the regular h5 file is preferred, before using ADIOS2 HDF5 engine.
H5FD_SUBFILING_STRIPE_SIZE
H5FD_IOC_THREAD_POOL_SIZE

DAOS

The DAOS engine writes and reads ADIOS2 datasets through the DAOS object-store API directly, instead of through a POSIX file. On HPC sites with a DAOS storage tier this bypasses the POSIX layer that BP5 uses against a dfuse-mounted DAOS container.

Each writer rank writes its data to its own DAOS array. During the step the engine issues async daos_array_write submits as the internal buffer accumulates each chunk’s worth of data; subsequent Put calls overlap the in-flight DAOS transfer. At EndStep the engine drains the in-flight queue and flushes any remaining tail synchronously. Each rank writes its own data independently; the data path performs no cross-rank aggregation.

Metadata is split. The bulk of metadata (per-rank, per-step) goes to DAOS objects. The smaller rank-0 metadata files (the md.idx step index, md.0 attribute file, and mmd.0 FFS format file) are written as POSIX files in the dataset directory.

The engine requires DAOS development headers and libraries (libdaos, libdaos_common). Configure ADIOS2 with -DADIOS2_USE_DAOS=ON; CMake will probe for DAOS via pkg-config.

The DAOS engine uses the DAOS object API directly. It does not require the dfuse FUSE filesystem or any LD_PRELOAD interception library; those are needed only when accessing DAOS as a POSIX filesystem.

The engine does require:

A DAOS pool and container that the user has access to.
The DAOS agent reachable through DAOS_AGENT_DRPC_DIR.
Two environment variables on every rank:
- DAOS_POOL — pool label or UUID.
- DAOS_CONT — container label or UUID.

The pool and container must already exist before the writer opens the engine; the engine does not create them.

Note

HPC sites that ship a DAOS client module (for example daos/base on ALCF Aurora) typically use it to export libfabric and Mercury tuning environment variables — on Slingshot fabrics these include FI_CXI_DEFAULT_CQ_SIZE, FI_CXI_RX_MATCH_MODE, FI_MR_CACHE_MAX_COUNT, and NA_OFI_CXI_PROTO_RNR. These are needed for the DAOS client to sustain the network load of many concurrent ranks.

module load sets these in the launching shell, but they are not automatically propagated to ranks by every MPI launcher. Propagate them explicitly:

MPICH / PALS mpiexec: -genvall, or specific variables via --env NAME=VALUE.
SLURM srun: --export=ALL (or list specific variables).
OpenMPI mpirun: typically propagates by default; -x NAME to be explicit.

Without propagation the ranks run with libfabric defaults that are too small for high-rank-count concurrent DAOS reads. The visible symptom is DER_HG (Mercury transport error) followed by DER_TIMEDOUT after ~120 s during multi-node reads; the workload completes but read throughput drops by one to two orders of magnitude.

On Aurora specifically, propagating the full module env to BP5 ranks running on Lustre has been observed to break MPICH/Lustre I/O. When a single submit script runs both DAOS and BP5 phases, apply -genvall only to the DAOS-phase mpiexec line.

io.SetEngine("daos");
adios2::Engine daosFile = io.Open("name", adios2::Mode::Write);

The dataset is a directory rooted at name:

name/                       # dataset directory
    md.0                    # attribute blob (rank 0)
    md.idx                  # step index (rank 0)
    mmd.0                   # FFS format definitions (rank 0)
    data_oids.txt           # per-rank DAOS-array OID index
    oid.txt                 # metadata-side DAOS object IDs

The bulk per-rank data lives in DAOS objects rooted in the container, located by data_oids.txt.

The dataset directory itself is on a POSIX-accessible filesystem. On systems where the application’s working filesystem already lives on a DAOS POSIX container (via dfuse), the directory will land there; otherwise it lands on whatever filesystem the path resolves to. The choice of filesystem for the directory is independent of the bulk data path, which always uses DAOS objects.

Reader-side requires the same DAOS_POOL and DAOS_CONT environment variables and access to the dataset directory (so the reader can read data_oids.txt and the metadata files).

This engine accepts the following optional parameters via IO::SetParameter (or its language equivalents):

BufferChunkSize: Size in bytes of each internal data chunk. Async daos_array_write is triggered whenever the bytes accumulated since the last submit cross this threshold; larger values produce fewer, larger DAOS writes. This value is also used as the internal chunk_size of the DAOS array on creation. Default is 128 MiB. When per-step data per rank is small, setting this value to roughly one quarter of that size keeps multiple submits in flight per step.
MinDeferredSize: Puts with payload smaller than this are routed through a synchronous (memcpy-into-buffer) path rather than the deferred-extern (zero-copy from user pointer) path. Per-iov memory-region registration dominates DAOS submit time for small payloads; routing them through the buffer batches them into a single iov. Default is 4 MiB.
StatsLevel: 0 disables the per-Put GetMinMax computation, which removes that work from the data path. Default is 1.
UseOneTimeAttributes: If true, attributes are marshalled once at the first BeginStep instead of re-marshalled every step. Lowers per-step metadata cost when attributes are static.

The metadata-layout choice is exposed through the environment variable DAOS_METADATA_LAYOUT rather than an IO parameter. Its default (array-1mb-aligned) is appropriate for typical use; the other values (array, kv) exist for benchmarking alternative DAOS metadata storage strategies and are not normally needed.

Open costs (pool connect, container open, OID allocation) are paid once per engine and amortize across all steps. Async submits issued during Put overlap with subsequent Put calls in the same step.
Because every rank writes its own data, scaling is per-rank rather than per-aggregator-node. Per-node bandwidth is set by the rank-to-DAOS-server fabric rather than by a single aggregator process.

Span Puts are not supported. Put(var, span, init, value) cannot meet ADIOS semantics in this engine: async submits during subsequent Put calls may flush the span’s bytes to DAOS before the user has filled it.
Append mode is not supported (Mode::Append is rejected at open).
Both Mode::Read (step-by-step) and Mode::ReadRandomAccess are supported on the read side.

SST Sustainable Staging Transport

In ADIOS2, the Sustainable Staging Transport (SST) is an engine that allows direct connection of data producers and consumers via the ADIOS2 write/read APIs. This is a classic streaming data architecture where the data passed to ADIOS on the write side (via Put() deferred and sync, and similar calls) is made directly available to a reader (via Get(), deferred and sync, and similar calls).

SST is designed for use in HPC environments and can take advantage of RDMA network interconnects to speed the transfer of data between communicating HPC applications; however, it is also capable of operating in a Wide Area Networking environment over standard sockets. SST supports full MxN data distribution, where the number of reader ranks can differ from the number of writer ranks. SST also allows multiple reader cohorts to get access to a writer’s data simultaneously.

To use this engine, you can either specify it in your xml config file, with tag <engine type=SST> or, set it in client code. For example, here is how to create an SST reader:

adios2::IO sstIO = adios.DeclareIO("SomeName");
sstIO.SetEngine("SST");
adios2::Engine sstReader = sstIO.Open(filename, adios2::Mode::Read);

and a sample code for SST writer is:

adios2::IO sstIO = adios.DeclareIO("SomeName");
sstIO.SetEngine("SST");
adios2::Engine sstWriter = sstIO.Open(filename, adios2::Mode::Write);

The general goal of ADIOS2 is to ease the conversion of a file-based application to instead use a non-file streaming interconnect, for example, data producers such as computational physics codes and consumers such as analysis applications. However, there are some uses of ADIOS2 APIs that work perfectly well with the ADIOS2 file engines, but which will not work or will perform badly with streaming. For example, SST is based upon the “step” concept and ADIOS2 applications that use SST must call BeginStep() and EndStep(). On the writer side, the Put() calls between BeginStep and EndStep are the unit of communication and represent the data that will be available between the corresponding Begin/EndStep calls on the reader.

Also, it is recommended that SST-based applications not use the ADIOS2 Get() sync method unless there is only one data item to be read per step. This is because SST implements MxN data transfer (and avoids having to deliver all data to every reader), by queueing data on the writer ranks until it is known which reader rank requires it. Normally this data fetch stage is initiated by PerformGets() or EndStep(), both of which fulfill any pending Get() deferred operations. However, unlike Get() deferred, the semantics of Get() sync require the requested data to be fetched from the writers before the call can return. If there are multiple calls to Get() sync per step, each one may require a communication with many writers, something that would have only had to happen once if Get() differed were used instead. Thus the use of Get() sync is likely to incur a substantial performance penalty.

On the writer side, depending upon the chosen data marshaling option there may be some (relatively small) performance differences between Put() sync and Put() deferred, but they are unlikely to be as substantial as between Get() sync and Get() deferred.

Note that SST readers and writers do not necessarily move in lockstep, but depending upon the queue length parameters and queueing policies specified, differing reader and writer speeds may cause one or the other side to wait for data to be produced or consumed, or data may be dropped if allowed by the queueing policy. However, steps themselves are atomic and no step will be partially dropped, delivered to a subset of ranks, or otherwise divided.

The SST engine allows the user to customize the streaming operations through the following optional parameters:

1. RendezvousReaderCount: Default 1. This integer value specifies the number of readers for which the writer should wait before the writer-side Open() returns. The default of 1 implements an ADIOS1/flexpath style “rendezvous”, in which an early-starting reader will wait for the writer to start, or vice versa. A number >1 will cause the writer to wait for more readers and a value of 0 will allow the writer to proceed without any readers present. This value is interpreted by SST Writer engines only.

2. RegistrationMethod: Default “File”. By default, SST reader and writer engines communicate network contact information via files in a shared filesystem. Specifically, the "filename" parameter in the Open() call is interpreted as a path which the writer uses as the name of a file to which contact information is written, and from which a reader will attempt to read contact information. As with other file-based engines, file creation and access is subject to the usual considerations (directory components are interpreted, but must exist and be traversable, writer must be able to create the file and the reader must be able to read it). Generally the file so created will exist only for as long as the writer keeps the stream Open(), but abnormal process termination may leave “stale” files in those locations. These stray “.sst” files should be deleted to avoid confusing future readers. SST also offers a “Screen” registration method in which writers and readers send their contact information to, and read it from, stdout and stdin respectively. The “screen” registration method doesn’t support batch mode operations in any way, but may be useful when manually starting jobs on machines in a WAN environment that don’t share a filesystem. A future release of SST will also support a “Cloud” registration method where contact information is registered to and retrieved from a network-based third-party server so that both the shared filesystem and interactivity can be avoided. This value is interpreted by both SST Writer and Reader engines.

3. QueueLimit: Default 0. This integer value specifies the number of steps which the writer will allow to be queued before taking specific action (such as discarding data or waiting for readers to consume the data). The default value of 0 is interpreted as no limit. This value is interpreted by SST Writer engines only.

4. QueueFullPolicy: Default “Block”. This value controls what policy is invoked if a non-zero QueueLimit has been specified and new data would cause the queue limit to be reached. Essentially, the “Block” option ensures data will not be discarded and if the queue fills up the writer will block on EndStep until the data has been read. If there is one active reader, EndStep will block until data has been consumed off the front of the queue to make room for newly arriving data. If there is more than one active reader, it is only removed from the queue when it has been read by all readers, so the slowest reader will dictate progress. NOTE THAT THE NO READERS SITUATION IS A SPECIAL CASE: If there are no active readers, new timesteps are considered to have completed their active queueing immediately upon submission. They may be retained in the “reserve queue” if the ReserveQueueLimit is non-zero. However, if that ReserveQueueLimit parameter is zero, timesteps submitted when there are no active readers will be immediately discarded.

Besides “Block”, the other acceptable value for QueueFullPolicy is “Discard”. When “Discard” is specified, and an EndStep operation would add more than the allowed number of steps to the queue, some step is discarded. If there are no current readers connected to the stream, the oldest data in the queue is discarded. If there are current readers, then the newest data (I.E. the just-created step) is discarded. (The differential treatment is because SST sends metadata for each step to the readers as soon as the step is accepted and cannot reliably prevent that use of that data without a costly all-to-all synchronization operation. Discarding the newest data instead is less satisfying, but has a similar long-term effect upon the set of steps delivered to the readers.) This value is interpreted by SST Writer engines only.

5. ReserveQueueLimit: Default 0. This integer value specifies the number of steps which the writer will keep in the queue for the benefit of late-arriving readers. This may consist of timesteps that have already been consumed by any readers, as well as timesteps that have not yet been consumed. In some sense this is target queue minimum size, while QueueLimit is a maximum size. This value is interpreted by SST Writer engines only.

6. DataTransport: Default varies. This string value specifies the underlying network communication mechanism to use for exchanging data in SST. Generally this is chosen by SST based upon what is available on the current platform. However, specifying this engine parameter allows overriding SST’s choice. Current allowed values are “UCX”, “MPI”, “RDMA”, “Mercury”, and “WAN”. (ib and fabric are accepted as equivalent to RDMA and evpath is equivalent to WAN.) Generally both the reader and writer should be using the same network transport, and the network transport chosen may be dictated by the situation. For example, the RDMA transport generally operates only between applications running on the same high-performance interconnect (e.g. on the same HPC machine). If communication is desired between applications running on different interconnects, the Wide Area Network (WAN) option should be chosen. This value is interpreted by both SST Writer and Reader engines.

7. WANDataTransport: Default sockets. If the SST DataTransport parameter is “WAN, this string value specifies the EVPath-level data transport to use for exchanging data. The value must be a data transport known to EVPath, such as “sockets”, “enet”, or “ib”. Generally both the reader and writer should be using the same EVPath-level data transport. This value is interpreted by both SST Writer and Reader engines.

8. MercuryProtocol: Default “tcp”. If the SST DataTransport parameter is “Mercury”, this string value specifies the underlying Mercury NA (Network Abstraction) plugin and protocol string to use, in the form that Mercury accepts. Typical values are “cxi://” (Slingshot/HPE-Cray interconnects such as Frontier and Perlmutter), “ofi+tcp” (libfabric over TCP), “ofi+verbs” (InfiniBand via libfabric), “bmi+tcp”, and “na+sm” (shared memory). If this parameter is not set, SST falls back to the SST_MERCURY_PROTOCOL environment variable, and then to “tcp”. Generally both the reader and writer should use the same protocol. This value is interpreted by both SST Writer and Reader engines.

9. ControlTransport: Default tcp. This string value specifies the underlying network communication mechanism to use for performing control operations in SST. SST can be configured to standard TCP sockets, which are very reliable and efficient, but which are limited in their scalability. Alternatively, SST can use a reliable UDP protocol, that is more scalable, but as of ADIOS2 Release 2.4.0 still suffers from some reliability problems. (sockets is accepted as equivalent to tcp and udp, rudp, and enet are equivalent to scalable. Generally both the reader and writer should be using the same control transport. This value is interpreted by both SST Writer and Reader engines.

10. NetworkInterface: Default NULL. In situations in which there are multiple possible network interfaces available to SST, this string value specifies which should be used to generate SST’s contact information for writers. Generally this should NOT be specified except for narrow sets of circumstances. It has no effect if specified on Reader engines. If specified, the string value should correspond to a name of a network interface, such as are listed by commands like “netstat -i”. For example, on most Unix systems, setting the NetworkInterface parameter to “lo” (or possibly “lo0”) will result in SST generating contact information that uses the network address associated with the loopback interface (127.0.0.1). This value is interpreted by only by the SST Writer engine.

11. ControlInterface: Default NULL. This value is similar to the NetworkInterface parameter, but only applies to the SST layer which does messaging for control (open, close, flow and timestep management, but not actual data transfer). Generally the NetworkInterface parameter can be used to control this, but that also aplies to the Data Plane. Use ControlInterface in the event of conflicting specifications.

12. DataInterface: Default NULL. This value is similar to the NetworkInterface parameter, but only applies to the SST layer which does messaging for data transfer, not control (open, close, flow and timestep management). Generally the NetworkInterface parameter can be used to control this, but that also aplies to the Control Plane. Use DataInterface in the event of conflicting specifications. In the case of the RDMA data plane, this parameter controls the libfabric interface choice.

13. FirstTimestepPrecious: Default FALSE. FirstTimestepPrecious is a boolean parameter that affects the queueing of the first timestep presented to the SST Writer engine. If FirstTimestepPrecious is TRUE, then the first timestep is effectively never removed from the output queue and will be presented as a first timestep to any reader that joins at a later time. This can be used to convey run parameters or other information that every reader may need despite joining later in a data stream. Note that this queued first timestep does count against the QueueLimit parameter above, so if a QueueLimit is specified, it should be a value larger than 1. Further note while specifying this parameter guarantees that the preserved first timestep will be made available to new readers, other reader-side operations (like requesting the LatestAvailable timestep in Engine parameters) might still cause the timestep to be skipped. This value is interpreted by only by the SST Writer engine.

14. AlwaysProvideLatestTimestep: Default FALSE. AlwaysProvideLatestTimestep is a boolean parameter that affects what of the available timesteps will be provided to the reader engine. If AlwaysProvideLatestTimestep is TRUE, then if there are multiple timesteps available to the reader, older timesteps will be skipped and the reader will see only the newest available upon BeginStep. This value is interpreted by only by the SST Reader engine.

15. OpenTimeoutSecs: Default 60. OpenTimeoutSecs is an integer parameter that specifies the number of seconds SST is to wait for a peer connection on Open(). Currently this is only implemented on the Reader side of SST, and is a timeout for locating the contact information file created by Writer-side Open, not for completing the entire Open() handshake. Currently value is interpreted by only by the SST Reader engine.

16. SpeculativePreloadMode: Default AUTO. In some circumstances, SST eagerly sends all data from writers to every readers without first waiting for read requests. Generally this improves performance if every reader needs all the data, but can be very detrimental otherwise. The value AUTO for this engine parameter instructs SST to apply its own heuristic for determining if data should be eagerly sent. The value OFF disables this feature and the value ON causes eager sending regardless of heuristic. Currently SST’s heuristic is simple. If the size of the reader cohort is less than or equal to the value of the SpecAutoNodeThreshold engine parameter (Default value 1), eager sending is initiated. Currently value is interpreted by only by the SST Reader engine.

17. SpecAutoNodeThreshold: Default 1. If the size of the reader cohort is less than or equal to this value and the SpeculativePreloadMode parameter is AUTO, SST will initiate eager data sending of all data from each writer to all readers. Currently value is interpreted by only by the SST Reader engine.

18. StepDistributionMode: Default “AllToAll”. This value controls how steps are distributed, particularly when there are multiple readers. By default, the value is “AllToAll”, which means that all timesteps are to be delivered to all readers (subject to discard rules, etc.). In other distribution modes, this is not the case. For example, in “RoundRobin”, each step is delivered only to a single reader, determined in a round-robin fashion based upon the number or readers who have opened the stream at the time the step is submitted. In “OnDemand” each step is delivered to a single reader, but only upon request (with a request being initiated by the reader doing BeginStep()). Normal reader-side rules (like BeginStep timeouts) and writer-side rules (like queue limit behavior) apply.

Key

Value Format

Default and Examples

RendezvousReaderCount

integer

1

RegistrationMethod

string

File, Screen

QueueLimit

integer

0 (no queue limits)

QueueFullPolicy

string

Block, Discard

ReserveQueueLimit

integer

0 (no queue limits)

DataTransport

string

default varies by platform, UCX, MPI, RDMA, Mercury, WAN

WANDataTransport

string

sockets, enet, ib

MercuryProtocol

string

tcp, cxi://, ofi+tcp, ofi+verbs, bmi+tcp, na+sm

ControlTransport

string

TCP, Scalable

MarshalMethod

string

BP5, BP, FFS

NetworkInterface

string

NULL

ControlInterface

string

NULL

DataInterface

string

NULL

FirstTimestepPrecious

boolean

FALSE, true, no, yes

AlwaysProvideLatestTimestep

boolean

FALSE, true, no, yes

OpenTimeoutSecs

integer

60

SpeculativePreloadMode

string

AUTO, ON, OFF

SpecAutoNodeThreshold

integer

1

Additionally to the above controls, finetuning for technical parameters of the data planes may become necessary under some circumstances. The following environment variables are read and interpreted by the data planes:

By the RDMA/libfabric data plane:
- FABRIC_PROVIDER denotes the networking technology used within libfabric. libfabric is a unified API for a multitude of networking technologies, of which ADIOS2 supports a select few by default, including verbs, gni, psm2, cxi. Other providers (such as e.g. tcp or sockets) can be specified with this environment variable. Available providers can be found by running fi_info (part of libfabric) on the target system and looking for identifiers labeled as provider:.
  
  Note: If the provider is not supported by default in ADIOS2, the environment variable FABRIC_IFACE must additionally be specified. If the provider is supported by default, both FABRIC_PROVIDER and FABRIC_IFACE are optional as ADIOS2 implements a heuristic for automatically selecting an adequate fabric.
- FABRIC_IFACE denotes the network interface to be used. This will often be the name of the network card, but may also have distinct values such as shm for the shared memory provider. Available values can be found by running fi_info (part of libfabric) on the target system and looking for identifiers labeled as domain:.
  
  Note that an interface such as a network card can generally be accessed by multiple providers, e.g. the loopback interface lo will typically be available both from providers tcp and sockets. Use environment variable FABRIC_PROVIDER for predictible results.
- FABRIC_PROGRESS_THREAD: The RDMA data plane will normally decide automatically if a progress thread should be used: By default, a progress thread will be started on the writer side, if the fabric requires manual data progress (also check the man page for man fi_domain). The reader end will make synchronous progress without a thread by default.
  
  This behavior can be overridden by defining FABRIC_PROGRESS_THREAD: A value of 1, yes or on will force the use of a progress thread, no matter any other conditions. Any other value will force disable progress threads.
- SLINGSHOT_DEVICES, SLINGSHOT_VNIS, SLINGSHOT_SVC_IDS: Variables defined by the Slingshot network for exchanging authentication keys. The CXI provider will only be activated if those variables are defined. These variables should not be defined manually, but by the Batch system.
By the UCX data plane:
- SST_UCX_PROGRESS_THREAD: Some setups of UCX may require progress threads. Unlike in the RDMA data plane, progress threads are not opened automatically, but must be requested by defining this environment variable as either 1, on or yes. Any other value as well as no definition of the environment variable at all will not create progress threads.
- Tip: For restricting UCX to communication with shared memory only, set UCX_TLS=shm. Progress threads must be used on the writer and reader side. It might be necessary to additionally set UCX_POSIX_USE_PROC_LINK=n on some systems.
By the Mercury data plane:
- SST_MERCURY_PROTOCOL: Fallback for the MercuryProtocol engine parameter. If MercuryProtocol is not set on the engine, this environment variable is consulted before falling back to the default of tcp. Useful when the same binary needs to run on different interconnects without rebuilding or re-parameterizing.
- On Slingshot/CXI systems (e.g. Frontier, Perlmutter), the Mercury data plane automatically supplies an OFI auth key to Mercury’s NA layer when SLINGSHOT_VNIS is defined in the environment (set by the batch system). No user action is required; specifying MercuryProtocol=cxi:// (or the env-var fallback) is sufficient. Without the auth key, RDMA bulk transfers fail with EXDEV.
- Platform guidance: use cxi:// on HPE-Cray Slingshot systems, ofi+verbs or ofi+tcp on InfiniBand / Ethernet clusters, and na+sm for single-node shared-memory testing. Generally the same protocol string should be used on both writer and reader.

SSC Strong Staging Coupler

The SSC engine is designed specifically for strong code coupling. Currently SSC only supports fixed IO pattern, which means once the first step is finished, users are not allowed to write or read a data block with a start and count that have not been written or read in the first step. SSC uses a combination of one sided MPI and two sided MPI methods. In any cases, all user applications are required to be launched within a single mpirun or mpiexec command, using the MPMD mode.

The SSC engine takes the following parameters:

OpenTimeoutSecs: Default 10. Timeout in seconds for opening a stream. The SSC engine’s open function will block until the RendezvousAppCount is reached, or timeout, whichever comes first. If it reaches the timeout, SSC will throw an exception.
Threading: Default False. SSC will use threads to hide the time cost for metadata manipulation and data transfer when this parameter is set to true. SSC will check if MPI is initialized with multi-thread enabled, and if not, then SSC will force this parameter to be false. Please do NOT enable threading when multiple I/O streams are opened in an application, as it will cause unpredictable errors. This parameter is only effective when writer definitions and reader selections are NOT locked. For cases definitions and reader selections are locked, SSC has a more optimized way to do data transfers, and thus it will not use this parameter.

Key	Value Format	Default and Examples
OpenTimeoutSecs	integer	10, 2, 20, 200
Threading	bool	false, true

DataMan for Wide Area Network Data Staging

The DataMan engine is designed for data staging over the wide area network. It is supposed to be used in cases where a few writers send data to a few readers over long distance.

DataMan supports compression operators such as ZFP lossy compression and BZip2 lossless compression. Please refer to the operator section for usage.

The DataMan engine takes the following parameters:

IPAddress: No default value. The IP address of the host where the writer application runs. This parameter is compulsory in wide area network data staging.
Port: Default 50001. The port number on the writer host that will be used for data transfers.
Timeout: Default 5. Timeout in seconds to wait for every send / receive operation. Packages not sent or received within this time are considered lost.
RendezvousReaderCount: Default 1. This integer value specifies the number of readers for which the writer should wait before the writer-side Open() returns. By default, an early-starting writer will wait for the reader to start, or vice versa. A number >1 will cause the writer to wait for more readers, and a value of 0 will allow the writer to proceed without any readers present. This value is interpreted by DataMan Writer engines only.
Threading: Default true for reader, false for writer. Whether to use threads for send and receive operations. Enabling threading will cause extra overhead for managing threads and buffer queues, but will improve the continuity of data steps for readers, and help overlap data transfers with computations for writers.
TransportMode: Default fast. Only DataMan writers take this parameter. Readers are automatically synchronized at runtime to match writers’ transport mode. The fast mode is optimized for latency-critical applications. It enforces readers to only receive the latest step. Therefore, in cases where writers are faster than readers, readers will skip some data steps. The reliable mode ensures that all steps are received by readers, by sacrificing performance compared to the fast mode.
MaxStepBufferSize: Default 128000000. In order to bring down the latency in wide area network staging use cases, DataMan uses a fixed receiver buffer size. This saves an extra communication operation to sync the buffer size for each step, before sending actual data. The default buffer size is 128 MB, which is sufficient for most use cases. However, in case 128 MB is not enough, this parameter must be set correctly, otherwise DataMan will fail.

Key	Value Format	Default and Examples
IPAddress	string	N/A, 22.195.18.29
Port	integer	50001, 22000, 33000
Timeout	integer	5, 10, 30
RendezvousReaderCount	integer	1, 0, 3
Threading	bool	true for reader, false for writer
TransportMode	string	fast, reliable
MaxStepBufferSize	integer	128000000, 512000000, 1024000000

DataSpaces

The DataSpaces engine for ADIOS2 is experimental. DataSpaces is an asynchronous I/O transfer method within ADIOS that enables low-overhead, high-throughput data extraction from a running simulation. DataSpaces is designed for use in HPC environments and can take advantage of RDMA network interconnects to speed the transfer of data between communicating HPC applications. DataSpaces supports full MxN data distribution, where the number of reader ranks can differ from the number of writer ranks. In addition, this engine supports multiple reader and writer applications, which must be distinguished by unique values of AppID for different applications. It can be set in the xml config file with tag <parameter key="AppID" value="2"/>. The value should be unique for each applications or clients.

To use this engine, you can either specify it in your xml config file, with tag <engine type=DATASPACES> or, set it in client code. For example, here is how to create an DataSpaces reader:

adios2::IO dspacesIO = adios.DeclareIO("SomeName");
dspacesIO.SetEngine("DATASPACES");
adios2::Engine dspacesReader = dspacesIO.Open(filename, adios2::Mode::Read);

and a sample code for DataSpaces writer is:

adios2::IO dspacesIO = adios.DeclareIO("SomeName");
dspacesIO.SetEngine("DATASPACES");
adios2::Engine dspacesWriter = dspacesIO.Open(filename, adios2::Mode::Write);

To make use of the DataSpaces engine, an application job needs to also run the dataspaces_server component together with the application. The server should be configured and started before the application as a separate job in the system. For example:

aprun -n $SPROC ./dataspaces_server -s $SPROC &> log.server &

The variable $SPROC represents the number of server instances to run. The & character at the end of the line would place the aprun command in the background, and will allow the job script to continue and run the other applications. The server processes produce a configuration file, i.e., conf.0 that is used by the application to connect to the servers. This file contains identifying information of the master server, which coordinates the client registration and discovery process. The job script should wait for the servers to start-up and produce the conf.0 file before starting the client application processes.

The server also needs a user configuration read from a text file called dataspaces.conf. How many output timesteps of the same dataset (called versions) should be kept in the server’s memory and served to readers should be specified in the file. If this file does not exist in the current directory, the server will assume default values (only 1 timestep stored). .. code-block:

## Config file for DataSpaces
max_versions = 5
lock_type = 3

The dataspaces_server module is a stand-alone service that runs independently of a simulation on a set of dedicated nodes in the staging area. It transfers data from the application through RDMA, and can save it to local storage system, e.g., the Lustre file system, stream it to remote sites, e.g., auxilliary clusters, or serve it directly from the staging area to other applications. One instance of the dataspaces_server can service multiple applications in parallel. Further, the server can run in cooperative mode (i.e., multiple instances of the server cooperate to service the application in parallel and to balance load). The dataspaces_server receives notification messages from the transport method, schedules the requests, and initiates the data transfers in parallel. The server schedules and prioritizes the data transfers while the simulation is computing in order to overlap data transfers with computations, to maximize data throughput, and to minimize the overhead on the application.

Inline for zero-copy

The Inline engine provides in-process communication between the writer and reader, avoiding the copy of data buffers.

This engine is focused on the N → N case: N writers share a process with N readers, and the analysis happens ‘inline’ without writing the data to a file or copying to another buffer. It is similar to the streaming SST engine, since analysis must happen per step.

To use this engine, you can either add <engine type=Inline> to your XML config file, or set it in your application code:

adios2::IO io = adios.DeclareIO("ioName");
io.SetEngine("Inline");
adios2::Engine inlineWriter = io.Open("inline_write", adios2::Mode::Write);
adios2::Engine inlineReader = io.Open("inline_read", adios2::Mode::Read);

Notice that unlike other engines, the reader and writer share an IO instance. Both the writer and reader must be opened before either tries to call BeginStep()/PerformPuts()/PerformGets(). There must be exactly one writer, and exactly one reader.

For successful operation, the writer will perform a step, then the reader will perform a step in the same process. When the reader starts its step, the only data it has available is that written by the writer in its process. The reader then can retrieve whatever data was written by the writer by using the double-pointer Get call:

void Engine::Get<T>(Variable<T>, T**) const;

This version of Get is only used for the inline engine. See the example below for details.

Note

Since the inline engine does not copy any data, the writer should avoid changing the data before the reader has read it.

Typical access pattern:

// ... Application data generation

inlineWriter.BeginStep();
inlineWriter.Put(var, in_data); // always use deferred mode
inlineWriter.EndStep();
// Unlike other engines, data should not be reused at this point (since ADIOS
// does not copy the data), though ADIOS cannot enforce this.
// Must wait until reader is finished using the data.

inlineReader.BeginStep();
double* out_data;
inlineReader.Get(var, &data);
// Now in_data == out_data.
inlineReader.EndStep();

Null

The Null Engine performs no internal work and no I/O. It was created for testing applications that have ADIOS2 output in it by turning off the I/O easily. The runtime difference between a run with the Null engine and another engine tells us the IO overhead of that particular output with that particular engine.

adios2::IO io = adios.DeclareIO("Output");
io.SetEngine("Null");

or using the XML config file:

<adios-config>
    <io name="Output">
        <engine type="Null">
        </engine>
    </io>
</adios-config>

Although there is a reading engine as well, which will not fail, any variable/attribute inquiry returns nullptr and any subsequent Get() calls will throw an exception in C++/Python or return an error in C/Fortran.

Note that there is also a Null transport that can be used by a BP engine instead of the default File transport. In that case, the BP engine will perform all internal work including buffering and aggregation but no data will be output at all. A run like this can be used to assess the overhead of the internal work of the BP engine.

adios2::IO io = adios.DeclareIO("Output");
io.SetEngine("BP5");
io.AddTransport("Null", {});

or using the XML config file

<adios-config>
    <io name="Output">
        <engine type="BP5">
        </engine>
        <transport type="Null">
        </transport>
    </io>
</adios-config>

Plugin Engine

For details on using the Plugin Engine, see the Plugins documentation.

Campaign Reader

The Campaign Reader Engine is a specialty engine made for reading Campaign files (.aca) created with hpc-campaign tools. See an example.

It is an engine, that uses other engine (BP5, HDF5, Timeseries) to open ADIOS/HDF5 datasets added to the campaign file. It directly supports reading images and text files added to the campaign file.

hostname: Use this name as local hostname. If the host of a replica of a dataset in the campaign file matches this hostname, it will be regarded as locally readable replica (i.e., its path must be a local path). The default hostname is read from ~/.config/hpc-campaign/config.yaml.
campaignstorepath: Use this base path to find campaign files using their relative path/name. Having all campaign files organized under a common base path allows for finding them using a simple relative path under that base path. The default value is read from ~/.config/hpc-campaign/config.yaml.
cachepath: Local disk path where extracted pieces from the campaign file are stored, as well as data downloaded by ADIOS when reading remote data. The default value is read from ~/.config/hpc-campaign/config.yaml.
include-dataset: As a campaign file contains more and more datasets the open time will increase linearly since ADIOS has to create a reading engine for each BP/HDF5 file, to open that dataset and process it’s metadata. If one is only interested in specific datasets in the campaign, a semicolon-separated (;) list of regular expressions can be provided to only open datasets that match one of those expressions.
exclude-dataset: Similar to the previous option, datasets can be excluded from processing by providing a semicolon-separated list of regular expressions.
verbose: print debug information on what is happening when reading a campaign file.

Key	Value Format	Default and Examples
hostname	string	hostname from ~/.config/hpc-campaign/config.yaml
campaignstorepath	string	campaignstorepath from ~/.config/hpc-campaign/config.yaml
cachepath	string	cachepath from ~/.config/hpc-campaign/config.yaml
include-dataset	string	“include-dataset=./images;./data”
exclude-dataset	string	“exclude-dataset=logs/.*”
verbose	integer	0, 0..5