Supported Engines

This section provides a description of the Available Engines in ADIOS2 and their specific parameters to allow extra-control from the user. Parameters are passed in key-value pairs for:

  1. Engine specific parameters

  2. Engine supported transports and parameters

Parameters are passed at:

  1. Compile time IO::SetParameters (adios2_set_parameter in C, Fortran)

  2. Compile time IO::AddTransport (adios2_set_transport_parameter in C, Fortran)

  3. Runtime Configuration Files in the ADIOS component.

BP5

The BP5 Engine writes and reads files in ADIOS2 native binary-pack (bp version 5) format. This was a new format for ADIOS 2.8, improving on the metadata operations and the memory consumption of the older BP4/BP3 formats. BP5 is the default file format as of ADIOS 2.9. As compared to the older format, BP5 provides three main advantages:

  • Lower memory consumption. Deferred Puts will use user buffer for I/O wherever possible thus saving on a memory copy. Aggregation uses a fixed-size shared-memory segment on each compute node instead of using MPI to send data from one process to another. Memory consumption can get close to half of BP4 in some cases.

  • Faster metadata management improves write/read performance where hundreds or more variables are added to the output.

  • Improved functionality around appending many output steps into the same file. Better performance than writing new files each step. Restart can append to an existing series by truncating unwanted steps. Readers can filter out unwanted steps to only see and process a limited set of steps. Just like as in BP4, existing steps cannot be corrupted by appending new steps.

In 2.8 BP5 was a brand new file format and engine. It still does NOT support some functionality of BP4:

  • Burst buffer support for writing data.

BP5 files have the following structure given a “name” string passed as the first argument of IO::Open:

io.SetEngine("BP5");
adios2::Engine bpFile = io.Open("name", adios2::Mode::Write);

will generate:

% BP5 datasets are always a directory
name.bp/

% data and metadata files
name.bp/
        data.0
        data.1
        ...
        data.M
        md.0
        md.idx
        mmd.0

Note

BP5 file names are compatible with the Unix (/) and Windows (\\) file system naming convention for directories and files.

Note

BP5 has an mmd.0 file in the directory, which BP4 does not have.

This engine allows the user to fine tune the buffering operations through the following optional parameters:

  1. Streaming through file

    1. OpenTimeoutSecs: (Streaming mode) Reader may want to wait for the creation of the file in io.Open(). By default the Open() function returns with an error if file is not found.

    2. BeginStepPollingFrequencySecs: (Streaming mode) Reader can set how frequently to check the file (and file system) for new steps. Default is 1 seconds which may be stressful for the file system and unnecessary for the application.

  2. Aggregation

    1. AggregationType: TwoLevelShm, EveryoneWritesSerial and EveryoneWrites are three aggregation strategies. See Aggregation in BP5. The default is TwoLevelShm.

    2. NumAggregators: The number of processes that will ever write data directly to storage. The default is set to the number of compute nodes the application is running on (i.e. one process per compute node). TwoLevelShm will select a fixed number of processes per compute-node to get close to the intention of the user but does not guarantee the exact number of aggregators.

    3. AggregatorRatio: An alternative option to NumAggregators to pick every nth process as aggregator. The number of aggregators will be automatically kept to be within 1 and total number of processes no matter what bad number is supplied here. Moreover, TwoLevelShm will select an fixed number of processes per compute-node to get close to the intention of this ratio but does not guarantee the exact number of aggregators.

    4. NumSubFiles: The number of data files to write to in the .bp/ directory. Only used by TwoLevelShm aggregator, where the number of files can be smaller then the number of aggregators. The default is set to NumAggregators.

    5. StripeSize: The data blocks of different processes are aligned to this size (default is 4096 bytes) in the files. Its purpose is to avoid multiple processes to write to the same file system block and potentially slow down the write.

    6. MaxShmSize: Upper limit for how much shared memory an aggregator process in TwoLevelShm can allocate. For optimum performance, this should be at least 2xM +1KB where M is the maximum size any process writes in a single step. However, there is no point in allowing for more than 4GB. The default is 4GB.

  3. Buffering

    1. BufferVType: chunk or malloc, default is chunking. Chunking maintains the buffer as a list of memory blocks, either ADIOS-owned for sync-ed Puts and small Puts, and user-owned pointers of deferred Puts. Malloc maintains a single memory block and extends it (reallocates) whenever more data is buffered. Chunking incurs extra cost in I/O by having to write data in chunks (multiple write system calls), which can be helped by increasing BufferChunkSize and MinDeferredSize. Malloc incurs extra cost by reallocating memory whenever more data is buffered (by Put()), which can be helped by increasing InitialBufferSize.

    2. BufferChunkSize: (for chunk buffer type) The size of each memory buffer chunk, default is 128MB but it is worth increasing up to 2147381248 (a bit less than 2GB) if possible for maximum write performance.

    3. MinDeferredSize: (for chunk buffer type) Small user variables are always buffered, default is 4MB.

    4. InitialBufferSize: (for malloc buffer type) initial memory provided for buffering (default and minimum is 16Kb). To avoid reallocations, it is worth increasing this size to the expected maximum total size of data any process would write in any step (not counting deferred Puts).

    5. GrowthFactor: (for malloc buffer type) exponential growth factor for initial buffer > 1, default = 1.05.

  4. Managing steps

    1. AppendAfterSteps: BP5 enables overwriting some existing steps by opening in adios2::Mode::Append mode and specifying how many existing steps to keep. Default value is MAX_INT, so it always appends after the last step. -1 would achieve the same thing. If you have 10 steps in the file,

      • value 0 means starting from the beginning, truncating all existing data

      • value 1 means appending after the first step, so overwrite 2,3…10

      • value 10 means appending after all existing steps

      • value >10 means the same, append after all existing steps (gaps in steps are impossible)

      • -1 means appending after the last step, i.e. same as 10 or higher

      • -2 means removing the last step, i.e. starting from the 10th

      • -11 (and <-11) means truncating all existing data

    2. SelectSteps: BP5 reading allows for only seeing selected steps. This is a string of space-separated list of range definitions in the form of “start:end:step”. Indexing starts from 0. If ‘end’ is ‘n’ or ‘N’, then it is an unlimited range expression. Range definitions are adding up. Note that in the reading functions, counting the steps is always 0 to s-1 where s steps are presented, so even after applying this selection, the selected steps are presented as 0 to s-1. Examples:

      • “0 6 3 2” selects four steps indexed 0,2,3 and 6 (presented in reading as 0,1,2,3)

      • “1:5” selects 5 consecutive steps, skipping step 0, and starting from 1

      • “2:n” selects all steps from step 2

      • “0:n:2” selects every other steps from the beginning (0,2,4,6…)

      • “0:n:3 10:n:5” selects every third step from the beginning and additionally every fifth steps from step 10.

  5. Asynchronous writing I/O

    1. AsyncOpen: true/false Call the open function asynchronously. It decreases I/O overhead when creating lots of subfiles (NumAggregators is large) and one calls io.Open() well ahead of the first write step. Only implemented for writing. Default is true.

    2. AsyncWrite: true/false Perform data writing operations asynchronously after EndStep(). Default is false. If the application calls EnterComputationBlock()/ExitComputationBlock() to indicate phases where no communication is happening, ADIOS will try to perform all data writing during those phases, otherwise it will write immediately and eagerly after EndStep().

  6. Direct I/O. Experimental, see discussion on GitHub.

    1. DirectIO: Turn on O_DIRECT when using POSIX transport. Do not use this on parallel file systems.

    2. DirectIOAlignOffset: Alignment for file offsets. Default is 512 which is usually

    3. DirectIOAlignBuffer: Alignment for memory pointers. Default is to be same as DirectIOAlignOffset.

  7. Miscellaneous

    1. StatsLevel: 1 turns on Min/Max calculation for every variable, 0 turns this off. Default is 1. It has some cost to generate this metadata so it can be turned off if there is no need for this information.

    2. MaxOpenFilesAtOnce: Specify how many subfiles a process can keep open at once. Default is unlimited. If a dataset contains more subfiles than how many open file descriptors the system allows (see ulimit -n) then one can either try to raise that system limit (set it with ulimit -n), or set this parameter to force the reader to close some subfiles to stay within the limits.

    3. Threads: Read side: Specify how many threads one process can use to speed up reading. The default value is 0, to let the engine estimate the number of threads based on how many processes are running on the compute node and how many hardware threads are available on the compute node but it will use maximum 16 threads. Value 1 forces the engine to read everything within the main thread of the process. Other values specify the exact number of threads the engine can use. Although multithreaded reading works in a single Get(adios2::Mode::Sync) call if the read selection spans multiple data blocks in the file, the best parallelization is achieved by using deferred mode and reading everything in PerformGets()/EndStep().

Key

Value Format

Default and Examples

OpenTimeoutSecs

float

0 for ReadRandomAccess mode, 3600 for Read mode, 10.0, 5

BeginStepPollingFrequencySecs

float

1, 10.0

AggregationType

string

TwoLevelShm, EveryoneWritesSerial, EveryoneWrites

NumAggregators

integer >= 1

0 (one file per compute node)

AggregatorRatio

integer >= 1

not used unless set

NumSubFiles

integer >= 1

=NumAggregators, only used when AggregationType=TwoLevelShm

StripeSize

integer+units

4KB

MaxShmSize

integer+units

4294762496

BufferVType

string

chunk, malloc

BufferChunkSize

integer+units

128MB, worth increasing up to min(2GB, datasize/process/step)

MinDeferredSize

integer+units

4MB

InitialBufferSize

float+units >= 16Kb

16Kb, 10Mb, 0.5Gb

GrowthFactor

float > 1

1.05, 1.01, 1.5, 2

AppendAfterSteps

integer >= 0

INT_MAX

SelectSteps

string

“0 6 3 2”, “1:5”, “0:n:3 10:n:5”

AsyncOpen

string On/Off

On, Off, true, false

AsyncWrite

string On/Off

Off, On, true, false

DirectIO

string On/Off

Off, On, true, false

DirectIOAlignOffset

integer >= 0

512

DirectIOAlignBuffer

integer >= 0

set to DirectIOAlignOffset if unset

StatsLevel

integer, 0 or 1

1, 0

MaxOpenFilesAtOnce

integer >= 0

UINT_MAX, 1024, 1

Threads

integer >= 0

0, 1, 32

Only file transport types are supported. Optional parameters for IO::AddTransport or in runtime config file transport field:

Transport type: File

Key

Value Format

Default and Examples

Library

string

POSIX (UNIX), FStream (Windows), stdio, IME

The IME transport directly reads and writes files stored on DDN’s IME burst buffer using the IME native API. To use the IME transport, IME must be avaiable on the target system and ADIOS2 needs to be configured with ADIOS2_USE_IME. By default, data written to the IME is automatically flushed to the parallel filesystem at every EndStep() call. You can disable this automatic flush by setting the transport parameter SyncToPFS to OFF.

BP4

The BP4 Engine writes and reads files in ADIOS2 native binary-pack (bp version 4) format. This was a new format for ADIOS 2.5 and improved on the metadata operations of the older BP3 format. Compared to the older format, BP4 provides three main advantages:

  • Fast and safe appending of multiple output steps into the same file. Better performance than writing new files each step. Existing steps cannot be corrupted by appending new steps.

  • Streaming through files (i.e. online processing). Consumer apps can read existing steps while the Producer is still writing new steps. Reader’s loop can block (with timeout) and wait for new steps to arrive. Same reader code can read the entire data in post or in situ. No restrictions on the Producer.

  • Burst buffer support for writing data. It can write the output to a local file system on each compute node and drain the data to the parallel file system in a separate asynchronous thread. Streaming read from the target file system are still supported when data goes through the burst buffer. Appending to an existing file on the target file system is NOT supported currently.

BP4 files have the following structure given a “name” string passed as the first argument of IO::Open:

io.SetEngine("BP4");
adios2::Engine bpFile = io.Open("name", adios2::Mode::Write);

will generate:

% BP4 datasets are always a directory
name.bp/

% data and metadata files
name.bp/
        data.0
        data.1
        ...
        data.M
        md.0
        md.idx

Note

BP4 file names are compatible with the Unix (/) and Windows (\\) file system naming convention for directories and files.

This engine allows the user to fine tune the buffering operations through the following optional parameters:

  1. Profile: turns ON/OFF profiling information right after a run

  2. ProfileUnits: set profile units according to the required measurement scale for intensive operations

  3. Threads: number of threads provided from the application for buffering, use this for very large variables in data size

  4. InitialBufferSize: initial memory provided for buffering (minimum is 16Kb)

  5. BufferGrowthFactor: exponential growth factor for initial buffer > 1, default = 1.05.

  6. MaxBufferSize: maximum allowable buffer size (must be larger than 16Kb). If too large adios2 will throw an exception.

  7. FlushStepsCount: users can select how often to produce the more expensive collective metadata file in terms of steps: default is 1. Increase to reduce adios2 collective operations footprint, with the trade-off of reducing checkpoint frequency. Buffer size will increase until first steps count if MaxBufferSize is not set.

  8. NumAggregators (or SubStreams): Users can select how many sub-files (M) are produced during a run, ranges between 1 and the number of mpi processes from MPI_Size (N), adios2 will internally aggregate data buffers (N-to-M) to output the required number of sub-files. Default is 0, which will let adios2 to group processes per shared-memory-access (i.e. one per compute node) and use one process per node as an aggregator. If NumAggregators is larger than the number of processes then it will be set to the number of processes.

  9. AggregatorRatio: An alternative option to NumAggregators to pick every Nth process as aggregator. An integer divider of the number of processes is required, otherwise a runtime exception is thrown.

  10. OpenTimeoutSecs: (Streaming mode) Reader may want to wait for the creation of the file in io.Open(). By default the Open() function returns with an error if file is not found.

  11. BeginStepPollingFrequencySecs: (Streaming mode) Reader can set how frequently to check the file (and file system) for new steps. Default is 1 seconds which may be stressful for the file system and unnecessary for the application.

  12. StatsLevel: Turn on/off calculating statistics for every variable (Min/Max). Default is On. It has some cost to generate this metadata so it can be turned off if there is no need for this information.

  13. StatsBlockSize: Calculate Min/Max for a given size of each process output. Default is one Min/Max per writer. More fine-grained min/max can be useful for querying the data.

  14. NodeLocal or Node-Local: For distributed file system. Every writer process must make sure the .bp/ directory is created on the local file system. Required when writing to local disk/SSD/NVMe in a cluster. Note: the BurstBuffer* parameters are newer and should be used for using the local storage as temporary instead of this parameter.

  15. BurstBufferPath: Redirect output file to another location and drain it to the original target location in an asynchronous thread. It requires to be able to launch one thread per aggregator (see SubStreams) on the system. This feature can be used on machines that have local NVMe/SSDs on each node to accelerate the output writing speed. On Summit at OLCF, use “/mnt/bb/<username>” for the path where <username> is your user account name. Temporary files on the accelerated storage will be automatically deleted after the application closes the output and ADIOS drains all data to the file system, unless draining is turned off (see the next parameter). Note: at this time, this feature cannot be used to append data to an existing dataset on the target system.

  16. BurstBufferDrain: To write only to the accelerated storage but to not drain it to the target file system, set this flag to false. Data will NOT be deleted from the accelerated storage on close. By default, setting the BurstBufferPath will turn on draining.

  17. BurstBufferVerbose: Verbose level 1 will cause each draining thread to print a one line report at the end (to standard output) about where it has spent its time and the number of bytes moved. Verbose level 2 will cause each thread to print a line for each draining operation (file creation, copy block, write block from memory, etc).

  18. StreamReader: By default the BP4 engine parses all available metadata in Open(). An application may turn this flag on to parse a limited number of steps at once, and update metadata when those steps have been processed. If the flag is ON, reading only works in streaming mode (using BeginStep/EndStep); file reading mode will not work as there will be zero steps processed in Open().

Key

Value Format

Default and Examples

Profile

string On/Off

On, Off

ProfileUnits

string

Microseconds, Milliseconds, Seconds, Minutes, Hours

Threads

integer > 1

1, 2, 3, 4, 16, 32, 64

InitialBufferSize

float+units >= 16Kb

16Kb, 10Mb, 0.5Gb

MaxBufferSize

float+units >= 16Kb

at EndStep, 10Mb, 0.5Gb

BufferGrowthFactor

float > 1

1.05, 1.01, 1.5, 2

FlushStepsCount

integer > 1

1, 5, 1000, 50000

NumAggregators

integer >= 1

0 (one file per compute node), MPI_Size/2, … , 2, (N-to-1) 1

AggregatorRatio

integer >= 1

not used unless set, MPI_Size/N must be an integer value

OpenTimeoutSecs

float

0, 10.0, 5

BeginStepPollingFrequencySecs

float

1, 10.0

StatsLevel

integer, 0 or 1

1, 0

StatsBlockSize

integer > 0

a very big number, 1073741824 for blocks with 1M elements

NodeLocal

string On/Off

Off, On

Node-Local

string On/Off

Off, On

BurstBufferPath

string

“”, /mnt/bb/norbert, /ssd

BurstBufferDrain

string On/Off

On, Off

BurstBufferVerbose

integer, 0-2

0, 1, 2

StreamReader

string On/Off

On, Off

Only file transport types are supported. Optional parameters for IO::AddTransport or in runtime config file transport field:

Transport type: File

Key

Value Format

Default and Examples

Library

string

POSIX (UNIX), FStream (Windows), stdio, IME

The IME transport directly reads and writes files stored on DDN’s IME burst buffer using the IME native API. To use the IME transport, IME must be avaiable on the target system and ADIOS2 needs to be configured with ADIOS2_USE_IME. By default, data written to the IME is automatically flushed to the parallel filesystem at every EndStep() call. You can disable this automatic flush by setting the transport parameter SyncToPFS to OFF.

BP3

The BP3 Engine writes and reads files in ADIOS2 native binary-pack (bp) format. BP files are backwards compatible with ADIOS1.x and have the following structure given a “name” string passed as the first argument of IO::Open:

adios2::Engine bpFile = io.Open("name", adios2::Mode::Write);

will generate:

% collective metadata file
name.bp

% data directory and files
name.bp.dir/
            name.bp.0
            name.bp.1
            ...
            name.bp.M

Note

BP3 file names are compatible with the Unix (/) and Windows (\\) file system naming convention for directories and files.

Caution

The default BP3 engine will check if the .bp is the extension of the first argument of IO::Open and will add .bp and .bp.dir if not.

This engine allows the user to fine tune the buffering operations through the following optional parameters:

  1. Profile: turns ON/OFF profiling information right after a run

  2. ProfileUnits: set profile units according to the required measurement scale for intensive operations

  3. CollectiveMetadata: turns ON/OFF forming collective metadata during run (used by large scale HPC applications)

  4. Threads: number of threads provided from the application for buffering, use this for very large variables in data size

  5. InitialBufferSize: initial memory provided for buffering (minimum is 16Kb)

  6. BufferGrowthFactor: exponential growth factor for initial buffer > 1, default = 1.05.

  7. MaxBufferSize: maximum allowable buffer size (must be larger than 16Kb). If too large adios2 will throw an exception.

  8. FlushStepsCount: users can select how often to produce the more expensive collective metadata file in terms of steps: default is 1. Increase to reduce adios2 collective operations footprint, with the trade-off of reducing checkpoint frequency. Buffer size will increase until first steps count if MaxBufferSize is not set.

  9. NumAggregators (or SubStreams): Users can select how many sub-files (M) are produced during a run, ranges between 1 and the number of mpi processes from MPI_Size (N), adios2 will internally aggregate data buffers (N-to-M) to output the required number of sub-files. Default is 0, which will let adios2 to group processes per shared-memory-access (i.e. one per compute node) and use one process per node as an aggregator. If NumAggregators is larger than the number of processes then it will be set to the number of processes.

  10. AggregatorRatio: An alternative option to NumAggregators to pick every Nth process as aggregator. An integer divider of the number of processes is required, otherwise a runtime exception is thrown.

  11. Node-Local: For distributed file system. Every writer process must make sure the .bp/ directory is created on the local file system. Required for using local disk/SSD/NVMe in a cluster.

Key

Value Format

Default and Examples

Profile

string On/Off

On, Off

ProfileUnits

string

Microseconds, Milliseconds, Seconds, Minutes, Hours

CollectiveMetadata

string On/Off

On, Off

Threads

integer > 1

1, 2, 3, 4, 16, 32, 64

InitialBufferSize

float+units >= 16Kb

16Kb, 10Mb, 0.5Gb

MaxBufferSize

float+units >= 16Kb

at EndStep, 10Mb, 0.5Gb

BufferGrowthFactor

float > 1

1.05, 1.01, 1.5, 2

FlushStepsCount

integer > 1

1, 5, 1000, 50000

NumAggregators

integer >= 1

0 (one file per compute node), MPI_Size/2, … , 2, (N-to-1) 1

AggregatorRatio

integer >= 1

not used unless set, MPI_Size/N must be an integer value

Node-Local

string On/Off

Off, On

Only file transport types are supported. Optional parameters for IO::AddTransport or in runtime config file transport field:

Transport type: File

Key

Value Format

Default and Examples

Library

string

POSIX (UNIX), FStream (Windows), stdio, IME

HDF5

In ADIOS2, the default engine for reading and writing HDF5 files is called “HDF5”. To use this engine, you can either specify it in your xml config file, with tag <engine type=HDF5> or, set it in client code. For example, here is how to create a hdf5 reader:

adios2::IO h5IO = adios.DeclareIO("SomeName");
h5IO.SetEngine("HDF5");
adios2::Engine h5Reader = h5IO.Open(filename, adios2::Mode::Read);

To read back the h5 files generated with VDS to ADIOS2, one can use the HDF5 engine. Please make sure you are using the HDF5 library that has version greater than or equal to 1.11 in ADIOS2.

The h5 file generated by ADIOS2 has two levels of groups: The top Group, / and its subgroups: Step0StepN, where N is number of steps. All datasets belong to the subgroups.

Any other h5 file can be read back to ADIOS as well. To be consistent, when reading back to ADIOS2, we assume a default Step0, and all datasets from the original h5 file belong to that subgroup. The full path of a dataset (from the original h5 file) is used when represented in ADIOS2.

We can pass options to HDF5 API from ADIOS xml configuration. Currently we support CollectionIO (default false), and chunk specifications. The chunk specification uses space to seperate values, and by default, if a valid H5ChunkDim exists, it applies to all variables, unless H5ChunkVar is specified. Examples:

<parameter key="H5CollectiveMPIO" value="yes"/>
<parameter key="H5ChunkDim" value="200 200"/>
<parameter key="H5ChunkVar" value="VarName1 VarName2"/>

We suggest to read HDF5 documentation before appling these options.

After the subfile feature is introduced in HDF5 version 1.14, the ADIOS2 HDF5 engine will use subfiles as the default h5 format as it improves I/O in general (for example, see https://escholarship.org/uc/item/6fs7s3jb)

To use the subfile feature, client needs to support MPI_Init_thread with MPI_THREAD_MULTIPLE.

Useful parameters from the HDF library to tune subfiles are:

H5FD_SUBFILING_IOC_PER_NODE (num of subfiles per node)
  set H5FD_SUBFILING_IOC_PER_NODE to 0 if the regular h5 file is preferred, before using ADIOS2 HDF5 engine.
H5FD_SUBFILING_STRIPE_SIZE
H5FD_IOC_THREAD_POOL_SIZE

SST Sustainable Staging Transport

In ADIOS2, the Sustainable Staging Transport (SST) is an engine that allows direct connection of data producers and consumers via the ADIOS2 write/read APIs. This is a classic streaming data architecture where the data passed to ADIOS on the write side (via Put() deferred and sync, and similar calls) is made directly available to a reader (via Get(), deferred and sync, and similar calls).

SST is designed for use in HPC environments and can take advantage of RDMA network interconnects to speed the transfer of data between communicating HPC applications; however, it is also capable of operating in a Wide Area Networking environment over standard sockets. SST supports full MxN data distribution, where the number of reader ranks can differ from the number of writer ranks. SST also allows multiple reader cohorts to get access to a writer’s data simultaneously.

To use this engine, you can either specify it in your xml config file, with tag <engine type=SST> or, set it in client code. For example, here is how to create an SST reader:

adios2::IO sstIO = adios.DeclareIO("SomeName");
sstIO.SetEngine("SST");
adios2::Engine sstReader = sstIO.Open(filename, adios2::Mode::Read);

and a sample code for SST writer is:

adios2::IO sstIO = adios.DeclareIO("SomeName");
sstIO.SetEngine("SST");
adios2::Engine sstWriter = sstIO.Open(filename, adios2::Mode::Write);

The general goal of ADIOS2 is to ease the conversion of a file-based application to instead use a non-file streaming interconnect, for example, data producers such as computational physics codes and consumers such as analysis applications. However, there are some uses of ADIOS2 APIs that work perfectly well with the ADIOS2 file engines, but which will not work or will perform badly with streaming. For example, SST is based upon the “step” concept and ADIOS2 applications that use SST must call BeginStep() and EndStep(). On the writer side, the Put() calls between BeginStep and EndStep are the unit of communication and represent the data that will be available between the corresponding Begin/EndStep calls on the reader.

Also, it is recommended that SST-based applications not use the ADIOS2 Get() sync method unless there is only one data item to be read per step. This is because SST implements MxN data transfer (and avoids having to deliver all data to every reader), by queueing data on the writer ranks until it is known which reader rank requires it. Normally this data fetch stage is initiated by PerformGets() or EndStep(), both of which fulfill any pending Get() deferred operations. However, unlike Get() deferred, the semantics of Get() sync require the requested data to be fetched from the writers before the call can return. If there are multiple calls to Get() sync per step, each one may require a communication with many writers, something that would have only had to happen once if Get() differed were used instead. Thus the use of Get() sync is likely to incur a substantial performance penalty.

On the writer side, depending upon the chosen data marshaling option there may be some (relatively small) performance differences between Put() sync and Put() deferred, but they are unlikely to be as substantial as between Get() sync and Get() deferred.

Note that SST readers and writers do not necessarily move in lockstep, but depending upon the queue length parameters and queueing policies specified, differing reader and writer speeds may cause one or the other side to wait for data to be produced or consumed, or data may be dropped if allowed by the queueing policy. However, steps themselves are atomic and no step will be partially dropped, delivered to a subset of ranks, or otherwise divided.

The SST engine allows the user to customize the streaming operations through the following optional parameters:

1. RendezvousReaderCount: Default 1. This integer value specifies the number of readers for which the writer should wait before the writer-side Open() returns. The default of 1 implements an ADIOS1/flexpath style “rendezvous”, in which an early-starting reader will wait for the writer to start, or vice versa. A number >1 will cause the writer to wait for more readers and a value of 0 will allow the writer to proceed without any readers present. This value is interpreted by SST Writer engines only.

2. RegistrationMethod: Default “File”. By default, SST reader and writer engines communicate network contact information via files in a shared filesystem. Specifically, the "filename" parameter in the Open() call is interpreted as a path which the writer uses as the name of a file to which contact information is written, and from which a reader will attempt to read contact information. As with other file-based engines, file creation and access is subject to the usual considerations (directory components are interpreted, but must exist and be traversable, writer must be able to create the file and the reader must be able to read it). Generally the file so created will exist only for as long as the writer keeps the stream Open(), but abnormal process termination may leave “stale” files in those locations. These stray “.sst” files should be deleted to avoid confusing future readers. SST also offers a “Screen” registration method in which writers and readers send their contact information to, and read it from, stdout and stdin respectively. The “screen” registration method doesn’t support batch mode operations in any way, but may be useful when manually starting jobs on machines in a WAN environment that don’t share a filesystem. A future release of SST will also support a “Cloud” registration method where contact information is registered to and retrieved from a network-based third-party server so that both the shared filesystem and interactivity can be avoided. This value is interpreted by both SST Writer and Reader engines.

3. QueueLimit: Default 0. This integer value specifies the number of steps which the writer will allow to be queued before taking specific action (such as discarding data or waiting for readers to consume the data). The default value of 0 is interpreted as no limit. This value is interpreted by SST Writer engines only.

4. QueueFullPolicy: Default “Block”. This value controls what policy is invoked if a non-zero QueueLimit has been specified and new data would cause the queue limit to be reached. Essentially, the “Block” option ensures data will not be discarded and if the queue fills up the writer will block on EndStep until the data has been read. If there is one active reader, EndStep will block until data has been consumed off the front of the queue to make room for newly arriving data. If there is more than one active reader, it is only removed from the queue when it has been read by all readers, so the slowest reader will dictate progress. NOTE THAT THE NO READERS SITUATION IS A SPECIAL CASE: If there are no active readers, new timesteps are considered to have completed their active queueing immediately upon submission. They may be retained in the “reserve queue” if the ReserveQueueLimit is non-zero. However, if that ReserveQueueLimit parameter is zero, timesteps submitted when there are no active readers will be immediately discarded.

Besides “Block”, the other acceptable value for QueueFullPolicy is “Discard”. When “Discard” is specified, and an EndStep operation would add more than the allowed number of steps to the queue, some step is discarded. If there are no current readers connected to the stream, the oldest data in the queue is discarded. If there are current readers, then the newest data (I.E. the just-created step) is discarded. (The differential treatment is because SST sends metadata for each step to the readers as soon as the step is accepted and cannot reliably prevent that use of that data without a costly all-to-all synchronization operation. Discarding the newest data instead is less satisfying, but has a similar long-term effect upon the set of steps delivered to the readers.) This value is interpreted by SST Writer engines only.

5. ReserveQueueLimit: Default 0. This integer value specifies the number of steps which the writer will keep in the queue for the benefit of late-arriving readers. This may consist of timesteps that have already been consumed by any readers, as well as timesteps that have not yet been consumed. In some sense this is target queue minimum size, while QueueLimit is a maximum size. This value is interpreted by SST Writer engines only.

6. DataTransport: Default varies. This string value specifies the underlying network communication mechanism to use for exchanging data in SST. Generally this is chosen by SST based upon what is available on the current platform. However, specifying this engine parameter allows overriding SST’s choice. Current allowed values are “UCX”, “MPI”, “RDMA”, and “WAN”. (ib and fabric are accepted as equivalent to RDMA and evpath is equivalent to WAN.) Generally both the reader and writer should be using the same network transport, and the network transport chosen may be dictated by the situation. For example, the RDMA transport generally operates only between applications running on the same high-performance interconnect (e.g. on the same HPC machine). If communication is desired between applications running on different interconnects, the Wide Area Network (WAN) option should be chosen. This value is interpreted by both SST Writer and Reader engines.

7. WANDataTransport: Default sockets. If the SST DataTransport parameter is “WAN, this string value specifies the EVPath-level data transport to use for exchanging data. The value must be a data transport known to EVPath, such as “sockets”, “enet”, or “ib”. Generally both the reader and writer should be using the same EVPath-level data transport. This value is interpreted by both SST Writer and Reader engines.

8. ControlTransport: Default tcp. This string value specifies the underlying network communication mechanism to use for performing control operations in SST. SST can be configured to standard TCP sockets, which are very reliable and efficient, but which are limited in their scalability. Alternatively, SST can use a reliable UDP protocol, that is more scalable, but as of ADIOS2 Release 2.4.0 still suffers from some reliability problems. (sockets is accepted as equivalent to tcp and udp, rudp, and enet are equivalent to scalable. Generally both the reader and writer should be using the same control transport. This value is interpreted by both SST Writer and Reader engines.

9. NetworkInterface: Default NULL. In situations in which there are multiple possible network interfaces available to SST, this string value specifies which should be used to generate SST’s contact information for writers. Generally this should NOT be specified except for narrow sets of circumstances. It has no effect if specified on Reader engines. If specified, the string value should correspond to a name of a network interface, such as are listed by commands like “netstat -i”. For example, on most Unix systems, setting the NetworkInterface parameter to “lo” (or possibly “lo0”) will result in SST generating contact information that uses the network address associated with the loopback interface (127.0.0.1). This value is interpreted by only by the SST Writer engine.

10. ControlInterface: Default NULL. This value is similar to the NetworkInterface parameter, but only applies to the SST layer which does messaging for control (open, close, flow and timestep management, but not actual data transfer). Generally the NetworkInterface parameter can be used to control this, but that also aplies to the Data Plane. Use ControlInterface in the event of conflicting specifications.

11. DataInterface: Default NULL. This value is similar to the NetworkInterface parameter, but only applies to the SST layer which does messaging for data transfer, not control (open, close, flow and timestep management). Generally the NetworkInterface parameter can be used to control this, but that also aplies to the Control Plane. Use DataInterface in the event of conflicting specifications. In the case of the RDMA data plane, this parameter controls the libfabric interface choice.

12. FirstTimestepPrecious: Default FALSE. FirstTimestepPrecious is a boolean parameter that affects the queueing of the first timestep presented to the SST Writer engine. If FirstTimestepPrecious is TRUE, then the first timestep is effectively never removed from the output queue and will be presented as a first timestep to any reader that joins at a later time. This can be used to convey run parameters or other information that every reader may need despite joining later in a data stream. Note that this queued first timestep does count against the QueueLimit parameter above, so if a QueueLimit is specified, it should be a value larger than 1. Further note while specifying this parameter guarantees that the preserved first timestep will be made available to new readers, other reader-side operations (like requesting the LatestAvailable timestep in Engine parameters) might still cause the timestep to be skipped. This value is interpreted by only by the SST Writer engine.

13. AlwaysProvideLatestTimestep: Default FALSE. AlwaysProvideLatestTimestep is a boolean parameter that affects what of the available timesteps will be provided to the reader engine. If AlwaysProvideLatestTimestep is TRUE, then if there are multiple timesteps available to the reader, older timesteps will be skipped and the reader will see only the newest available upon BeginStep. This value is interpreted by only by the SST Reader engine.

14. OpenTimeoutSecs: Default 60. OpenTimeoutSecs is an integer parameter that specifies the number of seconds SST is to wait for a peer connection on Open(). Currently this is only implemented on the Reader side of SST, and is a timeout for locating the contact information file created by Writer-side Open, not for completing the entire Open() handshake. Currently value is interpreted by only by the SST Reader engine.

15. SpeculativePreloadMode: Default AUTO. In some circumstances, SST eagerly sends all data from writers to every readers without first waiting for read requests. Generally this improves performance if every reader needs all the data, but can be very detrimental otherwise. The value AUTO for this engine parameter instructs SST to apply its own heuristic for determining if data should be eagerly sent. The value OFF disables this feature and the value ON causes eager sending regardless of heuristic. Currently SST’s heuristic is simple. If the size of the reader cohort is less than or equal to the value of the SpecAutoNodeThreshold engine parameter (Default value 1), eager sending is initiated. Currently value is interpreted by only by the SST Reader engine.

16. SpecAutoNodeThreshold: Default 1. If the size of the reader cohort is less than or equal to this value and the SpeculativePreloadMode parameter is AUTO, SST will initiate eager data sending of all data from each writer to all readers. Currently value is interpreted by only by the SST Reader engine.

17. StepDistributionMode: Default “AllToAll”. This value controls how steps are distributed, particularly when there are multiple readers. By default, the value is “AllToAll”, which means that all timesteps are to be delivered to all readers (subject to discard rules, etc.). In other distribution modes, this is not the case. For example, in “RoundRobin”, each step is delivered only to a single reader, determined in a round-robin fashion based upon the number or readers who have opened the stream at the time the step is submitted. In “OnDemand” each step is delivered to a single reader, but only upon request (with a request being initiated by the reader doing BeginStep()). Normal reader-side rules (like BeginStep timeouts) and writer-side rules (like queue limit behavior) apply.

Key

Value Format

Default and Examples

RendezvousReaderCount RegistrationMethod QueueLimit QueueFullPolicy ReserveQueueLimit DataTransport WANDataTransport ControlTransport MarshalMethod NetworkInterface ControlInterface DataInterface FirstTimestepPrecious AlwaysProvideLatestTimestep OpenTimeoutSecs SpeculativePreloadMode SpecAutoNodeThreshold

integer string integer string integer string string string string string string string boolean boolean integer string integer

1 File, Screen 0 (no queue limits) Block, Discard 0 (no queue limits) default varies by platform, UCX, MPI, RDMA, WAN sockets, enet, ib TCP, Scalable BP5, BP, FFS NULL NULL NULL FALSE, true, no, yes FALSE, true, no, yes 60 AUTO, ON, OFF 1

SSC Strong Staging Coupler

The SSC engine is designed specifically for strong code coupling. Currently SSC only supports fixed IO pattern, which means once the first step is finished, users are not allowed to write or read a data block with a start and count that have not been written or read in the first step. SSC uses a combination of one sided MPI and two sided MPI methods. In any cases, all user applications are required to be launched within a single mpirun or mpiexec command, using the MPMD mode.

The SSC engine takes the following parameters:

  1. OpenTimeoutSecs: Default 10. Timeout in seconds for opening a stream. The SSC engine’s open function will block until the RendezvousAppCount is reached, or timeout, whichever comes first. If it reaches the timeout, SSC will throw an exception.

  2. Threading: Default False. SSC will use threads to hide the time cost for metadata manipulation and data transfer when this parameter is set to true. SSC will check if MPI is initialized with multi-thread enabled, and if not, then SSC will force this parameter to be false. Please do NOT enable threading when multiple I/O streams are opened in an application, as it will cause unpredictable errors. This parameter is only effective when writer definitions and reader selections are NOT locked. For cases definitions and reader selections are locked, SSC has a more optimized way to do data transfers, and thus it will not use this parameter.

Key

Value Format

Default and Examples

OpenTimeoutSecs

integer

10, 2, 20, 200

Threading

bool

false, true

DataMan for Wide Area Network Data Staging

The DataMan engine is designed for data staging over the wide area network. It is supposed to be used in cases where a few writers send data to a few readers over long distance.

DataMan supports compression operators such as ZFP lossy compression and BZip2 lossless compression. Please refer to the operator section for usage.

The DataMan engine takes the following parameters:

  1. IPAddress: No default value. The IP address of the host where the writer application runs. This parameter is compulsory in wide area network data staging.

  2. Port: Default 50001. The port number on the writer host that will be used for data transfers.

  3. Timeout: Default 5. Timeout in seconds to wait for every send / receive operation. Packages not sent or received within this time are considered lost.

  4. RendezvousReaderCount: Default 1. This integer value specifies the number of readers for which the writer should wait before the writer-side Open() returns. By default, an early-starting writer will wait for the reader to start, or vice versa. A number >1 will cause the writer to wait for more readers, and a value of 0 will allow the writer to proceed without any readers present. This value is interpreted by DataMan Writer engines only.

  5. Threading: Default true for reader, false for writer. Whether to use threads for send and receive operations. Enabling threading will cause extra overhead for managing threads and buffer queues, but will improve the continuity of data steps for readers, and help overlap data transfers with computations for writers.

  6. TransportMode: Default fast. Only DataMan writers take this parameter. Readers are automatically synchronized at runtime to match writers’ transport mode. The fast mode is optimized for latency-critical applications. It enforces readers to only receive the latest step. Therefore, in cases where writers are faster than readers, readers will skip some data steps. The reliable mode ensures that all steps are received by readers, by sacrificing performance compared to the fast mode.

  7. MaxStepBufferSize: Default 128000000. In order to bring down the latency in wide area network staging use cases, DataMan uses a fixed receiver buffer size. This saves an extra communication operation to sync the buffer size for each step, before sending actual data. The default buffer size is 128 MB, which is sufficient for most use cases. However, in case 128 MB is not enough, this parameter must be set correctly, otherwise DataMan will fail.

Key

Value Format

Default and Examples

IPAddress

string

N/A, 22.195.18.29

Port

integer

50001, 22000, 33000

Timeout

integer

5, 10, 30

RendezvousReaderCount

integer

1, 0, 3

Threading

bool

true for reader, false for writer

TransportMode

string

fast, reliable

MaxStepBufferSize

integer

128000000, 512000000, 1024000000

DataSpaces

The DataSpaces engine for ADIOS2 is experimental. DataSpaces is an asynchronous I/O transfer method within ADIOS that enables low-overhead, high-throughput data extraction from a running simulation. DataSpaces is designed for use in HPC environments and can take advantage of RDMA network interconnects to speed the transfer of data between communicating HPC applications. DataSpaces supports full MxN data distribution, where the number of reader ranks can differ from the number of writer ranks. In addition, this engine supports multiple reader and writer applications, which must be distinguished by unique values of AppID for different applications. It can be set in the xml config file with tag <parameter key="AppID" value="2"/>. The value should be unique for each applications or clients.

To use this engine, you can either specify it in your xml config file, with tag <engine type=DATASPACES> or, set it in client code. For example, here is how to create an DataSpaces reader:

adios2::IO dspacesIO = adios.DeclareIO("SomeName");
dspacesIO.SetEngine("DATASPACES");
adios2::Engine dspacesReader = dspacesIO.Open(filename, adios2::Mode::Read);

and a sample code for DataSpaces writer is:

adios2::IO dspacesIO = adios.DeclareIO("SomeName");
dspacesIO.SetEngine("DATASPACES");
adios2::Engine dspacesWriter = dspacesIO.Open(filename, adios2::Mode::Write);

To make use of the DataSpaces engine, an application job needs to also run the dataspaces_server component together with the application. The server should be configured and started before the application as a separate job in the system. For example:

aprun -n $SPROC ./dataspaces_server -s $SPROC &> log.server &

The variable $SPROC represents the number of server instances to run. The & character at the end of the line would place the aprun command in the background, and will allow the job script to continue and run the other applications. The server processes produce a configuration file, i.e., conf.0 that is used by the application to connect to the servers. This file contains identifying information of the master server, which coordinates the client registration and discovery process. The job script should wait for the servers to start-up and produce the conf.0 file before starting the client application processes.

The server also needs a user configuration read from a text file called dataspaces.conf. How many output timesteps of the same dataset (called versions) should be kept in the server’s memory and served to readers should be specified in the file. If this file does not exist in the current directory, the server will assume default values (only 1 timestep stored). .. code-block:

## Config file for DataSpaces
max_versions = 5
lock_type = 3

The dataspaces_server module is a stand-alone service that runs independently of a simulation on a set of dedicated nodes in the staging area. It transfers data from the application through RDMA, and can save it to local storage system, e.g., the Lustre file system, stream it to remote sites, e.g., auxilliary clusters, or serve it directly from the staging area to other applications. One instance of the dataspaces_server can service multiple applications in parallel. Further, the server can run in cooperative mode (i.e., multiple instances of the server cooperate to service the application in parallel and to balance load). The dataspaces_server receives notification messages from the transport method, schedules the requests, and initiates the data transfers in parallel. The server schedules and prioritizes the data transfers while the simulation is computing in order to overlap data transfers with computations, to maximize data throughput, and to minimize the overhead on the application.

Inline for zero-copy

The Inline engine provides in-process communication between the writer and reader, avoiding the copy of data buffers.

This engine is focused on the NN case: N writers share a process with N readers, and the analysis happens ‘inline’ without writing the data to a file or copying to another buffer. It is similar to the streaming SST engine, since analysis must happen per step.

To use this engine, you can either add <engine type=Inline> to your XML config file, or set it in your application code:

adios2::IO io = adios.DeclareIO("ioName");
io.SetEngine("Inline");
adios2::Engine inlineWriter = io.Open("inline_write", adios2::Mode::Write);
adios2::Engine inlineReader = io.Open("inline_read", adios2::Mode::Read);

Notice that unlike other engines, the reader and writer share an IO instance. Both the writer and reader must be opened before either tries to call BeginStep()/PerformPuts()/PerformGets(). There must be exactly one writer, and exactly one reader.

For successful operation, the writer will perform a step, then the reader will perform a step in the same process. When the reader starts its step, the only data it has available is that written by the writer in its process. The reader then can retrieve whatever data was written by the writer by using the double-pointer Get call:

void Engine::Get<T>(Variable<T>, T**) const;

This version of Get is only used for the inline engine. See the example below for details.

Note

Since the inline engine does not copy any data, the writer should avoid changing the data before the reader has read it.

Typical access pattern:

// ... Application data generation

inlineWriter.BeginStep();
inlineWriter.Put(var, in_data); // always use deferred mode
inlineWriter.EndStep();
// Unlike other engines, data should not be reused at this point (since ADIOS
// does not copy the data), though ADIOS cannot enforce this.
// Must wait until reader is finished using the data.

inlineReader.BeginStep();
double* out_data;
inlineReader.Get(var, &data);
// Now in_data == out_data.
inlineReader.EndStep();

Null

The Null Engine performs no internal work and no I/O. It was created for testing applications that have ADIOS2 output in it by turning off the I/O easily. The runtime difference between a run with the Null engine and another engine tells us the IO overhead of that particular output with that particular engine.

adios2::IO io = adios.DeclareIO("Output");
io.SetEngine("Null");

or using the XML config file:

<adios-config>
    <io name="Output">
        <engine type="Null">
        </engine>
    </io>
</adios-config>

Although there is a reading engine as well, which will not fail, any variable/attribute inquiry returns nullptr and any subsequent Get() calls will throw an exception in C++/Python or return an error in C/Fortran.

Note that there is also a Null transport that can be used by a BP engine instead of the default File transport. In that case, the BP engine will perform all internal work including buffering and aggregation but no data will be output at all. A run like this can be used to assess the overhead of the internal work of the BP engine.

adios2::IO io = adios.DeclareIO("Output");
io.SetEngine("BP5");
io.AddTransport("Null", {});

or using the XML config file

<adios-config>
    <io name="Output">
        <engine type="BP5">
        </engine>
        <transport type="Null">
        </transport>
    </io>
</adios-config>

Plugin Engine

For details on using the Plugin Engine, see the Plugins documentation.