Supported Engines
This section provides a description of the Available Engines in ADIOS2 and their specific parameters to allow extra-control from the user. Parameters are passed in key-value pairs for:
Engine specific parameters
Engine supported transports and parameters
Parameters are passed at:
Compile time
IO::SetParameters
(adios2_set_parameter
in C, Fortran)Compile time
IO::AddTransport
(adios2_set_transport_parameter
in C, Fortran)Runtime Configuration Files in the ADIOS component.
BP5
The BP5 Engine writes and reads files in ADIOS2 native binary-pack (bp version 5) format. This was a new format for ADIOS 2.8, improving on the metadata operations and the memory consumption of the older BP4/BP3 formats. BP5 is the default file format as of ADIOS 2.9. As compared to the older format, BP5 provides three main advantages:
Lower memory consumption. Deferred Puts will use user buffer for I/O wherever possible thus saving on a memory copy. Aggregation uses a fixed-size shared-memory segment on each compute node instead of using MPI to send data from one process to another. Memory consumption can get close to half of BP4 in some cases.
Faster metadata management improves write/read performance where hundreds or more variables are added to the output.
Improved functionality around appending many output steps into the same file. Better performance than writing new files each step. Restart can append to an existing series by truncating unwanted steps. Readers can filter out unwanted steps to only see and process a limited set of steps. Just like as in BP4, existing steps cannot be corrupted by appending new steps.
In 2.8 BP5 was a brand new file format and engine. It still does NOT support some functionality of BP4:
Burst buffer support for writing data.
BP5 files have the following structure given a “name” string passed as the first argument of IO::Open
:
io.SetEngine("BP5");
adios2::Engine bpFile = io.Open("name", adios2::Mode::Write);
will generate:
% BP5 datasets are always a directory
name.bp/
% data and metadata files
name.bp/
data.0
data.1
...
data.M
md.0
md.idx
mmd.0
Note
BP5 file names are compatible with the Unix (/
) and Windows (\\
) file system naming convention for directories and files.
Note
BP5 has an mmd.0
file in the directory, which BP4 does not have.
This engine allows the user to fine tune the buffering operations through the following optional parameters:
Streaming through file
OpenTimeoutSecs: (Streaming mode) Reader may want to wait for the creation of the file in
io.Open()
. By default the Open() function returns with an error if file is not found.BeginStepPollingFrequencySecs: (Streaming mode) Reader can set how frequently to check the file (and file system) for new steps. Default is 1 seconds which may be stressful for the file system and unnecessary for the application.
Aggregation
AggregationType: TwoLevelShm, EveryoneWritesSerial and EveryoneWrites are three aggregation strategies. See Aggregation in BP5. The default is TwoLevelShm.
NumAggregators: The number of processes that will ever write data directly to storage. The default is set to the number of compute nodes the application is running on (i.e. one process per compute node). TwoLevelShm will select a fixed number of processes per compute-node to get close to the intention of the user but does not guarantee the exact number of aggregators.
AggregatorRatio: An alternative option to NumAggregators to pick every nth process as aggregator. The number of aggregators will be automatically kept to be within 1 and total number of processes no matter what bad number is supplied here. Moreover, TwoLevelShm will select an fixed number of processes per compute-node to get close to the intention of this ratio but does not guarantee the exact number of aggregators.
NumSubFiles: The number of data files to write to in the .bp/ directory. Only used by TwoLevelShm aggregator, where the number of files can be smaller then the number of aggregators. The default is set to NumAggregators.
StripeSize: The data blocks of different processes are aligned to this size (default is 4096 bytes) in the files. Its purpose is to avoid multiple processes to write to the same file system block and potentially slow down the write.
MaxShmSize: Upper limit for how much shared memory an aggregator process in TwoLevelShm can allocate. For optimum performance, this should be at least 2xM +1KB where M is the maximum size any process writes in a single step. However, there is no point in allowing for more than 4GB. The default is 4GB.
Buffering
BufferVType: chunk or malloc, default is chunking. Chunking maintains the buffer as a list of memory blocks, either ADIOS-owned for sync-ed Puts and small Puts, and user-owned pointers of deferred Puts. Malloc maintains a single memory block and extends it (reallocates) whenever more data is buffered. Chunking incurs extra cost in I/O by having to write data in chunks (multiple write system calls), which can be helped by increasing BufferChunkSize and MinDeferredSize. Malloc incurs extra cost by reallocating memory whenever more data is buffered (by Put()), which can be helped by increasing InitialBufferSize.
BufferChunkSize: (for chunk buffer type) The size of each memory buffer chunk, default is 128MB but it is worth increasing up to 2147381248 (a bit less than 2GB) if possible for maximum write performance.
MinDeferredSize: (for chunk buffer type) Small user variables are always buffered, default is 4MB.
InitialBufferSize: (for malloc buffer type) initial memory provided for buffering (default and minimum is 16Kb). To avoid reallocations, it is worth increasing this size to the expected maximum total size of data any process would write in any step (not counting deferred Puts).
GrowthFactor: (for malloc buffer type) exponential growth factor for initial buffer > 1, default = 1.05.
Managing steps
AppendAfterSteps: BP5 enables overwriting some existing steps by opening in adios2::Mode::Append mode and specifying how many existing steps to keep. Default value is MAX_INT, so it always appends after the last step. -1 would achieve the same thing. If you have 10 steps in the file,
value 0 means starting from the beginning, truncating all existing data
value 1 means appending after the first step, so overwrite 2,3…10
value 10 means appending after all existing steps
value >10 means the same, append after all existing steps (gaps in steps are impossible)
-1 means appending after the last step, i.e. same as 10 or higher
-2 means removing the last step, i.e. starting from the 10th
-11 (and <-11) means truncating all existing data
SelectSteps: BP5 reading allows for only seeing selected steps. This is a string of space-separated list of range definitions in the form of “start:end:step”. Indexing starts from 0. If ‘end’ is ‘n’ or ‘N’, then it is an unlimited range expression. Range definitions are adding up. Note that in the reading functions, counting the steps is always 0 to s-1 where s steps are presented, so even after applying this selection, the selected steps are presented as 0 to s-1. Examples:
“0 6 3 2” selects four steps indexed 0,2,3 and 6 (presented in reading as 0,1,2,3)
“1:5” selects 5 consecutive steps, skipping step 0, and starting from 1
“2:n” selects all steps from step 2
“0:n:2” selects every other steps from the beginning (0,2,4,6…)
“0:n:3 10:n:5” selects every third step from the beginning and additionally every fifth steps from step 10.
Asynchronous writing I/O
AsyncOpen: true/false Call the open function asynchronously. It decreases I/O overhead when creating lots of subfiles (NumAggregators is large) and one calls io.Open() well ahead of the first write step. Only implemented for writing. Default is true.
AsyncWrite: true/false Perform data writing operations asynchronously after EndStep(). Default is false. If the application calls EnterComputationBlock()/ExitComputationBlock() to indicate phases where no communication is happening, ADIOS will try to perform all data writing during those phases, otherwise it will write immediately and eagerly after EndStep().
Direct I/O. Experimental, see discussion on GitHub.
DirectIO: Turn on O_DIRECT when using POSIX transport. Do not use this on parallel file systems.
DirectIOAlignOffset: Alignment for file offsets. Default is 512 which is usually
DirectIOAlignBuffer: Alignment for memory pointers. Default is to be same as DirectIOAlignOffset.
Miscellaneous
StatsLevel: 1 turns on Min/Max calculation for every variable, 0 turns this off. Default is 1. It has some cost to generate this metadata so it can be turned off if there is no need for this information.
MaxOpenFilesAtOnce: Specify how many subfiles a process can keep open at once. Default is unlimited. If a dataset contains more subfiles than how many open file descriptors the system allows (see ulimit -n) then one can either try to raise that system limit (set it with ulimit -n), or set this parameter to force the reader to close some subfiles to stay within the limits.
Threads: Read side: Specify how many threads one process can use to speed up reading. The default value is 0, to let the engine estimate the number of threads based on how many processes are running on the compute node and how many hardware threads are available on the compute node but it will use maximum 16 threads. Value 1 forces the engine to read everything within the main thread of the process. Other values specify the exact number of threads the engine can use. Although multithreaded reading works in a single Get(adios2::Mode::Sync) call if the read selection spans multiple data blocks in the file, the best parallelization is achieved by using deferred mode and reading everything in PerformGets()/EndStep().
FlattenSteps: This is a writer-side parameter specifies that the reader should interpret multiple writer-created timesteps as a single timestep, essentially flattening all Put()s into a single step.
IgnoreFlattenSteps: This is a reader-side parameter that tells the reader to ignore any FlattenSteps parameter supplied to the writer.
Key |
Value Format |
Default and Examples |
---|---|---|
OpenTimeoutSecs |
float |
0 for ReadRandomAccess mode, 3600 for Read mode, |
BeginStepPollingFrequencySecs |
float |
1, 10.0 |
AggregationType |
string |
TwoLevelShm, EveryoneWritesSerial, EveryoneWrites |
NumAggregators |
integer >= 1 |
0 (one file per compute node) |
AggregatorRatio |
integer >= 1 |
not used unless set |
NumSubFiles |
integer >= 1 |
=NumAggregators, only used when AggregationType=TwoLevelShm |
StripeSize |
integer+units |
4KB |
MaxShmSize |
integer+units |
4294762496 |
BufferVType |
string |
chunk, malloc |
BufferChunkSize |
integer+units |
128MB, worth increasing up to min(2GB, datasize/process/step) |
MinDeferredSize |
integer+units |
4MB |
InitialBufferSize |
float+units >= 16Kb |
16Kb, 10Mb, 0.5Gb |
GrowthFactor |
float > 1 |
1.05, 1.01, 1.5, 2 |
AppendAfterSteps |
integer >= 0 |
INT_MAX |
SelectSteps |
string |
“0 6 3 2”, “1:5”, “0:n:3 10:n:5” |
AsyncOpen |
string On/Off |
On, Off, true, false |
AsyncWrite |
string On/Off |
Off, On, true, false |
DirectIO |
string On/Off |
Off, On, true, false |
DirectIOAlignOffset |
integer >= 0 |
512 |
DirectIOAlignBuffer |
integer >= 0 |
set to DirectIOAlignOffset if unset |
StatsLevel |
integer, 0 or 1 |
1, 0 |
MaxOpenFilesAtOnce |
integer >= 0 |
UINT_MAX, 1024, 1 |
Threads |
integer >= 0 |
0, 1, 32 |
FlattenSteps |
boolean |
off, on, true, false |
IgnoreFlattenSteps |
boolean |
off, on, true, false |
Only file transport types are supported. Optional parameters for IO::AddTransport
or in runtime config file transport field:
Transport type: File
Key |
Value Format |
Default and Examples |
---|---|---|
Library |
string |
POSIX (UNIX), FStream (Windows), stdio, IME |
The IME transport directly reads and writes files stored on DDN’s IME burst
buffer using the IME native API. To use the IME transport, IME must be
avaiable on the target system and ADIOS2 needs to be configured with
ADIOS2_USE_IME
. By default, data written to the IME is automatically
flushed to the parallel filesystem at every EndStep()
call. You can
disable this automatic flush by setting the transport parameter SyncToPFS
to OFF
.
BP4
The BP4 Engine writes and reads files in ADIOS2 native binary-pack (bp version 4) format. This was a new format for ADIOS 2.5 and improved on the metadata operations of the older BP3 format. Compared to the older format, BP4 provides three main advantages:
Fast and safe appending of multiple output steps into the same file. Better performance than writing new files each step. Existing steps cannot be corrupted by appending new steps.
Streaming through files (i.e. online processing). Consumer apps can read existing steps while the Producer is still writing new steps. Reader’s loop can block (with timeout) and wait for new steps to arrive. Same reader code can read the entire data in post or in situ. No restrictions on the Producer.
Burst buffer support for writing data. It can write the output to a local file system on each compute node and drain the data to the parallel file system in a separate asynchronous thread. Streaming read from the target file system are still supported when data goes through the burst buffer. Appending to an existing file on the target file system is NOT supported currently.
BP4 files have the following structure given a “name” string passed as the first argument of IO::Open
:
io.SetEngine("BP4");
adios2::Engine bpFile = io.Open("name", adios2::Mode::Write);
will generate:
% BP4 datasets are always a directory
name.bp/
% data and metadata files
name.bp/
data.0
data.1
...
data.M
md.0
md.idx
Note
BP4 file names are compatible with the Unix (/
) and Windows (\\
) file system naming convention for directories and files.
This engine allows the user to fine tune the buffering operations through the following optional parameters:
Profile: turns ON/OFF profiling information right after a run
ProfileUnits: set profile units according to the required measurement scale for intensive operations
Threads: number of threads provided from the application for buffering, use this for very large variables in data size
InitialBufferSize: initial memory provided for buffering (minimum is 16Kb)
BufferGrowthFactor: exponential growth factor for initial buffer > 1, default = 1.05.
MaxBufferSize: maximum allowable buffer size (must be larger than 16Kb). If too large adios2 will throw an exception.
FlushStepsCount: users can select how often to produce the more expensive collective metadata file in terms of steps: default is 1. Increase to reduce adios2 collective operations footprint, with the trade-off of reducing checkpoint frequency. Buffer size will increase until first steps count if
MaxBufferSize
is not set.NumAggregators (or SubStreams): Users can select how many sub-files (
M
) are produced during a run, ranges between 1 and the number of mpi processes fromMPI_Size
(N
), adios2 will internally aggregate data buffers (N-to-M
) to output the required number of sub-files. Default is 0, which will let adios2 to group processes per shared-memory-access (i.e. one per compute node) and use one process per node as an aggregator. If NumAggregators is larger than the number of processes then it will be set to the number of processes.AggregatorRatio: An alternative option to NumAggregators to pick every Nth process as aggregator. An integer divider of the number of processes is required, otherwise a runtime exception is thrown.
OpenTimeoutSecs: (Streaming mode) Reader may want to wait for the creation of the file in
io.Open()
. By default the Open() function returns with an error if file is not found.BeginStepPollingFrequencySecs: (Streaming mode) Reader can set how frequently to check the file (and file system) for new steps. Default is 1 seconds which may be stressful for the file system and unnecessary for the application.
StatsLevel: Turn on/off calculating statistics for every variable (Min/Max). Default is On. It has some cost to generate this metadata so it can be turned off if there is no need for this information.
StatsBlockSize: Calculate Min/Max for a given size of each process output. Default is one Min/Max per writer. More fine-grained min/max can be useful for querying the data.
NodeLocal or Node-Local: For distributed file system. Every writer process must make sure the .bp/ directory is created on the local file system. Required when writing to local disk/SSD/NVMe in a cluster. Note: the BurstBuffer* parameters are newer and should be used for using the local storage as temporary instead of this parameter.
BurstBufferPath: Redirect output file to another location and drain it to the original target location in an asynchronous thread. It requires to be able to launch one thread per aggregator (see SubStreams) on the system. This feature can be used on machines that have local NVMe/SSDs on each node to accelerate the output writing speed. On Summit at OLCF, use “/mnt/bb/<username>” for the path where <username> is your user account name. Temporary files on the accelerated storage will be automatically deleted after the application closes the output and ADIOS drains all data to the file system, unless draining is turned off (see the next parameter). Note: at this time, this feature cannot be used to append data to an existing dataset on the target system.
BurstBufferDrain: To write only to the accelerated storage but to not drain it to the target file system, set this flag to false. Data will NOT be deleted from the accelerated storage on close. By default, setting the BurstBufferPath will turn on draining.
BurstBufferVerbose: Verbose level 1 will cause each draining thread to print a one line report at the end (to standard output) about where it has spent its time and the number of bytes moved. Verbose level 2 will cause each thread to print a line for each draining operation (file creation, copy block, write block from memory, etc).
StreamReader: By default the BP4 engine parses all available metadata in Open(). An application may turn this flag on to parse a limited number of steps at once, and update metadata when those steps have been processed. If the flag is ON, reading only works in streaming mode (using BeginStep/EndStep); file reading mode will not work as there will be zero steps processed in Open().
Key |
Value Format |
Default and Examples |
---|---|---|
Profile |
string On/Off |
On, Off |
ProfileUnits |
string |
Microseconds, Milliseconds, Seconds, Minutes, Hours |
Threads |
integer > 1 |
1, 2, 3, 4, 16, 32, 64 |
InitialBufferSize |
float+units >= 16Kb |
16Kb, 10Mb, 0.5Gb |
MaxBufferSize |
float+units >= 16Kb |
at EndStep, 10Mb, 0.5Gb |
BufferGrowthFactor |
float > 1 |
1.05, 1.01, 1.5, 2 |
FlushStepsCount |
integer > 1 |
1, 5, 1000, 50000 |
NumAggregators |
integer >= 1 |
0 (one file per compute node), |
AggregatorRatio |
integer >= 1 |
not used unless set, |
OpenTimeoutSecs |
float |
0, |
BeginStepPollingFrequencySecs |
float |
1, |
StatsLevel |
integer, 0 or 1 |
1, |
StatsBlockSize |
integer > 0 |
a very big number, |
NodeLocal |
string On/Off |
Off, On |
Node-Local |
string On/Off |
Off, On |
BurstBufferPath |
string |
“”, /mnt/bb/norbert, /ssd |
BurstBufferDrain |
string On/Off |
On, Off |
BurstBufferVerbose |
integer, 0-2 |
0, |
StreamReader |
string On/Off |
On, Off |
Only file transport types are supported. Optional parameters for IO::AddTransport
or in runtime config file transport field:
Transport type: File
Key |
Value Format |
Default and Examples |
---|---|---|
Library |
string |
POSIX (UNIX), FStream (Windows), stdio, IME |
The IME transport directly reads and writes files stored on DDN’s IME burst
buffer using the IME native API. To use the IME transport, IME must be
avaiable on the target system and ADIOS2 needs to be configured with
ADIOS2_USE_IME
. By default, data written to the IME is automatically
flushed to the parallel filesystem at every EndStep()
call. You can
disable this automatic flush by setting the transport parameter SyncToPFS
to OFF
.
BP3
The BP3 Engine writes and reads files in ADIOS2 native binary-pack (bp) format. BP files are backwards compatible with ADIOS1.x and have the following structure given a “name” string passed as the first argument of IO::Open
:
adios2::Engine bpFile = io.Open("name", adios2::Mode::Write);
will generate:
% collective metadata file
name.bp
% data directory and files
name.bp.dir/
name.bp.0
name.bp.1
...
name.bp.M
Note
BP3 file names are compatible with the Unix (/
) and Windows (\\
) file system naming convention for directories and files.
Caution
The default BP3 engine will check if the .bp
is the extension of the first argument of IO::Open
and will add .bp
and .bp.dir
if not.
This engine allows the user to fine tune the buffering operations through the following optional parameters:
Profile: turns ON/OFF profiling information right after a run
ProfileUnits: set profile units according to the required measurement scale for intensive operations
CollectiveMetadata: turns ON/OFF forming collective metadata during run (used by large scale HPC applications)
Threads: number of threads provided from the application for buffering, use this for very large variables in data size
InitialBufferSize: initial memory provided for buffering (minimum is 16Kb)
BufferGrowthFactor: exponential growth factor for initial buffer > 1, default = 1.05.
MaxBufferSize: maximum allowable buffer size (must be larger than 16Kb). If too large adios2 will throw an exception.
FlushStepsCount: users can select how often to produce the more expensive collective metadata file in terms of steps: default is 1. Increase to reduce adios2 collective operations footprint, with the trade-off of reducing checkpoint frequency. Buffer size will increase until first steps count if
MaxBufferSize
is not set.NumAggregators (or SubStreams): Users can select how many sub-files (
M
) are produced during a run, ranges between 1 and the number of mpi processes fromMPI_Size
(N
), adios2 will internally aggregate data buffers (N-to-M
) to output the required number of sub-files. Default is 0, which will let adios2 to group processes per shared-memory-access (i.e. one per compute node) and use one process per node as an aggregator. If NumAggregators is larger than the number of processes then it will be set to the number of processes.AggregatorRatio: An alternative option to NumAggregators to pick every Nth process as aggregator. An integer divider of the number of processes is required, otherwise a runtime exception is thrown.
Node-Local: For distributed file system. Every writer process must make sure the .bp/ directory is created on the local file system. Required for using local disk/SSD/NVMe in a cluster.
Key |
Value Format |
Default and Examples |
---|---|---|
Profile |
string On/Off |
On, Off |
ProfileUnits |
string |
Microseconds, Milliseconds, Seconds, Minutes, Hours |
CollectiveMetadata |
string On/Off |
On, Off |
Threads |
integer > 1 |
1, 2, 3, 4, 16, 32, 64 |
InitialBufferSize |
float+units >= 16Kb |
16Kb, 10Mb, 0.5Gb |
MaxBufferSize |
float+units >= 16Kb |
at EndStep, 10Mb, 0.5Gb |
BufferGrowthFactor |
float > 1 |
1.05, 1.01, 1.5, 2 |
FlushStepsCount |
integer > 1 |
1, 5, 1000, 50000 |
NumAggregators |
integer >= 1 |
0 (one file per compute node), |
AggregatorRatio |
integer >= 1 |
not used unless set, |
Node-Local |
string On/Off |
Off, On |
Only file transport types are supported. Optional parameters for IO::AddTransport
or in runtime config file transport field:
Transport type: File
Key |
Value Format |
Default and Examples |
---|---|---|
Library |
string |
POSIX (UNIX), FStream (Windows), stdio, IME |
HDF5
In ADIOS2, the default engine for reading and writing HDF5 files is called “HDF5”.
To use this engine, you can either specify it in your xml config file, with tag <engine type=HDF5>
or, set it in client code. For example, here is how to create a hdf5 reader:
adios2::IO h5IO = adios.DeclareIO("SomeName");
h5IO.SetEngine("HDF5");
adios2::Engine h5Reader = h5IO.Open(filename, adios2::Mode::Read);
To read back the h5 files generated with VDS to ADIOS2, one can use the HDF5 engine. Please make sure you are using the HDF5 library that has version greater than or equal to 1.11 in ADIOS2.
The h5 file generated by ADIOS2 has two levels of groups: The top Group, /
and its subgroups: Step0
… StepN
, where N
is number of steps. All datasets belong to the subgroups.
Any other h5 file can be read back to ADIOS as well. To be consistent, when reading back to ADIOS2, we assume a default Step0, and all datasets from the original h5 file belong to that subgroup. The full path of a dataset (from the original h5 file) is used when represented in ADIOS2.
We can pass options to HDF5 API from ADIOS xml configuration. Currently we support CollectionIO (default false), and chunk specifications. The chunk specification uses space to seperate values, and by default, if a valid H5ChunkDim exists, it applies to all variables, unless H5ChunkVar is specified. Examples:
<parameter key="H5CollectiveMPIO" value="yes"/>
<parameter key="H5ChunkDim" value="200 200"/>
<parameter key="H5ChunkVar" value="VarName1 VarName2"/>
We suggest to read HDF5 documentation before appling these options.
After the subfile feature is introduced in HDF5 version 1.14, the ADIOS2 HDF5 engine will use subfiles as the default h5 format as it improves I/O in general (for example, see https://escholarship.org/uc/item/6fs7s3jb)
To use the subfile feature, client needs to support MPI_Init_thread with MPI_THREAD_MULTIPLE.
Useful parameters from the HDF library to tune subfiles are:
H5FD_SUBFILING_IOC_PER_NODE (num of subfiles per node)
set H5FD_SUBFILING_IOC_PER_NODE to 0 if the regular h5 file is preferred, before using ADIOS2 HDF5 engine.
H5FD_SUBFILING_STRIPE_SIZE
H5FD_IOC_THREAD_POOL_SIZE
SST Sustainable Staging Transport
In ADIOS2, the Sustainable Staging Transport (SST) is an engine that allows direct connection of data producers and consumers via the ADIOS2 write/read APIs. This is a classic streaming data architecture where the data passed to ADIOS on the write side (via Put() deferred and sync, and similar calls) is made directly available to a reader (via Get(), deferred and sync, and similar calls).
SST is designed for use in HPC environments and can take advantage of RDMA network interconnects to speed the transfer of data between communicating HPC applications; however, it is also capable of operating in a Wide Area Networking environment over standard sockets. SST supports full MxN data distribution, where the number of reader ranks can differ from the number of writer ranks. SST also allows multiple reader cohorts to get access to a writer’s data simultaneously.
To use this engine, you can either specify it in your xml config file, with
tag <engine type=SST>
or, set it in client code. For example, here is
how to create an SST reader:
adios2::IO sstIO = adios.DeclareIO("SomeName");
sstIO.SetEngine("SST");
adios2::Engine sstReader = sstIO.Open(filename, adios2::Mode::Read);
and a sample code for SST writer is:
adios2::IO sstIO = adios.DeclareIO("SomeName");
sstIO.SetEngine("SST");
adios2::Engine sstWriter = sstIO.Open(filename, adios2::Mode::Write);
The general goal of ADIOS2 is to ease the conversion of a file-based
application to instead use a non-file streaming interconnect, for example,
data producers such as computational physics codes and consumers such as
analysis applications. However, there are some uses of ADIOS2 APIs that
work perfectly well with the ADIOS2 file engines, but which will not work or
will perform badly with streaming. For example, SST is based upon the “step” concept and
ADIOS2 applications that use SST must call BeginStep()
and EndStep()
. On
the writer side, the Put()
calls between BeginStep
and EndStep
are the unit
of communication and represent the data that will be available between the
corresponding Begin
/EndStep
calls on the reader.
Also, it is recommended that SST-based applications not use the ADIOS2
Get() sync method unless there is only one data item to be read per step.
This is because SST implements MxN data transfer (and avoids having to
deliver all data to every reader), by queueing data on the writer ranks
until it is known which reader rank requires it. Normally this data fetch
stage is initiated by PerformGets()
or EndStep()
, both of which fulfill any
pending Get()
deferred operations. However, unlike Get()
deferred, the
semantics of Get()
sync require the requested data to be fetched from the
writers before the call can return. If there are multiple calls to
Get()
sync per step, each one may require a communication with many writers,
something that would have only had to happen once if Get()
differed were used
instead. Thus the use of Get()
sync is likely to incur a substantial
performance penalty.
On the writer side, depending upon the chosen data marshaling option there
may be some (relatively small) performance differences between Put()
sync and
Put()
deferred, but they are unlikely to be as substantial as between
Get()
sync and Get()
deferred.
Note that SST readers and writers do not necessarily move in lockstep, but depending upon the queue length parameters and queueing policies specified, differing reader and writer speeds may cause one or the other side to wait for data to be produced or consumed, or data may be dropped if allowed by the queueing policy. However, steps themselves are atomic and no step will be partially dropped, delivered to a subset of ranks, or otherwise divided.
The SST engine allows the user to customize the streaming operations through the following optional parameters:
1. RendezvousReaderCount
: Default 1. This integer value specifies
the number of readers for which the writer should wait before the
writer-side Open() returns. The default of 1 implements an ADIOS1/flexpath
style “rendezvous”, in which an early-starting reader will wait for the
writer to start, or vice versa. A number >1 will cause the writer to wait
for more readers and a value of 0 will allow the writer to proceed without
any readers present. This value is interpreted by SST Writer engines only.
2. RegistrationMethod
: Default “File”. By default, SST reader and
writer engines communicate network contact information via files in a shared
filesystem. Specifically, the "filename"
parameter in the Open()
call is
interpreted as a path which the writer uses as the name of a file to which
contact information is written, and from which a reader will attempt to read
contact information. As with other file-based engines, file creation and
access is subject to the usual considerations (directory components are
interpreted, but must exist and be traversable, writer must be able to
create the file and the reader must be able to read it). Generally the file
so created will exist only for as long as the writer keeps the stream
Open(), but abnormal process termination may leave “stale” files in those
locations. These stray “.sst” files should be deleted to avoid confusing
future readers. SST also offers a “Screen” registration method in which
writers and readers send their contact information to, and read it from,
stdout and stdin respectively. The “screen” registration method doesn’t
support batch mode operations in any way, but may be useful when manually
starting jobs on machines in a WAN environment that don’t share a
filesystem. A future release of SST will also support a “Cloud”
registration method where contact information is registered to and retrieved
from a network-based third-party server so that both the shared filesystem
and interactivity can be avoided. This value is interpreted by both SST
Writer and Reader engines.
3. QueueLimit
: Default 0. This integer value specifies the number
of steps which the writer will allow to be queued before taking specific
action (such as discarding data or waiting for readers to consume the
data). The default value of 0 is interpreted as no limit. This value is
interpreted by SST Writer engines only.
4. QueueFullPolicy
: Default “Block”. This value controls what
policy is invoked if a non-zero QueueLimit has been specified and
new data would cause the queue limit to be reached. Essentially, the
“Block” option ensures data will not be discarded and if the queue
fills up the writer will block on EndStep until the data has been
read. If there is one active reader, EndStep will block until data
has been consumed off the front of the queue to make room for newly
arriving data. If there is more than one active reader, it is only
removed from the queue when it has been read by all readers, so the
slowest reader will dictate progress. NOTE THAT THE NO READERS
SITUATION IS A SPECIAL CASE: If there are no active readers, new
timesteps are considered to have completed their active queueing
immediately upon submission. They may be retained in the “reserve
queue” if the ReserveQueueLimit is non-zero. However, if that
ReserveQueueLimit parameter is zero, timesteps submitted when there
are no active readers will be immediately discarded.
Besides “Block”, the other acceptable value for QueueFullPolicy is “Discard”. When “Discard” is specified, and an EndStep operation would add more than the allowed number of steps to the queue, some step is discarded. If there are no current readers connected to the stream, the oldest data in the queue is discarded. If there are current readers, then the newest data (I.E. the just-created step) is discarded. (The differential treatment is because SST sends metadata for each step to the readers as soon as the step is accepted and cannot reliably prevent that use of that data without a costly all-to-all synchronization operation. Discarding the newest data instead is less satisfying, but has a similar long-term effect upon the set of steps delivered to the readers.) This value is interpreted by SST Writer engines only.
5. ReserveQueueLimit
: Default 0. This integer value specifies the
number of steps which the writer will keep in the queue for the benefit
of late-arriving readers. This may consist of timesteps that have
already been consumed by any readers, as well as timesteps that have not
yet been consumed. In some sense this is target queue minimum size,
while QueueLimit is a maximum size. This value is interpreted by SST
Writer engines only.
6. DataTransport
: Default varies. This string value specifies
the underlying network communication mechanism to use for exchanging
data in SST. Generally this is chosen by SST based upon what is
available on the current platform. However, specifying this engine
parameter allows overriding SST’s choice. Current allowed values are
“UCX”, “MPI”, “RDMA”, and “WAN”. (ib and fabric are accepted as
equivalent to RDMA and evpath is equivalent to WAN.)
Generally both the reader and writer should be using the same network
transport, and the network transport chosen may be dictated by the
situation. For example, the RDMA transport generally operates only
between applications running on the same high-performance interconnect
(e.g. on the same HPC machine). If communication is desired between
applications running on different interconnects, the Wide Area Network
(WAN) option should be chosen. This value is interpreted by both SST
Writer and Reader engines.
7. WANDataTransport
: Default sockets. If the SST
DataTransport parameter is “WAN, this string value specifies
the EVPath-level data transport to use for exchanging data. The value
must be a data transport known to EVPath, such as “sockets”,
“enet”, or “ib”. Generally both the reader and writer should
be using the same EVPath-level data transport. This value is
interpreted by both SST Writer and Reader engines.
8. ControlTransport
: Default tcp. This string value specifies
the underlying network communication mechanism to use for performing
control operations in SST. SST can be configured to standard TCP
sockets, which are very reliable and efficient, but which are limited
in their scalability. Alternatively, SST can use a reliable UDP
protocol, that is more scalable, but as of ADIOS2 Release 2.4.0 still
suffers from some reliability problems. (sockets is accepted as
equivalent to tcp and udp, rudp, and enet are
equivalent to scalable. Generally both the reader and writer
should be using the same control transport. This value is interpreted
by both SST Writer and Reader engines.
9. NetworkInterface
: Default NULL. In situations in which
there are multiple possible network interfaces available to SST, this
string value specifies which should be used to generate SST’s contact
information for writers. Generally this should NOT be specified
except for narrow sets of circumstances. It has no effect if
specified on Reader engines. If specified, the string value should
correspond to a name of a network interface, such as are listed by
commands like “netstat -i”. For example, on most Unix systems,
setting the NetworkInterface parameter to “lo” (or possibly “lo0”)
will result in SST generating contact information that uses the
network address associated with the loopback interface (127.0.0.1).
This value is interpreted by only by the SST Writer engine.
10. ControlInterface
: Default NULL. This value is similar to the
NetworkInterface parameter, but only applies to the SST layer which does
messaging for control (open, close, flow and timestep management, but not
actual data transfer). Generally the NetworkInterface parameter can be used
to control this, but that also aplies to the Data Plane. Use
ControlInterface in the event of conflicting specifications.
11. DataInterface
: Default NULL. This value is similar to the
NetworkInterface parameter, but only applies to the SST layer which does
messaging for data transfer, not control (open, close, flow and timestep
management). Generally the NetworkInterface parameter can be used to
control this, but that also aplies to the Control Plane. Use DataInterface
in the event of conflicting specifications. In the case of the RDMA data
plane, this parameter controls the libfabric interface choice.
12. FirstTimestepPrecious
: Default FALSE.
FirstTimestepPrecious is a boolean parameter that affects the queueing
of the first timestep presented to the SST Writer engine. If
FirstTimestepPrecious is TRUE, then the first timestep is
effectively never removed from the output queue and will be presented
as a first timestep to any reader that joins at a later time. This
can be used to convey run parameters or other information that every
reader may need despite joining later in a data stream. Note that
this queued first timestep does count against the QueueLimit parameter
above, so if a QueueLimit is specified, it should be a value larger
than 1. Further note while specifying this parameter guarantees that
the preserved first timestep will be made available to new readers,
other reader-side operations (like requesting the LatestAvailable
timestep in Engine parameters) might still cause the timestep to be skipped.
This value is interpreted by only by the SST Writer engine.
13. AlwaysProvideLatestTimestep
: Default FALSE.
AlwaysProvideLatestTimestep is a boolean parameter that affects what
of the available timesteps will be provided to the reader engine. If
AlwaysProvideLatestTimestep is TRUE, then if there are multiple
timesteps available to the reader, older timesteps will be skipped and
the reader will see only the newest available upon BeginStep.
This value is interpreted by only by the SST Reader engine.
14. OpenTimeoutSecs
: Default 60. OpenTimeoutSecs is an integer
parameter that specifies the number of seconds SST is to wait for a peer
connection on Open(). Currently this is only implemented on the Reader side
of SST, and is a timeout for locating the contact information file created
by Writer-side Open, not for completing the entire Open() handshake.
Currently value is interpreted by only by the SST Reader engine.
15. SpeculativePreloadMode
: Default AUTO. In some
circumstances, SST eagerly sends all data from writers to every
readers without first waiting for read requests. Generally this
improves performance if every reader needs all the data, but can be
very detrimental otherwise. The value AUTO for this engine
parameter instructs SST to apply its own heuristic for determining if
data should be eagerly sent. The value OFF disables this feature
and the value ON causes eager sending regardless of heuristic.
Currently SST’s heuristic is simple. If the size of the reader cohort
is less than or equal to the value of the SpecAutoNodeThreshold
engine parameter (Default value 1), eager sending is initiated.
Currently value is interpreted by only by the SST Reader engine.
16. SpecAutoNodeThreshold
: Default 1. If the size of the
reader cohort is less than or equal to this value and the
SpeculativePreloadMode
parameter is AUTO, SST will initiate
eager data sending of all data from each writer to all readers.
Currently value is interpreted by only by the SST Reader engine.
17. StepDistributionMode
: Default “AllToAll”. This value
controls how steps are distributed, particularly when there are
multiple readers. By default, the value is “AllToAll”, which
means that all timesteps are to be delivered to all readers (subject
to discard rules, etc.). In other distribution modes, this is not the
case. For example, in “RoundRobin”, each step is delivered
only to a single reader, determined in a round-robin fashion based
upon the number or readers who have opened the stream at the time the
step is submitted. In “OnDemand” each step is delivered to a
single reader, but only upon request (with a request being initiated
by the reader doing BeginStep()). Normal reader-side rules (like
BeginStep timeouts) and writer-side rules (like queue limit behavior) apply.
Key |
Value Format |
Default and Examples |
RendezvousReaderCount RegistrationMethod QueueLimit QueueFullPolicy ReserveQueueLimit DataTransport WANDataTransport ControlTransport MarshalMethod NetworkInterface ControlInterface DataInterface FirstTimestepPrecious AlwaysProvideLatestTimestep OpenTimeoutSecs SpeculativePreloadMode SpecAutoNodeThreshold |
integer string integer string integer string string string string string string string boolean boolean integer string integer |
1 File, Screen 0 (no queue limits) Block, Discard 0 (no queue limits) default varies by platform, UCX, MPI, RDMA, WAN sockets, enet, ib TCP, Scalable BP5, BP, FFS NULL NULL NULL FALSE, true, no, yes FALSE, true, no, yes 60 AUTO, ON, OFF 1 |
SSC Strong Staging Coupler
The SSC engine is designed specifically for strong code coupling. Currently SSC only supports fixed IO pattern, which means once the first step is finished, users are not allowed to write or read a data block with a start and count that have not been written or read in the first step. SSC uses a combination of one sided MPI and two sided MPI methods. In any cases, all user applications are required to be launched within a single mpirun or mpiexec command, using the MPMD mode.
The SSC engine takes the following parameters:
OpenTimeoutSecs
: Default 10. Timeout in seconds for opening a stream. The SSC engine’s open function will block until the RendezvousAppCount is reached, or timeout, whichever comes first. If it reaches the timeout, SSC will throw an exception.Threading
: Default False. SSC will use threads to hide the time cost for metadata manipulation and data transfer when this parameter is set to true. SSC will check if MPI is initialized with multi-thread enabled, and if not, then SSC will force this parameter to be false. Please do NOT enable threading when multiple I/O streams are opened in an application, as it will cause unpredictable errors. This parameter is only effective when writer definitions and reader selections are NOT locked. For cases definitions and reader selections are locked, SSC has a more optimized way to do data transfers, and thus it will not use this parameter.
Key |
Value Format |
Default and Examples |
---|---|---|
OpenTimeoutSecs |
integer |
10, 2, 20, 200 |
Threading |
bool |
false, true |
DataMan for Wide Area Network Data Staging
The DataMan engine is designed for data staging over the wide area network. It is supposed to be used in cases where a few writers send data to a few readers over long distance.
DataMan supports compression operators such as ZFP lossy compression and BZip2 lossless compression. Please refer to the operator section for usage.
The DataMan engine takes the following parameters:
IPAddress
: No default value. The IP address of the host where the writer application runs. This parameter is compulsory in wide area network data staging.Port
: Default 50001. The port number on the writer host that will be used for data transfers.Timeout
: Default 5. Timeout in seconds to wait for every send / receive operation. Packages not sent or received within this time are considered lost.RendezvousReaderCount
: Default 1. This integer value specifies the number of readers for which the writer should wait before the writer-side Open() returns. By default, an early-starting writer will wait for the reader to start, or vice versa. A number >1 will cause the writer to wait for more readers, and a value of 0 will allow the writer to proceed without any readers present. This value is interpreted by DataMan Writer engines only.Threading
: Default true for reader, false for writer. Whether to use threads for send and receive operations. Enabling threading will cause extra overhead for managing threads and buffer queues, but will improve the continuity of data steps for readers, and help overlap data transfers with computations for writers.TransportMode
: Default fast. Only DataMan writers take this parameter. Readers are automatically synchronized at runtime to match writers’ transport mode. The fast mode is optimized for latency-critical applications. It enforces readers to only receive the latest step. Therefore, in cases where writers are faster than readers, readers will skip some data steps. The reliable mode ensures that all steps are received by readers, by sacrificing performance compared to the fast mode.MaxStepBufferSize
: Default 128000000. In order to bring down the latency in wide area network staging use cases, DataMan uses a fixed receiver buffer size. This saves an extra communication operation to sync the buffer size for each step, before sending actual data. The default buffer size is 128 MB, which is sufficient for most use cases. However, in case 128 MB is not enough, this parameter must be set correctly, otherwise DataMan will fail.
Key |
Value Format |
Default and Examples |
---|---|---|
IPAddress |
string |
N/A, 22.195.18.29 |
Port |
integer |
50001, 22000, 33000 |
Timeout |
integer |
5, 10, 30 |
RendezvousReaderCount |
integer |
1, 0, 3 |
Threading |
bool |
true for reader, false for writer |
TransportMode |
string |
fast, reliable |
MaxStepBufferSize |
integer |
128000000, 512000000, 1024000000 |
DataSpaces
The DataSpaces engine for ADIOS2 is experimental. DataSpaces is an asynchronous I/O transfer method within ADIOS that enables
low-overhead, high-throughput data extraction from a running simulation.
DataSpaces is designed for use in HPC environments and can take advantage of RDMA
network interconnects to speed the transfer of data between communicating
HPC applications. DataSpaces supports full MxN data distribution, where the number
of reader ranks can differ from the number of
writer ranks. In addition, this engine supports multiple reader and writer applications, which
must be distinguished by unique values of AppID
for different applications. It can be set
in the xml config file with tag <parameter key="AppID" value="2"/>
. The value should be unique
for each applications or clients.
To use this engine, you can either specify it in your xml config file, with
tag <engine type=DATASPACES>
or, set it in client code. For example, here is
how to create an DataSpaces reader:
adios2::IO dspacesIO = adios.DeclareIO("SomeName");
dspacesIO.SetEngine("DATASPACES");
adios2::Engine dspacesReader = dspacesIO.Open(filename, adios2::Mode::Read);
and a sample code for DataSpaces writer is:
adios2::IO dspacesIO = adios.DeclareIO("SomeName");
dspacesIO.SetEngine("DATASPACES");
adios2::Engine dspacesWriter = dspacesIO.Open(filename, adios2::Mode::Write);
To make use of the DataSpaces engine, an application job needs to also run the dataspaces_server component together with the application. The server should be configured and started before the application as a separate job in the system. For example:
aprun -n $SPROC ./dataspaces_server -s $SPROC &> log.server &
The variable $SPROC
represents the number of server instances to run. The &
character
at the end of the line would place the aprun
command in the background, and will
allow the job script to continue and run the other applications. The server processes
produce a configuration file, i.e., conf.0
that is used by the application
to connect to the servers. This file contains identifying information of the
master server, which coordinates the client registration
and discovery process. The job script should wait for the servers to start-up and
produce the conf.0
file before starting the client application processes.
The server also needs a user configuration read from a text file called dataspaces.conf
.
How many output timesteps of the same dataset (called versions) should be kept in the server’s memory
and served to readers should be specified in the file. If this file does not exist in the current directory,
the server will assume default values (only 1 timestep stored).
.. code-block:
## Config file for DataSpaces
max_versions = 5
lock_type = 3
The dataspaces_server module is a stand-alone service that runs independently of a simulation on a set of dedicated nodes in the staging area. It transfers data from the application through RDMA, and can save it to local storage system, e.g., the Lustre file system, stream it to remote sites, e.g., auxilliary clusters, or serve it directly from the staging area to other applications. One instance of the dataspaces_server can service multiple applications in parallel. Further, the server can run in cooperative mode (i.e., multiple instances of the server cooperate to service the application in parallel and to balance load). The dataspaces_server receives notification messages from the transport method, schedules the requests, and initiates the data transfers in parallel. The server schedules and prioritizes the data transfers while the simulation is computing in order to overlap data transfers with computations, to maximize data throughput, and to minimize the overhead on the application.
Inline for zero-copy
The Inline
engine provides in-process communication between the writer and reader, avoiding the copy of data buffers.
This engine is focused on the N → N case: N writers share a process with N readers, and the analysis happens ‘inline’ without writing the data to a file or copying to another buffer. It is similar to the streaming SST engine, since analysis must happen per step.
To use this engine, you can either add <engine type=Inline>
to your XML config file, or set it in your application code:
adios2::IO io = adios.DeclareIO("ioName");
io.SetEngine("Inline");
adios2::Engine inlineWriter = io.Open("inline_write", adios2::Mode::Write);
adios2::Engine inlineReader = io.Open("inline_read", adios2::Mode::Read);
Notice that unlike other engines, the reader and writer share an IO instance.
Both the writer and reader must be opened before either tries to call BeginStep()
/PerformPuts()
/PerformGets()
.
There must be exactly one writer, and exactly one reader.
For successful operation, the writer will perform a step, then the reader will perform a step in the same process.
When the reader starts its step, the only data it has available is that written by the writer in its process.
The reader then can retrieve whatever data was written by the writer by using the double-pointer Get
call:
void Engine::Get<T>(Variable<T>, T**) const;
This version of Get
is only used for the inline engine.
See the example below for details.
Note
Since the inline engine does not copy any data, the writer should avoid changing the data before the reader has read it.
Typical access pattern:
// ... Application data generation
inlineWriter.BeginStep();
inlineWriter.Put(var, in_data); // always use deferred mode
inlineWriter.EndStep();
// Unlike other engines, data should not be reused at this point (since ADIOS
// does not copy the data), though ADIOS cannot enforce this.
// Must wait until reader is finished using the data.
inlineReader.BeginStep();
double* out_data;
inlineReader.Get(var, &data);
// Now in_data == out_data.
inlineReader.EndStep();
Null
The Null
Engine performs no internal work and no I/O. It was created for testing applications that have ADIOS2 output in it by turning off the I/O easily. The runtime difference between a run with the Null engine and another engine tells us the IO overhead of that particular output with that particular engine.
adios2::IO io = adios.DeclareIO("Output");
io.SetEngine("Null");
or using the XML config file:
<adios-config>
<io name="Output">
<engine type="Null">
</engine>
</io>
</adios-config>
Although there is a reading engine as well, which will not fail, any variable/attribute inquiry returns nullptr and any subsequent Get() calls will throw an exception in C++/Python or return an error in C/Fortran.
Note that there is also a Null transport that can be used by a BP engine instead of the default File transport. In that case, the BP engine will perform all internal work including buffering and aggregation but no data will be output at all. A run like this can be used to assess the overhead of the internal work of the BP engine.
adios2::IO io = adios.DeclareIO("Output");
io.SetEngine("BP5");
io.AddTransport("Null", {});
or using the XML config file
<adios-config>
<io name="Output">
<engine type="BP5">
</engine>
<transport type="Null">
</transport>
</io>
</adios-config>
Plugin Engine
For details on using the Plugin Engine, see the Plugins documentation.