- Build: Update protobuf detection to work for newer versions
- Bug fix: Extend error handling to account for inability to create directories during session creation.
- Bug fix: add missing sync for partitions.
- Require DDS 3.12 for the new submit features.
- Add expendable & RMSJobIDs infos to logs and gRPC replies.
- Reenable minAgents setting during submit.
- Handle number of submitted agents info returned by DDS during submit.
- In case of Slurm, if the agent submit request returns with a number of hosts lower than requested but higher than nMin, the number of expected agents will be updated and ODC will wait only for the corresponding number of slots to be active.
- Fix bug with current collection count going into negative when multiple devices from a single collection fail.
- Temporarily disable nMin during DDS agent submission - this avoids setting minnodes in sbatch.
- Potential workaround for hanging DDS Agent Info request: Make getAgentInfo use async DDS calls.
- Potential workaround for hanging DDS Agent Info request: Unsubscribe from DDS request events on timeout.
- Bugfix: avoid race condition in wait for async calls
- Log agent info in case of topology activation timeout
- Log topology generation failures with content aware severity: fatal when "fatal" is present in the output, error when "error" is present in the output, warning otherwise.
- Breaking Change: Support expendable collections. When nMin is 0, collection is considered to be expendable and failure of all collection members will not trigger a global error. Previously nMin of 0 was the default, which would mean no nMin defined. Now this behaviour occurs when nMin is -1, which is also the new default.
- Fix a deprecation warning with C++20
- Add more details in the log on failed tasks/collections: host & working directory
- CustomCommands: Adapt to the changes in https://github.com/google/flatbuffers/releases/tag/v23.5.8 by dropping the (unused) JSON commands format.
- Rename GrpcController -> GrpcServer (non-breaking).
- Additional debug info for request timeouts.
- gRPC controller: log request before lock to provide better feedback if the lock can't be acquired.
- Logger: add trace severity ("trc" value for --severity for CLI tools). Trace is lower than debug. This severity is not forwarded to infoLogger at all.
- Log detailed device list (response to getState --detailed) only in trace severity.
- Check for negative timeout values and produce an error if they occur.
- GetState: log the request and its results with debug severity only, unless there is an error.
- GetState: do not error if the topology is not yet initialized, instead return aggregatedState = Undefined, detailedState = [], without error.
- odc-grpc-server: add --infologger-severity cmd option to control which severities are passed to infoLogger (default is inf).
- Apply nMin/expendable checks for hanging devices
- Fix regression in resource parameter extraction and add tests for the breaking case
- Fixed known issue from 0.78.0-beta: agents are now shut down failed tasks/collections from expendable/nMin triggers.
- Bugfix: updated Topology Ops to allow ignoring of tasks. Used during ignore events in Topology, to avoid command races.
- InfoLogger: Set message levels (same translation as in AliceO2)
-
New Feature: Async handling for nMin and expendable tasks. Async GetState requests report proper state, taking expendable tasks & nMin into account.
-
New Feature: Allow resource extraction from the topology file.
Derives number of agents, agent slots and other requirements from the given topology file.
To use this:
- Launch the server (grpc or cli) with
--rms <slurm/ssh/localhost>
and--zones <name>:<cfgFilePath>:<envFilePath>
(latter is the same as --zones for the Slurm plugin, but without the number of slots). - Set
extractTopoResources
totrue
in the Run request (or --extract-topo-resources true for CLI clients). Off by default. - The Run request will not use the
plugin
andresources
parameters. These can be left empty. - For the server,
--rp
is not used. It doesn't have to be set.
This is completely optional - previous approach works as before.
- Launch the server (grpc or cli) with
-
Breaking Change: Remove request triggers (unused).
-
Breaking Change: nMin of 0 does nothing in the main Controller. Previously empty group would be allowed to be executed. In practice this was unused because topology creation tools did not allow for this case.
-
Bugfix: Change-/WaitForState Operation: handle unexpected exiting state
-
Bugfix: WaitForState Operation: fix timer cancellation
-
Bugfix: Topology: Include exited tasks when resetting op count
-
Tests: Extend unit tests
-
Tests: Add parameter test for epn topo
-
Tests: Add testsuite for requirements tests
-
Tests: Add testsuite for parameter tests
Known Issues:
- DDS agents are not shut down if their tasks/collections fail, but the session is still ongoing.
- Bugfix: handle return value of FairMQ's ChangeState
- Improvement: Fail early on resource plugin failures.
- Improvement: Include agent startup time in the agent info output.
- Bugfix: Fix incorrect slot count on submission failures.
- Bugfix: Fix some CLI tools/commands not stopping execution after --help.
- Remove async mode of the gRPC controller. The
--sync
CLI parameter now does nothing and is deprecated. - Remove logger from the odc-rp-epn-slurm plugin.
--logdir
,--severity
and--infologger
parameters now do nothing and are deprecated. - Include list of hosts in the Run/Submit result.
- The topology generation script will have a
ODC_TOPO_GEN_CMD
environment variable set, containing the command line used to execute the script.
- Cleanup output of topology generation failure: remove cmd (duplicate), shorten stdout, focus stderr.
- Allow defining expendable tasks, failure of which will be ignored and excluded from the aggregated state report.
To define an expendable task, add a custom DDS requirement in the topology file, whose name starts with
odc_expendable_
and valuetrue
, e.g.:And use the requirement in the task declaration:<declrequirement name="odc_expendable_task" type="custom" value="true" />
<decltask name="Processor"> <exe>odc-ex-processor</exe> <requirements> <name>odc_expendable_task</name> </requirements> </decltask>
- Adjust timeout on subsequent async ops within one request to avoid delaying execution beyond original timeout value.
- Include run number into the exiting task log entry, when it is valid (primarily during Running state).
- epnc plugin removed.
- restore file is now updated after session shutdown, to avoid inconsistent entries.
- bugfix: crash during submit recovery when no agents were launched.
- defaults removed from --cmds, --cf batch options.
- defaults removed from --topo, --script, --content, --prop, --plugin, --resources options.
- defaults removed from --res of the odc-rp-epn-slurm plugin.
- improved resource validation in the Slurm plugin. More errors are caught with improved error messages.
- disallow repeated Run requests per partition. Only single Run request is allowed, until partition is shut down.
- allow skipping 'n' in the resource field, and pick up core scheduling settings from the topology.
- State change errors are now logged as fatal.
- fixed: wrong severity logged in some cases.
- fixed: repeated task failures are counted multiple times despite being ignored.
- fixed: Use of moved-from value during Submit.
- fixed: --restore-dir working incorrectly with missing trailing slash.
- fixed: odc-epn-topo: calib was missing prependExe.
- Require DDS 3.7.15.
- Add support for core-based scheduling. Includes some breaking changes in odc-epn-topo:
--recogroup
&--calibgroup
arguments are removed. Agent groups are now set dynamically and internally by the tool.- Instead of the above,
--recozone
&--calibzone
must be set, whose values must correspond to the zones used at ODC server start in the resource plugin's--zones
. - To request core-based scheduling for calib topos, add
:<ncores>
in the arguments, e.g.--calib calib_mft.xml:20
. This is actually non-breaking - omitting:<ncores>
will simply not add any core-based scheduling.
- Add
--restore-dir
parameter to control where restore files are located. - Add session history file. By default it is in
$HOME/.ODC/history
, can be changed via--history-dir
.
- Honor nMin on SetProperties command.
- Split the log output of the topology generation script provided via gRPC request on new lines.
- SetProperties command reply includes correct topology state, instead of
Undefined
. - Include DDS session ID on Shutdown command reply.
- Bugfix: Fixed incorrect tracking of nMin parameter after submission failures.
- Bugfix: Fixed invalid transitions being treated as task failures leading to task termination on nMin handling.
- Fail earlier for devices that fail between state transitions.
- Fix tasks incorrectly being labeled as failed during nMin handling.
- Honor nMin during activation.
- Add 'ignored' task status to log & grpc reply.
- Reduce log verbosity during activation.
- Split longer log entries into multiple.
- Add host/node info to
GetState --detailed
gRPC response and log. - Allow grpc::GetState to be processed asynchronously.
- Remove
<requiredSlots>
from resource definitions. All slots are now required. - (impl) cmds: Disable successfull TransitionStatus (unused).
- (impl) cmds: Remove StateChangeExitingReceived command - use OnTaskDone instead.
- (impl) cmds: Replace CurrentState with StateChange.
- Fix several failing tests.
- Fix broken SetProperties parsing in Cli controllers.
- Cleanup session data on Shutdown command.
- execute(): fix race that can lead to incomplete output.
- execute(): fill stdout/stderr also on timeout.
- Fix more missing includes in InfoLogger module
- Fix missing includes in the InfoLogger module
- Require gRPC 1.1 (eaa122f)
- Require Protobuf 3.15 (eaa122f)
- Require CMake 3.21 (eaa122f) to fix alisw/alidist#4164
- Require DDS 3.7.12
- Breaking change: due to changes in DDS, slurm configuration has to be split in two files - Slurm config and environment config. Former is only for
#SBATCH
parameters, latter is for any kind of other environment setup or per-host calls. Example server cmd line:odc-grpc-server--host "127.0.0.1:7777" --rp "slurm:/home/user/ODC/install/bin/odc-rp-epn-slurm --zones online:2:/home/user/slurm-online.cfg:/home/user/slurm-env.cfg calib:2:/home/user/slurm-calib.cfg:/home/user/slurm-env.cfg"
. - Partition ID is now passed via DDS as the Slurm
--job-name
parameter. - nMin parameter is honored during
.Submit
command.
- Fix compilation error when readline is not available.
- Require DDS 3.7.10.
- Always log a list of DDS agent details on submit.
- Increase controller heartbeat interval from 0.6 seconds to 10 minutes.
- Do not change gRPC verbosity depending on configured severity, or in any other way.
- Reduce log verbosity of the Status command.
- Log failed tasks/collections line by line.
- Honor nMin parameter during device state transitions.
- odc-grpc-client: log only to stdout, not to the file.
- Use timeout value of the corresponding request.
- Require DDS 3.7.5
- Fail earlier if devices crash or transition to Error state.
- Reduce verbosity of GetState responses.
- Add stdout to failed topology generation error.
- Add .help command to CLI.
- odc-epn-topo: add support for --nmin parameter. Currently still unused in ODC controller, usage to be added in the next release.
- odc-grpc-server: make System, Facility and Role InfoLogger fields customizable via --infologger-system, --infologger-facility and --infologger-role cmd args.
- odc-grpc-client: Add missing Run request parameters for plugin and resource.
- Promote topology generation script errors to fatal and output them line by line.
- Update odc-epn-topo to handle agent groups.
- Annotate incoming gRPC request logs with client info.
- Add odc-rp-epn-slurm plugin that translates AliECS resource descriptions into DDS Slurm RMS plugin configurations.
- Serialize only running sessions to the restore file.
- Require DDS 3.7.1 (for agent group names).
- Log task and collection state stats.
- Increase log rotation size to 100MB.
- Tune log severity levels.
- Add
--sshopt
option forodc-rp-epn
.
- Require DDS
3.5.21
. - Log messages of state change and set property requests contain additional informastion about the device's runtime including hostname and wrk directory.
- Caching of additional information of tasks given by DDS in activate request.
- Modified: improve list of failed devices and log messages for
SetProperties
. - Modified: improve log messages of state change requests.
- Added:
odc-epn-topo
learned new optionrecown
allowing to specify worker node requirement for reco collection. - Modified:
odc-reco-topo
add error monitoring task to calibration collection. - Modified: Log gRPC replies with error severity in case of a failure.
- Modified: set
System
field of InfoLogger toODC
.
- Added: session restore. Managed sessions are serialized to a file, specified by ID. On restart ODC tryes to attach to the running sessions. If failed then shutdown trigger is called.
- Added: optionally filter running DDS sessions in
Status
request. - Modified: log crash of the task as
fatal
instead oferror
. - Modified: use channel severity Boost log instead of severity Boost log.
- Added: whenever possible set
partition ID
for the log message. - Added: set
Partition
andRun
fields of theInfoLogger
. - Modified: require
InfoLogger
2.2.0
. - Added: optional
runnr
field to each request. If specified than run number will appear in logs. - Added:: optional
timeout
field to each request. If specified than timeout value from request is used. Otherwise, the default global timeout is used. - Added:
odc-epn-topo
learned a new option--mon
allowing to optionally include error monitoring tool.
- Modified: use sync version of
SetProperties
instead of async one. - Modified: use sync version of
ChangeState
instead of async one.
- Added: docs on how to run on the EPN cluster.
- Added: print task path on task done event from DDS.
- Modified: bump DDS to 3.5.18.
- odc-rp-epn: creation of DDS SSH config file.
- Fixed: fix bug which prevents proper parsing of
.prop
request when usingodc-grpc-client
.
- Added: new
content
andscript
parameters toActivate
,Update
andRun
requests. One can set either a topology filepath, content or shell commands. Ifcontent
is set than ODC creates a temp topology file with that content. Ifscript
is set than ODC executes the script and saves stdout to a temp topology file. - Added:
odc-rp-epn
plugin supports array of resources.
- Added: new
cmake
optionBUILD_EPN_PLUGIN
which switch on/of building of EPN resource plugin. - Modified:
cmake
optionBUILD_PLUGINS
renamed toBUILD_DEFAULT_PLUGINS
. - Added:
odc-epn-topo
an EPN topology merging tool.
- Fixed: Prevent parallel execution of multiple requests per partition. Resolves GH-24.
- Fixed: registration of resource plugins
- Modified:
fairmq::sdk::Topology
migrated to ODC. - Modified: FairMQ
DDS
plugin migrated to ODC and renamed toODC
. - Modified: DDS FairMQ examples migrated to ODC.
- Added: Request triggers. Request trigger is an external executable which can be registered and started whenever a particular request is processed. Add
--rt
option forodc-cli-server
andodc-grpc-server
allowing to register request triggers. - Added: New EPN resource plugin
odc-rp-epn
which gets a list of nodes via gRPC fromepnc
service and creates SSH config file. - Modified: Resource plugin can be registered as a command line, not only a path.
- Fixed: Fix deadlocks in
Topology
dtor. - Modified: Require DDS 3.5.16.
- Added: optional
InfoLogger
support. - Added: Subscribe to DDS TaskDone events.
- Added: Improve device state change logging.
- Modified: require DDS 3.5.14.
- Fixed: crash of
Activate
andSubmit
requests for empty session. - Fixed:
Shutdown
request in case session was stopped bydds-session
ordds-commander
was killed. - Added: More functional tests.
- Modified: require DDS 3.5.13.
- Added: Status request which returns a list of partition/session statuses, i.e. DDS session status, aggregated topology state.
- gRPC: async server implementation. Async server allows better control of threads. Only a single request is processed at a time. Multiple connections to the server are allowed. Async is a default for
odc-grpc-server
. Use--sync
option to set sync mode.
- Fixed: Linux compilation error.
- Added:
odc-topo
adds a single instance requirement for DPL collection. - Added: initial version of
alfa-ci
integration.
- Added: batch execution of requests from a configuration file. Filepath to a configuration file can be specified via
--cf
option added toodc-grpc-client
andodc-cli-server
. - Added: batch execution for interactive mode. The set of options is the same as for executable. Either
--cmds
containg an array of requests or--cf
containig a filepath to commands configuration file. - Added:
.sleep
command allowing to sleep for some time between the requests. This is usefull for testing and batch execution. - Added: optional
readline
support with command completion and searchable history.
- Added: request/response logs for
odc-grpc-server
. - Fixed: setting a timeout for odc-cli-server.
- Added: Resource plugins.
- Examples: Update topology creation example.
- gRPC: set severity for gRPC library. Setting severity to dbg, inf, err also sets corresponding value of GRPC_VERBOSITY. For command line use
--severity
option ofodc-grpc-server
andodc-grpc-client
. Resolves GH-14.
- Added: new CMake build options:
BUILD_GRPC_CLIENT
,BUILD_GRPC_SERVER
,BUILD_CLI_SERVER
,BUILD_EXAMPLES
. In order to build withoutProtobuf
andgRPC
dependencies one has to explicitly disable building ofodc-grpc-server
andodc-grpc-client
viacmake
command line options-DBUILD_GRPC_CLIENT=OFF
and-DBUILD_GRPC_SERVER=OFF
. - Added:
--version
argument for all executables.
- Modified: new C++ standard requirement - C++17.
- Added: new Commnad Line Interface of
odc-cli-server
andodc-grpc-client
. New interface was adapted for the multi-partition case: each request containes now a list of command line options.
- Added: support of multiple partitions. In DDS terminology partition translates to a DDS session. ODC internally manages a mapping between a partition ID and corresponding DDS session.
- Modified: protocol was adapted to support multiple partitions (DDS sessions). Each request to ODC and each reply from ODC containes a
string partitionid
which uniquely identifies a partition.runid
is removed.
- Modified: improved error propagation. Use proper error codes and messages.
- Fixed: reset topology on shutdown. Topology can be activated and stoped multiple times.
- Added: execution of a sequence of commands in a batch mode. New "--batch" and "--cmds" command line arguments.
- Fixed: command line options of set properties request.
- Added: new Run request which combines Initialize, Submit and Activate. Run request always creates a new DDS session.
- Added: new GetState request which returns a current aggregated state of FairMQ devices.
- Modified: bump DDS version is 3.5.1
- Modified: documentation of the proto file.
- Modified: StateChangeRequest changed to StateRequest.
- Modified: StateChangeReply changed to StateReply. New
state
field containing aggregated device state was added. - Modified: implement SetPropertiesRequest instead of SetPropertyRequest. Multiple properties can be set with a single request.
- Added: possibility to attach to the running DDS session.
- Added: DDS session ID in each reply.
- Added: set request timeout via command line interface.
- Added: unit tests.
- Modified: minimum required DDS version is 3.4
The first stable internal release.