As a follow-up to the status revision that has to be done in Autosubmit in BSC-ES/autosubmit#2947, an update has to be done in the API to better orchestrate the status tracking process.
To better understand what needs to be done, here is a summary of how the API processes all the statuses:
- API gets all the entries of
experiment and experiment_status
- For each experiment:
- Get the modified time of the pkl file (for 4.2.0 this will be replaced by the max time in the modified column of the jobs table)
- If the retrieved time is between the last 10 minutes, then it will be considered as
RUNNING
- Else if the retrieved time is between the last hour, it will try to look for the modified time of the
*_run.log files to see if there is one with the modified time between the last 5 minutes.
- Else it is set as
NOT RUNNING
@ntorqulu to avoid overwriting the DELETED/ARCHIVED statuses from BSC-ES/autosubmit#2980, your idea to first filter the RUNNING (and NOT RUNNING) experiments seems great.
However, now we also have to discuss:
- how to replace the
*_run.log files "hearthbeat" to rely less on the file system and more on the DB
- avoid race conditions where the API can overwrite statuses set by Autosubmit (e.g. double check in the API before update for possible outdated statuses)
As a follow-up to the status revision that has to be done in Autosubmit in BSC-ES/autosubmit#2947, an update has to be done in the API to better orchestrate the status tracking process.
To better understand what needs to be done, here is a summary of how the API processes all the statuses:
experimentandexperiment_statusRUNNING*_run.logfiles to see if there is one with the modified time between the last 5 minutes.NOT RUNNING@ntorqulu to avoid overwriting the
DELETED/ARCHIVEDstatuses from BSC-ES/autosubmit#2980, your idea to first filter theRUNNING(andNOT RUNNING) experiments seems great.However, now we also have to discuss:
*_run.logfiles "hearthbeat" to rely less on the file system and more on the DB