Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce fire_and_forget #127

Merged
merged 1 commit into from
Aug 29, 2024
Merged

Conversation

francesconazzaro
Copy link
Collaborator

No description provided.

@francesconazzaro francesconazzaro merged commit 9f4fcbd into main Aug 29, 2024
8 of 9 checks passed
@francesconazzaro francesconazzaro deleted the investigate-broker-restart branch August 29, 2024 09:14
Copy link

@ecmwf-cobarzan ecmwf-cobarzan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clone runs recipe (at 2024.08.29, ~9AM, UK time):

  • configure cds-accept-cci2 as a development stack (using all latest images)
  • use:
    • CADS_BROKER_REF: main
    • CADS_WORKER_REF: main
  • use 3 workers (i.e. 12 processes capacity in total)
  • launch 14 requests (or more), running for ~2-3 minutes each (I used 14 identical "reanalysis-era5-pressure-levels" requests, avoiding the cache with a random key/value in the request)
  • do a random series of:
    • QoS updates, which will trigger a broker restart (should be pushed really early after launching these rather short requests, to make sure the restart occurs while some jobs are still running)
    • manual restarts of the broker:
      • some really quick after the last restart (less than 5s)
      • others allowing for the broker to fully restart (~30s after the last restart)
      • others during or after the QoS triggered restart
    • cancel some requests on the client side:
      • while running
      • while downloading the results
  • track the logs (in Splunk or in the pods) and determine the number of uploaded results:
    • when more than successful requests, it is a clear telltale that clone runs occurred
  • alternatively, track the number of uploaded results directly in the object store:
    • to make it easier to track:
      • empty the cci2-accept-cache before starting the experiment
      • make sure you are the only user of the stack
      • make sure these are the only running requests (including the ADS and other heads).

The above recipe, when using CADS_BROKER_REF: investigate-broker-restart (instead of main) did not result in any clone runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants