Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Release] Fix promotion jobs race condition #16633

Open
wants to merge 5 commits into
base: compatible
Choose a base branch
from

Conversation

dkijania
Copy link
Member

@dkijania dkijania commented Feb 17, 2025

We are promoting debian packages within packages.o1test.net by moving debians from channel A to B (e.g. unstable to alpha). Usually we have 8 packages to move. Currently it is done in parallel as this was default mode of any buildkite job. Also it was challenging to make all jobs run in sequential mode with existing dhall framework code. Fortunately after some previous refactoring we are now able to define promotion job with dependency to previous one.
We have 16 jobs which are usually spawned :

  • promote {logproc,deamon,archive,rosetta} for {focal,bullseye} = 8
  • verify promotion {logproc,deamon,archive,rosetta} for {focal,bullseye} = 8

promote jobs are run in a non blocking fashion . Verification jobs are waiting for particular promotion job to be finished. Below a simplified description should give a hint about order:

  • LogProc Promotion From A to B:
    • depends_on: null
  • Daemon Promotion From A to B:
    • depends_on: null
  • Archive Promotion From A to B:
    • depends_on: null
      ....
  • LogProc Promotion Verification From A to B:
    • depends_on: Archive Promotion From A to B:
  • Daemon Promotion From A to B:
    • depends_on: Archive Promotion From A to B:
  • Archive Promotion From A to B:
    • depends_on: Archive Promotion From A to B

That flow was problematic since our debian repo was under "heavy" load of 8 concurrent jobs which modifies a single Release file hosting on s3. We have locking mechanism implemented and s3 is not permitting any deadlock/interruption in write. However, there is a problem when uploading debian package with ~150 MB takes a while and usually Release file is uploaded first. This can lead to a issue where two jobs are modifying Release file and then while debian are still uploading one job is failed because waiting time for lock was too big. This can leave Release file in a wrong state. Fix is easy but requires manual intervention (rerun manifest fix via deb-s3). We can of course automate fixing manifest always before uploading debian but it would be nicer not to conflict it in a first place and we would need to do it almost before every interaction with deb-s3 which take some time.

Fix for above situation would be to make all promotion jobs sequential while leaving promotion verifications concurrent but dependent on corresponding promotion jobs:

  • LogProc Promotion From A to B:
    • depends_on: null
  • Daemon Promotion From A to B:
    • depends_on: LogProc Promotion From A to B
  • Archive Promotion From A to B:
    • depends_on: Archive Promotion From A to B
      ....
  • LogProc Promotion Verification From A to B:
    • depends_on: Archive Promotion From A to B
  • Daemon Promotion From A to B:
    • depends_on: Archive Promotion From A to B
  • Archive Promotion From A to B:
    • depends_on: Archive Promotion From A to B

Please note that wait between promotion and verification jobs are done on pipeline setup not in dhall files.

@dkijania dkijania self-assigned this Feb 17, 2025
This reverts commit 8d97d69339c1faeecc7171455b29b2b21d7b4ade.
@dkijania
Copy link
Member Author

!ci-build-me

@dkijania
Copy link
Member Author

@dkijania
Copy link
Member Author

!ci-build-me

@dkijania dkijania marked this pull request as ready for review February 17, 2025 17:45
@dkijania dkijania requested a review from a team as a code owner February 17, 2025 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants