[Release] Fix promotion jobs race condition #16633
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We are promoting debian packages within packages.o1test.net by moving debians from channel A to B (e.g. unstable to alpha). Usually we have 8 packages to move. Currently it is done in parallel as this was default mode of any buildkite job. Also it was challenging to make all jobs run in sequential mode with existing dhall framework code. Fortunately after some previous refactoring we are now able to define promotion job with dependency to previous one.
We have 16 jobs which are usually spawned :
promote jobs are run in a non blocking fashion . Verification jobs are waiting for particular promotion job to be finished. Below a simplified description should give a hint about order:
....
Archive Promotion From A to B:
Archive Promotion From A to B:
Archive Promotion From A to B
That flow was problematic since our debian repo was under "heavy" load of 8 concurrent jobs which modifies a single Release file hosting on s3. We have locking mechanism implemented and s3 is not permitting any deadlock/interruption in write. However, there is a problem when uploading debian package with ~150 MB takes a while and usually Release file is uploaded first. This can lead to a issue where two jobs are modifying Release file and then while debian are still uploading one job is failed because waiting time for lock was too big. This can leave Release file in a wrong state. Fix is easy but requires manual intervention (rerun manifest fix via deb-s3). We can of course automate fixing manifest always before uploading debian but it would be nicer not to conflict it in a first place and we would need to do it almost before every interaction with deb-s3 which take some time.
Fix for above situation would be to make all promotion jobs sequential while leaving promotion verifications concurrent but dependent on corresponding promotion jobs:
LogProc Promotion From A to B
Archive Promotion From A to B
....
Archive Promotion From A to B
Archive Promotion From A to B
Archive Promotion From A to B
Please note that wait between promotion and verification jobs are done on pipeline setup not in dhall files.