Welcome to GitHub merge queue simulator. This tool is designed to model the behavior of GitHub's merge queue and provide guidance on how to dial in various settings to optimize for throughput while balancing wait time and load on CI resources.
To use this tool effectively you need to know a few things about your PR workloads. This information comes from observing your project and analysing historical data. The things you need to know:
- Duration of the PR workload
- Average CI failure rate
- Target throughput
Once you have a good handle on what these numbers are, you can start simulating queue behavior.
Merge queue simulator is a python script that runs a large number of iterations to model GitHub's merge queue behavior using queue size, probability of failure, wait time to enter the queue and its probability, and randomly selecting queue position to fail a job. The longer the duration, the more stable the predictions tend to be. The tool computes PR throughput by tracking the number of successful PR moving through the queue over actual execution time (the last iteration may exceed allowed simulation duration). Average wait time is computed using Little's Law.
Build the project:
poetry install
The following shows an example of tuning the queue size to determine optimal configuration for throughput and wait times.
Tip
This scenario demonstrates how oversizing the queue can increase wait times and cost of CI resources.
Simulation 1: Run a simulation with the following parameters to get a baseline:
- queue size of
5
- each job takes
30 minutes
to execute - job failure probability of
10%
- no jobs waiting to enter the queue at any time
poetry run simulate --min-queue-size 5 --max-queue-size 5 --job-duration 30 --failure-probability 0.1
Job duration : 30m
Job failure probability : 0.1
Jobs waiting to enter the queue probability : 0
Jobs waiting to enter the queue : 0
Simulation duration : 10000h
Q Size | Throughput (PR/h) | Avg Time to Merge (m) | Median Time to Merge (m) | PRs Lost (PR/h)
-------|-------------------|-----------------------|--------------------------|-----------------
5 | 7.3 | 48.6 | 30.0 | 2.7
Simulation 2: Run a second simulation that shrinks the queue size to 3 and puts 2 jobs into the queue waiting line with a small probability of that occurrence:
- queue size of
3
- each job takes
30 minutes
to execute (same) - job failure probability of
10%
(same) 10%
probability of2
jobs waiting to enter the queue, representing the reduction in the queue size shifted to queue waiting line
poetry run simulate --min-queue-size 3 --max-queue-size 3 --job-duration 30 --failure-probability 0.1 --jobs_waiting_to_enter 2 --jobs_waiting_to_enter_probability 0.1
Job duration : 30m
Job failure probability : 0.1
Jobs waiting to enter the queue probability : 0.1
Jobs waiting to enter the queue : 2
Simulation duration : 10000h
Q Size | Throughput (PR/h) | Avg Time to Merge (m) | Median Time to Merge (m) | PRs Lost (PR/h)
-------|-------------------|-----------------------|--------------------------|-----------------
3 | 4.9 | 40.0 | 30.0 | 1.1
Note
- There is an expected reduction in throughput due to smaller queue size.
- There's also a significant drop in the average wait time.
- The median number staying the same suggests most workloads are not impacted but few workloads are likely to see an improvement in wait times.
- There's a significat drop in lost PRs which reduces cost in the CI system.
GitHub merge queue uses an optimistic concurrency model. The best way to understand how it works is to visualize a scenario where a queue is building multiple jobs and a failure occurs in the middle of the queue. Let's consider a scenario with 3 PRs:
PR1
builds commitA
which is its head commit.PR2
builds commitB
, and it includesPR1
's commitA
anticipatingPR1
build success.PR3
builds all commitsA
B
andC
.
Queue with 3 PRs:
block-beta
columns 1
block:group1
columns 2
a["PR1"] b["commit A"]
end
block:group2
columns 2
c["PR2"] d["commit A + commit B"]
end
block:group3
columns 2
e["PR3"] f["commit A + commit B + commit C"]
end
When PR2
build fails, it is removed from the queue, and all jobs that include its commit in the queue are restarted.
PR1
is unaffected.PR2
build is removed from the queue.PR3
build is restarted with commitsA
andC
, no commitB
.
Queue goes to 2 PRs:
block-beta
columns 1
block:group1
columns 2
a["PR1"] b["commit A"]
end
block:group2
columns 2
c["PR3"] d["commit A + commit C"]
end