Gitlab CI Autoscaling Setup

Fixed-size Openstack Setup

Since we got rid of our Kubernetes cluster (which was also on AWS and so still expensive), we now use a fixed-size collection of Gitlab runners living on our Openstack cluster, which we bought already.

These are set up using methods developed for the vg project.

New shared runners

We have shared Gitlab runners that run one task at a time on 8 cores, set up like this:

SSH_KEY_NAME=anovak-swords
SERVER_NAME=anovak-gitlab-runner-shared-6
FLAVOR=m1.medium

openstack --os-cloud openstack server create --image ubuntu-22.04-LTS-x86_64 --flavor ${FLAVOR} --key-name ${SSH_KEY_NAME} --wait ${SERVER_NAME}
# There is no way to find a free floating IP that already exists without fighting over it.
# Assignment steals the IP if it was already assigned elsewhere.
# See <https://stackoverflow.com/q/36497218>
IP_ID=$(openstack --os-cloud openstack floating ip create ext-net --format value --column id | head -n1)
openstack --os-cloud openstack server add floating ip ${SERVER_NAME} ${IP_ID}
sleep 60
INSTANCE_IP="$(openstack --os-cloud openstack floating ip show ${IP_ID} --column floating_ip_address --format value)"
ssh-keygen -R ${INSTANCE_IP}
ssh ubuntu@${INSTANCE_IP}

sudo su -

RUNNER_TOKEN=

systemctl stop docker.socket || true
systemctl stop docker.service || true
systemctl stop ephemeral-setup.service || true
rm -Rf /var/lib/docker

cat >/etc/systemd/system/ephemeral-setup.service <<'EOF'
[Unit]
Description=bind mounts ephemeral directories
Before=docker.service
Requires=mnt.mount
After=mnt.mount

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=mkdir -p /mnt/ephemeral/var/lib/docker
ExecStart=mkdir -p /var/lib/docker
ExecStart=mount --bind /mnt/ephemeral/var/lib/docker /var/lib/docker
ExecStop=umount /var/lib/docker

[Install]
RequiredBy=docker.service
EOF

systemctl daemon-reload

systemctl enable ephemeral-setup.service
systemctl start docker.socket || true
systemctl start docker.service || true

TASK_MEMORY=25G
TASKS_PER_NODE=1
CPUS_PER_TASK=8
bash -c "export DEBIAN_FRONTEND=noninteractive; curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh | bash"
DEBIAN_FRONTEND=noninteractive apt update && sudo DEBIAN_FRONTEND=noninteractive apt upgrade -y
DEBIAN_FRONTEND=noninteractive apt install -y docker.io gitlab-runner
gitlab-runner register --non-interactive --url https://ucsc-ci.com --token "${RUNNER_TOKEN}" --limit "${TASKS_PER_NODE}" --executor docker --docker-privileged --docker-memory "${TASK_MEMORY}" --docker-cpus "${CPUS_PER_TASK}" --docker-image docker:dind
sed -i "s/concurrent = 1/concurrent = ${TASKS_PER_NODE}/g" /etc/gitlab-runner/config.toml
echo "  output_limit = 40960" >>/etc/gitlab-runner/config.toml
gitlab-runner restart

New runners should probably be set up as shared runners to prevent resources from being reserved but idle.

Old dedicated Toil runners

These have been destroyed and no longer are used.

Server creation at one point used:

SSH_KEY_NAME=anovak-swords
SERVER_NAME=anovak-gitlab-runner-toil-2
openstack --os-cloud openstack server create --image ubuntu-22.04-LTS-x86_64 --flavor m1.huge --key-name ${SSH_KEY_NAME} --wait ${SERVER_NAME}
while true ; do
    IP_ID=$(openstack --os-cloud openstack floating ip list --long --status DOWN --network ext-net --format value --column ID | head -n1)
    while [[ "${IP_ID}" == "" ]] ; do
        openstack --os-cloud openstack floating ip create ext-net
        IP_ID=$(openstack --os-cloud openstack floating ip list --long --status DOWN --network ext-net --format value --column ID | head -n1)
    done
    openstack --os-cloud openstack server add floating ip ${SERVER_NAME} ${IP_ID} || continue
    break
done
INSTANCE_IP="$(openstack --os-cloud openstack floating ip show ${IP_ID} --column floating_ip_address --format value)"
sleep 60
ssh ubuntu@${INSTANCE_IP}

TASK_MEMORY=15G
TASKS_PER_NODE=7
bash -c "export DEBIAN_FRONTEND=noninteractive; curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh | bash"
DEBIAN_FRONTEND=noninteractive apt update && sudo DEBIAN_FRONTEND=noninteractive apt upgrade -y
DEBIAN_FRONTEND=noninteractive apt install -y docker.io gitlab-runner
gitlab-runner register --non-interactive --url https://ucsc-ci.com --token "${RUNNER_TOKEN}" --limit "${TASKS_PER_NODE}" --executor docker --docker-privileged --docker-memory "${TASK_MEMORY}" --docker-image docker:dind
sed -i "s/concurrent = 1/concurrent = ${TASKS_PER_NODE}/g" /etc/gitlab-runner/config.toml
gitlab-runner restart

This is not recommended anymore (use the simpler code on the vg wiki that can't accidentally steal IP addresses due to a race condition). But note the flavor and the lack of a CPU limit for each task.

Old Kubernetes-based setup

Since we were spending too much money with our AWS-based CI setup, I deployed a new set of autoscaling Gitlab runners on our Kubernetes cluster, the compute for which we buy in bulk at a lower cost.

To do this, I followed this tutorial on how to do it using a Helm chart. This is pretty easy with Helm 3, since Helm doesn't actually need to be pre-installed on the cluster. The main problems come from the lack of ability to make arbitrary configuration changes; you can only do what the chart supports.

The basic structure is the same as in the AWS deployment: a persistent runner runs (this time as a Kubernetes pod), signs up to do jobs, and then runs the jobs in their own containers (this time as Kubernetes pods).

With this setup, the ENTRYPOINT of the Docker container that the .gitlab-ci.yml file asks to run in never runs, so I made some changes to quay.io/vgteam/dind and quay.io/vgteam/vg_ci_prebake to provide startdocker and stopdocker commands to start/stop the daemon in the container. I added these to the .gitlab-ci.yml of Toil:

image: quay.io/vgteam/vg_ci_prebake:latest

before_script:
  - startdocker || true
...

after_script:
  - stopdocker || true

Unfortunately, for reasons I have not yet been able to work out, starting Docker this way in a Kubernetes container requires the container to be privileged. When starting via the ENTRYPOINT on ordinary non-Kubernetes Docker this isn't the case, so in theory we should be able to overcome it, but I just gave up and let the containers run as privileged, which we can do on our own cluster.

I also had to adjust the Toil tests to not rely on having AWS access via an IAM role assigned to the hosting instances. The Kubernetes pods don't run on machines with IAM roles that they can use, but there's also no way to tell Gitlab to mount the Kubernetes secret we usually use for AWS access inside the CI jobs' containers. I switched everything to use Gitlab-managed secret credential files instead, both for AWS access to actually test AWS, and for the credentials we formerly kept in the AWS secrets manager.

To actually set up the Gitlab runner, I grabbed the runner registration token for the Gitlab entity I wanted to own the runner (in this case, the DataBiosphere organization) and made a values.yml file with it to configure the Helm chart:

checkInterval: 30
concurrent: 20
imagePullPolicy: Always
rbac:
  create: false
  serviceAccountName: toil-svc
gitlabUrl: https://ucsc-ci.com/
runnerRegistrationToken: "!!!PASTE TOKEN HERE!!!"
runners:
  config: |
    [[runners]]
      name = "Kubernetes Runner"
      output_limit = 40960
      [runners.kubernetes]
        namespace = "toil"
        image = "quay.io/vgteam/vg_ci_prebake"
        poll_timeout = 86400
        privileged = true
        service_account = "toil-svc"
        cpu_limit = "4000m"
        cpu_request = "4000m"
        memory_limit = "15Gi"
        memory_request = "15Gi"
        ephemeral_storage_limit = "20Gi"
        ephemeral_storage_request = "20Gi"
        service_cpu_limit = "4000m"
        service_cpu_request = "4000m"
        service_memory_limit = "8Gi"
        service_memory_request = "8Gi"
        service_ephemeral_storage_limit = "20Gi"
        service_ephemeral_storage_request = "20Gi"
        helper_cpu_limit = "500m"
        helper_cpu_request = "500m"
        helper_memory_limit = "256M"
        helper_memory_request = "256M"
        helper_ephemeral_storage_limit = "20Gi"
        helper_ephemeral_storage_request = "20Gi"

Or in the old format:

cat >values.yml <<EOF
imagePullPolicy: Always
gitlabUrl: "https://ucsc-ci.com/"
runnerRegistrationToken: "!!!PASTE_TOKEN_HERE!!!"
concurrent: 10
checkInterval: 30
rbac:
  create: false
  serviceAccountName: toil-svc
runners:
  image: "quay.io/vgteam/vg_ci_prebake"
  privileged: true
  pollTimeout: 86400
  outputLimit: 40960
  namespace: toil
  serviceAccountName: toil-svc
  builds:
    cpuLimit: 4000m
    memoryLimit: 15Gi
    cpuRequests: 4000m
    memoryRequests: 15Gi
  services:
    cpuLimit: 4000m
    memoryLimit: 15Gi
    cpuRequests: 4000m
    memoryRequests: 15Gi
EOF

I upped the pollTimeout substantially from what was given in the tutorial, because the runner isn't clever enough to work out when the Kubernetes cluster is busy. It will happily sign up for 10 jobs at once from Gitlab, and then not be able to get any of its pods to start because there's no room. Upping the timeout lets it wait for a long time with the pods in Kubernetes' queue.

Note that for Helm chart 0.23+ you should, and for Helm chart version 1.0+ you MUST, use a new syntax for the configuration, like this example for the VG runner:

checkInterval: 30
concurrent: 10
imagePullPolicy: Always
rbac:
  create: false
  serviceAccountName: vg-svc
gitlabUrl: https://ucsc-ci.com/
runnerRegistrationToken: "!!!PASTE_TOKEN_HERE!!!"
runners:
  cache:
    secretName: shared-s3-credentials-literal
  config: |
    [[runners]]
      name = "Kubernetes Runner"
      output_limit = 40960
      [runners.kubernetes]
        namespace = "vg"
        image = "quay.io/vgteam/vg_ci_prebake"
        poll_timeout = 86400
        privileged = true
        service_account = "vg-svc"
        cpu_limit = "8000m"
        cpu_request = "8000m"
        memory_limit = "25Gi"
        memory_request = "25Gi"
        ephemeral_storage_limit = "35Gi"
        ephemeral_storage_request = "10Gi"
        service_cpu_limit = "4000m"
        service_cpu_request = "4000m"
        service_memory_limit = "2Gi"
        service_memory_request = "2Gi"
        service_ephemeral_storage_limit = "35Gi"
        service_ephemeral_storage_request = "10Gi"
        helper_cpu_limit = "500m"
        helper_cpu_request = "500m"
        helper_memory_limit = "256M"
        helper_memory_request = "256M"
        helper_ephemeral_storage_limit = "35Gi"
        helper_ephemeral_storage_request = "10Gi"
      [runners.cache]
        Type = "s3"
        Path = "vg_ci/cache"
        Shared = true
        [runners.cache.s3]
          ServerAddress = "s3.amazonaws.com"
          BucketName = "vg-data"
          BucketLocation = "us-west-2"
          Insecure = false

After making the values.yml file, I used Helm 3 to deploy:

helm install --namespace toil gitlab-toil-kubernetes-runner -f values.yml gitlab/gitlab-runner

For this to work, I had to have access to create configmaps resources on the cluster, which Erich hadn't granted yet.

An Error: Kubernetes cluster unreachable message can be solved by adding --kubeconfig ~/.kube/path.to.your.config to the helm command.

I had to tweak the values.yml a few times to get it working. To apply the changes, I ran:

helm upgrade --recreate-pods --namespace toil gitlab-toil-kubernetes-runner -f values.yml gitlab/gitlab-runner

Note that every time you do this (or the pod restarts) it registers a new runner with Gitlab and gets rid of the old one. So if you had it paused before, it will become unpaused and start running jobs.

Also, to make this work, I had to get Erich to set a sensible default disk request and limit of 10 GB on the namespace. The Helm chart only allows you to set the CPU and memory requests and limits, and was using a disk limit of 500 Mi, which was much too small. Unfortunately, this setting has to be configured at the namespace level, so it affects anything else in the namespace that doesn't specify its own disk request/limit. This setting is now configured on the toil and vg namespaces.

Updating

If you need to update the runner (for example, to change a registration token, or to upgrade to a new Gitlab), you can:

Do helm -n toil list to find the release to update. We will use gitlab-toil-kubernetes-runner in our example.
Get the existing configuration with helm -n toil get values -o yaml gitlab-toil-kubernetes-runner >values.yml.
Do helm repo update to make sure your gitlab repo is up to date. If you are on a different machine than the one you originally deployed from, you might need to add the gitlab repo as explained in the Gitlab documentation.
Upgrade to the newest version of the chart: helm -n toil upgrade --recreate-pods gitlab-toil-kubernetes-runner -f values.yml gitlab/gitlab-runner

Old Old AWS-Based Setup

I have set up an autoscaling Gitlab runner on AWS to run multiple tests in parallel.

I am basically following the tutorial at https://docs.gitlab.com/runner/configuration/runner_autoscale_aws/

The tutorial has you create a "bastion" instance, on which you install the Gitlab Runner, using the "docker+machine" runner type. Then the bastion instance uses Docker Machine to create and destroy other instances to do the actual testing, as needed, but from the Gitlab side it looks like a single "runner" executing multiple tests.

I created a t2.micro instance named gitlab-ci-bastion, in the gitlab-ci-runner security group, with the gitlab-ci-runner IAM role, using the Ubuntu 18.04 image. I gave it a 20 GB root volume. I protected it from termination. It got IP address 54.218.250.217.

ssh [email protected]

I made sure to authorize the "ci" SSH key to access it, in ~/.ssh/authorized_keys.

Then I installed Gitlab Runner and Docker. I had to run each command separately; copy-pasting the whole block did not work.

curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh | sudo bash

sudo apt-get -y -q install gitlab-runner

sudo apt-get -y -q install docker.io

sudo usermod -a -G docker gitlab-runner 

sudo usermod -a -G docker ubuntu

Then I installed Docker Machine. Version 0.16.1 was current:

curl -L https://github.com/docker/machine/releases/download/v0.16.1/docker-machine-`uname -s`-`uname -m` >/tmp/docker-machine &&
chmod +x /tmp/docker-machine &&
sudo mv /tmp/docker-machine /usr/local/bin/docker-machine

Then I disconnected and ssh-d back in. At that point I could successfully run docker ps.

Then I went and got the Gitlab registration token from the Gitlab web UI. I decided to register the runner to the DataBiosphere group, instead of just the Toil project.

Then I registered the Gitlab Runner with the main Gitlab server, using the token instead of ##CENSORED##.

sudo gitlab-ci-multi-runner register -n \
  --url https://ucsc-ci.com/ \
  --registration-token ##CENSORED## \
  --executor docker+machine \
  --description "docker-machine-runner" \
  --docker-image "quay.io/vgteam/dind" \
  --docker-privileged

As soon as the runner registered with the Gitlab server, I found it in the web UI and paused it, so it wouldn't start trying to run jobs until I had it configured properly.

I also at some point updated the packages on the bastion machine:

sudo apt update && sudo apt upgrade -y

I edited the /etc/gitlab-runner/config.toml file to actually configure the runner. After a bit of debugging, I got it looking like this.

# Let the runner run 10 jobs in parallel
concurrent = 10
check_interval = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "docker-machine-runner"
  url = "https://ucsc-ci.com/"
  # Leave the pre-filled value here from your config.toml, or replace
  # with the registration token you are using if copy-pasting this one.
  token = "##CENSORED##"
  executor = "docker+machine"
  # Run no more than 10 machines at a time.
  limit = 10
  [runners.docker]
    tls_verify = false
    # We reuse this image because it is Ubuntu with Docker 
    # available and virtualenv installed.
    image = "quay.io/vgteam/vg_ci_prebake"
    # t2.xlarge has 16 GB
    memory = "15g"
    privileged = true
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]
  [runners.machine]
    IdleCount = 0
    IdleTime = 60
    # Max builds per machine before recreating
    MaxBuilds = 10
    MachineDriver = "amazonec2"
    MachineName = "gitlab-ci-machine-%s"
    MachineOptions = [
      "amazonec2-iam-instance-profile=gitlab-ci-runner",
      "amazonec2-region=us-west-2",
      "amazonec2-zone=a",
      "amazonec2-use-private-address=true",
      # Make sure to fill in your own owner details here!
      "amazonec2-tags=Owner,[email protected],Name,gitlab-ci-runner-machine",
      "amazonec2-security-group=gitlab-ci-runner",
      "amazonec2-instance-type=t2.xlarge",
      "amazonec2-root-size=80"
    ]

To enable this to work, I had to add some IAM policies to the gitlab-ci-runner role. It already had the AWS built-in AmazonS3ReadOnlyAccess, to let the tests read test data from S3. I gave it the AWS built-in AmazonEC2FullAccess to allow the bastion to create the machines. I also gave it gitlab-ci-runner-passrole, which I had to talk cluster-admin into creating for me, which allows the bastion to pass on the gitlab-ci-runner role to the machines it creates. That policy had the following contents:

{
   "Version": "2012-10-17",
   "Statement": [
       {
           "Sid": "VisualEditor0",
           "Effect": "Allow",
           "Action": "iam:PassRole",
           "Resource": "arn:aws:iam::719818754276:role/gitlab-ci-runner"
       }
   ]
}

After getting all the policies attached to the role, I rebooted the bastion machine to get it to actually start up the Gitlab Runner daemon:

sudo shutdown -r now

Then when it came back up I unpaused it in the Gitlab web interface, and it started running jobs. A few jobs failed, and to debug them I set the docker image to the vg_ci_prebake that vg uses (to provide packages like python-virtualenv) and added python3-dev to the packages that that image carries.

Docker Maintenance

To make more changes to the image, commit to https://github.com/vgteam/vg_ci_prebake and Quay will automatically rebuild it. If you don't have rights to do that and don't want to wait around for a PR, clone the repo, edit it, and make a new Quay project to build your own version.

Future Work

One change I have not yet made might be to set a high output_limit as described in https://stackoverflow.com/a/53541010 in case the CI logs get too long.

I also have not yet destroyed the old shell runner. I want to leave it in place until we are confident in the new system.

Github Web Hooks

It's also useful to connect your Github repo to your Gitlab repo with a web hook on the Github side, to speed up pull mirroring. THe Gitlab docs for how to do this are here, starting from step 4, where you make a token on Gitlab and configure a hook to be called by Github. We've been using a Gitlab user named "vgbot" with sufficient access to each project to refresh the mirroring, and getting access tokens for it using Gitlab administrator powers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly