Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor logs and concurrency issues #93

Open
Destrocamil opened this issue Nov 21, 2024 · 11 comments
Open

executor logs and concurrency issues #93

Destrocamil opened this issue Nov 21, 2024 · 11 comments

Comments

@Destrocamil
Copy link

Destrocamil commented Nov 21, 2024

Hey !
I'm so glad that gitlab-tart-executor exists, but currently I am facing 2 issues that could find some help 👍

I am having issues while trying to build 1 or more jobs simultaneously. The one that starts first, is all good. The second one keeps telling

The number of VMs exceeds the system limit (other running VMs: gitlab-8437130944, gitlab-8438646366)
Failed to retrieve IP address of VM "gitlab-8438646369" in 60 seconds: tart command returned non-zero exit code: "no IP address found, is your VM running?", will re-try...

My current configuration is

(...)
  [runners.custom]
    prepare_exec = "/opt/homebrew/bin/gitlab-tart-executor"
    prepare_args = ["prepare", "--concurrency", "4", "--cpu","auto", "--memory","auto"]
    run_exec = "/opt/homebrew/bin/gitlab-tart-executor"
    run_args = ["run"]
    cleanup_exec = "/opt/homebrew/bin/gitlab-tart-executor"
    cleanup_args = ["cleanup"]

Because I am having this issue, I was trying to find some logs but found none, at leas not aware where they are.
Thanks!

@edigaryev
Copy link
Contributor

edigaryev commented Nov 21, 2024

The number of VMs exceeds the system limit

Are you running macOS VMs?

It's not possible to run more than 2 of such VMs due to the underlying Virtualization.Framework limitation.

It is possible, however, to run 2 macOS + many Linux VMs on a single host.

You can try decreasing the --concurrency to 2 to work around this.

@Destrocamil
Copy link
Author

Thanks for the quick reply.
I changed concurrency back to 2 but still facing the same issues.
2 jobs runnings,
job 1:

2024/11/22 11:41:50 Cloning and configuring a new VM...
2024/11/22 11:41:50 Waiting for the VM to boot and be SSH-able...
2024/11/22 11:42:02 Was able to SSH!
2024/11/22 11:42:02 Installing GitLab Runner...

job 2:

2024/11/22 11:41:56 Cloning and configuring a new VM...
2024/11/22 11:41:56 Waiting for the VM to boot and be SSH-able...
The number of VMs exceeds the system limit (other running VMs: gitlab-8437130944, gitlab-8445654884)
Failed to retrieve IP address of VM "gitlab-8445655773" in 60 seconds: tart command returned non-zero exit code: "no IP address found, is your VM running?", will re-try...

Am I doing something wrong?!

@edigaryev
Copy link
Contributor

edigaryev commented Nov 22, 2024

The number of VMs exceeds the system limit (other running VMs: gitlab-8437130944, gitlab-8445654884)

You have more than 2 VMs running, not sure why.

Can try shutting down the GitLab Runner, and then stopping and deleting the existing gitlab-* VMs by hand (using tart list and tart delete)?

@Destrocamil
Copy link
Author

You were right, seems some machine was hanging. Thanks 👍

@Destrocamil
Copy link
Author

Reopening the issue with more questions!
So, you were completely right both about the limit of VMs running and the need to change the concurrency flag, it did the trick.
However, let's consider this scenario:

  • I trigger 2 parallel builds
  • they start to run without any problem and do their thing
  • after 5 minutes, I trigger a third build. This one hangs because the 2 other are still running, which is expected
  • after another 5 minutes, build 1 finishes. There is now an empty slot for another VM
  • the third build fails to start, even after the 10 *60 seconds of timeout

So, manually, I see tart deleting/closing the machines without any issue, but via GitLab-tart-executor I have no idea why it keeps the machines there for some time even after few minutes (at least, 5). Can I check the logs somewhere? Can I force the cleanup of the machines?

Just for reference, this was my test case
- triggered xxx-app and xxx-development-app jobs, at the same time (14h09)
- jobs are instantly picked up
- will wait 5 minutes before triggering a new one
- triggered a third job (14h15)
- got the info that there are 2 running VMs(which is expected)
- waiting until one of the first jobs finish and check how the third one behaves
- third job waiting time is 60 seconds * 10 before it exits
- 14h18, xxx-app job finished, third job on queue should run in a few seconds (until the retry kicks in)..
- xxx-development-app finished building after 11 minutes, third job remains on queue
- seems cleanup process takes some minutes?
- third job, 8 minutes on queue, still not picking the job
- failed to pick up the job
- after 15m, VMs running are none and we can trigger a (not third but the third job) again

@Destrocamil Destrocamil reopened this Nov 26, 2024
@edigaryev
Copy link
Contributor

edigaryev commented Dec 3, 2024

after another 5 minutes, build 1 finishes

  1. Anything fishy in build 1 logs? E.g. Failed to stop VM: [...] or Failed to delete VM: [...] messages?

  2. Can you verify that the VMs that are not automatically cleaned up by GitLab Tart Executor do indeed correspond to already finished jobs with no errors in the jobs logs?

  3. Which Tart version are you running?

@Destrocamil
Copy link
Author

Destrocamil commented Dec 4, 2024

Thanks for your reply.
For 1, nothing fishy there, just the standard "uploading artifacts" and "job succeded".
For 2, I will double check that, but I am positive the info corresponds.
For 3, using the latest one available from brew

@Destrocamil
Copy link
Author

Destrocamil commented Jan 7, 2025

Which user input you need from my side @edigaryev ?
For 2, I was able to confirm that corresponds.

@edigaryev
Copy link
Contributor

For 2, I was able to confirm that corresponds.

Do I understand correctly that you have observed both:

  1. A successfully finished job
  2. An errored job, where the error is The number of VMs exceeds the system limit (other running VMs: gitlab-[...], gitlab-[...]) and one of the VM names in the error point to the job ID that of a successfully finished job

?

@Destrocamil
Copy link
Author

Yes, that is correct.
Since this issue is more than 2 months old, I'll start everything from the beginning somewhere around Thursday/Friday and will update here accordingly.

@aofldl
Copy link

aofldl commented Jan 8, 2025

@Destrocamil when we first started with apple virtualization we ran into this limit like you and sometimes even just starting up vms manually 2 at a time would give errors like this but work other times. Very frustrating.

I know this wont answer your issue but what we ended up doing for our runners was going with mac m2 minis and using a concurrency of 1 with all resources to the vm which results in a one job per runner setup that has been rock solid stable and also faster as resources are essentially dedicated.

our original plan of a couple beefy machines to run the jobs was shot due to apple virtualization limits. we are also not a company doing 5000 builds an hour either so this works for us.

Tart has been a godsend versus our old setup. Love it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants