Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrence issue #1521

Closed
bitchecker opened this issue Sep 7, 2024 · 22 comments
Closed

Concurrence issue #1521

bitchecker opened this issue Sep 7, 2024 · 22 comments
Assignees
Labels
acknowledged 🐛 bug Something isn't working

Comments

@bitchecker
Copy link
Contributor

Describe the bug
Hi,
using the provider creating multiple resources at the same time using the same template return a timeout error:

│ Error: error waiting for VM clone: All attempts fail:
│ #1: task "UPID:proxmox:0035D677:03AB00C3:66DC5987:qmclone:132:root@pam:" failed to complete with exit code: can't lock file '/var/lock/qemu-server/lock-133.conf' - got timeout
│ 

To Reproduce
I created a module that manage the proxmox_virtual_environment_vm resource, and in my final code I invoke that module 4 time.

plan goes perfectly ok, showing all the resources that will be created, but when I run an apply I get the reported error.

For "skip" the error, I need to run apply multiple time (getting N-1 times the reported error) and finish with all the resources created.

  • Single or clustered Proxmox: single
  • Proxmox version: 8.2.4
  • Provider version (ideally it should be the latest version): latest
  • Terraform/OpenTofu version: v1.9.5
  • OS (where you run Terraform/OpenTofu from): Fedora 40
@bitchecker bitchecker added the 🐛 bug Something isn't working label Sep 7, 2024
@bpg
Copy link
Owner

bpg commented Sep 8, 2024

Hi @bitchecker

Seems to be similar to #995. You can try to play with "parallelism" parameter, or perhaps update you scsi_hardware to virtio-scsi-single.

@bitchecker
Copy link
Contributor Author

Hi @bpg,
with scsi_hardware config, the situation is a little better, but the issue is still present!

Of course, setup parallelism option is a good workaround also if build time are longer, also if the final result is bad to see:
immagine

@bpg
Copy link
Owner

bpg commented Sep 11, 2024

@bitchecker, have you tried using the virtio disk interface as described here?

Ultimately, this is not an issue with the provider but rather a bottleneck in the PVE I/O subsystem, which is exacerbated by Terraform’s parallel provisioning of VMs.

You could also try moving your VM source (template or disk image) to a different physical datastore, if you have that option. I have found that doing so drastically improves the performance of VM creation when I create and destroy dozens of them in acceptance tests.

@bpg bpg added the ⌛ pending author's response Requested additional information from the reporter label Sep 11, 2024
@bitchecker
Copy link
Contributor Author

Hi @bpg,
no, I didn't try virtio disk interface yet because this will be a braking change on module and will destroy and rebuild all guests!

I don't think that can be related to I/O issue because the server is running on NVMe drives. While I run apply I see that he try to assign the same ID more than one time...so the error is right because you can't have multiple guests with same ID.

@bitchecker
Copy link
Contributor Author

Testing on a testing vm I get this error:

module.test.proxmox_virtual_environment_vm.virtual_machine[0]: Modifying... [id=105]
╷
│ Error: deletion of disks not supported. Please delete disk by hand. Old interface was "scsi0"
│ 
│   with module.test.proxmox_virtual_environment_vm.virtual_machine[0],
│   on .terraform/modules/test/main.tf line 13, in resource "proxmox_virtual_environment_vm" "virtual_machine":
│   13: resource "proxmox_virtual_environment_vm" "virtual_machine" {
│ 
╵

@bitchecker
Copy link
Contributor Author

Building a new machine I get the same error also with virtio interface:

│ Error: error waiting for VM clone: All attempts fail:
│ #1: task "UPID:proxmox:001D0CCF:021A2E8B:66E32A62:qmclone:132:root@pam:" failed to complete with exit code: clone failed: can't lock file '/var/lock/pve-manager/pve-storage-local-zfs' - got timeout
│ 
│   with module.test.proxmox_virtual_environment_vm.virtual_machine[3],
│   on .terraform/modules/test/main.tf line 13, in resource "proxmox_virtual_environment_vm" "virtual_machine":
│   13: resource "proxmox_virtual_environment_vm" "virtual_machine" {
│ 
╵
╷
│ Error: error waiting for VM clone: All attempts fail:
│ #1: task "UPID:proxmox:001D0CD1:021A2E9D:66E32A63:qmclone:132:root@pam:" failed to complete with exit code: clone failed: can't lock file '/var/lock/pve-manager/pve-storage-local-zfs' - got timeout
│ 
│   with module.test.proxmox_virtual_environment_vm.virtual_machine[4],
│   on .terraform/modules/test/main.tf line 13, in resource "proxmox_virtual_environment_vm" "virtual_machine":
│   13: resource "proxmox_virtual_environment_vm" "virtual_machine" {
│ 
╵
╷
│ Error: error updating VM: received an HTTP 599 response - Reason: Too many redirections
│ 
│   with module.test.proxmox_virtual_environment_vm.virtual_machine[0],
│   on .terraform/modules/test/main.tf line 13, in resource "proxmox_virtual_environment_vm" "virtual_machine":
│   13: resource "proxmox_virtual_environment_vm" "virtual_machine" {
│ 
╵

so between scsi0 and virtio0 nothing changes!

@bpg
Copy link
Owner

bpg commented Sep 16, 2024

I don't think that can be related to I/O issue because the server is running on NVMe drives. While I run apply I see that he try to assign the same ID more than one time...so the error is right because you can't have multiple guests with same ID.

Do VM templates have fixed vm_ids, or are they dynamically generated by the provider (e.g., when the vm_id is omitted)?

@bitchecker
Copy link
Contributor Author

The VM Template is of course the same for all the machine and has not a fixed ID, but proxmox assigned it when I created it.

For VMs, we're using the same logic...via code I'm asking 4 machines and I demand to proxmox the VMID.

@bpg
Copy link
Owner

bpg commented Sep 17, 2024

Ah, that could be the reason. Under the hood, the provider retrieves the VM IDs from PVE before creating the VMs. This operation is not atomic, so when multiple VMs without IDs are being created in parallel, there’s a chance of assigning the same ID to two or more VMs.

There are a few reasons why the provider handles it this way instead of relying on PVE to allocate the IDs. Let me see if this can be improved.

@bpg bpg added acknowledged and removed ⌛ pending author's response Requested additional information from the reporter labels Sep 17, 2024
@bitchecker
Copy link
Contributor Author

Hi,
I think too that the main reason can be the API calls that are spawned in parallel as I suggested here.

@bpg
Copy link
Owner

bpg commented Sep 18, 2024

Hm... the problem lies with the PVE's clone API. It requires a new VM ID. The same is reflected in UI, VM ID is mandatory there:
Screenshot 2024-09-17 at 9 08 23 PM

This requirement has driven the current implementation. As a result, any VM creation, whether new or clone, first obtains an ID (or uses the provided one if it's in the template) before calling create or clone API.

From the initial error message, I assume you're cloning a template in your use case. So if you don't like to use parallelism = 1, I would suggest reworking the template by generating the vm_id using a count argument or something similar. For example:

resource "proxmox_virtual_environment_vm" "test_vm" {
  count     = 3
  name      = "test-vm-${count.index}"
  vm_id     = 100 + count.index

  clone {
    vm_id = 123
  }
  ...
}

The regular VM create can be improved to make it more reliable, and I'll address it at some point.

@bpg bpg self-assigned this Sep 22, 2024
@bpg
Copy link
Owner

bpg commented Sep 30, 2024

@bpg
Copy link
Owner

bpg commented Oct 4, 2024

The duplicate IDs should be mitigated by #1557.
Although, I'm not 100% confident that the ID duplicates are the root cause for the observed behaviour. FWIW, this could be a coincidence.

Anyway, I'm going to close this ticket. Please test with the new 0.66 release and reopen if the issue is still there.

@bpg bpg closed this as completed Oct 4, 2024
@bitchecker
Copy link
Contributor Author

Hi @bpg,
I can't find updates on documentation (https://registry.terraform.io/providers/bpg/proxmox/latest/docs/resources/virtual_environment_vm).

Another point is that on your PR I can see that with or without the random_vm_ids = true the result is the same 😅
The only change are on completely random IDs.

@bpg
Copy link
Owner

bpg commented Oct 4, 2024

Hi @bitchecker

I've added a new section here, as this is a provider-wide feature

Another point is that on your PR I can see that with or without the random_vm_ids = true the result is the same.

Just to confirm, "result is the same" means you've tried the new version in your environment, and you see the same error `can't lock file '/var/lock/qemu-server/lock-XXXX.conf' ?

@bitchecker
Copy link
Contributor Author

Hi @bpg,

Just to confirm, "result is the same" means you've tried the new version in your environment, and you see the same error `can't lock file '/var/lock/qemu-server/lock-XXXX.conf' ?

oh, no... I was just reporting that the first output you reported (without random_vm_ids) goes without any error 😅

@bitchecker
Copy link
Contributor Author

I just tried to add the new option and I can confirm that no regressions on existing guests!

@bitchecker
Copy link
Contributor Author

bitchecker commented Oct 10, 2024

Hi @bpg,
I just tried to spawn 10 guests with random_vm_ids enabled in the provider section, but the issue is still present!

Please re-open this issue.

@bpg
Copy link
Owner

bpg commented Oct 10, 2024

Hi @bpg,

I just tried to spawn 10 guests with random_vm_ids enabled in the provider section, but the issue is still present!

Please re-open this issue.

What is the error do you see? There were 2 different lock errors in the previous reports: lock on the VM config, and lock on the storage. Also "599 response" error, which is a different issue.

@bitchecker
Copy link
Contributor Author

For what I can see, it seems that the "lock problem" is now moved to the template clone operation:

╷
│ Error: error waiting for VM clone: All attempts fail:
│ #1: task "UPID:proxmox:0024DE49:1099932D:6708478D:qmclone:120:root@pam:" failed to complete with exit code: clone failed: can't lock file '/var/lock/pve-manager/pve-storage-local-zfs' - got timeout
│ 
│   with module.test.proxmox_virtual_environment_vm.virtual_machine[2],
│   on .terraform/modules/test/main.tf line 13, in resource "proxmox_virtual_environment_vm" "virtual_machine":
│   13: resource "proxmox_virtual_environment_vm" "virtual_machine" {
│ 
╵
╷
│ Error: error updating VM: received an HTTP 500 response - Reason: can't lock file '/var/lock/pve-manager/pve-storage-local-zfs' - got timeout
│ 
│   with module.test.proxmox_virtual_environment_vm.virtual_machine[6],
│   on .terraform/modules/test/main.tf line 13, in resource "proxmox_virtual_environment_vm" "virtual_machine":
│   13: resource "proxmox_virtual_environment_vm" "virtual_machine" {
│ 
╵

@bitchecker
Copy link
Contributor Author

hi @bpg,
any news on this?

@bpg
Copy link
Owner

bpg commented Oct 25, 2024

Hey @bitchecker,

I was able to intermittently reproduce this storage lock error during tests with high concurrency—specifically, when creating 8 or more machines from the same clone on the same physical storage. However, this issue is mitigated when clones/target VMs are distributed across different storage devices or when using high-throughput Ceph storage, as in my case.

In my opinion, this error points to an I/O bottleneck within the PVE storage subsystem. The most effective solution seems to be reducing the throughput on the storage devices that are experiencing the lock. I recommend utilizing the parallelism setting or moving the clone source to a different storage to alleviate the issue.

While I could add retry logic to the provider’s code for handling the clone operation, it won't guarantee reliability. It’s difficult to predict the duration of the storage lock and determine an appropriate retry strategy. Another option would be to throttle parallel cloning within the provider, but that would essentially replicate the parallelism functionality that Terraform already offers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledged 🐛 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants