Concurrence issue #1521

bitchecker · 2024-09-07T13:59:36Z

Describe the bug
Hi,
using the provider creating multiple resources at the same time using the same template return a timeout error:

│ Error: error waiting for VM clone: All attempts fail:
│ #1: task "UPID:proxmox:0035D677:03AB00C3:66DC5987:qmclone:132:root@pam:" failed to complete with exit code: can't lock file '/var/lock/qemu-server/lock-133.conf' - got timeout
│

To Reproduce
I created a module that manage the proxmox_virtual_environment_vm resource, and in my final code I invoke that module 4 time.

plan goes perfectly ok, showing all the resources that will be created, but when I run an apply I get the reported error.

For "skip" the error, I need to run apply multiple time (getting N-1 times the reported error) and finish with all the resources created.

Single or clustered Proxmox: single
Proxmox version: 8.2.4
Provider version (ideally it should be the latest version): latest
Terraform/OpenTofu version: v1.9.5
OS (where you run Terraform/OpenTofu from): Fedora 40

The text was updated successfully, but these errors were encountered:

bpg · 2024-09-08T15:18:19Z

Hi @bitchecker

Seems to be similar to #995. You can try to play with "parallelism" parameter, or perhaps update you scsi_hardware to virtio-scsi-single.

bitchecker · 2024-09-09T10:16:20Z

Hi @bpg,
with scsi_hardware config, the situation is a little better, but the issue is still present!

Of course, setup parallelism option is a good workaround also if build time are longer, also if the final result is bad to see:

bpg · 2024-09-11T22:06:06Z

@bitchecker, have you tried using the virtio disk interface as described here?

Ultimately, this is not an issue with the provider but rather a bottleneck in the PVE I/O subsystem, which is exacerbated by Terraform’s parallel provisioning of VMs.

You could also try moving your VM source (template or disk image) to a different physical datastore, if you have that option. I have found that doing so drastically improves the performance of VM creation when I create and destroy dozens of them in acceptance tests.

bitchecker · 2024-09-12T07:01:09Z

Hi @bpg,
no, I didn't try virtio disk interface yet because this will be a braking change on module and will destroy and rebuild all guests!

I don't think that can be related to I/O issue because the server is running on NVMe drives. While I run apply I see that he try to assign the same ID more than one time...so the error is right because you can't have multiple guests with same ID.

bitchecker · 2024-09-12T17:41:28Z

Testing on a testing vm I get this error:

module.test.proxmox_virtual_environment_vm.virtual_machine[0]: Modifying... [id=105]
╷
│ Error: deletion of disks not supported. Please delete disk by hand. Old interface was "scsi0"
│ 
│   with module.test.proxmox_virtual_environment_vm.virtual_machine[0],
│   on .terraform/modules/test/main.tf line 13, in resource "proxmox_virtual_environment_vm" "virtual_machine":
│   13: resource "proxmox_virtual_environment_vm" "virtual_machine" {
│ 
╵

bitchecker · 2024-09-12T17:56:57Z

Building a new machine I get the same error also with virtio interface:

│ Error: error waiting for VM clone: All attempts fail:
│ #1: task "UPID:proxmox:001D0CCF:021A2E8B:66E32A62:qmclone:132:root@pam:" failed to complete with exit code: clone failed: can't lock file '/var/lock/pve-manager/pve-storage-local-zfs' - got timeout
│ 
│   with module.test.proxmox_virtual_environment_vm.virtual_machine[3],
│   on .terraform/modules/test/main.tf line 13, in resource "proxmox_virtual_environment_vm" "virtual_machine":
│   13: resource "proxmox_virtual_environment_vm" "virtual_machine" {
│ 
╵
╷
│ Error: error waiting for VM clone: All attempts fail:
│ #1: task "UPID:proxmox:001D0CD1:021A2E9D:66E32A63:qmclone:132:root@pam:" failed to complete with exit code: clone failed: can't lock file '/var/lock/pve-manager/pve-storage-local-zfs' - got timeout
│ 
│   with module.test.proxmox_virtual_environment_vm.virtual_machine[4],
│   on .terraform/modules/test/main.tf line 13, in resource "proxmox_virtual_environment_vm" "virtual_machine":
│   13: resource "proxmox_virtual_environment_vm" "virtual_machine" {
│ 
╵
╷
│ Error: error updating VM: received an HTTP 599 response - Reason: Too many redirections
│ 
│   with module.test.proxmox_virtual_environment_vm.virtual_machine[0],
│   on .terraform/modules/test/main.tf line 13, in resource "proxmox_virtual_environment_vm" "virtual_machine":
│   13: resource "proxmox_virtual_environment_vm" "virtual_machine" {
│ 
╵

so between scsi0 and virtio0 nothing changes!

bpg · 2024-09-16T22:07:15Z

I don't think that can be related to I/O issue because the server is running on NVMe drives. While I run apply I see that he try to assign the same ID more than one time...so the error is right because you can't have multiple guests with same ID.

Do VM templates have fixed vm_ids, or are they dynamically generated by the provider (e.g., when the vm_id is omitted)?

bitchecker · 2024-09-17T06:21:55Z

The VM Template is of course the same for all the machine and has not a fixed ID, but proxmox assigned it when I created it.

For VMs, we're using the same logic...via code I'm asking 4 machines and I demand to proxmox the VMID.

bpg · 2024-09-17T21:55:37Z

Ah, that could be the reason. Under the hood, the provider retrieves the VM IDs from PVE before creating the VMs. This operation is not atomic, so when multiple VMs without IDs are being created in parallel, there’s a chance of assigning the same ID to two or more VMs.

There are a few reasons why the provider handles it this way instead of relying on PVE to allocate the IDs. Let me see if this can be improved.

bitchecker · 2024-09-17T22:03:58Z

Hi,
I think too that the main reason can be the API calls that are spawned in parallel as I suggested here.

bpg · 2024-09-18T01:07:32Z

Hm... the problem lies with the PVE's clone API. It requires a new VM ID. The same is reflected in UI, VM ID is mandatory there:

This requirement has driven the current implementation. As a result, any VM creation, whether new or clone, first obtains an ID (or uses the provided one if it's in the template) before calling create or clone API.

From the initial error message, I assume you're cloning a template in your use case. So if you don't like to use parallelism = 1, I would suggest reworking the template by generating the vm_id using a count argument or something similar. For example:

resource "proxmox_virtual_environment_vm" "test_vm" {
  count     = 3
  name      = "test-vm-${count.index}"
  vm_id     = 100 + count.index

  clone {
    vm_id = 123
  }
  ...
}

The regular VM create can be improved to make it more reliable, and I'll address it at some point.

bpg · 2024-09-30T15:49:22Z

Related: https://forum.proxmox.com/threads/is-there-an-atomic-way-to-get-the-next-free-vm_id-and-reserve-it.123984/

bpg · 2024-10-04T00:23:00Z

The duplicate IDs should be mitigated by #1557.
Although, I'm not 100% confident that the ID duplicates are the root cause for the observed behaviour. FWIW, this could be a coincidence.

Anyway, I'm going to close this ticket. Please test with the new 0.66 release and reopen if the issue is still there.

bitchecker · 2024-10-04T07:20:58Z

Hi @bpg,
I can't find updates on documentation (https://registry.terraform.io/providers/bpg/proxmox/latest/docs/resources/virtual_environment_vm).

Another point is that on your PR I can see that with or without the random_vm_ids = true the result is the same 😅
The only change are on completely random IDs.

bpg · 2024-10-04T22:46:28Z

Hi @bitchecker

I've added a new section here, as this is a provider-wide feature

Another point is that on your PR I can see that with or without the random_vm_ids = true the result is the same.

Just to confirm, "result is the same" means you've tried the new version in your environment, and you see the same error `can't lock file '/var/lock/qemu-server/lock-XXXX.conf' ?

bitchecker · 2024-10-05T09:17:26Z

Hi @bpg,

Just to confirm, "result is the same" means you've tried the new version in your environment, and you see the same error `can't lock file '/var/lock/qemu-server/lock-XXXX.conf' ?

oh, no... I was just reporting that the first output you reported (without random_vm_ids) goes without any error 😅

bitchecker · 2024-10-05T10:37:31Z

I just tried to add the new option and I can confirm that no regressions on existing guests!

bitchecker · 2024-10-10T19:44:53Z

Hi @bpg,
I just tried to spawn 10 guests with random_vm_ids enabled in the provider section, but the issue is still present!

Please re-open this issue.

bpg · 2024-10-10T21:07:50Z

Hi @bpg,

I just tried to spawn 10 guests with random_vm_ids enabled in the provider section, but the issue is still present!

Please re-open this issue.

What is the error do you see? There were 2 different lock errors in the previous reports: lock on the VM config, and lock on the storage. Also "599 response" error, which is a different issue.

bitchecker · 2024-10-10T21:34:38Z

For what I can see, it seems that the "lock problem" is now moved to the template clone operation:

╷
│ Error: error waiting for VM clone: All attempts fail:
│ #1: task "UPID:proxmox:0024DE49:1099932D:6708478D:qmclone:120:root@pam:" failed to complete with exit code: clone failed: can't lock file '/var/lock/pve-manager/pve-storage-local-zfs' - got timeout
│ 
│   with module.test.proxmox_virtual_environment_vm.virtual_machine[2],
│   on .terraform/modules/test/main.tf line 13, in resource "proxmox_virtual_environment_vm" "virtual_machine":
│   13: resource "proxmox_virtual_environment_vm" "virtual_machine" {
│ 
╵
╷
│ Error: error updating VM: received an HTTP 500 response - Reason: can't lock file '/var/lock/pve-manager/pve-storage-local-zfs' - got timeout
│ 
│   with module.test.proxmox_virtual_environment_vm.virtual_machine[6],
│   on .terraform/modules/test/main.tf line 13, in resource "proxmox_virtual_environment_vm" "virtual_machine":
│   13: resource "proxmox_virtual_environment_vm" "virtual_machine" {
│ 
╵

bitchecker · 2024-10-19T11:06:33Z

hi @bpg,
any news on this?

bpg · 2024-10-25T14:50:41Z

Hey @bitchecker,

I was able to intermittently reproduce this storage lock error during tests with high concurrency—specifically, when creating 8 or more machines from the same clone on the same physical storage. However, this issue is mitigated when clones/target VMs are distributed across different storage devices or when using high-throughput Ceph storage, as in my case.

In my opinion, this error points to an I/O bottleneck within the PVE storage subsystem. The most effective solution seems to be reducing the throughput on the storage devices that are experiencing the lock. I recommend utilizing the parallelism setting or moving the clone source to a different storage to alleviate the issue.

While I could add retry logic to the provider’s code for handling the clone operation, it won't guarantee reliability. It’s difficult to predict the duration of the storage lock and determine an appropriate retry strategy. Another option would be to throttle parallel cloning within the provider, but that would essentially replicate the parallelism functionality that Terraform already offers.

bitchecker added the 🐛 bug Something isn't working label Sep 7, 2024

bpg added the ⌛ pending author's response Requested additional information from the reporter label Sep 11, 2024

bpg added acknowledged and removed ⌛ pending author's response Requested additional information from the reporter labels Sep 17, 2024

bpg self-assigned this Sep 22, 2024

bpg mentioned this issue Oct 3, 2024

feat(provider): reliable sequential and random vm_id generation #1557

Merged

3 tasks

bpg closed this as completed Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrence issue #1521

Concurrence issue #1521

bitchecker commented Sep 7, 2024

bpg commented Sep 8, 2024

bitchecker commented Sep 9, 2024

bpg commented Sep 11, 2024

bitchecker commented Sep 12, 2024

bitchecker commented Sep 12, 2024

bitchecker commented Sep 12, 2024

bpg commented Sep 16, 2024

bitchecker commented Sep 17, 2024

bpg commented Sep 17, 2024

bitchecker commented Sep 17, 2024

bpg commented Sep 18, 2024 •

edited

Loading

bpg commented Sep 30, 2024

bpg commented Oct 4, 2024

bitchecker commented Oct 4, 2024

bpg commented Oct 4, 2024

bitchecker commented Oct 5, 2024

bitchecker commented Oct 5, 2024

bitchecker commented Oct 10, 2024 •

edited

Loading

bpg commented Oct 10, 2024

bitchecker commented Oct 10, 2024

bitchecker commented Oct 19, 2024

bpg commented Oct 25, 2024

Concurrence issue #1521

Concurrence issue #1521

Comments

bitchecker commented Sep 7, 2024

bpg commented Sep 8, 2024

bitchecker commented Sep 9, 2024

bpg commented Sep 11, 2024

bitchecker commented Sep 12, 2024

bitchecker commented Sep 12, 2024

bitchecker commented Sep 12, 2024

bpg commented Sep 16, 2024

bitchecker commented Sep 17, 2024

bpg commented Sep 17, 2024

bitchecker commented Sep 17, 2024

bpg commented Sep 18, 2024 • edited Loading

bpg commented Sep 30, 2024

bpg commented Oct 4, 2024

bitchecker commented Oct 4, 2024

bpg commented Oct 4, 2024

bitchecker commented Oct 5, 2024

bitchecker commented Oct 5, 2024

bitchecker commented Oct 10, 2024 • edited Loading

bpg commented Oct 10, 2024

bitchecker commented Oct 10, 2024

bitchecker commented Oct 19, 2024

bpg commented Oct 25, 2024

bpg commented Sep 18, 2024 •

edited

Loading

bitchecker commented Oct 10, 2024 •

edited

Loading