Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"check_restart.grace" is firing early for "group.service.check" that have a task defined #25318

Open
Yokutto opened this issue Mar 7, 2025 · 0 comments
Labels

Comments

@Yokutto
Copy link

Yokutto commented Mar 7, 2025

When using a service inside a group stanza, the check.check_restart.grace timer starts as soon as the allocation is created, while the workloads are still in the "starting" phase. This behavior works as intended, but there are some inconsistencies when a task is provided for that service

Users can define a task in either the service or check stanza (by default, the check inherits the value from service.task if defined). The problem is that if a task benefits from a sidecar or an init container, the grace period starts for the overall allocation rather than waiting for the specific task to start

What is happening:

  • The grace timer starts at allocation creation, not after the task that the check is designed for
  • For tasks that use a sidecar or an init container, this means the grace timer is already counting down even before the main task is still waiting for its lifecycle phase

Expected Behavior:
If the service or check stanza defines a task attribute, Nomad should wait for the "task started" event from that task before starting the grace timer

Additional Notes:

  • The issue occurs with both Nomad and Consul service providers
  • When the service is defined inside a task stanza, the grace period works as intended, starting after the "task started" event
  • For services using the Consul provider with "consul-connect", the sidecar container may have a slow start/warmup, resulting in inconsistent grace periods across replicas and occasional premature restarts

Steps to Reproduce:

  1. Deploy the job definition below
  2. By using the UI, observe that Nomad fires a restart event for the service.task before the task has started, as it waits for the init container lifecycle to complete

Job file

# sandbox.nomad.hcl

job "sandbox" {
  datacenters = ["dc1"]

  group "hashicorp" {
    network {
      mode = "bridge"

      port "api-http" {
        to = 5678
      }
    }

    service {
      provider = "nomad"
      port     = "api-http"

      check {
        task     = "echo"
        type     = "http"
        path     = "/health"
        port     = "api-http"
        interval = "2s"
        timeout  = "3s"

        check_restart {
          limit = 5
          grace = "10s"
        }
      }
    }

    task "sleep" {
      driver = "docker"

      lifecycle {
        hook = "prestart"
      }

      config {
        image   = "docker.io/library/alpine:latest"
        command = "sleep"
        args    = ["30"]
      }
    }

    task "echo" {
      driver = "docker"

      config {
        image = "docker.io/hashicorp/http-echo:latest"
        args  = ["-text=☕"]
      }
    }
  }
}

Environment information

Nomad version (client)
Nomad v1.9.6
BuildDate 2025-02-11T18:55:10Z
Revision 7f8b44963d36d025520348d7f24735774d26f13b+CHANGES

Nomad version (server)
Nomad v1.9.5
BuildDate 2025-01-14T18:35:12Z
Revision 0b7bb8b60758981dae2a78a0946742e09f8316f5+CHANGES

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

1 participant