"check_restart.grace" is firing early for "group.service.check" that have a task defined #25318

Yokutto · 2025-03-07T21:19:53Z

When using a service inside a group stanza, the check.check_restart.grace timer starts as soon as the allocation is created, while the workloads are still in the "starting" phase. This behavior works as intended, but there are some inconsistencies when a task is provided for that service

Users can define a task in either the service or check stanza (by default, the check inherits the value from service.task if defined). The problem is that if a task benefits from a sidecar or an init container, the grace period starts for the overall allocation rather than waiting for the specific task to start

What is happening:

The grace timer starts at allocation creation, not after the task that the check is designed for
For tasks that use a sidecar or an init container, this means the grace timer is already counting down even before the main task is still waiting for its lifecycle phase

Expected Behavior:
If the service or check stanza defines a task attribute, Nomad should wait for the "task started" event from that task before starting the grace timer

Additional Notes:

The issue occurs with both Nomad and Consul service providers
When the service is defined inside a task stanza, the grace period works as intended, starting after the "task started" event
For services using the Consul provider with "consul-connect", the sidecar container may have a slow start/warmup, resulting in inconsistent grace periods across replicas and occasional premature restarts

Steps to Reproduce:

Deploy the job definition below
By using the UI, observe that Nomad fires a restart event for the service.task before the task has started, as it waits for the init container lifecycle to complete

Job file

# sandbox.nomad.hcl

job "sandbox" {
  datacenters = ["dc1"]

  group "hashicorp" {
    network {
      mode = "bridge"

      port "api-http" {
        to = 5678
      }
    }

    service {
      provider = "nomad"
      port     = "api-http"

      check {
        task     = "echo"
        type     = "http"
        path     = "/health"
        port     = "api-http"
        interval = "2s"
        timeout  = "3s"

        check_restart {
          limit = 5
          grace = "10s"
        }
      }
    }

    task "sleep" {
      driver = "docker"

      lifecycle {
        hook = "prestart"
      }

      config {
        image   = "docker.io/library/alpine:latest"
        command = "sleep"
        args    = ["30"]
      }
    }

    task "echo" {
      driver = "docker"

      config {
        image = "docker.io/hashicorp/http-echo:latest"
        args  = ["-text=☕"]
      }
    }
  }
}

Environment information

Nomad version (client)
Nomad v1.9.6
BuildDate 2025-02-11T18:55:10Z
Revision 7f8b44963d36d025520348d7f24735774d26f13b+CHANGES

Nomad version (server)
Nomad v1.9.5
BuildDate 2025-01-14T18:35:12Z
Revision 0b7bb8b60758981dae2a78a0946742e09f8316f5+CHANGES

The text was updated successfully, but these errors were encountered:

Yokutto added the type/bug label Mar 7, 2025

tgross added this to Nomad - Community Issues Triage Mar 7, 2025

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"check_restart.grace" is firing early for "group.service.check" that have a task defined #25318

"check_restart.grace" is firing early for "group.service.check" that have a task defined #25318

Yokutto commented Mar 7, 2025 •

edited

Loading

"check_restart.grace" is firing early for "group.service.check" that have a task defined #25318

"check_restart.grace" is firing early for "group.service.check" that have a task defined #25318

Comments

Yokutto commented Mar 7, 2025 • edited Loading

Job file

Environment information

Yokutto commented Mar 7, 2025 •

edited

Loading