Skip to content

Conversation

@ehigham
Copy link
Member

@ehigham ehigham commented Jun 12, 2025

The following sql triggers and procedures were not updated to reflect
changes to job cancellation when job groups merged:

  • jobs_after_update
  • cancel_batch
  • schedule_job
  • mark_job_creating
  • mark_job_started

These error if there's more than 1 cancelled job group in the batch,
preventing jobs being marked complete even if they had been executed
sucessfully.

This change drops and re-creates these entities with updated job
cancellation logic.

Fixes #14864

Security Assessment

This change impacts the Hail Batch instance as deployed by Broad Institute in GCP.
This change has a medium security impact, affecting

  • the stored SQL procedures in the batch database, and
  • error handling in the batch worker.

This change does not expose new functionality that could be exploited.

Appsec Review

  • Required: The impact has been assessed and approved by appsec

@ehigham ehigham force-pushed the ehigham/14864-mark-job-started branch 6 times, most recently from be8ffa6 to 16012f8 Compare June 16, 2025 18:44
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the original cancel_job_group? (unless I missed it in those two places you linked, which is very likely)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so yes

Copy link
Member Author

ehigham commented Jun 18, 2025

Tested by submitting stress tests and monitoring batch-driver logs
for errors to batch services running in

  • the default k8s namespace to reproduce the errors
  • a dev deployment patched with this branch.

The stress test is below and requires #14817.

The default-namespace deployment logged errors matching that of #14864
whereas the patched instance did not. I claim that this change fixes that issue.

from random import choice, random
from typing import Callable, Optional, TypeVar, Union

import hailtop.batch as hb


def flip(p):
    return random() <= p


A = TypeVar('A')


def prob(p: float, f: Callable[[], A]) -> Optional[A]:
    return f() if flip(p) else None


def job_group_name(a: Union[hb.Batch, hb.JobGroup]) -> str:
    return a.attributes['name'] if isinstance(a, hb.JobGroup) else ''


def job_group_tree(root: Union[hb.Batch, hb.JobGroup], depth: int, arity: int) -> hb.JobGroup:
    if depth <= 1:
        return

    failure_threashold = prob(0.5, lambda: 1)
    group_name = f'{job_group_name(root)}/{depth}'
    jg = root.create_job_group(attributes={'name': group_name}, cancel_after_n_failures=failure_threashold)
    j = jg.new_bash_job(name=f'({group_name}, 0)')
    j.command(f'sleep {choice(range(4))}; exit {1 if flip(0.1) else 0}')

    for m in range(arity):
        jm = jg.new_bash_job(name=f'({group_name}, {m + 1})')
        jm = jm.depends_on(j)
        jm.command(f'sleep {choice(range(4))}; exit {1 if flip(0.1) else 0}')
        jm._always_run = flip(0.1)

        if flip(0.01):
            jm._machine_type = 'n1-standard-1'
            jm.spot(flip(0.5))

    for n in range(arity):
        job_group_tree(jg, depth - 1, arity)


def stress():
    b = hb.Batch(name='stress', backend=hb.ServiceBackend(billing_project='ehigham-trial'))
    job_group_tree(b, depth=3, arity=100)
    b.run(open=False, wait=False)


if __name__ == "__main__":
    stress()

@ehigham ehigham changed the title [batch] Join job_groups_self_and_ancestors to determine if job has been cancelled [batch] Fix job_group cancellation logic in SQL procedures Jun 18, 2025
@ehigham ehigham requested review from cjllanwarne and grohli June 18, 2025 16:54
@ehigham ehigham added full-deploy Requires a full deployment at this commit before following commits can be deployed migration labels Jun 18, 2025
@ehigham ehigham force-pushed the ehigham/14864-mark-job-started branch 3 times, most recently from 136b0a6 to e1a7bf6 Compare June 18, 2025 17:54
@ehigham ehigham force-pushed the ehigham/14864-mark-job-started branch from e1a7bf6 to 6ca2389 Compare June 18, 2025 18:10
@ehigham ehigham requested review from a team and grohli June 18, 2025 20:37
Copy link
Contributor

@grohli grohli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for the improvement Ed

Copy link

@sarahgibs sarahgibs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved by security.

@hail-ci-robot hail-ci-robot merged commit b460d95 into hail-is:main Jun 23, 2025
2 checks passed
@ehigham ehigham deleted the ehigham/14864-mark-job-started branch June 23, 2025 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full-deploy Requires a full deployment at this commit before following commits can be deployed migration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Batch services fails to mark job started when the batch has more than one failed job group

4 participants