-
Notifications
You must be signed in to change notification settings - Fork 256
[batch] Fix job_group cancellation logic in SQL procedures #14919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[batch] Fix job_group cancellation logic in SQL procedures #14919
Conversation
be8ffa6 to
16012f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the original cancel_job_group? (unless I missed it in those two places you linked, which is very likely)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so yes
|
Tested by submitting stress tests and monitoring batch-driver logs
The stress test is below and requires #14817. The default-namespace deployment logged errors matching that of #14864 from random import choice, random
from typing import Callable, Optional, TypeVar, Union
import hailtop.batch as hb
def flip(p):
return random() <= p
A = TypeVar('A')
def prob(p: float, f: Callable[[], A]) -> Optional[A]:
return f() if flip(p) else None
def job_group_name(a: Union[hb.Batch, hb.JobGroup]) -> str:
return a.attributes['name'] if isinstance(a, hb.JobGroup) else ''
def job_group_tree(root: Union[hb.Batch, hb.JobGroup], depth: int, arity: int) -> hb.JobGroup:
if depth <= 1:
return
failure_threashold = prob(0.5, lambda: 1)
group_name = f'{job_group_name(root)}/{depth}'
jg = root.create_job_group(attributes={'name': group_name}, cancel_after_n_failures=failure_threashold)
j = jg.new_bash_job(name=f'({group_name}, 0)')
j.command(f'sleep {choice(range(4))}; exit {1 if flip(0.1) else 0}')
for m in range(arity):
jm = jg.new_bash_job(name=f'({group_name}, {m + 1})')
jm = jm.depends_on(j)
jm.command(f'sleep {choice(range(4))}; exit {1 if flip(0.1) else 0}')
jm._always_run = flip(0.1)
if flip(0.01):
jm._machine_type = 'n1-standard-1'
jm.spot(flip(0.5))
for n in range(arity):
job_group_tree(jg, depth - 1, arity)
def stress():
b = hb.Batch(name='stress', backend=hb.ServiceBackend(billing_project='ehigham-trial'))
job_group_tree(b, depth=3, arity=100)
b.run(open=False, wait=False)
if __name__ == "__main__":
stress() |
job_groups_self_and_ancestors to determine if job has been cancelled136b0a6 to
e1a7bf6
Compare
e1a7bf6 to
6ca2389
Compare
grohli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for the improvement Ed
sarahgibs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved by security.
The following sql triggers and procedures were not updated to reflect
changes to job cancellation when job groups merged:
jobs_after_updatecancel_batchschedule_jobmark_job_creatingmark_job_startedThese error if there's more than 1 cancelled job group in the batch,
preventing jobs being marked complete even if they had been executed
sucessfully.
This change drops and re-creates these entities with updated job
cancellation logic.
Fixes #14864
Security Assessment
This change impacts the Hail Batch instance as deployed by Broad Institute in GCP.
This change has a medium security impact, affecting
This change does not expose new functionality that could be exploited.
Appsec Review