Skip to content

Individual node failure causes whole workflow failures due to queue draining #2

@ChristopherWilks

Description

@ChristopherWilks

This has been a known issues for a while, but finally documenting it here.

The problem is that if even 1 worker on 1 node fails due to a non job-specific reason (e.g. running out of disk space) the failing worker will rapidly attempt every remaining job on the queue, which quickly leads to starving the workers on other nodes until they prematurely exit or just idle. The queue itself will either eventually recover by failed jobs being made visible again, or they will be dropped into the DLQ from which manual reloading to the main queue is required.

Initially, the parent worker would exit after all child workers exited (either cleanly or not).
Subsequent changes for running on JHPCE due to other failures changed the behavior of cluster.py, forcing potentially endless restarts to child worker processes which failed.

In the stampede2 environment, with the current requirement (3/24/2020) to write many of our small temporary files to the node's local /tmp, it's clear that we need a per node limit in the cluster.py code on worker failures. This is also true in the MARCC environment which is also constrained by local disk space (typically /dev/shm), and on AWS EC2 where the local NVMe's of the c5d instances have limited space.

The current solution is to revert back to the original version which is to quite the parent process after all child workers have exited, either by error or cleanly. This appears to be working on AWS EC2, but not on Stampede2, which needs further investigation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions