DAOS-19058 pydaos: torch surface worker errors in parallel_list#18414
DAOS-19058 pydaos: torch surface worker errors in parallel_list#18414enakta wants to merge 4 commits into
Conversation
|
Ticket title is 'pytorch parallel_list does not surface worker process errors, causing silent hangs' |
Worker processes spawned by _Dfs.parallel_list may raise exceptions that never reached the calling process. This results in indefinite hang during Dataset and IterableDataset construction with no surfaced error to the user. Replacing manual Process + Queue scheme and its queued/processed counter with a multiprocessing.Pool driven by imap_unordered. Pool re-raises worker exceptions in the parent when their results are consumed, so a worker error now propagates as a raised OSError instead of a deadlock, and the Pool context manager reaps all workers on any exit path. `concurrent.futures.ProcessPoolExecutor` would be even better but its initializer/initargs arguments are unavailable before Python 3.7, and the target runtime includes EL8.8 / Python 3.6. Features: pytorch Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18414/2/testReport/ |
daltonbohning
left a comment
There was a problem hiding this comment.
Your branch is ~150 commits behind master so you should merge latest master
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: enakta <140368024+enakta@users.noreply.github.com>
Features: pytorch Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18414/4/testReport/ |
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18414/4/execution/node/990/log |
Worker processes spawned by _Dfs.parallel_list may raise exceptions that never reached the calling process. This results in indefinite hang during Dataset and IterableDataset construction with no surfaced error to the user.
Replacing manual Process + Queue scheme and its queued/processed counter with a multiprocessing.Pool driven by imap_unordered. Pool re-raises worker exceptions in the parent when their results are consumed, so a worker error now propagates as a raised OSError instead of a deadlock, and the Pool context manager reaps all workers on any exit path.
concurrent.futures.ProcessPoolExecutorwould be even better but its initializer/initargs arguments are unavailable before Python 3.7, and the target runtime includes EL8.8 / Python 3.6.Features: pytorch
Steps for the author:
After all prior steps are complete: