Skip to content

Conversation

@ehigham
Copy link
Member

@ehigham ehigham commented Sep 24, 2025

Change Description

This PR refactors the Backend interface to improve separation of concerns between driver and worker contexts. The key changes include:

  1. Removes the static Backend.instance singleton pattern in favour of explicit context passing
  2. Repurposes BackendContext​ as DriverRuntimeContext to implement mapCollectPartitions​ (formally parallelizeAndComputeWithIndex)
  3. Removes canExecuteParallelTasksOnDriver​ from Backend​ - this is now an implementation detail in mapCollectPartitions​.
  4. Broadcasts globals in a separate file for ServiceBackend
  5. Disable all semantic-hash code when it's not featured on

Security Assessment

This change cannot impact the Hail Batch instance as deployed by Broad Institute in GCP

@ehigham ehigham marked this pull request as ready for review September 24, 2025 20:31
@ehigham ehigham force-pushed the ehigham/move-parallelize-and-compute-with-index branch 3 times, most recently from c9c4d18 to fc30260 Compare September 25, 2025 01:53
@ehigham ehigham force-pushed the ehigham/vanquish-hail-context branch from 56ea4b5 to b43a853 Compare September 25, 2025 13:49
@ehigham ehigham force-pushed the ehigham/move-parallelize-and-compute-with-index branch from fc30260 to 42c64d4 Compare September 25, 2025 13:49
@ehigham ehigham force-pushed the ehigham/vanquish-hail-context branch from b43a853 to 6d4f99d Compare September 25, 2025 16:30
@ehigham ehigham force-pushed the ehigham/move-parallelize-and-compute-with-index branch from 42c64d4 to 84cf0b7 Compare September 25, 2025 16:30
@ehigham ehigham force-pushed the ehigham/vanquish-hail-context branch from 6d4f99d to dde9456 Compare September 25, 2025 16:47
@ehigham ehigham force-pushed the ehigham/move-parallelize-and-compute-with-index branch from 84cf0b7 to 0ddefc3 Compare September 25, 2025 16:47
@ehigham ehigham force-pushed the ehigham/vanquish-hail-context branch from dde9456 to b109248 Compare September 25, 2025 17:08
@ehigham ehigham force-pushed the ehigham/move-parallelize-and-compute-with-index branch from 0ddefc3 to b0e430a Compare September 25, 2025 17:08
@ehigham ehigham force-pushed the ehigham/vanquish-hail-context branch from b109248 to f5845a4 Compare September 25, 2025 17:09
@ehigham ehigham force-pushed the ehigham/move-parallelize-and-compute-with-index branch 2 times, most recently from 8a65cd3 to f415254 Compare September 25, 2025 18:37
@ehigham ehigham force-pushed the ehigham/vanquish-hail-context branch from f5845a4 to d4d42dd Compare September 25, 2025 18:37
@ehigham ehigham force-pushed the ehigham/move-parallelize-and-compute-with-index branch from f415254 to e8f4948 Compare September 26, 2025 01:06
@ehigham ehigham force-pushed the ehigham/vanquish-hail-context branch from d4d42dd to 89ce817 Compare September 26, 2025 01:06
@ehigham ehigham force-pushed the ehigham/vanquish-hail-context branch from 5d45dc2 to d1d700d Compare October 1, 2025 19:42
@ehigham ehigham force-pushed the ehigham/move-parallelize-and-compute-with-index branch from df2cde7 to dce17cf Compare October 1, 2025 20:27
@ehigham ehigham force-pushed the ehigham/vanquish-hail-context branch from d1d700d to e65c195 Compare October 1, 2025 20:29
@ehigham ehigham force-pushed the ehigham/move-parallelize-and-compute-with-index branch 3 times, most recently from f67084e to de4b953 Compare October 1, 2025 20:38
@ehigham ehigham force-pushed the ehigham/vanquish-hail-context branch from e65c195 to 34b2e0e Compare October 1, 2025 20:38
@ehigham ehigham force-pushed the ehigham/move-parallelize-and-compute-with-index branch from de4b953 to fbe209a Compare October 2, 2025 14:20
Base automatically changed from ehigham/vanquish-hail-context to main October 3, 2025 14:28
@ehigham ehigham force-pushed the ehigham/move-parallelize-and-compute-with-index branch 3 times, most recently from 495256b to 7dc5bd7 Compare October 3, 2025 20:21
Copy link
Member

Can you comment on the reason for 4?

Copy link
Member

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this a lot. Just have a few questions/comments, including the earlier one about the motivation for the change to globals handling.

subparts,
(idx, result: Array[Byte]) => buffer += result -> subparts(idx),
)
(failure, buffer.result().sortBy(_._2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a wip branch that introduces a pattern to avoid the copy when sorting the result of an ArraySeq builder like this (which we do somewhat often). I made a note to come back and use it here too.

Copy link
Member Author

ehigham commented Oct 8, 2025

One reason to upload globals separately is if globals is large, we'd get increased parallelism from writing/reading those while writing/reading the PartitionFn.

Subjectively, if I serialised the globals into the PartitionFn, the worker would invoke it with an empty byte array as the globals. Maybe that's fine but feels a little bit of a gotcha where we have to remember that the driver handles serialisation that way. I'm probably over thinking it.

@ehigham ehigham force-pushed the ehigham/move-parallelize-and-compute-with-index branch from 7dc5bd7 to 4beb603 Compare October 8, 2025 21:06
@ehigham ehigham force-pushed the ehigham/move-parallelize-and-compute-with-index branch 3 times, most recently from 09d0468 to f6adcc0 Compare October 9, 2025 02:44
@ehigham ehigham force-pushed the ehigham/move-parallelize-and-compute-with-index branch from f6adcc0 to 2b6f236 Compare October 9, 2025 14:14
Copy link
Member

For large enough files, we should be getting the optimal amount of parallelism by chunking up the file, so I don't think manually breaking it up should have much effect. But I also can't think of any negatives to writing the globals separately, so I'm fine with leaving it. It might pave the way for someday having each partition only read in the broadcasted values they actually need, as spark does.

Copy link
Member Author

ehigham commented Oct 9, 2025

Are we doing any chunking currently?

@patrick-schultz
Copy link
Member

I don't know for sure, but if this is the same code we use to localize inputs in batch jobs, then yes, that chunking is what has been causing timeout issues (because we're too eager to timeout and retry chunks).

Copy link
Member Author

ehigham commented Oct 9, 2025

That chunking happens in the python fs implementation. From a quick glance at our
scala FSs, i'm not sure that we do.

Copy link
Collaborator

@chrisvittal chrisvittal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hurray for removing another bit of global state. Thanks for doing this.

@hail-ci-robot hail-ci-robot merged commit e829711 into main Oct 10, 2025
3 checks passed
@hail-ci-robot hail-ci-robot deleted the ehigham/move-parallelize-and-compute-with-index branch October 10, 2025 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants