-
Notifications
You must be signed in to change notification settings - Fork 256
[query] remove global Backend field
#15107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[query] remove global Backend field
#15107
Conversation
c9c4d18 to
fc30260
Compare
56ea4b5 to
b43a853
Compare
fc30260 to
42c64d4
Compare
b43a853 to
6d4f99d
Compare
42c64d4 to
84cf0b7
Compare
6d4f99d to
dde9456
Compare
84cf0b7 to
0ddefc3
Compare
dde9456 to
b109248
Compare
0ddefc3 to
b0e430a
Compare
b109248 to
f5845a4
Compare
8a65cd3 to
f415254
Compare
f5845a4 to
d4d42dd
Compare
f415254 to
e8f4948
Compare
d4d42dd to
89ce817
Compare
5d45dc2 to
d1d700d
Compare
df2cde7 to
dce17cf
Compare
d1d700d to
e65c195
Compare
f67084e to
de4b953
Compare
e65c195 to
34b2e0e
Compare
de4b953 to
fbe209a
Compare
495256b to
7dc5bd7
Compare
|
Can you comment on the reason for 4? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this a lot. Just have a few questions/comments, including the earlier one about the motivation for the change to globals handling.
| subparts, | ||
| (idx, result: Array[Byte]) => buffer += result -> subparts(idx), | ||
| ) | ||
| (failure, buffer.result().sortBy(_._2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a wip branch that introduces a pattern to avoid the copy when sorting the result of an ArraySeq builder like this (which we do somewhat often). I made a note to come back and use it here too.
|
One reason to upload globals separately is if globals is large, we'd get increased parallelism from writing/reading those while writing/reading the Subjectively, if I serialised the globals into the |
7dc5bd7 to
4beb603
Compare
09d0468 to
f6adcc0
Compare
f6adcc0 to
2b6f236
Compare
|
For large enough files, we should be getting the optimal amount of parallelism by chunking up the file, so I don't think manually breaking it up should have much effect. But I also can't think of any negatives to writing the globals separately, so I'm fine with leaving it. It might pave the way for someday having each partition only read in the broadcasted values they actually need, as spark does. |
|
Are we doing any chunking currently? |
|
I don't know for sure, but if this is the same code we use to localize inputs in batch jobs, then yes, that chunking is what has been causing timeout issues (because we're too eager to timeout and retry chunks). |
|
That chunking happens in the python fs implementation. From a quick glance at our |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hurray for removing another bit of global state. Thanks for doing this.

Change Description
This PR refactors the Backend interface to improve separation of concerns between driver and worker contexts. The key changes include:
Backend.instancesingleton pattern in favour of explicit context passingBackendContext asDriverRuntimeContextto implementmapCollectPartitions (formallyparallelizeAndComputeWithIndex)canExecuteParallelTasksOnDriver fromBackend - this is now an implementation detail inmapCollectPartitions.ServiceBackendSecurity Assessment
This change cannot impact the Hail Batch instance as deployed by Broad Institute in GCP