-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-53525][CONNECT] Spark Connect ArrowBatch Result Chunking #52271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@@ -819,6 +828,19 @@ message ReattachOptions { | |||
bool reattachable = 1; | |||
} | |||
|
|||
message ResultChunkingOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the client be able to set chunk size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can we name this ResultOptions? I can also see us setting a max arrow batch size here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I introduced a client-side option preferred_arrow_chunk_size
(this commit).
.build() | ||
response.setArrowBatch(batch) | ||
responseObserver.onNext(response.build()) | ||
for (i <- 0 until numChunks) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't use for comprehensions... Either use an actual loop, or the functional equivalent this translates into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, right, updated to an explicit loop with foreach.
val to = math.min(from + maxChunkSize, bytes.length) | ||
val length = to - from | ||
|
||
val response = proto.ExecutePlanResponse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can reuse the builder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, updated.
" While spark.connect.grpc.arrow.maxBatchSize determines the max size of a result batch," + | ||
" maxChunkSize defines the max size of each individual chunk that is part of the batch" + | ||
" that will be sent in a response. This allows the server to send large rows to clients." + | ||
" However, excessively large plans remain unsupported due to Spark internals and JVM" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove these two lines. They are not related to the conf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, removed.
|
||
// Execute plan. | ||
@volatile var done = false | ||
val responses = mutable.Buffer.empty[proto.ExecutePlanResponse] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If done need volatile, then this should be synchronized... Unless you are relying on some happens-before cleverness with the volatile variable (if so, then you need to document this)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
here doesn't need volatile. I added volatile because other test cases have it. Removed as it is not needed in our case.
} | ||
|
||
// Reassemble the chunks into a single Arrow batch and validate its content. | ||
val batchData: ByteString = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you build this for scala, please use a Concatenating InputStream or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's a good point, updated the test as well.
@xi-db are you adding scala support in a follow-up? |
Yes, I'm implementing the scala support as a follow-up. |
fb5689a
to
b68461f
Compare
What changes were proposed in this pull request?
Currently, we enforce gRPC message limits on both the client and the server. These limits are largely meant to protect both sides from potential OOMs by rejecting abnormally large messages. However, there are cases in which the server incorrectly sends oversized messages that exceed these limits and cause execution failures.
Specifically, the large message issue from the server to the client we’re solving here, comes from the Arrow batch data in ExecutePlanResponse being too large. It’s caused by a single arrow row exceeding the 128MB message limit, and Arrow cannot partition further and it has to return the single large row in one gRPC message.
To improve Spark Connect stability, this PR implements chunking large Arrow batches when returning query results from the server to the client, ensuring each ExecutePlanResponse chunk remains within the size limit, and the chunks from a batch will be reassembled on the client when parsing as an arrow batch.
(Scala client changes are being implemented in a follow-up PR.)
To reproduce the existing issue we are solving here, run this code on Spark Connect:
It fails with
StatusCode.RESOURCE_EXHAUSTED
error with messageReceived message larger than max (314570608 vs. 134217728)
, because the server is trying to send an ExecutePlanResponse of ~300MB to the client.With the improvement introduced by the PR, the above code runs successfully and prints the expected result.
Why are the changes needed?
It improves Spark Connect stability when returning large rows.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New tests on both the server side and the client side.
Was this patch authored or co-authored using generative AI tooling?
No.