Skip to content

[FLINK-39754][core] Fix int overflow in DataOutputSerializer.resize#28252

Open
sauliusvl wants to merge 2 commits into
apache:masterfrom
sauliusvl:FLINK-39754-resize-overflow
Open

[FLINK-39754][core] Fix int overflow in DataOutputSerializer.resize#28252
sauliusvl wants to merge 2 commits into
apache:masterfrom
sauliusvl:FLINK-39754-resize-overflow

Conversation

@sauliusvl

Copy link
Copy Markdown

What is the purpose of the change

Fixes FLINK-39754. DataOutputSerializer.resize() uses int arithmetic for buffer.length * 2. Once buffer.length crosses Integer.MAX_VALUE / 2 (~1.07 GB), doubling overflows to a negative int, Math.max then picks buffer.length + minCapacityAdd, and every subsequent resize grows the buffer by a handful of bytes instead of doubling — doing a full System.arraycopy of the ~1+ GB buffer each call. On large heaps this manifests as a silent O(n²) hang until buffer.length + minCapacityAdd itself overflows and the existing catch (NegativeArraySizeException) translates it to an IOException.

Brief change log

  • Extract the size computation from resize(int) into a @VisibleForTesting package-private static helper computeNewBufferLength(int, int).
  • The helper uses long arithmetic, validates against a new MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8 cap (matching java.util.ArrayList), and jumps to the cap when doubling would overflow — so serializations that just barely fit under 2 GB still complete instead of grinding through a linear-step resize loop.
  • Remove the now-unreachable catch (NegativeArraySizeException) block from resize. The existing OutOfMemoryError retry path is preserved (it addresses an independent concern — doubled size exceeding available heap).

Verifying this change

This change added tests and can be verified as follows:

  • Five pure-arithmetic unit tests on computeNewBufferLength in DataInputOutputSerializerTest covering: normal doubling, minCapacityAdd-dominated growth, jump-to-cap when currentLength * 2 would overflow, exact-cap boundary, and IOException when the required size exceeds the cap. No multi-GB allocations required.
  • Existing DataInputOutputSerializerTest tests continue to pass, confirming the normal write/read paths through resize() are unchanged.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no (DataOutputSerializer is unannotated / internal)
  • The serializers: no (this is the byte-buffer growth path, not record (de)serialization logic)
  • The runtime per-record code paths (performance sensitive): no (the helper runs only on buffer growth, not per record; the buggy linear-step path it replaces is what was previously degrading performance)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no behavior change for serializations < ~1 GB. Serializations that previously silently O(n²)-hung near 2 GB now either complete cleanly (one final grow to the cap) or fail with an actionable IOException instead of an opaque NegativeArraySizeException-derived message.
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

Generated-by: Claude (Anthropic, Opus 4.7) via Zed editor

buffer.length * 2 uses int arithmetic and overflows to a negative value once the buffer crosses Integer.MAX_VALUE / 2 (~1.07 GB). Math.max then picks buffer.length + minCapacityAdd, so every subsequent resize grows the buffer by a handful of bytes instead of doubling, doing a full System.arraycopy of the ~1+ GB buffer each call. On large heaps this manifests as a silent O(n^2) hang until buffer.length + minCapacityAdd itself overflows and the existing NegativeArraySizeException catch translates it to an IOException.

Extract the size computation into a package-private static helper that uses long arithmetic, caps at Integer.MAX_VALUE - 8 (matching java.util.ArrayList), and jumps to the cap once doubling would overflow so serializations that just barely fit under 2 GB still complete.
@flinkbot

flinkbot commented May 25, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

public class DataOutputSerializer implements DataOutputView, MemorySegmentWritable {

/**
* Maximum array length the JVM can allocate. Some VMs reserve a few header bytes, so we cap

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a reference to "Some VMs reserve a few header bytes". I would have thought Integer.MAX_VALUE should be the value - I have not heard of needing to only use slightly less.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the LLM learned it from the openjdk source code, this part specifically

} catch (OutOfMemoryError ee) {
// still not possible. give an informative exception message that reports the
// size
throw new IOException(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we pre allocate this Exception in case cannot create a new IOException as we have no memory? Or just leave out of memory to percolate up

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's my understanding that we'd end up here if a potentially multi-gigabyte array allocation failed, not because we've accumulated tons of small objects and exhausted the heap, so actually there should be enough memory to create a new exception object here. Pre-allocating would also drop the stack trace and the message

*/
@VisibleForTesting
static int computeNewBufferLength(int currentLength, int minCapacityAdd) throws IOException {
long requiredLen = (long) currentLength + minCapacityAdd;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should check that these values are positive. I am not sure if 0 for either value would not cause issues.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added precondition checks for good measure, no code path calls these with zeroes or negative numbers currently

Add Preconditions checks that currentLength is non-negative and
minCapacityAdd is positive, matching the contract of
java.util.ArraysSupport.newLength. Addresses review feedback.

Generated-by: Claude (Anthropic, Opus 4.8) via Zed editor

@Jackeyzhe Jackeyzhe left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, LGTM

@github-actions github-actions Bot added the community-reviewed PR has been reviewed by the community. label Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants