Skip to content

Add S3 write support with hybrid storage for BP5#4831

Open
eisenhauer wants to merge 14 commits intoornladios:masterfrom
eisenhauer:s3-write-support
Open

Add S3 write support with hybrid storage for BP5#4831
eisenhauer wants to merge 14 commits intoornladios:masterfrom
eisenhauer:s3-write-support

Conversation

@eisenhauer
Copy link
Member

@eisenhauer eisenhauer commented Feb 2, 2026

S3 Multipart Upload

  • FileAWSSDK write support using S3 multipart upload API
  • Zero-copy uploads via AWS SDK's PreallocatedStreamBuf
  • Configurable part sizes (min_part_size, max_part_size)
  • Buffering for small writes to meet S3's 5MiB minimum

Hybrid Storage (Local Metadata + S3 Data)

  • New DataTransport engine parameter to route data files to S3
  • Metadata files (md.idx, md.0, mmd.0) stay local for fast access
  • BP5Reader/Writer: separate FilePool for metadata vs data
  • S3 parameters: S3Endpoint, S3Bucket

Other Fixes

  • ParseArgs: quoted value support for URLs with colons
  • FileAWSSDK: proper path handling with bucket parameter
  • Transport cleanup order fix in DoClose

Usage

io.SetEngine("BP5");
io.SetParameter("DataTransport", "awssdk");
io.SetParameter("S3Endpoint", "http://localhost:9000");
io.SetParameter("S3Bucket", "mybucket");

@eisenhauer eisenhauer changed the title Add S3 multipart upload write support for AWS SDK transport Add S3 write support with hybrid storage for BP5 Feb 3, 2026
@eisenhauer eisenhauer force-pushed the s3-write-support branch 5 times, most recently from 945e589 to faced1c Compare February 4, 2026 03:51
pnorbert
pnorbert previously approved these changes Feb 5, 2026
else
{
// Use same transport for data as metadata
m_DataTransportsParameters = m_IO.m_TransportsParameters;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the metadata is always stored locally?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this implementation, yes. No reason why we couldn't have a pure S3 version, though some things probably get a little messier. For example, if you were actively writing, the metadata files would have to go above 5Mb before we could write even a portion of them. No issue if we have a normal shutdown, but the possibility of crash recovery would be virtually eliminated. But easy to have a copy-to-S3 and/or copy-from-s3 utility, things like that.

eisenhauer and others added 9 commits February 6, 2026 10:48
Implement write support for FileAWSSDK transport using S3 multipart
upload API. This enables writing data to S3-compatible storage.

Key features:
- Multipart upload with configurable part sizes (min_part_size, max_part_size)
- Zero-copy uploads using AWS SDK's PreallocatedStreamBuf
- Buffering for small writes to meet S3's 5MB minimum part size
- Direct upload path for large writes to avoid unnecessary copies
- Proper cleanup of multipart uploads on close

Parameters:
- min_part_size: Minimum part size for uploads (default 5MB, S3 minimum)
- max_part_size: Maximum part size for uploads (default 5GB, S3 maximum)

Unit tests included (disabled by default, require S3-compatible endpoint):
- Basic write/read roundtrip
- Large file multipart upload (15MB)
- Many small writes with buffer accumulation
- Mixed size writes
- Boundary condition writes
- Very small writes (1KB chunks)
- Configurable part size

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- BP5Reader/Writer: separate FilePool for metadata vs data files
- DataTransport parameter routes data files to alternate transport
- FileAWSSDK: fix path handling with bucket parameter
- ParseArgs: support quoted values for URLs with colons
- Fix transport cleanup order in DoClose to avoid mutex issues

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix S3_MIN_PART_SIZE: use 5 MiB (5*1024*1024) not 5 MB (5*1000*1000)
  S3 requires non-final parts to be at least 5 MiB, causing EntityTooSmall
  errors with the incorrect decimal value
- Fix S3_MAX_PART_SIZE: use 5 GiB (5*1024*1024*1024) for consistency

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This reverts commit 770e884.
eisenhauer and others added 5 commits February 6, 2026 13:32
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…reading campaigns where data is on s3. It does not work yet for files in tar files.
modify awsdk cache setup and campaign reader s3 parameters to enable …
@pnorbert pnorbert self-requested a review February 13, 2026 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants