Skip to content

Conversation

@sbassam
Copy link
Contributor

@sbassam sbassam commented Nov 7, 2025

Have you read the Contributing Guidelines?

Issue #

Describe your changes

Clearly and concisely describe what's in this pull request. Include screenshots, if necessary.


Note

Increase file size limit to 50.1GB, bump multipart target part size to 250MB, add sliding-window concurrent uploads, add download timeout, and enhance file validation progress; tests updated accordingly.

  • Uploads:
    • Increase TARGET_PART_SIZE_MB to 250 and file limit MAX_FILE_SIZE_GB to 50.1.
    • Refactor multipart upload to sliding-window concurrency with _submit_part helper; maintain max concurrency via executor; progress tracked with tqdm.
  • Downloads:
    • Add request_timeout=3600 to streamed downloads.
  • File Validation:
    • Use line-iteration for UTF-8 check in _check_utf8.
    • Add tqdm progress to JSONL validation loop.
  • Tests:
    • Update multipart part calculations (e.g., 500MB -> 2 parts; 50GB -> ~205 parts) and size-limit assertions.
    • Adjust as_completed mocking for sliding-window logic.

Written by Cursor Bugbot for commit d104aa1. This will update automatically on new commits. Configure here.

@sbassam sbassam marked this pull request as ready for review November 7, 2025 03:21
@sbassam sbassam requested a review from vorobyov01 November 8, 2025 03:38
Copy link

@nikita-smetanin nikita-smetanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Soroush, PR looks nice, I left a few suggestions :)


# Submit next part if available
if part_index < len(parts):
part_info = parts[part_index]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to rewrite it to deduplicate this code piece with the one above. I think you can either make a for loop to submit tasks and wait on result if we have enough already, or use executor.map with buffersize to limit concurrent tasks.

@connermanuel connermanuel removed their request for review November 12, 2025 20:20
@vorobyov01
Copy link

Thanks for the PR, looks good to me! Please make sure to address this comment:
#396 (comment)

Copy link

@nikita-smetanin nikita-smetanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! Let's ship it

self._upload_single_part, part_info, part_data
)
# Submit initial batch limited by max_concurrent_parts
for i in range(min(self.max_concurrent_parts, len(parts))):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd update it to while part_index < min(self.max_concurrent_parts, len(parts)): or at least replace i with _ as it's not used

@sbassam sbassam force-pushed the feat/increase-file-limit-to-50gb branch from f3ba7c9 to d104aa1 Compare November 20, 2025 20:05
@sbassam sbassam merged commit 1201470 into main Nov 20, 2025
12 checks passed
@sbassam sbassam deleted the feat/increase-file-limit-to-50gb branch November 20, 2025 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants