Skip to content

Extended I/O Framework: Readers/Writers for Parquet #2229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

sayedkeika
Copy link

Overview

This pull request adds support for the Parquet file format. The implementation includes both readers and writers that can handle Parquet files sequentially and concurrently.

Details

  • Single-threaded reader and writer for smaller datasets
  • Multi-threaded reader and writer for larger datasets
  • Component tests for both sequential and parallel operations that cover different schemas
  • Detailed documentation

@sayedkeika sayedkeika changed the title LDE Project - Extended I/O Framework: Readers/Writers for Parquet Extended I/O Framework: Readers/Writers for Parquet Feb 16, 2025
@mboehm7 mboehm7 closed this in d19f505 Apr 18, 2025
@github-project-automation github-project-automation bot moved this from In Progress to Done in SystemDS PR Queue Apr 18, 2025
@mboehm7
Copy link
Contributor

mboehm7 commented Apr 18, 2025

LGTM - thanks for the patch @sayedkeika. During the merge I resolved the merge conflicts, fixed the warnings and formatting (tabs over spaces), added additional tests (sparse data), and left a FIXME (for removing the ExampleParquetWriter). Additionally, I fixed the parallel write task to call the sequential write instead of the parallel writer again (which worked because the number of part files was 1 due to the small size).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants