Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Malloc exception while reading large parquet file #487

Open
shamimashik opened this issue Dec 30, 2024 · 3 comments
Open

[BUG]: Malloc exception while reading large parquet file #487

shamimashik opened this issue Dec 30, 2024 · 3 comments

Comments

@shamimashik
Copy link

shamimashik commented Dec 30, 2024

Issue Description

Having issues reading large sized checkpoint parquet files.
I'm using code like this -
val results = rowGroupReader.Column(0).LogicalReader<string>().ReadAll(numRows);

Getting the below error -
class parquet::ParquetStatusException (message: 'Out of memory: malloc of size 104478272 failed')

Environment Information

  • ParquetSharp Version: [e.g. 1.0.1]
  • .NET Framework/SDK Version: [e.g. .NET Framework 4.7.2]
  • Operating System: [e.g. Windows 10]

Steps To Reproduce

val results = rowGroupReader.Column(0).LogicalReader<string>().ReadAll(numRows);

Expected Behavior

There should not be any exception

Additional Context (Optional)

No response

@shamimashik shamimashik changed the title [BUG]: <title> [BUG]: Malloc exception while reading large parquet file Dec 31, 2024
@adamreeve
Copy link
Contributor

Hi @shamimashik. Is numRows the total number of rows in the row group? Reading a smaller number of rows at a time might help reduce memory usage. Eg. something like:

const int bufferSize = 1024;
var buffer = new string[bufferSize];
using var columnReader = rowGroupReader.Column(0).LogicalReader<string>();
while (columnReader.HasNext)
{
    var rowsRead = columnReader.ReadBatch(buffer);
    var values = buffer.AsSpan(0, rowsRead);
}

@pathacke
Copy link

Hi @adamreeve Thanks for response.

How can we determine an ideal buffer size that works for tables of all sizes? We might have tables with a large number of columns as well as tables with fewer columns. A table with many columns could cause an exception if the buffer size isn't appropriate.

@adamreeve
Copy link
Contributor

The buffer would be used to read one column at a time so the number of columns shouldn't matter. I think you'd need to do testing using your own data to determine a buffer size that works best for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants