Skip to content

Conversation

@Snowman-cpu
Copy link

This patch addresses an issue in the current scool file creation workflow where, during append operations, the bins group is always deleted and re-created—even when the bins data (i.e., the "chrom", "start", and "end" columns) hasn’t changed. Over time, this behavior leads to file bloat due to HDF5’s inability to reclaim deleted space automatically.

What’s Changed:

Conditional Bins Update:
The patch introduces a check that compares the existing bins in the file with the new bins data. If they match, the bins group is left intact, avoiding unnecessary deletion and re-creation.

Consistent Chromosome Data:
The chroms group is always updated to ensure consistency, while the bins group is only rewritten when there is an actual difference in the underlying data.

Why This Matters:

Space Efficiency:
By skipping redundant writes when the bins are identical, we prevent the accumulation of dead space and reduce the need for costly file repacking operations (e.g., using h5repack).

Improved Performance:
Avoiding unnecessary I/O operations helps maintain a leaner file size and can lead to faster append operations over multiple iterations.

Testing:

The patch was tested locally by creating an initial scool file, appending cells with unchanged bins (which showed minimal file size increase), and then appending with modified bins (which resulted in a larger file increase as expected).
Additional Notes:

HDF5 Limitations:
Even with this patch, HDF5 will not automatically reclaim space from deleted datasets. For existing files with accumulated dead space, tools like h5repack are still recommended.

Future Enhancements:
A longer-term improvement might involve accepting a single shared bins DataFrame with per-cell iterators, reducing memory usage and further streamlining the workflow.

Please review the changes and let me know if there are any questions or further improvements needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant