Skip to content

Conversation

@ycli1995
Copy link
Contributor

@ycli1995 ycli1995 commented May 9, 2025

This will enable users to use BPCells matrix to save RAM.

@esiegel
Copy link
Collaborator

esiegel commented May 12, 2025

Can you give more of an explanation of what this PR is doing? Can users not use the BPCells library and then invoke the create_loupe function with their custom count matrix? Are you saying that we are saving memory because BPCells sparse matrices are more compact than dgCMatrix?

Just briefly looking at the PR, I don't like all of the duplication and I worry about depending on on a v0.3.0 of a library

@ycli1995
Copy link
Contributor Author

Sorry about that my initial PR was a bit minimal in terms of explanation. I should have provided more context on the motivation of this PR. Thanks for pointing it out!

The original create_loupe() ask for count_mat to be only dgCMatrix, which means users should load the whole matrix into R session. This could be high memory-cost when it comes to large scale data (>100k cells). With BPCells, however, users can create 10x matrix HDF5 without loading the whole matrix.

For example, to create .cloupe from a matrix of 200k cells in an h5ad file, one can first open the matrix with BPCells::open_matrix_anndata_hdf5():

count_mat <- BPCells::open_matrix_anndata_hdf5(path_to_h5ad)

This operation will not actually load the whole matrix. We can then create the .cloupe file with create_loupe():

create_loupe(count_mat, ...)

The actual logic is create_hdf5_BPCells(count_mat, h5path) -> louper_create_cloupe(h5path). In create_hdf5_BPCells(), data from input matrix are transfered to h5path using buffered stream and delayed evaluation, with the default buffer size as only 16384 bytes. Therefore it can save the memory cost at HDF5 writing.

Regarding your concerns:

  • Duplication: You are right —— I’ll try refactoring the code to reduce duplication. The current version was aiming for keeping the change clear from the original version, and for ensuring the new features behave exactly as dgCMatrix does in unit tests. But I'm happy to clean it up.
  • Dependency: BPCells installation will be checked when users call create_loupe() on matrix classes inheriting from IterableMatrix, and will raise an error if the package is not installed. This follows a similar pattern to how Seurat::Read10X_h5() checks for hdf5r.

Let me know if you'd be open to including this functionality — I'm happy to make further adjustments if you'd prefer a different direction. Also I completely understand if you’d prefer to avoid adding a dependency on BPCells at this stage. Thanks again for taking the time to review this — really appreciate your work on loupeR!

@esiegel
Copy link
Collaborator

esiegel commented May 16, 2025

After thinking about it, I think adding BPCell support would be a great feature. Seems like it is popular enough - Seurat even has vignettes encouraging using it.

Maybe for cleaning up the code we can:

  • continue with a single exported create_hdf5 function. Now it will accept either the dgCMatrix or the IterableMatrix. It is here we also should verify that the user has the optional BPCells installed and fail fast.
  • Create the internal functions write_mat_dgcmatrix and write_mat_bpcells that takes a path where the H5 should be written and writes out only the matrix data (indptr, data, indices) and not the feature data.

Thanks again

@esiegel esiegel self-requested a review May 16, 2025 17:33
@ycli1995
Copy link
Contributor Author

Thanks for the detailed reply — really appreciate the constructive feedback and your willingness to support this feature!

Your suggestions on simplifying the logic and separating out the internal write_mat_* functions make a lot of sense, and should make the code cleaner going forward.

One small thing to flag in advance: when using BPCells to output the 10x-style matrix HDF5, it always writes the barcodes and features groups in its own implementation. In details, BPCells writes the strings data with unlimited length while loupeR does not. Unfortunately, this format doesn't match what louper.exe expects, and causes errors when creating the .cloupe file.

My current workaround is to let BPCells write the matrix part, then manually delete and rewrite the barcodes and features groups using loupeR’s existing logic. It’s a bit of a hack, but avoids having to open an issue or PR on BPCells itself. This part may take me a bit of time to clean up. I’ll aim to revisit it as soon as I can. Totally agree it’s worth polishing up.

@esiegel
Copy link
Collaborator

esiegel commented May 19, 2025

One small thing to flag in advance: when using BPCells to output the 10x-style matrix HDF5, it always writes the barcodes and features groups in its own implementation. In details, BPCells writes the strings data with unlimited length while loupeR does not. Unfortunately, this format doesn't match what louper.exe expects, and causes errors when creating the .cloupe file.

My current workaround is to let BPCells write the matrix part, then manually delete and rewrite the barcodes and features groups using loupeR’s existing logic. It’s a bit of a hack, but avoids having to open an issue or PR on BPCells itself. This part may take me a bit of time to clean up. I’ll aim to revisit it as soon as I can. Totally agree it’s worth polishing up.

Thank you for pointing out the shortcoming of louper.exe and the format that it expects strings. I think using unlink to remove the datasets from the H5 file is a fine solution for now. It will lead to the files being slightly larger as unlink doesn't actually remove the bytes, but I think this is fine. Longer term, the executable should be updated to support non-fixed length strings.

@esiegel
Copy link
Collaborator

esiegel commented Jun 11, 2025

Just checking in. Any update on this?

@ycli1995
Copy link
Contributor Author

Sorry for the delayed response. I finally get back from my full-time job. T^T

The current version has merged the logic of create_hdf5_BPCells to the original create_hdf5. To keep create_hdf5() simple, I only added a conditional branch when count_mat is BPCells matrix. The write_mat() function previously used for dgCMatrix was kept unchanged.

For the unit tests, I did not merge the "can create hdf5 with BPCells" to "can create hdf5". After the hdf5 file written, the tests codes are generally the same. Do you want me to extract that part of codes into a helper function such as validate_hdf5()?

Copy link
Collaborator

@esiegel esiegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Fix the lints and I think we can land. Thanks again

R/validate.R Outdated
validate_count_mat <- function(count_mat, feature_ids = NULL) { # nolint: cyclocomp_linter.
if (!is(count_mat, "dgCMatrix")) {
return(err("count_mat must be a dgCMatrix"))
if (!is(count_mat, "dgCMatrix") & !inherits(count_mat, "IterableMatrix")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix this lint error

}
if (any(is.infinite(count_mat@x))) {
return(err("matrix values must not be infinite"))
if (inherits(count_mat, "dgCMatrix")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we avoiding this check because the matrix will be large? Or is this unsupported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, only dgCMatrix has the slot x, while IterableMatrix from BPCells does not.

#' Create a sparse count_mat with BPCells
#'
#' @importFrom Matrix rsparsematrix
create_count_mat_BPCells <- function(rows, cols, valid_barcodes = FALSE) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess call this create_count_mat_bpcells to satisfy the linter.

testthat (>= 3.0.0),
Matrix,
SeuratObject (>= 5.0.0)
SeuratObject (>= 5.0.0),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might also need to add a REMOTES section so that the CI knows where/how to download bpcells.

Remotes:
    github::bnprks/BPCells/r

ycli1995 and others added 2 commits July 4, 2025 12:18
Copy Seurat-Object and add an extra R repository to fetch BPCells.
Using the REMOTES section within the DESCRIPTION only
worked on Windows.  On linux and mac it was failing to untar
the source.
@esiegel esiegel self-requested a review July 7, 2025 21:03
Copy link
Collaborator

@esiegel esiegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the PR. Everything is building and looks good

r-version: ${{ matrix.config.r }}
http-user-agent: ${{ matrix.config.http-user-agent }}
use-public-rspm: true
extra-repositories: 'https://bnprks.r-universe.dev'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied what Seurat-Object is doing to unbreak the CI builds

@esiegel esiegel merged commit d7994c5 into 10XGenomics:main Jul 7, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants