Right now, rsc.pp.scrublet doesn't support Dask arrays, and there's a relatively straightforward path to implement one (at least, from what I know).
Background:
- Scrublet only really needs to run within a sample, or batch. This is provided to the function as a 'batch_key'
- These samples/batches are typically on the order of < 100k cells for batches, or < 10,000 for samples, meaning that they can fit within a typical GPU's memory.
Implementation concept:
- Check the the anndata object has a Dask array. If so, require a batch_key be provided.
- Rechunk the dask array by batch_key - one dask array for each batch_key
- Run scrublet in memory on each GPU (.compute_chunk_sizes())
- Save results in obs as normal.