This document translates the goals from initial_plan.md into a concrete roadmap for implementing piscem-rs in Rust, informed by detailed review of the C++ codebase (piscem-cpp/) and the Rust sshash-rs library.
- Parity target: Rust outputs must be semantically equivalent to
piscem-cppoutputs (not byte-identical). - Index size target: Serialized Rust indices should be very similar in size to C++ indices (avoid meaningful disk-size bloat).
- RAD comparison path: Use
libradiclon thedevelopbranch to read and compare C++ and Rust RAD outputs. - Project structure:
piscem-rsshould be an independent top-level project that depends onsshash-lib, not part of thesshash-rsworkspace. - Dependency source preference: Prefer local path dependency to the checked-out
sshash-rs, with optional Git fallback.
Implement build, serialization, load, and query of the full piscem index:
- SSHash dictionary (
sshash-lib) - Inverted tiling index (unitig -> packed occurrence postings)
- Optional equivalence class table
- Optional poison table
Implement mapping algorithms exactly as in piscem-cpp such that mapping semantics are identical.
Implement in this exact order:
- Single-cell RNA-seq
- Bulk RNA-seq
- Single-cell ATAC-seq
Preserve practical performance and keep serialized index sizes near C++.
piscem-rs/
Cargo.toml
src/
lib.rs
main.rs
cli/
mod.rs
build.rs
map_scrna.rs
map_bulk.rs
map_scatac.rs
poison.rs
inspect.rs
index/
mod.rs
reference_index.rs
contig_table.rs
eq_classes.rs
poison_table.rs
refinfo.rs
formats.rs
mapping/
mod.rs
engine.rs
cache.rs
chain_state.rs
hits.rs
merge_pairs.rs
filters.rs
streaming_query.rs # <-- piscem-level streaming query wrapper (NEW)
hit_searcher.rs # <-- hit collection algorithms (NEW)
projected_hits.rs # <-- projected_hits + decode_hit (NEW)
protocols/
mod.rs
scrna.rs
bulk.rs
scatac.rs
io/
mod.rs
fastx.rs
rad.rs
threads.rs
verify/
mod.rs
parity.rs
rad_compare.rs
index_compare.rs
tests/
data/
parity_index.rs
parity_query.rs
parity_mapping_scrna.rs
parity_mapping_bulk.rs
parity_mapping_scatac.rs
-
Algorithm-preserving port first
- Prefer direct translation of control flow and state transitions for parity-critical kernels.
- Optimize after equivalence harness is stable.
-
Stable semantic contracts at module boundaries
index::reference_indexexposes read-only query API used by mapping.mapping::engineconsumes a protocol adapter (scRNA/bulk/scATAC specifics).
-
Determinism controls for validation
- Single-thread deterministic mode for strict regression checks.
- Multi-thread mode allows output ordering differences only.
This section maps key C++ types to their planned Rust equivalents.
| C++ type | Rust module | Rust type | Notes |
|---|---|---|---|
piscem_dictionary |
sshash-lib |
Dictionary |
Already implemented |
basic_contig_table |
index::contig_table |
ContigTable |
Port EF offsets + compact_vector entries |
equivalence_class_map |
index::eq_classes |
EqClassMap |
tile_ec_ids, label_list_offsets, label_entries |
reference_index |
index::reference_index |
ReferenceIndex |
Owns Dictionary + ContigTable + RefInfo + optional EqClassMap |
poison_table |
index::poison_table |
PoisonTable |
Hash map + offsets + occurrences |
ref_sig_info_t |
index::refinfo |
RefSigInfo |
Signature metadata |
| C++ type | Rust module | Rust type | Notes |
|---|---|---|---|
projected_hits |
mapping::projected_hits |
ProjectedHits |
contigIdx, contigPos, orientation, contigLen, globalPos, k, refRange |
simple_hit |
mapping::hits |
SimpleHit |
tid, pos, is_fw, num_hits, score |
chain_state |
mapping::chain_state |
ChainState |
read_start_pos, prev_pos, curr_pos, num_hits, min_distortion |
sketch_hit_info |
mapping::hits |
SketchHitInfo |
fw/rc chain vectors with structural constraints |
sketch_hit_info_no_struct_constraint |
mapping::hits |
SketchHitInfoSimple |
Simple counting variant |
mapping_cache_info<S,Q> |
mapping::cache |
MappingCache |
Per-thread state: hit_map, accepted_hits, query, searcher |
hit_searcher |
mapping::hit_searcher |
HitSearcher |
3 variants of k-mer hit collection |
piscem::streaming_query<with_cache> |
mapping::streaming_query |
PiscemStreamingQuery |
Wraps sshash StreamingQuery + contig table resolution |
MappingType |
mapping::hits |
MappingType |
Enum: Unmapped, SingleMapped, Orphan1, Orphan2, Pair |
poison_state_t |
mapping::filters |
PoisonState |
Poison k-mer scan between hit intervals |
SkipContext |
mapping::hit_searcher |
SkipContext |
Stateful iterator for skip-and-verify hit collection |
| sshash-rs term | piscem-cpp term | Meaning |
|---|---|---|
string_id |
contig_id |
Unitig identifier |
kmer_id_in_string |
kmer_id_in_contig |
K-mer position within unitig |
string_begin/string_end |
contig begin/end | Unitig boundaries in SPSS |
string_length() |
contig length | Unitig length |
Critical finding: The C++ piscem::streaming_query<bool with_cache> (in streaming_query.hpp) is a piscem-level wrapper around sshash's streaming query, not the same as sshash's own streaming query. It adds:
- Contig table resolution: After sshash lookup, resolves contig_id → reference occurrence span via
m_ctg_offsets/m_ctg_entries. - Unitig-end k-mer caching:
boost::concurrent_flat_map<uint64_t, lookup_result>that caches the lookup result for the last k-mer in each unitig to speed up transitions between unitigs. - Direction/extension tracking: Tracks
m_direction,m_remaining_contig_bases,m_prev_contig_idto decide whether to extend (cheap) or do a full lookup.
In Rust, this must be built as mapping::streaming_query::PiscemStreamingQuery wrapping sshash_lib::StreamingQueryEngine. It takes a reference to the ContigTable to resolve spans.
The hit_searcher (in hit_searcher.cpp, ~1400 lines) implements three variants:
get_raw_hits_sketch(primary, newer): UsesSkipContextfor stateful skip-and-verify. Two sub-modes:STRICT: Conservative skip with safe-walk fallback.PERMISSIVE: Aggressive skip with mid-point verification; on failure, falls back to every-kmer walk for the failing interval.
get_raw_hits_sketch_orig(original): Usesquery_kmer_complex()with safe-skip walk-back.get_raw_hits_sketch_everykmer(exhaustive): Queries every k-mer in the read (no skipping).
The SkipContext struct is central — it tracks: read position, contig position, expected k-mer for fast-hit checking, and manages the skip logic. It reads reference k-mers from the SPSS bit vector (piscem_bv_iterator). The Rust equivalent will need to use sshash-rs's ability to decode k-mers at specific positions in the SPSS.
The map_read() function (~300 lines) is the per-read mapping kernel:
- Calls
hit_searcher::get_raw_hits_sketch()to collect raw hits. - Optionally runs
poison_state_t::scan_raw_hits()to check for poison k-mers. - Runs
collect_mappings_from_hitslambda:- First pass: builds
hit_map(tid → SketchHitInfo) from projected hits. - If too many hits (>
max_hit_occ), opens recovery mode with higher threshold. - Applies ambiguous-hit EC filtering if EC table is present.
- Final best-hit selection based on hit count and structural constraints.
- First pass: builds
- Populates
accepted_hitsvector ofSimpleHit.
projected_hits::decode_hit(v) resolves a packed contig table entry v into a reference position + orientation. The 4-case orientation logic (contigFW × contigOrientation) is critical for semantic parity.
sshash-rs uses dispatch_on_k!(k, K => { ... }) to go from runtime k to const generic K. In piscem-rs, the ReferenceIndex will learn k at load time, and all downstream code paths (streaming query, hit searcher, mapping engine) must be invoked inside a dispatch_on_k! block. This suggests the main mapping entry point will be generic over K, instantiated at load time.
C++ uses global-ish patterns:
PiscemIndexUtils::ref_shift()/PiscemIndexUtils::pos_mask()— global bit widths for decoding contig entries.CanonicalKmer::k()— global k-mer size.
In Rust, these should be stored as fields on ReferenceIndex (or a shared IndexParams struct) and passed through to code that needs them. ref_shift and pos_mask derive from ContigTable::ref_len_bits.
sshash-lib(local path dependency preferred)needletail(FASTA/FASTQ)rayonorcrossbeam(parallel work scheduling)clap(CLI)tracing+tracing-subscriber(logging)anyhow/thiserror(error handling)smallvec(small fixed-capacity vectors in hot paths — needed forsketch_hit_infochain vectors)ahash(fast hash maps where appropriate)memmap2(efficient large index IO)nohash-hasheror similar (for integer-keyed maps in hit_map, matching C++ankerl::unordered_dense)sucdsor custom (Elias-Fano and compact_vector implementations for contig table)
- Integrate/bridge to
libradicl(develop) for robust RAD read/compare path. - Implementation options:
- preferred: Rust-native RAD reader if practical and parity-safe;
- fallback: FFI bridge to
libradiclfor decode/normalization used by parity tests.
Primary (local development):
[dependencies]
sshash-lib = { path = "./sshash-rs/crates/sshash-lib" }Optional CI/repro fallback:
[dependencies]
sshash-lib = { git = "https://github.com/COMBINE-lab/sshash-rs", package = "sshash-lib" }Objective: Build the infrastructure to verify semantic equivalence early and continuously.
- Initialize
piscem-rscrate skeleton and command structure. - Add local-path
sshash-libdependency. - Build parity harness:
- invoke C++/Rust runs on same fixtures,
- normalize outputs,
- compare semantics (not bytes).
- Add RAD comparison utility path based on
libradicldevelop.
- Can run one command that reports parity pass/fail on a toy dataset.
Objective: Implement full Rust index loading/saving/query substrate equivalent to C++.
Port basic_contig_table to Rust:
m_ctg_offsets→ Elias-Fano monotone sequence (encodes cumulative posting list boundaries per unitig)m_ctg_entries→ Compact vector of bit-packed entries (each entry =ref_position | (orientation_bit << ref_len_bits))m_ref_len_bits→u64— number of bits for reference position encodingcontig_entries(contig_id)→ returns iterator/slice over entries for given unitig
Decision needed: Use sucds crate for Elias-Fano, or port the C++ bits::elias_fano / bits::compact_vector directly? (See Q1 in open questions.)
Port reference metadata load/save:
m_ref_names: Vec<String>— reference sequence namesm_ref_lens: Vec<u64>— reference sequence lengths- Serialized as
.refinfofile (C++ format: binary length-prefixed strings + u64 array)
Assemble the full index:
Dictionary(loaded viasshash-lib)ContigTable(loaded from.ctabequivalent)RefInfo(loaded from.refinfoequivalent)- Optional
EqClassMap(loaded from.ectabequivalent) ref_shift/pos_maskcomputed fromcontig_table.ref_len_bits
Key method: query(kmer_iter, streaming_query) -> ProjectedHits
- Takes a k-mer iterator and piscem streaming query
- Returns
ProjectedHitswith contig info + reference range
Port equivalence_class_map:
tile_ec_ids→ compact vector (unitig tile → EC id)label_list_offsets→ Elias-Fano (EC id → offset into label entries)label_entries→ compact vector (reference IDs in each EC)- Methods:
entries_for_ec(ec_id),entries_for_tile(tile_id),ec_for_tile(tile_id)
Port build_contig_table.cpp logic:
- Walk all unitigs from Dictionary, resolve reference occurrences, build packed posting lists
- Construct Elias-Fano offsets and compact vector entries
- Serialize to piscem-rs native format
- Rust-built index passes semantic index comparison against C++.
- On-disk index sizes are in expected range (no major inflation).
Objective: Implement poison table generation and query semantics equivalent to C++ optional behavior.
Port poison_table from poison_table.hpp:
- Hash map: canonical k-mer → offset into occurrence array
- Offset array +
poison_occ_tvector (unitig_id, begin_offset, end_offset entries) - Query methods:
key_exists(),key_occurs_in_unitig(),key_occurs_in_unitig_between() - Build:
build_from_occs()— processes raw poison k-mer occurrences - Serialize:
.poisonfile
- Poison-enabled runs show semantic parity with C++ reference.
Objective: Port exact mapping behavior from C++ core mapping utilities.
Port piscem::streaming_query<with_cache> as a piscem-rs wrapper around sshash_lib::StreamingQueryEngine:
- Holds reference to
ContigTablefor span resolution - Implements
query_lookup(): performs sshash lookup, then resolvescontig_spanfrom contig table - Tracks direction, remaining contig bases, previous contig ID for extension optimization
- Unitig-end cache:
DashMap<u64, LookupResult>or similar concurrent map (replacesboost::concurrent_flat_map)
Port projected_hits struct and the 4-case decode_hit(v) orientation logic:
- Extract orientation bit and reference position from packed entry
- 4 cases: (contigFW × contigOrientation) → (ref_pos, ref_fw)
Port hit_searcher (~1400 lines of C++):
SkipContextstruct: read_pos, contig_pos, expected_kmer, skip logic- Primary variant
get_raw_hits_sketchwith STRICT/PERMISSIVE - Original variant
get_raw_hits_sketch_orig - Exhaustive variant
get_raw_hits_sketch_everykmer - Left/right raw hit vectors:
Vec<(i32, ProjectedHits)>
Port order: get_raw_hits_sketch (PERMISSIVE) first, then STRICT, then orig, then everykmer.
Port mapping_cache_info and map_read():
hit_map: HashMap<u32, SketchHitInfo>(or nohash variant for integer keys)accepted_hits: Vec<SimpleHit>- Two-pass collection with occ recovery
- Ambiguous-hit EC filtering
- Best-hit selection
Port merge_se_mappings():
- Merge left/right single-end mappings for paired-end reads
- Fragment length computation, concordance checks
MappingTypedetermination (Unmapped/SingleMapped/Orphan1/Orphan2/Pair)
- Single-thread mapping outputs semantically identical for controlled fixtures.
- Protocol enum: ChromiumV2, V2_5P, V3, V3_5P, V4_3P, Custom
- Barcode and UMI extraction from read sequences based on geometry
- scRNA RAD emission path
- Paired-end mapping dispatch
- Parity test set for multiple protocol variants
- Reuse core engine with bulk-specific options and output semantics.
- Add scATAC-specific technical-sequence extraction and mapping/reporting behavior.
bin_poshelper for position-based binning.
- Per-protocol parity suites passing.
Objective: Reach production-ready speed/memory behavior without sacrificing equivalence.
- Profile hotspots and tune data structures/allocation patterns.
- Validate index size deltas and compactness.
- Add larger regression datasets and CI gates.
- Document reproducibility and benchmark methodology.
- Evaluate unitig-end cache effectiveness and tuning.
- Stable parity + acceptable performance + acceptable on-disk size behavior.
- Compare meaning, not bytes.
- For multithreaded outputs, compare as multisets after normalization.
- Compare decoded posting lists for sampled/full unitigs:
- unitig -> [(orientation, ref_id, pos), ...]
- Compare equivalence-class interpretation.
- Compare reference metadata (
ref names,ref lengths). - Track size deltas (
rust_size / cpp_size) as an explicit metric.
- For randomized and fixture k-mer queries, compare:
- found/not found,
- projected contig id/offset/orientation,
- any optional poison behavior.
- Compare RAD-derived semantic records through
libradiclnormalization. - Single-thread: exact semantic identity.
- Multi-thread: order-insensitive equivalence only.
- Initial target: match C++ within practical range on representative datasets.
- Follow-up target: improve hot-path performance where possible without changing semantics.
- Keep Rust index disk size close to C++ equivalent.
- Treat persistent, significant size inflation as a release blocker for format/layout tuning.
- Create
piscem-rscrate scaffold and command skeleton. - Add local-path
sshash-libdependency and compile smoke test. - Add logging/error conventions and configuration.
- Build fixture runner for C++ vs Rust index/query/map.
- Add normalized comparators for index artifacts.
- Add RAD semantic comparator using
libradicldevelop path.
- Implement ContigTable: Elias-Fano offsets + compact vector entries.
- Implement RefInfo: ref names/lengths serialization.
- Implement ReferenceIndex: assemble Dictionary + ContigTable + RefInfo.
- Implement EqClassMap: tile → EC → label entries.
- Implement index build pipeline from FASTA input.
- Implement
decode_hit()orientation logic (4-case).
- Implement PiscemStreamingQuery wrapping sshash StreamingQueryEngine.
- Add contig table span resolution after sshash lookup.
- Add direction/extension tracking and unitig-end caching.
- Implement SkipContext stateful iterator.
- Port
get_raw_hits_sketch(PERMISSIVE mode first). - Port
get_raw_hits_sketch(STRICT mode). - Port
get_raw_hits_sketch_orig. - Port
get_raw_hits_sketch_everykmer.
- Implement poison table builder (edge mode parity first).
- Integrate poison-aware query/mapping path.
- Implement
poison_state_t::scan_raw_hits()equivalent.
- Port
map_read()per-read mapping kernel. - Port
sketch_hit_infowith structural constraint chains. - Port
sketch_hit_info_no_struct_constraintsimple counting. - Port
mapping_cache_infoper-thread cache. - Port acceptance filters and tie-breakers exactly.
- Port
merge_se_mappings()paired-end merge semantics.
- Implement scRNA protocol path (barcode + UMI).
- Implement bulk RNA path.
- Implement scATAC path.
- Add comprehensive parity test matrix in CI.
- Add benchmark suite and disk-size tracking reports.
- Write user-facing docs for all commands and options.
- Index binary compatibility: Should Rust load C++-serialized indices (
.sshash,.ctab,.refinfo), or build its own format only? Binary compat enables parity testing against the same index but requires matching the exact C++ serialization.
- No, binary compatibility is not required, only the semantic equivalence of the indices . Though, as noted above, efficiency is paramount so the size of the Rust files should not be much larger than that of the C++ files.
- Primary hit searcher variant: Which
get_raw_hits_sketchvariant/strategy is the default in production use? (Assumed:get_raw_hits_sketchwith PERMISSIVE.)
- Yes, this is the default strategy (and the one we want to focus on getting right first)
- Test data and C++ binary: Are toy datasets and a compiled C++ piscem binary available? Or should the parity harness build piscem-cpp from source?
- There are toy datasets available, though piscem-cpp can be built from source if the executable needs to be run. In the
test_datadirectory, we have 5 imporant things:- a folder
gencode_pc_v44_dbgthat contains the input (segment and sequence) file that is necessary for building the sshash component of the index, and the inverted tiling index, this is the input to the piscem-cppbuildcommand - a folder
gencode_pc_v44_index_nopoison, which are the files generated by running thebuildcommand without poison/decoy sequences - a folder
gencode_pc_v44_index_with_poison, which are the files generate dby running thebuildcommand, and then runningbuild-poison-tableon the constructed index (and givingGRCh38.primary_assembly.genome.fa.gzas the decoy sequence) - the file
GRCh38.primary_assembly.genome.fa.gzused as the poison/decoy sequence for 2 above - the file
gencode.v49.pc_transcripts.fa.gz, which we should not need directly, but which is the source file from which the de bruijn graph in 1) was genereated.
- a folder
libradiclintegration: How should we depend onlibradicldevelop? Git dependency? Local path?
- we should depend on it as a git dependency (to the develop branch). If it is decided we need to modify it for some purpose, I will pull it in locally and we can then depend on and change the local version (whose changes I will push back upstream)
- Structural constraints default: Is
sketch_hit_info(structural constraints enabled) orsketch_hit_info_no_struct_constraintthe default? Should we port both from the start?
- without structural constraints is the default. We will eventually want both, but we should start with no structural constraints
- Custom geometry parsing: How important is
CUSTOMprotocol geometry for the initial implementation?
- We will eventually want this, but we can delay it until after other features are done
- Elias-Fano / compact_vector implementation: Use
sucdscrate, or port the C++bits::implementations directly? The C++ uses specific template instantiations (elias_fano<false, false>,compact_vector).
- Neither;
sshash-rsalready pulls insux-rsandcseqand we can feel comfortable relying on the latest version of either (or both) of these. In general, we should first prefer a solution fromsux-rs(https://github.com/vigna/sux-rs), then anything frombsuccinct-rs(https://github.com/beling/bsuccinct-rs).
-
Semantic drift in nuanced mapping logic
- Mitigation: golden parity tests at each stage, deterministic single-thread mode.
-
Index size inflation in Rust serialization
- Mitigation: explicit size KPI tracking, bit-packing and succinct structures from start.
-
RAD parsing mismatch
- Mitigation: use
libradicldevelop as canonical decode path for comparison.
- Mitigation: use
-
Dependency friction between local and CI setups
- Mitigation: support both local path and Git dependency configuration for
sshash-lib.
- Mitigation: support both local path and Git dependency configuration for
-
Const-generic K propagation complexity (NEW)
sshash-rsrequires compile-timeKviadispatch_on_k!. All downstream code (streaming query, hit searcher, mapping engine) must be generic overKor invoked inside a dispatch block.- Mitigation: establish the
dispatch_on_k!boundary early (atReferenceIndex::loador mapping entry point) and keep inner code K-generic.
-
piscem::streaming_query layer mismatch (NEW)
- The sshash-rs
StreamingQuerydoes NOT include contig table resolution or unitig-end caching. This is a separate layer that must be built in piscem-rs. - Mitigation: clearly separate sshash-level query from piscem-level query in the module structure.
- The sshash-rs
-
Global mutable state patterns in C++ (NEW)
- C++ uses
PiscemIndexUtils::ref_shift(),CanonicalKmer::k()as global state. Rust must thread these through as struct fields. - Mitigation: define
IndexParams { k, ref_shift, pos_mask }early and pass by reference.
- C++ uses
Create initial(Done)Cargo.toml+ module skeleton forpiscem-rs.Wire local-path(Done)sshash-libdependency and add build command.- Resolve open questions above (especially Q1: binary compatibility).
- Implement Phase 1A:
ContigTablewith Elias-Fano offsets and compact vector entries. - Implement Phase 1B:
RefInfoload/save. - Implement Phase 1C:
ReferenceIndexassembling Dictionary + ContigTable + RefInfo. - Add first parity test: load same reference, compare decoded posting lists.