- Phase 1: Core Infrastructure -- Complete
- Phase 2: Evidence Sources -- Complete
- Phase 3: LLM Synthesis Layer -- Complete
- Phase 4: Gap Detection & Analysis -- Complete
- Phase 5: User Interfaces -- Complete
- Phase 6: Validation & Hardening -- In Progress
- Phase 7: Expansion -- Not Started
- Project scaffolding (pyproject.toml, hatchling build)
- Variant normalization and input parsing
- Amino acid code conversion (3-letter/1-letter)
- Variant type classification (missense, nonsense, indel, frameshift)
- Async HTTP client infrastructure (httpx)
- Centralized constants and configuration
- Logging module (configurable via env var or CLI flag)
- Pydantic data models for all evidence types
- MyVariant.info integration (ClinVar, COSMIC, gnomAD, AlphaMissense, CADD)
- CIViC API client (variant-drug assertions by evidence level)
- VICC MetaKB client (aggregated knowledge bases)
- CGI biomarker annotations (local TSV lookup)
- FDA drug label parsing (biomarker-drug associations)
- cBioPortal integration (prevalence, co-mutations, tumor-specific studies)
- DepMap integration (CRISPR essentiality, PRISM drug sensitivity, cell line models)
- Cancer Hotspots API
- ClinicalTrials.gov client
- PubMed literature search
- Semantic Scholar literature search
- OncoTree tumor type ontology
- EvidenceAggregator with parallel fetching and graceful degradation
- LLM service with multi-provider support (litellm)
- Research dossier synthesis (5-section structured output)
- Paper relevance scoring (fast model)
- Variant knowledge extraction from literature
- Cross-source drug analysis (separate LLM call)
- Prompt engineering for grounded synthesis (no hallucination of drugs/data)
- Match specificity awareness in prompts (variant vs codon vs gene level)
- JSON response parsing with repair logic
- Evidence gap detection with severity scoring (CRITICAL through INFORMATIONAL)
- Gap categories (therapeutic, functional, biological, literature)
- Overall evidence quality assessment
- Research priority computation
- Well-characterized vs knowledge gap identification
- Conflicting evidence detection
- Acquired resistance mutation handling (context-aware)
- Targetable sensitizing variant tracking (avoid false conflict flags)
- CLI with
insightcommand (single variant) - CLI with
batchcommand (JSON input, multiple variants) - CLI with
annotatecommand (VCF file annotation) - CLI with
versioncommand - Rich terminal output (panels, color coding, gap severity indicators)
- Streamlit web UI with tabbed evidence display
- LLM synthesis rendering in Streamlit
- JSON export from both CLI and UI
- Docker deployment (Dockerfile for HuggingFace Spaces)
- Manual validation across 15 variant/source checks (100% match)
- Unit tests for LLM service (mocked)
- Systematic validation pipeline with domain expert review
- Automated testing across representative variant sets
- Negation detection in FDA label parsing
- Edge case handling for rare variant-disease pairings
- Structural variant support (fusions, amplifications, copy-number variants)
- Additional tumor types beyond current coverage
- Pre-fetching and caching for top cancer genes
- CSV and Markdown output formats (CLI currently outputs JSON only)
- Per-variant output splitting in batch mode