Skip to content

Latest commit

 

History

History
89 lines (73 loc) · 3.67 KB

File metadata and controls

89 lines (73 loc) · 3.67 KB

Plan

Progress Overview

  • Phase 1: Core Infrastructure -- Complete
  • Phase 2: Evidence Sources -- Complete
  • Phase 3: LLM Synthesis Layer -- Complete
  • Phase 4: Gap Detection & Analysis -- Complete
  • Phase 5: User Interfaces -- Complete
  • Phase 6: Validation & Hardening -- In Progress
  • Phase 7: Expansion -- Not Started

Phase 1: Core Infrastructure

  • Project scaffolding (pyproject.toml, hatchling build)
  • Variant normalization and input parsing
  • Amino acid code conversion (3-letter/1-letter)
  • Variant type classification (missense, nonsense, indel, frameshift)
  • Async HTTP client infrastructure (httpx)
  • Centralized constants and configuration
  • Logging module (configurable via env var or CLI flag)
  • Pydantic data models for all evidence types

Phase 2: Evidence Sources

  • MyVariant.info integration (ClinVar, COSMIC, gnomAD, AlphaMissense, CADD)
  • CIViC API client (variant-drug assertions by evidence level)
  • VICC MetaKB client (aggregated knowledge bases)
  • CGI biomarker annotations (local TSV lookup)
  • FDA drug label parsing (biomarker-drug associations)
  • cBioPortal integration (prevalence, co-mutations, tumor-specific studies)
  • DepMap integration (CRISPR essentiality, PRISM drug sensitivity, cell line models)
  • Cancer Hotspots API
  • ClinicalTrials.gov client
  • PubMed literature search
  • Semantic Scholar literature search
  • OncoTree tumor type ontology
  • EvidenceAggregator with parallel fetching and graceful degradation

Phase 3: LLM Synthesis Layer

  • LLM service with multi-provider support (litellm)
  • Research dossier synthesis (5-section structured output)
  • Paper relevance scoring (fast model)
  • Variant knowledge extraction from literature
  • Cross-source drug analysis (separate LLM call)
  • Prompt engineering for grounded synthesis (no hallucination of drugs/data)
  • Match specificity awareness in prompts (variant vs codon vs gene level)
  • JSON response parsing with repair logic

Phase 4: Gap Detection & Analysis

  • Evidence gap detection with severity scoring (CRITICAL through INFORMATIONAL)
  • Gap categories (therapeutic, functional, biological, literature)
  • Overall evidence quality assessment
  • Research priority computation
  • Well-characterized vs knowledge gap identification
  • Conflicting evidence detection
  • Acquired resistance mutation handling (context-aware)
  • Targetable sensitizing variant tracking (avoid false conflict flags)

Phase 5: User Interfaces

  • CLI with insight command (single variant)
  • CLI with batch command (JSON input, multiple variants)
  • CLI with annotate command (VCF file annotation)
  • CLI with version command
  • Rich terminal output (panels, color coding, gap severity indicators)
  • Streamlit web UI with tabbed evidence display
  • LLM synthesis rendering in Streamlit
  • JSON export from both CLI and UI
  • Docker deployment (Dockerfile for HuggingFace Spaces)

Phase 6: Validation & Hardening

  • Manual validation across 15 variant/source checks (100% match)
  • Unit tests for LLM service (mocked)
  • Systematic validation pipeline with domain expert review
  • Automated testing across representative variant sets
  • Negation detection in FDA label parsing
  • Edge case handling for rare variant-disease pairings

Phase 7: Expansion

  • Structural variant support (fusions, amplifications, copy-number variants)
  • Additional tumor types beyond current coverage
  • Pre-fetching and caching for top cancer genes
  • CSV and Markdown output formats (CLI currently outputs JSON only)
  • Per-variant output splitting in batch mode