OncoMind

title

OncoMind

emoji

🚀

colorFrom

blue

colorTo

red

sdk

docker

app_file

app.py

app_port

8501

pinned

false

OncoMind

Cancer Research Copilot for Gap Analysis

IN PROGRESS

Research intelligence for cancer variants. Find the gaps, not just the facts.

For BRAF V600E, databases already agree. For the next 10,000 variants, the key question is "what don't we know yet?"

OncoMind is a research intelligence platform that identifies evidence gaps in cancer variant knowledge—surfacing where research is thin, conflicting, or missing entirely. It's built for translational teams and small biotechs deciding which variants are worth a project, not for treating individual patients.

How the LLM fits in

The LLM is not the annotation engine—it's the synthesis layer on top of structured data:

Deterministic backbone first: Evidence is aggregated from 8+ databases (CIViC, ClinVar, CGI, VICC, FDA labels, COSMIC, cBioPortal, DepMap) with match specificity tracking (variant vs codon vs gene-level)
Gap detection is rule-based: Missing evidence, source conflicts, and tumor-type extrapolation concerns are computed deterministically before the LLM sees anything
LLM synthesizes, doesn't invent: The LLM receives pre-structured evidence blocks with explicit provenance. It generates:
- Functional/biological/therapeutic summaries grounded in the evidence provided
- Research hypotheses tied to specific identified gaps
- Cross-source drug analysis highlighting corroboration and conflicts
Calibrated to evidence quality: When evidence is sparse, the LLM is instructed to stay generic and highlight unknowns—not fill gaps with training data

Status: Proof-of-Concept / Architectural Demo

This demonstrates an approach to systematic evidence gap detection. It is research use only — not for diagnosis, treatment selection, or any clinical decision-making.

⚠️ SNPs and small indels only. Fusions, amplifications, and copy-number variants are not yet supported.

What Makes It Different

Feature	Typical tools	OncoMind
Primary question	"What is this variant?"	"What don't we know yet?"
Knowledge gaps	Rarely explicit	First-class outputs with severity scoring
Source conflicts	Buried in details	Detected, surfaced, and explained
Match specificity	Not tracked	Variant vs codon vs gene-level evidence labeled
Source attribution	Often missing	Every claim linked to PMID, FDA label, or DB entry
Cancer hotspots	Binary yes/no	+ Adjacent hotspot detection (±5 codons)
Research hypotheses	None	Generated with evidence basis tags
LLM synthesis	Generic summaries	Grounded in structured evidence backbone
LLM Cross-source drug analysis	Manual comparison	Corroboration, conflicts, and emerging targets surfaced
Output	Static clinical-style notes	LLM-ready context blocks with receipts

Screenshots

Gap Analysis

📸 Click to see more screenshots

LLM Research Synthesis

What This Demonstrates

Architecture & Integration

Multi-source evidence aggregation (CIViC, ClinVar, CGI, VICC, DepMap, cBioPortal) with conflict detection
Evidence hierarchy (variant > codon > gene level) with match specificity tracking
Resistance mechanism annotation using cross-database validation
Gap detection with severity scoring (CRITICAL → INFORMATIONAL)

Technical Decisions

Structured data extraction before LLM synthesis (reduces hallucination risk)
Evidence provenance tracking across 6+ databases
Context-aware research hypothesis generation
Deterministic annotation backbone + optional LLM research layer

Known Limitations (By Design)

This is a proof-of-concept, not production-ready software. Known issues include:

Validation: Needs systematic validation, at the technical level, as well as will as by a SME expert.
SNPs and small indels only: Fusions, amplifications, and copy-number variants are not yet supported.
Negation detection: FDA label parsing may miss negative indicators ("not demonstrated", "not approved")
Edge cases: Rare variant-disease pairings may have inconsistent evidence grading
Display formatting: Some compound identifiers (CAS numbers) may appear in clinical evidence sections
LLM variability: Research hypothesis quality varies; some may be speculative

Production deployment would require:

Systematic validation pipeline with domain expert review
Robust regulatory text parsing (negation detection, contraindications)
Automated testing across representative variant sets
Human-in-the-loop review for high-stakes clinical use cases

Use Cases

Good for:

Exploring architectural approaches to clinical evidence synthesis
Understanding systematic challenges in multi-database genomics integration
Generating research directions for understudied variants
Portfolio demonstration of domain expertise + technical execution
Prioritizing targets, models, or combination strategies for small biotechs
Planning functional studies or resistance screens for academic labs
Triaging large variant lists from NGS or CRISPR screens

Not suitable for:

Clinical decision-making (use validated tools like OncoKB, CIViC)
Regulatory submissions
Production therapeutic recommendations without expert review

Why This Approach?

Traditional variant knowledgebases focus on summarizing what's known. OncoMind inverts this: it systematically identifies what's unknown to guide research prioritization. The gap detection architecture could inform:

Research funding decisions
Clinical trial design
Functional validation studies

The platform works in two layers:

Deterministic annotation backbone – structured evidence from knowledge bases, trials, cBioPortal, DepMap, Hotspots, literature
Optional LLM research layer – highlights gaps and drafts hypotheses, constrained by that backbone

Quick Start

Install

git clone https://github.com/dami-gupta-git/onco_mind_v0.git
cd onco_mind_v0

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

pip install -e ".[dev]"

If you hit a ModuleNotFoundError after pulling updates, reinstall with:

pip install -e . --force-reinstall

Streamlit UI (Recommended)

The interactive web interface is the easiest way to explore variants:

cd streamlit
streamlit run app.py

Features:

Enter any gene + variant + tumor type
Browse evidence by source (CIViC, VICC, CGI, FDA, DepMap, etc.)
View gap analysis with severity scoring
See match specificity (variant vs codon vs gene level)
Cross-source drug analysis identifying corroboration and conflicts
Export results to JSON

CLI

# Annotation backbone (~7s)
mind insight BRAF V600E --tumor Melanoma

# Literature search only (~15s)
mind insight EGFR L858R -t NSCLC --lit

# Research mode: backbone + LLM (~20s)
mind insight MAP2K1 P124L -t Melanoma --llm

# Full: backbone + literature + LLM (~25s)
mind insight KRAS G12D -t CRC --full

# Save to JSON
mind insight EGFR L858R -t NSCLC --llm --output result.json

# Debug logging
mind insight BRAF V600E --log-level DEBUG

Modes

Mode	Flag	Output
Annotation	(none)	Structured evidence from all data sources
Literature	`--lit`	+ PubMed / Semantic Scholar hits
LLM	`--llm`	+ Research narrative and gap analysis
Full	`--full`	Annotation + literature + LLM layer

What You Get

Evidence Backbone

OncoMind constructs a structured, variant-centric evidence model:

Clinical / KB: CIViC, VICC MetaKB, ClinVar, COSMIC, CGI, FDA labels
Functional: AlphaMissense, CADD, PolyPhen2, gnomAD
Biological: cBioPortal prevalence and co-mutation structure
Preclinical: DepMap CRISPR essentiality and PRISM drug response
Trials: ClinicalTrials.gov
Literature: PubMed / Semantic Scholar summarized in a LiteratureEvidence model

Match specificity is tracked so you can separate variant-level from gene-level signals:

Match level	Meaning	Example
`variant`	Exact amino-acid change	BRAF V600E specific data
`codon`	Same residue, different change	BRAF V600K in "V600 variants"
`gene`	Gene-level-only evidence	"BRAF mutation" basket trials

Research Insight (LLM Layer)

When enabled, OncoMind adds a research card on top of the evidence backbone:

llm_summary – concise synthesis of function, biology, and therapeutic landscape
evidence_quality – comprehensive / moderate / limited / minimal
knowledge_gaps / well_characterized – structured view of what's missing vs solid
research_implications – short, testable hypotheses
key_references – PMIDs, trials, and KB IDs supporting the card

Cross-Source Drug Analysis (LLM Layer)

A separate LLM analysis that synthesizes therapeutic evidence across CGI, CIViC, VICC, and Literature sources:

Strongest Evidence – Drugs with corroboration across multiple independent sources, with biological rationale
Conflicting Signals – Drugs where sources disagree, with likely explanations (tumor type differences, sequential therapy, acquired mutations)
Emerging Targets – Single-source preclinical or early-phase evidence worth investigating
Key Gaps – Expected drugs not found, tumor type extrapolation concerns

This runs in parallel with the main LLM synthesis for faster response times.

Configuration

Create a .env file (or set as Hugging Face Secrets):

# Required for LLM research mode (Gemini 2.0 Flash is the default)
GOOGLE_API_KEY=your-google-api-key
# Optional: use Anthropic or OpenAI models instead
ANTHROPIC_API_KEY=your-anthropic-key
# Optional: use OpenAI models instead
OPENAI_API_KEY=your-openai-key
# Optional: better literature context 
SEMANTIC_SCHOLAR_API_KEY=your-s2-key

Supported LLM models:

gemini/gemini-2.0-flash (default, fast)
gemini/gemini-1.5-pro
claude-sonnet-4-20250514
claude-3-5-haiku-20241022
gpt-4o-mini, gpt-4o, gpt-4-turbo

Supported Variant Types

Currently supports:

Missense (e.g., V600E, L858R)
Nonsense (e.g., R248*)
Small indels (e.g., E746_A750del)
Frameshift (e.g., K132fs)

Variants can be provided as simple protein changes (V600E, p.V600E) or in HGVS notation; normalization is handled under the hood.

Planned: fusions, amplifications, and copy-number variants.

Data Sources

Clinical & Therapeutic

Source	Data Type	Access
CIViC	Curated variant–drug associations	API / dump
VICC MetaKB	Aggregated knowledge bases	API
ClinVar	Clinical significance	Via aggregation layer
COSMIC	Somatic mutation catalog	Via aggregation layer
CGI	Biomarker annotations	Local DB
FDA	Drug approvals	Public APIs / labels
ClinicalTrials.gov	Active and historical trials	Public API

Functional & Biological

Source	Data Type	Access
cBioPortal	Co-mutation patterns, prevalence	API
AlphaMissense	Pathogenicity predictions	Precomputed scores
gnomAD	Population frequencies	Via aggregation layer

Literature

Source	Data Type	Access
Semantic Scholar	AI-powered literature search	API
PubMed	Biomedical literature	E-utilities

Preclinical Research

Source	Data Type	Access
DepMap	Gene essentiality (CRISPR), drug sensitivity (PRISM), cell line models	API / downloads

Development

pytest tests/unit/ -v
pytest tests/unit/ --cov=src/oncomind --cov-report=html

mypy src/oncomind
ruff check src/oncomind
ruff format src/oncomind

Validation

Variant	Tumor Type	Source	Expected Data	OncoMind Match
BRAF V600E	Melanoma	FDA Labels	Trametinib Mekinist, Trametinib + Dabrafenib Mekinist + Tafinlar, Encorafenib + Binimetinib Braftovi + Mektovi, Vemurafenib Zelboraf, Cobimetinib + Vemurafenib Cotellic + Zelboraf, Dabrafenib Tafinlar, Atezolizumab + Cobimetinib + Vemurafenib Tecentriq + Cotellic + Zelboraf, Atezolizumab and Hyaluronidase-tqjs + Cobimetinib + Vemurafenib Tecentriq Hybreza + Cotellic + Zelboraf	✓
BRAF V600E	Melanoma	CIViC	Level A, 5 evidence items	✓
BRAF V600E	Melanoma	ClinVar	Pathogenic	✓
BRAF V600E	Melanoma	cBioPortal	Melanoma (MSK, Clin Cancer Res 2021)	✓
EGFR L858R	NSCLC	FDA Labels	Gefitinib Iressa, Erlotinib Tarceva, Afatinib Gilotrif, Dacomitinib Vizimpro, Osimertinib Tagrisso, Osimertinib + pemetrexed and platinum-based chemotherapy Tagrisso, Amivantamab Rybrevant + lazertinib Lazcluze, Amivantamab Rybrevant + carboplatin + pemetrexed, Amivantamab and hyaluronidase-lpuj Rybrevant Faspro	✓
EGFR L858R	NSCLC	CIViC	Level A, 14 evidence items	✓
EGFR L858R	NSCLC	DepMap	Gene essentiality - EGFR is not essential	✓
KRAS G12C	NSCLC	FDA Labels	Sotorasib, adagrasib	✓
KRAS G12C	NSCLC	ClinVar	Pathogenic	✓
PIK3CA H1047R	Breast	FDA Labels	Inavolisib + palbociclib + fulvestrant, Alpelisib + fulvestrant, Capivasertib + fulvestrant	✓
PIK3CA H1047R	Breast	CIViC	3 Level A variant specific items	✓
IDH1 R132H	Glioma	FDA Labels	Vorasidenib (Voranigo)	✓
IDH1 R132H	Glioma	ClinVar	Pathogenic	✓
IDH1 R132H	Glioma	Hotspots	This variant is at known cancer hotspot	✓
ERBB2 S310F	Breast	CIViC	Level B, 3 evidence items	✓

Summary:

Total checks: 15
Matches: 15 (100%)

License

MIT License – see LICENSE.

Acknowledgments

Built on the work of CIViC, VICC, MyVariant.info, DepMap, Semantic Scholar, cBioPortal, and the broader open-data oncology community.

Name		Name	Last commit message	Last commit date
Latest commit History 395 Commits
data		data
docs		docs
images		images
samples		samples
scripts		scripts
skills		skills
src		src
streamlit		streamlit
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
ARCHITECTURE.md		ARCHITECTURE.md
DECISIONS.md		DECISIONS.md
DOCKER.md		DOCKER.md
Dockerfile		Dockerfile
Makefile		Makefile
PLAN.md		PLAN.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
TODO.md		TODO.md
__init__.py		__init__.py
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt
test.md		test.md

Folders and files

Latest commit

History

Repository files navigation

OncoMind

IN PROGRESS

How the LLM fits in

What Makes It Different

Screenshots

Gap Analysis

LLM Research Synthesis

What This Demonstrates

Architecture & Integration

Technical Decisions

Known Limitations (By Design)

Use Cases

Good for:

Not suitable for:

Why This Approach?

Quick Start

Install

Streamlit UI (Recommended)

CLI

What You Get

Evidence Backbone

Research Insight (LLM Layer)

Cross-Source Drug Analysis (LLM Layer)

Configuration

Supported Variant Types

Data Sources

Clinical & Therapeutic

Functional & Biological

Literature

Preclinical Research

Development

Validation

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages