Comprehensive documentation for the Multi-Modal Academic Research System.
Documentation Index - Complete guide to all documentation files, quick links, and common tasks.
All module documentation is located in docs/modules/:
-
Data Collectors (16KB)
- Collecting papers from ArXiv, PubMed Central, Semantic Scholar
- YouTube educational video collection with transcripts
- Podcast episode collection from RSS feeds
-
Data Processors (21KB)
- PDF text and image extraction
- Gemini Vision for diagram analysis
- Video content processing
-
Indexing (25KB)
- OpenSearch index management
- Hybrid search (BM25 + semantic)
- Embedding generation
-
Database (24KB)
- SQLite collection tracking
- Statistics and analytics
- Search and filtering
-
API (22KB)
- FastAPI REST endpoints
- Request/response formats
- Deployment guides
-
Orchestration (24KB)
- LangChain query pipeline
- Citation extraction and tracking
- Bibliography export
-
UI (23KB)
- Gradio interface
- User workflows
- Tab-by-tab guide
docs/
├── README.md # This file
├── DOCUMENTATION_INDEX.md # Complete documentation index
└── modules/
├── data-collectors.md # Data collection from various sources
├── data-processors.md # Content processing with Gemini
├── indexing.md # OpenSearch integration
├── database.md # SQLite tracking
├── api.md # FastAPI REST API
├── orchestration.md # LangChain query pipeline
└── ui.md # Gradio interface
- Read UI Documentation to understand the interface
- Check Data Collectors to learn about data sources
- Explore Orchestration to understand how queries work
- Start with Documentation Index for architecture overview
- Read Indexing and Database for core infrastructure
- Study Data Processors for content processing pipeline
- Review API for programmatic access
- Go directly to API Documentation
- Check Database for data models
- See Indexing for search capabilities
- ✅ ArXiv papers with PDF download
- ✅ YouTube videos with transcripts
- ✅ Podcasts from RSS feeds
- ✅ Semantic Scholar integration
- ✅ PubMed Central open access
- ✅ PDF text extraction (PyMuPDF)
- ✅ Image extraction from PDFs
- ✅ Gemini Vision diagram analysis
- ✅ Video transcript processing
- ✅ Multi-modal content understanding
- ✅ Hybrid search (keyword + semantic)
- ✅ Field boosting (title^3, abstract^2)
- ✅ Embedding generation (384-dim)
- ✅ Multi-index search
- ✅ Aggregations and analytics
- ✅ LangChain integration
- ✅ Conversation memory
- ✅ Citation extraction
- ✅ Related query generation
- ✅ Bibliography export (BibTeX, APA)
- ✅ Gradio web UI (5 tabs)
- ✅ FastAPI REST API
- ✅ Visualization dashboard
- ✅ Collection management
- ✅ Citation tracking
from multi_modal_rag.data_collectors import AcademicPaperCollector
from multi_modal_rag.indexing import OpenSearchManager
from multi_modal_rag.orchestration import ResearchOrchestrator
# Collect papers
collector = AcademicPaperCollector()
papers = collector.collect_arxiv_papers("machine learning", max_results=10)
# Index papers
opensearch = OpenSearchManager()
for paper in papers:
opensearch.index_document("research_assistant", {
'content_type': 'paper',
'title': paper['title'],
'abstract': paper['abstract'],
'authors': paper['authors']
})
# Query system
orchestrator = ResearchOrchestrator("gemini_api_key", opensearch)
result = orchestrator.process_query("Explain neural networks", "research_assistant")
print(result['answer'])See: Documentation Index for more examples
- Doc: Data Processors
- Example: PDFProcessor section
- Doc: Indexing
- API: API Module - Search Endpoint
- API: API Module - Statistics
- Database: Database Module
┌─────────────┐
│ Data Sources │ (ArXiv, YouTube, Podcasts)
└──────┬──────┘
│
↓
┌─────────────┐
│ Collectors │ (Paper, Video, Podcast Collectors)
└──────┬──────┘
│
↓
┌─────────────┐
│ Processors │ (PDF, Video Processing + Gemini)
└──────┬──────┘
│
↓
┌─────────────────────┐
│ Indexing + Database │ (OpenSearch + SQLite)
└──────┬──────────────┘
│
↓
┌──────────────────┐
│ Orchestration │ (LangChain + Citations)
└──────┬───────────┘
│
↓
┌──────────────────┐
│ User Interfaces │ (Gradio UI + FastAPI)
└──────────────────┘
Detailed Docs: Each arrow documented in respective module files
User Query
↓
Orchestrator (orchestration.md)
↓
OpenSearch Retrieval (indexing.md)
↓
Context Formatting (orchestration.md)
↓
Gemini Generation (orchestration.md)
↓
Citation Extraction (orchestration.md)
↓
Response + Citations + Related Queries
| Issue | Module | Doc Section |
|---|---|---|
| OpenSearch connection failed | Indexing | Troubleshooting |
| Gemini API errors | Data Processors | Troubleshooting |
| YouTube transcript unavailable | Data Collectors | Troubleshooting |
| Database locked | Database | Troubleshooting |
| CORS errors | API | Troubleshooting |
| UI won't launch | UI | Troubleshooting |
Complete Guide: Documentation Index - Troubleshooting Index
- Indexing: Use bulk_index() for >10 documents → Indexing Docs
- Search: Limit results with k parameter → Indexing Docs
- Processing: Limit images to 5 per PDF → Processors Docs
- API: Use pagination for large result sets → API Docs
- UI: Enable queuing for better responsiveness → UI Docs
Complete Guide: Documentation Index - Performance Optimization Index
GET /api/collections- List collections → DocsGET /api/collections/{id}- Get details → DocsGET /api/statistics- Statistics → DocsGET /api/search- Search → DocsGET /viz- Dashboard → Docs
Complete Reference: API Module Documentation
Each module has complete Python API documentation:
- Collectors: Methods
- Processors: Methods
- Indexing: Methods
- Database: Methods
- Orchestrator: Methods
- UI: Event Handlers
# Data Collection
pip install arxiv yt-dlp youtube-transcript-api feedparser openai-whisper
# Processing
pip install google-generativeai pypdf pymupdf pillow
# Indexing
pip install opensearch-py sentence-transformers
# Orchestration
pip install langchain langchain-google-genai
# Interfaces
pip install fastapi uvicorn gradioDetailed Info: Each module docs has Dependencies section
Each module documentation includes:
✅ Overview and architecture ✅ Complete class/function reference ✅ Parameters and return types ✅ Working code examples ✅ Integration patterns ✅ Error handling ✅ Performance tips ✅ Troubleshooting ✅ Dependencies
| File | Size | Lines |
|---|---|---|
| data-collectors.md | 16KB | ~550 |
| data-processors.md | 21KB | ~700 |
| indexing.md | 25KB | ~850 |
| database.md | 24KB | ~800 |
| api.md | 22KB | ~750 |
| orchestration.md | 24KB | ~800 |
| ui.md | 23KB | ~800 |
| Total | 155KB | ~5,250 |
- Follow existing structure in module docs
- Include code examples for all features
- Add troubleshooting entries
- Update this README with new sections
- Update Documentation Index
- Keep examples tested and working
- Update related sections in other docs
- Increment version numbers if major changes
- Update file sizes if significantly changed
- Check relevant module documentation
- Review Documentation Index for quick links
- Search troubleshooting sections
- Check code examples
For documentation issues:
- Specify file and section
- Describe the problem
- Suggest improvements
- Include code examples if applicable
- Documentation Version: 1.0
- Last Updated: October 2024
- Modules Documented: 7
- Total Pages: ~155KB
- Code Examples: 100+
Quick Links: