Skip to content

Latest commit

 

History

History
386 lines (295 loc) · 11.6 KB

File metadata and controls

386 lines (295 loc) · 11.6 KB

Documentation

Comprehensive documentation for the Multi-Modal Academic Research System.

Quick Navigation

📚 Start Here

Documentation Index - Complete guide to all documentation files, quick links, and common tasks.

📖 Module Documentation

All module documentation is located in docs/modules/:

  1. Data Collectors (16KB)

    • Collecting papers from ArXiv, PubMed Central, Semantic Scholar
    • YouTube educational video collection with transcripts
    • Podcast episode collection from RSS feeds
  2. Data Processors (21KB)

    • PDF text and image extraction
    • Gemini Vision for diagram analysis
    • Video content processing
  3. Indexing (25KB)

    • OpenSearch index management
    • Hybrid search (BM25 + semantic)
    • Embedding generation
  4. Database (24KB)

    • SQLite collection tracking
    • Statistics and analytics
    • Search and filtering
  5. API (22KB)

    • FastAPI REST endpoints
    • Request/response formats
    • Deployment guides
  6. Orchestration (24KB)

    • LangChain query pipeline
    • Citation extraction and tracking
    • Bibliography export
  7. UI (23KB)

    • Gradio interface
    • User workflows
    • Tab-by-tab guide

Documentation Structure

docs/
├── README.md                      # This file
├── DOCUMENTATION_INDEX.md         # Complete documentation index
└── modules/
    ├── data-collectors.md        # Data collection from various sources
    ├── data-processors.md        # Content processing with Gemini
    ├── indexing.md               # OpenSearch integration
    ├── database.md               # SQLite tracking
    ├── api.md                    # FastAPI REST API
    ├── orchestration.md          # LangChain query pipeline
    └── ui.md                     # Gradio interface

Getting Started

For New Users

  1. Read UI Documentation to understand the interface
  2. Check Data Collectors to learn about data sources
  3. Explore Orchestration to understand how queries work

For Developers

  1. Start with Documentation Index for architecture overview
  2. Read Indexing and Database for core infrastructure
  3. Study Data Processors for content processing pipeline
  4. Review API for programmatic access

For API Users

  1. Go directly to API Documentation
  2. Check Database for data models
  3. See Indexing for search capabilities

Key Features Documented

Data Collection

  • ✅ ArXiv papers with PDF download
  • ✅ YouTube videos with transcripts
  • ✅ Podcasts from RSS feeds
  • ✅ Semantic Scholar integration
  • ✅ PubMed Central open access

Content Processing

  • ✅ PDF text extraction (PyMuPDF)
  • ✅ Image extraction from PDFs
  • ✅ Gemini Vision diagram analysis
  • ✅ Video transcript processing
  • ✅ Multi-modal content understanding

Search & Retrieval

  • ✅ Hybrid search (keyword + semantic)
  • ✅ Field boosting (title^3, abstract^2)
  • ✅ Embedding generation (384-dim)
  • ✅ Multi-index search
  • ✅ Aggregations and analytics

Query Pipeline

  • ✅ LangChain integration
  • ✅ Conversation memory
  • ✅ Citation extraction
  • ✅ Related query generation
  • ✅ Bibliography export (BibTeX, APA)

User Interfaces

  • ✅ Gradio web UI (5 tabs)
  • ✅ FastAPI REST API
  • ✅ Visualization dashboard
  • ✅ Collection management
  • ✅ Citation tracking

Code Examples

Quick Start: Collect and Search

from multi_modal_rag.data_collectors import AcademicPaperCollector
from multi_modal_rag.indexing import OpenSearchManager
from multi_modal_rag.orchestration import ResearchOrchestrator

# Collect papers
collector = AcademicPaperCollector()
papers = collector.collect_arxiv_papers("machine learning", max_results=10)

# Index papers
opensearch = OpenSearchManager()
for paper in papers:
    opensearch.index_document("research_assistant", {
        'content_type': 'paper',
        'title': paper['title'],
        'abstract': paper['abstract'],
        'authors': paper['authors']
    })

# Query system
orchestrator = ResearchOrchestrator("gemini_api_key", opensearch)
result = orchestrator.process_query("Explain neural networks", "research_assistant")

print(result['answer'])

See: Documentation Index for more examples

Common Tasks

Task: Collect New Content

Task: Process PDFs with Diagrams

Task: Search Content

Task: Export Citations

Task: View Statistics

Architecture Overview

Data Flow

┌─────────────┐
│ Data Sources │ (ArXiv, YouTube, Podcasts)
└──────┬──────┘
       │
       ↓
┌─────────────┐
│  Collectors │ (Paper, Video, Podcast Collectors)
└──────┬──────┘
       │
       ↓
┌─────────────┐
│  Processors │ (PDF, Video Processing + Gemini)
└──────┬──────┘
       │
       ↓
┌─────────────────────┐
│ Indexing + Database │ (OpenSearch + SQLite)
└──────┬──────────────┘
       │
       ↓
┌──────────────────┐
│  Orchestration   │ (LangChain + Citations)
└──────┬───────────┘
       │
       ↓
┌──────────────────┐
│ User Interfaces  │ (Gradio UI + FastAPI)
└──────────────────┘

Detailed Docs: Each arrow documented in respective module files

Query Pipeline

User Query
    ↓
Orchestrator (orchestration.md)
    ↓
OpenSearch Retrieval (indexing.md)
    ↓
Context Formatting (orchestration.md)
    ↓
Gemini Generation (orchestration.md)
    ↓
Citation Extraction (orchestration.md)
    ↓
Response + Citations + Related Queries

Troubleshooting Guide

Common Issues

Issue Module Doc Section
OpenSearch connection failed Indexing Troubleshooting
Gemini API errors Data Processors Troubleshooting
YouTube transcript unavailable Data Collectors Troubleshooting
Database locked Database Troubleshooting
CORS errors API Troubleshooting
UI won't launch UI Troubleshooting

Complete Guide: Documentation Index - Troubleshooting Index

Performance Optimization

Quick Tips

  • Indexing: Use bulk_index() for >10 documents → Indexing Docs
  • Search: Limit results with k parameter → Indexing Docs
  • Processing: Limit images to 5 per PDF → Processors Docs
  • API: Use pagination for large result sets → API Docs
  • UI: Enable queuing for better responsiveness → UI Docs

Complete Guide: Documentation Index - Performance Optimization Index

API Reference

REST Endpoints

  • GET /api/collections - List collections → Docs
  • GET /api/collections/{id} - Get details → Docs
  • GET /api/statistics - Statistics → Docs
  • GET /api/search - Search → Docs
  • GET /viz - Dashboard → Docs

Complete Reference: API Module Documentation

Python API

Each module has complete Python API documentation:

Dependencies

Core Libraries

# Data Collection
pip install arxiv yt-dlp youtube-transcript-api feedparser openai-whisper

# Processing
pip install google-generativeai pypdf pymupdf pillow

# Indexing
pip install opensearch-py sentence-transformers

# Orchestration
pip install langchain langchain-google-genai

# Interfaces
pip install fastapi uvicorn gradio

Detailed Info: Each module docs has Dependencies section

Documentation Standards

Each module documentation includes:

✅ Overview and architecture ✅ Complete class/function reference ✅ Parameters and return types ✅ Working code examples ✅ Integration patterns ✅ Error handling ✅ Performance tips ✅ Troubleshooting ✅ Dependencies

File Sizes

File Size Lines
data-collectors.md 16KB ~550
data-processors.md 21KB ~700
indexing.md 25KB ~850
database.md 24KB ~800
api.md 22KB ~750
orchestration.md 24KB ~800
ui.md 23KB ~800
Total 155KB ~5,250

Contributing

Adding Documentation

  1. Follow existing structure in module docs
  2. Include code examples for all features
  3. Add troubleshooting entries
  4. Update this README with new sections
  5. Update Documentation Index

Updating Documentation

  1. Keep examples tested and working
  2. Update related sections in other docs
  3. Increment version numbers if major changes
  4. Update file sizes if significantly changed

Support

Getting Help

  1. Check relevant module documentation
  2. Review Documentation Index for quick links
  3. Search troubleshooting sections
  4. Check code examples

Reporting Issues

For documentation issues:

  1. Specify file and section
  2. Describe the problem
  3. Suggest improvements
  4. Include code examples if applicable

Version Information

  • Documentation Version: 1.0
  • Last Updated: October 2024
  • Modules Documented: 7
  • Total Pages: ~155KB
  • Code Examples: 100+

Quick Links: