The API module provides a FastAPI-based REST API for visualizing and accessing collected research data. It includes endpoints for retrieving collections, viewing statistics, searching content, and serving a visualization dashboard.
multi_modal_rag/api/
├── api_server.py # FastAPI application
└── static/
└── visualization.html # Frontend dashboard (optional)
File: multi_modal_rag/api/api_server.py
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
app = FastAPI(
title="Multi-Modal Research Data API",
description="API for visualizing collected research data",
version="1.0.0"
)Features:
- CORS enabled for frontend access
- Automatic OpenAPI/Swagger documentation
- Integrated database manager
- RESTful endpoints
Method 1: Direct Execution
python -m multi_modal_rag.api.api_serverMethod 2: Using uvicorn
uvicorn multi_modal_rag.api.api_server:app --host 0.0.0.0 --port 8000Method 3: Using start script
python start_api_server.pyAccess Points:
- API: http://localhost:8000
- Swagger Docs: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
Returns API information and available endpoints.
Response:
{
"message": "Multi-Modal Research Data API",
"endpoints": {
"collections": "/api/collections",
"statistics": "/api/statistics",
"search": "/api/search",
"visualization": "/viz"
}
}Example:
curl http://localhost:8000/Retrieves all collections with optional filtering and pagination.
Query Parameters:
content_type(str, optional): Filter by type ('paper', 'video', 'podcast')limit(int, optional): Max results (1-1000). Default: 100offset(int, optional): Offset for pagination. Default: 0
Response:
{
"count": 25,
"collections": [
{
"id": 1,
"content_type": "paper",
"title": "Attention Is All You Need",
"source": "arxiv",
"url": "https://arxiv.org/abs/1706.03762",
"collection_date": "2024-10-02T14:30:00",
"metadata": {
"query": "transformer models",
"categories": ["cs.CL", "cs.LG"]
},
"status": "collected",
"indexed": true
},
// ... more collections
]
}Examples:
# Get all collections
curl http://localhost:8000/api/collections
# Get only papers
curl http://localhost:8000/api/collections?content_type=paper
# Get videos with pagination
curl http://localhost:8000/api/collections?content_type=video&limit=20&offset=0
# Get second page
curl http://localhost:8000/api/collections?limit=50&offset=50Python Client:
import requests
# Get all collections
response = requests.get("http://localhost:8000/api/collections")
data = response.json()
print(f"Total collections: {data['count']}")
for item in data['collections']:
print(f" - {item['title']} ({item['content_type']})")
# Filter by type
response = requests.get(
"http://localhost:8000/api/collections",
params={"content_type": "paper", "limit": 50}
)
papers = response.json()['collections']Retrieves detailed information for a specific collection.
Path Parameters:
collection_id(int): Collection ID
Response:
{
"id": 1,
"content_type": "paper",
"title": "Attention Is All You Need",
"source": "arxiv",
"url": "https://arxiv.org/abs/1706.03762",
"collection_date": "2024-10-02T14:30:00",
"metadata": {
"query": "transformer models",
"categories": ["cs.CL", "cs.LG"]
},
"status": "collected",
"indexed": true,
"details": {
"id": 1,
"collection_id": 1,
"arxiv_id": "1706.03762",
"pmc_id": null,
"abstract": "The dominant sequence transduction models...",
"authors": ["Ashish Vaswani", "Noam Shazeer", "Niki Parmar"],
"published_date": "2017-06-12",
"categories": ["cs.CL", "cs.LG"],
"pdf_path": "data/papers/1706.03762.pdf"
}
}Error Responses:
// 404 Not Found
{
"detail": "Collection not found"
}
// 500 Internal Server Error
{
"detail": "Error message here"
}Examples:
# Get collection details
curl http://localhost:8000/api/collections/1
# Handle errors
curl http://localhost:8000/api/collections/99999
# Returns 404 with "Collection not found"Python Client:
import requests
def get_collection_details(collection_id: int):
response = requests.get(
f"http://localhost:8000/api/collections/{collection_id}"
)
if response.status_code == 200:
data = response.json()
print(f"Title: {data['title']}")
print(f"Type: {data['content_type']}")
if 'details' in data:
if data['content_type'] == 'paper':
details = data['details']
print(f"Authors: {', '.join(details['authors'])}")
print(f"Abstract: {details['abstract'][:200]}...")
else:
print(f"Error: {response.status_code}")
get_collection_details(1)Retrieves database statistics.
Response:
{
"by_type": {
"paper": 150,
"video": 75,
"podcast": 30
},
"indexed": 200,
"not_indexed": 55,
"recent_7_days": 25,
"collection_history": [
{
"type": "paper",
"source": "arxiv",
"total": 150
},
{
"type": "video",
"source": "youtube",
"total": 75
},
{
"type": "podcast",
"source": "rss",
"total": 30
}
]
}Example:
curl http://localhost:8000/api/statisticsPython Client:
import requests
response = requests.get("http://localhost:8000/api/statistics")
stats = response.json()
print("=== Database Statistics ===")
print(f"\nContent by Type:")
for content_type, count in stats['by_type'].items():
print(f" {content_type}: {count}")
print(f"\nIndexing Status:")
print(f" Indexed: {stats['indexed']}")
print(f" Not Indexed: {stats['not_indexed']}")
total = stats['indexed'] + stats['not_indexed']
percentage = (stats['indexed'] / total * 100) if total > 0 else 0
print(f" Completion: {percentage:.1f}%")
print(f"\nRecent Activity:")
print(f" Last 7 days: {stats['recent_7_days']} new items")Visualization Example:
import matplotlib.pyplot as plt
# Pie chart of content types
response = requests.get("http://localhost:8000/api/statistics")
stats = response.json()
labels = list(stats['by_type'].keys())
sizes = list(stats['by_type'].values())
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.title('Collection Distribution by Type')
plt.show()Searches collections by title or source.
Query Parameters:
q(str, required): Search query (min 1 character)limit(int, optional): Max results (1-500). Default: 50
Response:
{
"query": "transformer",
"count": 12,
"results": [
{
"id": 1,
"content_type": "paper",
"title": "Attention Is All You Need",
"source": "arxiv",
"url": "https://arxiv.org/abs/1706.03762",
"collection_date": "2024-10-02T14:30:00",
"metadata": {...},
"status": "collected",
"indexed": true
},
// ... more results
]
}Examples:
# Search by title keyword
curl "http://localhost:8000/api/search?q=transformer"
# Search by source
curl "http://localhost:8000/api/search?q=arxiv&limit=100"
# Search with special characters (URL encoded)
curl "http://localhost:8000/api/search?q=neural%20networks"Python Client:
import requests
def search_collections(query: str, limit: int = 50):
response = requests.get(
"http://localhost:8000/api/search",
params={"q": query, "limit": limit}
)
data = response.json()
print(f"Query: '{data['query']}'")
print(f"Found {data['count']} results\n")
for item in data['results']:
print(f" {item['id']}: {item['title']}")
print(f" Type: {item['content_type']}, Source: {item['source']}")
print()
# Search for papers about attention
search_collections("attention", limit=10)
# Search for specific source
search_collections("youtube")Serves the HTML visualization dashboard.
Response: HTML page with interactive visualizations
Features:
- Charts showing collection distribution
- Filter by content type
- Search functionality
- Recent activity timeline
- Statistics cards
Access:
# Open in browser
open http://localhost:8000/vizFallback Response (if visualization.html not found):
<html>
<body>
<h1>Visualization page not found</h1>
<p>Please ensure visualization.html exists in the static directory</p>
</body>
</html>Dashboard Implementation:
The visualization page should be located at:
multi_modal_rag/api/static/visualization.html
Example Dashboard HTML:
<!DOCTYPE html>
<html>
<head>
<title>Research Collection Dashboard</title>
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
<style>
body { font-family: Arial, sans-serif; margin: 20px; }
.stats-card {
display: inline-block;
padding: 20px;
margin: 10px;
background: #f0f0f0;
border-radius: 8px;
}
.chart-container {
width: 400px;
height: 400px;
display: inline-block;
}
</style>
</head>
<body>
<h1>Research Collection Dashboard</h1>
<div id="stats"></div>
<div id="charts"></div>
<script>
// Fetch statistics
fetch('/api/statistics')
.then(response => response.json())
.then(stats => {
// Display stats cards
document.getElementById('stats').innerHTML = `
<div class="stats-card">
<h3>Total Papers</h3>
<p>${stats.by_type.paper || 0}</p>
</div>
<div class="stats-card">
<h3>Total Videos</h3>
<p>${stats.by_type.video || 0}</p>
</div>
<div class="stats-card">
<h3>Indexed</h3>
<p>${stats.indexed}</p>
</div>
`;
// Create pie chart
const ctx = document.createElement('canvas');
document.getElementById('charts').appendChild(ctx);
new Chart(ctx, {
type: 'pie',
data: {
labels: Object.keys(stats.by_type),
datasets: [{
data: Object.values(stats.by_type),
backgroundColor: ['#FF6384', '#36A2EB', '#FFCE56']
}]
}
});
});
</script>
</body>
</html>Health check endpoint for monitoring.
Response:
{
"status": "healthy"
}Example:
curl http://localhost:8000/healthUse Cases:
- Load balancer health checks
- Container orchestration (Kubernetes)
- Monitoring systems
- CI/CD pipelines
The API is configured to allow cross-origin requests:
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # Allow all origins
allow_credentials=True,
allow_methods=["*"], # Allow all HTTP methods
allow_headers=["*"], # Allow all headers
)Security Note: For production, restrict allow_origins to specific domains:
allow_origins=[
"https://yourdomain.com",
"http://localhost:3000"
]| Code | Description | When Used |
|---|---|---|
| 200 | OK | Successful request |
| 404 | Not Found | Collection ID doesn't exist |
| 422 | Unprocessable Entity | Invalid parameters |
| 500 | Internal Server Error | Database or server error |
{
"detail": "Error message describing what went wrong"
}Examples:
from fastapi import HTTPException
# Collection not found
raise HTTPException(status_code=404, detail="Collection not found")
# Invalid parameters (handled by FastAPI automatically)
# Query parameter validation fails → 422
# Database error
raise HTTPException(status_code=500, detail=str(e))import React, { useState, useEffect } from 'react';
function CollectionsList() {
const [collections, setCollections] = useState([]);
const [loading, setLoading] = useState(true);
useEffect(() => {
fetch('http://localhost:8000/api/collections?limit=50')
.then(response => response.json())
.then(data => {
setCollections(data.collections);
setLoading(false);
});
}, []);
if (loading) return <div>Loading...</div>;
return (
<div>
<h1>Collections ({collections.length})</h1>
{collections.map(item => (
<div key={item.id}>
<h3>{item.title}</h3>
<p>Type: {item.content_type}</p>
<p>Source: {item.source}</p>
</div>
))}
</div>
);
}import requests
import pandas as pd
import matplotlib.pyplot as plt
# Fetch all collections
response = requests.get("http://localhost:8000/api/collections?limit=1000")
collections = response.json()['collections']
# Convert to DataFrame
df = pd.DataFrame(collections)
# Analysis
print("Collections by Type:")
print(df['content_type'].value_counts())
print("\nCollections by Source:")
print(df['source'].value_counts())
# Visualization
df['content_type'].value_counts().plot(kind='bar')
plt.title('Collections by Type')
plt.xlabel('Content Type')
plt.ylabel('Count')
plt.show()
# Time series analysis
df['collection_date'] = pd.to_datetime(df['collection_date'])
df.set_index('collection_date', inplace=True)
df.resample('D').size().plot()
plt.title('Collections Over Time')
plt.show()import click
import requests
@click.group()
def cli():
"""Research Collection CLI"""
pass
@cli.command()
@click.option('--type', help='Filter by content type')
@click.option('--limit', default=10, help='Number of results')
def list_collections(type, limit):
"""List collections"""
params = {'limit': limit}
if type:
params['content_type'] = type
response = requests.get('http://localhost:8000/api/collections', params=params)
data = response.json()
click.echo(f"Found {data['count']} collections\n")
for item in data['collections']:
click.echo(f"[{item['id']}] {item['title']}")
click.echo(f" Type: {item['content_type']}, Source: {item['source']}")
click.echo()
@cli.command()
@click.argument('query')
def search(query):
"""Search collections"""
response = requests.get('http://localhost:8000/api/search', params={'q': query})
data = response.json()
click.echo(f"Query: '{data['query']}' - {data['count']} results\n")
for item in data['results']:
click.echo(f"• {item['title']} ({item['content_type']})")
@cli.command()
def stats():
"""Show statistics"""
response = requests.get('http://localhost:8000/api/statistics')
data = response.json()
click.echo("=== Statistics ===\n")
click.echo("By Type:")
for t, count in data['by_type'].items():
click.echo(f" {t}: {count}")
click.echo(f"\nIndexed: {data['indexed']}")
click.echo(f"Not Indexed: {data['not_indexed']}")
if __name__ == '__main__':
cli()Usage:
python cli_tool.py list-collections --type paper --limit 5
python cli_tool.py search "neural networks"
python cli_tool.py statsTypical Response Times:
GET /: <5msGET /api/collections: 10-50ms (100 items)GET /api/collections/{id}: 5-20msGET /api/statistics: 20-100ms (aggregations)GET /api/search: 50-200ms (LIKE query)
-
Pagination: Always use
limitparameter# Good response = requests.get('/api/collections?limit=50') # Bad (loads all) response = requests.get('/api/collections?limit=10000')
-
Caching: Implement Redis caching for statistics
from fastapi_cache import FastAPICache from fastapi_cache.decorator import cache @app.get("/api/statistics") @cache(expire=300) # Cache for 5 minutes async def get_statistics(): ...
-
Database Indexing: Add indexes to frequently queried fields
CREATE INDEX idx_content_type ON collections(content_type); CREATE INDEX idx_indexed ON collections(indexed);
-
Async Database Operations: Use async SQLite library
import aiosqlite @app.get("/api/collections") async def get_collections(...): async with aiosqlite.connect(db_path) as db: async with db.execute("SELECT ...") as cursor: rows = await cursor.fetchall()
-
CORS: Restrict origins
allow_origins=["https://yourdomain.com"]
-
Authentication: Add API key or OAuth
from fastapi.security import APIKeyHeader api_key_header = APIKeyHeader(name="X-API-Key") @app.get("/api/collections") async def get_collections(api_key: str = Depends(api_key_header)): if api_key != os.getenv("API_KEY"): raise HTTPException(401, "Invalid API key") ...
-
Rate Limiting: Prevent abuse
from slowapi import Limiter from slowapi.util import get_remote_address limiter = Limiter(key_func=get_remote_address) @app.get("/api/search") @limiter.limit("10/minute") async def search(...): ...
-
HTTPS: Use SSL/TLS in production
uvicorn app:app --host 0.0.0.0 --port 443 --ssl-keyfile key.pem --ssl-certfile cert.pem
Dockerfile:
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "multi_modal_rag.api.api_server:app", "--host", "0.0.0.0", "--port", "8000"]docker-compose.yml:
version: '3.8'
services:
api:
build: .
ports:
- "8000:8000"
volumes:
- ./data:/app/data
environment:
- DATABASE_PATH=/app/data/collections.dbRun:
docker-compose up -ddeployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: research-api
spec:
replicas: 3
selector:
matchLabels:
app: research-api
template:
metadata:
labels:
app: research-api
spec:
containers:
- name: api
image: research-api:latest
ports:
- containerPort: 8000
env:
- name: DATABASE_PATH
value: "/data/collections.db"
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: research-data-pvcfrom fastapi import FastAPI, Query, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import HTMLResponse
import uvicornInstallation:
pip install fastapi uvicorn[standard]Error: OSError: [Errno 48] Address already in use
Solution: Use different port or kill existing process
# Find process
lsof -i :8000
# Kill process
kill -9 <PID>
# Or use different port
uvicorn app:app --port 8001Error: Access to fetch at 'http://localhost:8000' from origin 'http://localhost:3000' has been blocked by CORS policy
Solution: Ensure CORS middleware is configured correctly
app.add_middleware(
CORSMiddleware,
allow_origins=["http://localhost:3000"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)Cause: Invalid query parameters
Example:
# Missing required parameter 'q'
curl http://localhost:8000/api/search
# Returns: {"detail":[{"loc":["query","q"],"msg":"field required",...}]}Solution: Provide required parameters
curl "http://localhost:8000/api/search?q=test"FastAPI automatically generates interactive API documentation:
Swagger UI: http://localhost:8000/docs
- Interactive testing
- Try endpoints directly in browser
- View request/response schemas
ReDoc: http://localhost:8000/redoc
- Clean, readable documentation
- Three-panel layout
- Better for sharing with team
OpenAPI Schema: http://localhost:8000/openapi.json
- Machine-readable API specification
- Use for client generation
- Import into Postman/Insomnia