An intelligent video retrieval system that transforms news archives into searchable knowledge bases using multimodal AI. Search through hours of video content using natural language and get precise answers with exact timestamps.
- Natural Language Queries: Ask questions like you would ask a colleague
- Multimodal Understanding: Simultaneously analyzes audio, visuals, and text
- Semantic Retrieval: Finds content by meaning, not just keywords
- Exact Timestamping: Returns precise video segments for playback
| Modality | Technology | What It Captures |
|---|---|---|
| 🔊 Audio | OpenAI Whisper | Transcribed dialogue, speaker identification |
| 🖼️ Visual | GPT-4o Vision | Scene descriptions, activities, objects |
| 📝 Text | EasyOCR | On-screen text, tickers, chyrons, banners |
| 🏷️ Metadata | SpaCy + GPT-4o | Named entities, topics, classifications |
- Sliding Window Segmentation: 20-second chunks with 50% overlap
- Scene Change Detection: Optimizes API calls using MSE analysis
- Parallel Processing: Efficient handling of multiple modalities
- Vector Embeddings: Semantic storage with ChromaDB
news_video_search/
├── 📂 app/ # Core backend logic
│ ├── config.py # Environment & configuration
│ ├── process_videos.py # ⚡ Master pipeline (run this first)
│ ├── rag_search.py # RAG answer generation
│ ├── 📂 services/ # External API integrations
│ │ ├── audio_service.py # Whisper transcription
│ │ ├── vision_service.py # GPT-4o visual analysis
│ │ └── embedding_service.py # Vector embedding generation
│ └── 📂 core/ # Processing algorithms
│ ├── video_processor.py # Sliding window segmentation
│ ├── ner_analyzer.py # Named Entity Recognition
│ ├── ocr_processor.py # On-screen text extraction
│ └── tag_generator.py # Automatic topic classification
├── 📂 data/ # Data storage (auto-created)
│ ├── videos/ # 🎬 Place .mp4 files here
│ ├── vector_db/ # ChromaDB vector storage
│ └── generated_tags.json # Auto-generated taxonomy tags
├── 📂 frontend/
│ └── streamlit_app.py # 🌐 Web interface
├── requirements.txt # Python dependencies
├── .env.example # Environment template
└── README.md # This file
# Clone repository
git clone https://github.com/yourusername/news-video-search.git
cd news_video_search
# Create virtual environment
python -m venv venv
# Activate environment
# Windows
venv\Scripts\activate
# Mac/Linux
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt# Copy environment template
cp .env.example .env
# Edit .env file with your API key
# Add: OPENAI_API_KEY=sk-proj-your-api-key-here# Install SpaCy model for NER
python -m spacy download en_core_web_smPlace your .mp4 video files in the data/videos/ directory:
# Create directory if needed
mkdir -p data/videos
# Add your news videos here
# Supported formats: .mp4, .mov, .aviRun the ingestion pipeline (this may take time depending on video length):
python -m app.process_videos✅ This automatically:
- Segments videos into 20-second chunks
- Transcribes audio with Whisper
- Analyzes visual scenes with GPT-4o Vision
- Extracts on-screen text with EasyOCR
- Stores embeddings in ChromaDB
Create topic classifications for better filtering:
python -m app.core.tag_generatorStart the search application:
streamlit run frontend/streamlit_app.py🌐 Open browser at: http://localhost:8501
"Show me segments about economic policies"
"Find climate change discussions"
"Show me sports highlights"
"Find interviews with the President"
"Show me when the peace treaty was signed"
"Find speeches by the Prime Minister"
"Show me footage from Ukraine"
"Find segments filmed in Washington D.C."
"Show me events in India"
"What was discussed about the recent election results?"
"Show me the debate about healthcare reform"
"Find moments when the stock market was mentioned"
# 1. Segmentation
Video → 20s chunks (50% overlap)
# 2. Multimodal Analysis
Audio → Whisper → Transcript
Visual → GPT-4o Vision → Scene description
Text → EasyOCR → On-screen text extraction
# 3. Metadata Enrichment
NER → People, Organizations, Locations
Tagging → Topic classification (Politics, Sports, etc.)
# 4. Vector Storage
Combined text → OpenAI embeddings → ChromaDB- Query Processing: User question → vector embedding
- Semantic Search: Find top 3 relevant video chunks
- Context Assembly: Combine transcripts, descriptions, OCR
- Answer Generation: GPT-4o generates response using retrieved context
- Result Delivery: Answer + exact timestamps + source video
Modify in app/core/video_processor.py:
WINDOW_SIZE = 20 # seconds per chunk
STEP_SIZE = 10 # seconds of overlap
MAX_CHUNKS = 100 # limit per videoAdjust in app/services/vision_service.py:
MSE_THRESHOLD = 1000 # Lower = more sensitive to changesConfigure in app/config.py:
CHROMA_DB_DIR = "data/vector_db"
EMBEDDING_MODEL = "text-embedding-3-small"- Scene Detection: Only call Vision API when scenes change significantly
- Batch Processing: Process multiple videos sequentially
- Local Models: Option to replace Whisper with local installation
- Parallel Processing: Audio, visual, and text extraction can be parallelized
- Caching: Results cached to avoid reprocessing
- Incremental Updates: Only process new video segments
# Install development dependencies
pip install -r requirements-dev.txt
# Run tests
pytest tests/
# Code formatting
black app/ frontend/ tests/- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
Distributed under the MIT License. See LICENSE for more information.
- OpenAI for Whisper and GPT-4o APIs
- ChromaDB for vector storage solutions
- Streamlit for the web framework
- EasyOCR for text extraction capabilities