This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
NewsCrawler is a multi-platform news and content crawler collection designed for educational purposes only. It supports:
- News Crawlers: 12 platforms (WeChat, Toutiao, NetEase, Sohu, Tencent, Lenny's Newsletter, Naver Blog, Detik News, Quora, BBC, CNN, Twitter/X)
- Video Downloaders: Stock media platforms (Pexels, Pixabay, Coverr, Mixkit)
- Web UI: Modern FastAPI + Vue 3 interface for easy extraction
Important: This codebase is for educational and research purposes only. Do not use for commercial purposes or large-scale crawling.
NewsCrawler/
├── news_crawler/ # Consolidated crawler modules
│ ├── wechat_news/ # WeChat crawler
│ ├── toutiao_news/ # Toutiao crawler
│ ├── netease_news/ # NetEase News
│ ├── sohu_news/ # Sohu News
│ ├── tencent_news/ # Tencent News
│ ├── lennysnewsletter/ # Lenny's Newsletter
│ ├── naver_news/ # Naver Blog
│ ├── detik_news/ # Detik News
│ ├── quora/ # Quora
│ ├── bbc_news/ # BBC News
│ ├── cnn_news/ # CNN News
│ └── twitter_news/ # Twitter/X (guest token or optional cookie auth)
│
├── news-extractor-ui/ # Web UI Application (NEW)
│ ├── backend/ # FastAPI backend
│ │ ├── app/
│ │ │ ├── api/ # API routes
│ │ │ ├── adapters/ # Crawler adapters
│ │ │ ├── services/ # Business logic
│ │ │ └── main.py
│ │ ├── pyproject.toml
│ │ └── run.py
│ │
│ └── frontend/ # Vue 3 frontend
│ ├── src/
│ │ ├── components/ # Vue components
│ │ ├── services/ # API services
│ │ ├── types/ # TypeScript types
│ │ └── App.vue
│ ├── package.json
│ └── vite.config.ts
│
├── video_crawler/ # Video downloader modules
│ ├── pexel/
│ ├── pixabay/
│ ├── cover_video/
│ └── mixkit_video/
│
├── libs/ # Shared utilities
│ ├── playwright_driver.py
│ ├── drissionpage_driver.py
│ └── tools.py
│
├── data/ # Output directory
├── call_example.py # CLI usage examples
├── pyproject.toml # Root project config (uv)
└── uv.lock # Dependency lock file
This project uses uv for dependency management.
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh # macOS/Linux
# or: pip install uv
# Sync root project dependencies
uv syncBackend (FastAPI):
cd news-extractor-ui/backend
uv sync # Install backend dependencies
uv run run.py # Start backend on port 8000Frontend (Vue 3):
cd news-extractor-ui/frontend
npm install # Install frontend dependencies
npm run dev # Start frontend on port 3000
npm run build # Build for productionAccess: Open browser to http://localhost:3000
# Run CLI examples
uv run call_example.py
# Direct module execution
uv run -m news_crawler.wechat_news.wechat_news
uv run -m news_crawler.toutiao_news.toutaio_news# Run script (no venv activation needed)
uv run script.py
# Add/remove dependencies
uv add package-name
uv remove package-name
# Update dependencies
uv sync --upgradeThe Web UI provides a modern interface for news extraction:
Backend (FastAPI):
- API Layer (
app/api/): RESTful endpointsPOST /api/extract: Main extraction endpoint
- Adapter Layer (
app/adapters/): Wraps crawler modulesWeChatAdapter,ToutiaoAdapter, etc.- Provides unified interface for different platforms
- Service Layer (
app/services/):detector.py: Auto-detects platform from URLextractor.py: Orchestrates extraction processmarkdown_converter.py: Converts JSON to Markdown
Frontend (Vue 3 + TypeScript):
- Components:
PlatformSelector.vue: Platform selection cardsUrlInputNew.vue: URL input with validationExtractProgress.vue: Real-time progress displayResultViewerNew.vue: Multi-tab result viewer (JSON/Markdown/Preview)
- Services:
api.ts: Axios-based API client
- Types:
index.ts: TypeScript interfaces for API responses
Key Features:
- Auto-detect platform from URL patterns
- Real-time extraction progress
- Dual-format output (JSON + Markdown)
- Responsive design with modern UI
- Platform-specific logo display
All crawlers follow a consistent architecture:
- RequestHeaders (Pydantic): User-Agent and Cookie management
- ContentType (Enum): TEXT, IMAGE, VIDEO
- ContentItem (Pydantic): Structured content with type/content/desc
- NewsMetaInfo (Pydantic): Author, publish time, etc.
- NewsItem (Pydantic): Complete article data model
- Crawler Class: Main implementation
__init__(url, save_path, headers)fetch_content(): HTTP with retryparse_html_to_news_meta(): Extract metadataparse_html_to_news_content(): Extract contentsave_as_json(): Save to data/run(): Main entry point
- URL:
https://mp.weixin.qq.com/s/{id} - Special: Supports both traditional and SSR-rendered pages
- Traditional: Parse from
#js_content - SSR (XHS-style): Parse
window.__QMTPL_SSR_DATA__
- Traditional: Parse from
- HTTP: Uses
curl_cffiwith Chrome impersonation - Parser:
WechatContentParserclass handles nested content - Features: List detection, paragraph structure preservation
- URL:
https://www.toutiao.com/article/{id}/ - Parser: parsel with XPath
- Simple: article/p/img/video extraction
- URL:
https://news.detik.com/internasional/d-{id}/... - Parser:
.detail__body-textextraction - Cover: Extracts cover media from
.detail__media
- URL:
https://x.com/{username}/status/{id}orhttps://twitter.com/{username}/status/{id} - API: Uses X internal GraphQL API (
TweetResultByRestId) - Authentication: Supports two modes:
- Guest Token mode (default): No authentication needed, can access public tweets
- Cookie mode: For protected tweets, requires Cookie authentication (
auth_token+ct0)
- Features:
- Supports regular tweets, long tweets (note_tweet), and articles
- Extracts images, videos (highest bitrate), and quoted tweets
- Handles both twitter.com and x.com URLs
- Automatic fallback: tries Guest Token first, then Cookie auth if available
Cookie Configuration (Optional - only needed for protected tweets):
# Option 1: Full cookie string (recommended for Web UI)
export TWITTER_COOKIE="guest_id=xxx; auth_token=abc123; ct0=xyz789; ..."
# Option 2: Separate tokens
export TWITTER_AUTH_TOKEN=abc123
export TWITTER_CT0=xyz789Getting Cookie (only if needed):
- Log in to x.com
- Open browser DevTools (F12) → Network tab
- Click any request and find
Cookie:in Request Headers - Copy the entire cookie value
OLD (before restructuring):
from wechat_news import WeChatNewsCrawler
from toutiao_news import ToutiaoNewsCrawlerNEW (current structure):
from news_crawler.wechat_news import WeChatNewsCrawler
from news_crawler.toutiao_news import ToutiaoNewsCrawlerAll crawler imports now go through the news_crawler package.
All crawlers output JSON to data/ directory:
{
"title": "string",
"news_url": "string",
"news_id": "string",
"meta_info": {
"author_name": "string",
"author_url": "string",
"publish_time": "string"
},
"contents": [
{"type": "text|image|video", "content": "...", "desc": "..."}
],
"texts": ["array of text strings"],
"images": ["array of image URLs"],
"videos": ["array of video URLs"]
}Extract news content from URL.
Request:
{
"url": "https://mp.weixin.qq.com/s/xxxxx",
"output_format": "markdown", // or "json"
"cookie": "optional_cookie_string" // Required for Twitter/X
}Response:
{
"status": "success",
"data": { /* NewsItem JSON */ },
"platform": "wechat",
"extracted_at": "2024-10-15T10:30:00",
"markdown": "# Title\n\nContent..."
}API Documentation: Visit http://localhost:8000/docs for Swagger UI
-
Create Adapter (
backend/app/adapters/):from news_crawler.platform_news import PlatformNewsCrawler class PlatformAdapter: def extract(self, url: str) -> NewsItem: crawler = PlatformNewsCrawler(url) return crawler.run()
-
Register in Detector (
backend/app/services/detector.py):URL_PATTERNS = { 'platform': r'platform\.com', }
-
Register in Extractor (
backend/app/services/extractor.py):ADAPTERS = { 'platform': PlatformAdapter(), }
-
Add Frontend Logo (
frontend/public/logos/platform.png) -
Update Platform Selector (
frontend/src/components/PlatformSelector.vue):{ id: 'platform', name: 'Platform Name', icon: '/logos/platform.png', description: 'Description' }
Each crawler has test section in __main__:
uv run -m news_crawler.platform_news.platform_newsBackend:
cd news-extractor-ui/backend
uv add package-nameFrontend:
cd news-extractor-ui/frontend
npm install package-name- Python 3.8+
- uv - Package manager
- FastAPI - Web framework
- Pydantic - Data validation
- curl_cffi / requests - HTTP
- parsel - HTML parsing
- tenacity - Retry logic
- Vue 3 - Framework (Composition API)
- TypeScript - Type safety
- Vite - Build tool
- Axios - HTTP client
- Playwright (optional) - Browser automation for header extraction
- Uses
tenacitywith@retrydecorator - 3 attempts, 1 second wait between retries
- Applied to both
fetch_content()andrun()
- Default headers work without authentication
- Optional: Use
playwright_driver.get_headers()for fresh cookies - Cookies may expire, use auto-extraction when needed
Check window.__QMTPL_SSR_DATA__ in HTML:
- Present →
parse_ssr_content() - Absent →
parse_html_to_news_content()
- Preserve structure (paragraphs, lists)
- Duplicate storage: structured
contents+ flattenedtexts/images/videos - Handle nested media tags
# Python
__pycache__/
*.pyc
.venv/
data/
# Frontend
news-extractor-ui/frontend/node_modules/
news-extractor-ui/frontend/dist/
news-extractor-ui/frontend/package-lock.json
# Backend
news-extractor-ui/backend/.venv/
news-extractor-ui/backend/uv.lock
# System
.DS_Store
git status
git add .
git commit -m "description"
git push origin mainThis repository is for educational and research purposes only. Users must:
- Not use for commercial purposes
- Not perform large-scale crawling
- Respect target websites' robots.txt and terms of service
- Understand that responsibility for misuse lies with the user
- 🆕 Twitter/X Crawler (
news_crawler/twitter_news/): Added support for extracting tweets- Uses X internal GraphQL API (
TweetResultByRestId) - Dual authentication modes:
- Guest Token mode (default): No auth needed for public tweets
- Cookie mode (optional): For protected tweets
- Supports regular tweets, long tweets (note_tweet), articles, and quoted tweets
- Extracts images and videos (highest bitrate)
- Uses X internal GraphQL API (
- 🔧 Web UI Update: Added Twitter platform with optional Cookie input
- Cookie input field shown when Twitter platform is detected
- Cookie is optional - public tweets work without it
- Updated platform count to 12
- 📝 Core Package Update: Added TwitterAdapter with optional cookie parameter
- 🆕 MCP Server (
news_extractor_mcp/): Added Model Context Protocol server for AI Agent integration- 4 tools:
extract_news,detect_news_platform,list_supported_platforms,batch_extract_news - 1 resource:
platforms://list - Full Claude Desktop integration support
- 4 tools:
- 🔧 Code Refactoring: Extracted shared code to
news_extractor_core_pkg/- Moved adapters, services, and models to core package
- Backend now depends on core package (cleaner architecture)
- Eliminated code duplication between Web UI and MCP Server
- 📊 New Architecture: Three-tier structure
news_crawler/- Original crawler implementationsnews_extractor_core_pkg/- Shared business logicnews_extractor_mcp/- MCP server for AI agentsnews-extractor-ui/backend/- Web API (simplified)
- 📝 Documentation: Added
MCP_INTEGRATION_SUMMARY.mdandnews_extractor_mcp/README.md
- Restructured crawler modules into
news_crawler/directory - Added comprehensive Web UI (FastAPI + Vue 3)
- Implemented platform auto-detection
- Added Markdown export functionality
- Unified logo display across UI
- Updated README with dual-mode usage instructions
- Fixed UV command formats (
uv run script.pynotuv run python script.py)