A comprehensive tool for scraping, processing, and organizing Thirdweb TypeScript API documentation into a structured local markdown repository.
- Web Scraping: Traverses the Thirdweb documentation site to extract content
- Markdown Conversion: Converts HTML content to clean, well-formatted Markdown
- Intelligent Categorization: Organizes documentation into meaningful categories:
- UI Components
- React Hooks
- Core Functions
- Advanced Topics
- Index Generation: Creates navigation indexes for each category
- Content Cleaning: Removes unnecessary boilerplate and formats code blocks
- Improved Scraper (
improved_scraper.py
): Main scraper with enhanced functionality - Reorganization Tool (
reorganize_docs.py
): Sorts and categorizes documentation files - Markdown Cleaner (
markdown_cleaner.py
): Cleans and formats scraped Markdown files
- Python 3.x
- Required libraries listed in
requirements.txt
./setup_venv.sh
For the complete process (scraping, cleaning, and organizing):
./run_improved_scraper.sh
If you already have scraped documentation and want to reorganize it:
python reorganize_docs.py
The scraped content is organized as follows:
thirdweb_typescript_docs/
├── UI Components/
│ ├── 00_index.md
│ ├── Component1.md
│ └── ...
├── React Hooks/
│ ├── 00_index.md
│ ├── Hook1.md
│ └── ...
├── Core Functions/
│ ├── 00_index.md
│ ├── Function1.md
│ └── ...
└── Advanced Topics/
├── 00_index.md
├── Topic1.md
└── ...
ScraperBuildGuide.md
: Detailed guide for building similar documentation scrapersreorganize_docs.py
: Script for categorizing documentation based on content patterns
This project helps developers maintain an up-to-date local copy of Thirdweb documentation for:
- Offline access
- Training AI models on the Thirdweb TypeScript SDK
- Creating customized knowledge bases
- Enhancing developer workflows with searchable documentation