APIAS (AI Powered API Documentation Scraper) is a powerful tool that helps you extract and convert API documentation from various sources into structured formats.
- Scrape API documentation from web pages
- Support for multiple documentation formats
- AI-powered content extraction and structuring
- Command-line interface for easy use
- Multiple output formats (Markdown, JSON, YAML)
- Batch processing mode with interactive TUI
- Python 3.10 or higher (Python 3.9 is not supported)
- OpenAI API key (for AI-powered extraction)
The fastest way to install APIAS is using uv:
# Install as a tool (recommended for CLI usage)
uv tool install apias --python=3.10
# Or install in a project
uv add apiaspip install apiaspython --version # Should be 3.10 or higherimport apias
from apias.config import APIASConfig
from apias.apias import Scraper, clean_html
# Check version
print(f"APIAS version: {apias.__version__}")
# Basic scraping with Scraper class
scraper = Scraper(quiet=True)
html_content, mime_type = scraper.scrape("https://api.example.com/docs")
# Clean and process the HTML
if html_content:
cleaned = clean_html(html_content)
print(f"Scraped {len(cleaned)} characters")
# Using configuration
config = APIASConfig(
model="gpt-5-nano",
num_threads=5,
quiet=True
)
print(f"Using model: {config.model}")For full programmatic API documentation, see API.md.
# Scrape a single page
apias --url https://api.example.com/docs
# Scrape multiple pages from a website (batch mode)
apias --url https://example.com --mode batch
# Limit how many pages to scrape
apias --url https://example.com --mode batch --limit 50
# Estimate costs before processing (no API calls made)
apias --url https://example.com --mode batch --estimate-cost
# Use a configuration file
apias --url https://example.com --config apias_config.yaml
# Resume a previous scraping session
apias --url https://example.com --mode batch --resume
# Scrape only (no AI processing)
apias --url https://example.com --mode batch --scrape-only
# Filter URLs with whitelist/blacklist patterns
apias --url https://example.com --mode batch --whitelist "*/api/*" --blacklist "*/legacy/*"
# Force specific retry count (for testing)
apias --url https://example.com --force-retry-count 3Think of APIAS like a team of workers in a factory!
APIAS can be configured using a YAML file. Generate an example with:
apias --generate-configThis creates apias_config.yaml that you can edit.
num_threads: 5 # Default: 5 workersImagine you have a big pile of web pages to process. num_threads is like choosing how many workers to hire:
+---> Worker 1 ---> processes page A
|
Your Pages -------->+---> Worker 2 ---> processes page B
(waiting) |
+---> Worker 3 ---> processes page C
|
+---> Worker 4 ---> processes page D
|
+---> Worker 5 ---> processes page E
- num_threads: 1 = One worker, processes pages one by one (slow but gentle on the website)
- num_threads: 5 = Five workers processing 5 pages at the same time (faster!)
- num_threads: 10 = Ten workers (even faster, but uses more computer power)
Warning: Don't use more than 10-15 threads! Too many workers might:
- Overwhelm the website you're scraping (they might block you!)
- Hit OpenAI rate limits (the AI can only handle so many requests)
- Use too much memory on your computer
Recommendation: Start with 5. Increase to 10 if everything works smoothly.
max_retries: 3 # Default: 3 attemptsSometimes things fail (network hiccups, server busy, etc.). max_retries is how many times APIAS will try again before giving up:
Attempt 1: "Hey server, give me this page!"
Server: "Sorry, I'm busy!" (FAIL)
Attempt 2: *waits 1 second* "Okay, how about now?"
Server: "Still busy!" (FAIL)
Attempt 3: *waits 2 seconds* "Please?"
Server: "Here you go!" (SUCCESS!)
- max_retries: 0 = Never retry (give up immediately on any error)
- max_retries: 3 = Try up to 3 times before giving up
- max_retries: 5 = Very persistent, keeps trying longer
chunk_size: 50000 # Default: 50,000 charactersWeb pages can be HUGE. We can't send a giant page to the AI all at once (it would choke!). So we cut it into smaller pieces called "chunks":
Giant Web Page (200,000 characters)
====================================
Gets cut into pieces:
[ Chunk 1 ] [ Chunk 2 ] [ Chunk 3 ] [ Chunk 4 ]
(50,000) (50,000) (50,000) (50,000)
| | | |
v v v v
AI AI AI AI
| | | |
v v v v
[Result 1] [Result 2] [Result 3] [Result 4]
Then all results get merged back together!
- chunk_size: 30000 = Smaller pieces (more API calls, but safer for complex pages)
- chunk_size: 50000 = Default balance
- chunk_size: 100000 = Bigger pieces (fewer API calls, but might hit token limits)
model: gpt-5-nano # Default: fast, affordable, and highly capableOpenAI GPT-5 models offer excellent quality at different price points. Prices shown below are approximate and may change - check OpenAI Pricing for current rates:
| Model | Context | Input | Output | Best For |
|---|---|---|---|---|
gpt-5-nano |
272K | Very Low | Very Low | Most scraping tasks (recommended default) |
gpt-5-mini |
272K | Low | Low | Complex documentation |
gpt-5 |
272K | Medium | Medium | Premium quality extraction |
gpt-5.1 |
272K | Medium | Medium | Agentic tasks, coding (newest) |
gpt-5-pro |
400K | High | High | Extended context, highest quality |
Note: Most GPT-5 models support 128K output tokens;
gpt-5-prosupports 272K output tokens. Thegpt-5-nanomodel offers the best cost-performance ratio for API documentation scraping.
limit: 50 # Only scrape up to 50 pages (null = no limit)In batch mode, a website might have thousands of pages. Use limit to control how many:
# Command line:
apias --url https://example.com --mode batch --limit 100
# Or in config file:
limit: 100Before committing to a full extraction, you can estimate costs without making any OpenAI API calls:
apias --url https://example.com --mode batch --estimate-costThis will:
- Scrape all pages (respecting
--limitif set) - Calculate total input tokens from page content
- Display three cost scenarios based on real-world usage data:
┌─────────────────────────────────────────────────────────────┐
│ Cost Estimation │
├─────────────────────────────────────────────────────────────┤
│ Input Tokens: 1,234,567 │
├─────────────────────────────────────────────────────────────┤
│ Scenario │ Output Tokens │ Input Cost │ Total Cost │
├─────────────────┼───────────────┼────────────┼──────────────┤
│ Conservative │ 716,249 │ $0.06 │ $0.35 │
│ Average │ 2,271,603 │ $0.06 │ $0.97 │
│ Worst Case │ 14,592,582 │ $0.06 │ $5.90 │
└─────────────────────────────────────────────────────────────┘
Cost Scenarios Explained:
| Scenario | Output Ratio | Description |
|---|---|---|
| Conservative | 0.58x input | P50 median - half of jobs cost this or less |
| Average | 1.84x input | Mean across all extractions |
| Worst Case | 11.82x input | P95 - only 5% of jobs exceed this |
Tip: The Conservative estimate is typically accurate for well-structured API documentation. Use the Worst Case estimate for budget planning with complex or messy HTML.
num_threads: 3
max_retries: 3
chunk_size: 50000
model: gpt-5-nano
limit: nullnum_threads: 8
max_retries: 5
chunk_size: 40000
model: gpt-5-nano
limit: 500num_threads: 2
max_retries: 5
retry_delay: 2.0
chunk_size: 30000
model: gpt-5-nanonum_threads: 5
no_tui: true
quiet: true
auto_resume: trueYou can also use environment variables:
# Required: Your OpenAI API key
export OPENAI_API_KEY="sk-your-key-here"
# Then run APIAS
apias --url https://example.comWe welcome contributions! Please see our Contributing Guide for details.
This project is licensed under the MIT License - see the LICENSE file for details.
For security issues, please see our Security Policy.
See CHANGELOG.md for a list of changes.
- API Documentation: API Reference
- Issues: GitHub Issues
- PyPI: https://pypi.org/project/apias/
