Comprehensive technical documentation for the Browser.AI framework - a sophisticated browser automation system that makes websites accessible for AI agents.
This documentation provides detailed technical specifications, architecture diagrams, implementation guides, and workflow patterns for all Browser.AI components.
| Document | Description | Focus Area |
|---|---|---|
| Architecture Overview | High-level system design and component relationships | System Architecture |
| Agent Implementation | AI agent service, orchestration, and specializations | Agent System |
| DOM Processing | DOM analysis, element extraction, and processing workflows | DOM Service |
| Controller & Actions | Action system, registry patterns, and browser operations | Action Framework |
| Browser Management | Playwright integration, context management, and configurations | Browser Layer |
| Workflows & Integration | End-to-end workflows, integration patterns, and real-world scenarios | Implementation Patterns |
Browser.AI follows a modular architecture with four core components:
graph TB
subgraph "Browser.AI Framework"
Agent[๐ค AI Agent Service<br/>Task orchestration & LLM integration]
Controller[๐ฎ Action Controller<br/>Browser action registry & execution]
Browser[๐ Browser Service<br/>Enhanced Playwright wrapper]
DOM[๐ DOM Processing<br/>Element extraction & analysis]
end
subgraph "External Services"
LLM[๐ง Language Models<br/>OpenAI, Anthropic, Ollama, etc.]
PW[๐ญ Playwright Engine<br/>Cross-browser automation]
end
Agent --> Controller
Agent --> LLM
Controller --> Browser
Browser --> DOM
Browser --> PW
style Agent fill:#e1f5fe
style Controller fill:#f3e5f5
style Browser fill:#e8f5e8
style DOM fill:#fff3e0
import asyncio
from browser_ai import Agent, Browser
from langchain_openai import ChatOpenAI
async def main():
# Initialize components
llm = ChatOpenAI(model="gpt-4-turbo")
browser = Browser()
# Create agent
agent = Agent(
task="Search for Python tutorials on Google and extract top 5 results",
llm=llm,
browser=browser,
use_vision=True
)
# Execute task
result = await agent.run()
print(f"Task completed: {result.is_done}")
print(f"Results: {result.extracted_content}")
if __name__ == "__main__":
asyncio.run(main())from browser_ai import Agent, BrowserConfig, Controller
from langchain_anthropic import ChatAnthropic
# Custom browser configuration
browser_config = BrowserConfig(
headless=False,
disable_security=True,
extra_chromium_args=['--disable-blink-features=AutomationControlled']
)
# Custom controller with excluded actions
controller = Controller(exclude_actions=['search_google'])
# Claude-3 with custom settings
llm = ChatAnthropic(model="claude-3-sonnet-20240229", temperature=0)
agent = Agent(
task="Complex web automation with custom actions",
llm=llm,
controller=controller,
browser_config=browser_config,
use_vision=True,
max_failures=3
)The Agent Service is the central orchestrator that combines intelligent planning with robust execution:
Key Features:
- Step-by-step task execution with planning
- Comprehensive conversation persistence
- Advanced prompt engineering with dynamic context
The Controller provides a flexible, extensible framework for browser actions:
- Registry Pattern: Decorator-based action registration
- Type Safety: Pydantic parameter validation
- Extensibility: Easy custom action development
- Built-in Actions: Navigation, interaction, tab management, and completion
Available Actions:
- Navigation:
go_to_url,search_google,go_back - Interaction:
click_element,input_text,scroll,send_keys - Tab Management:
switch_tab,open_tab,close_tab - Task Completion:
done, custom output models
Enhanced Playwright wrapper with advanced capabilities:
- Multiple Connection Types: Local launch, CDP, WebSocket, existing instances
- Context Management: Isolated browsing contexts with state tracking
- Advanced Features: Download handling, file uploads, screenshot capture
- Performance Optimization: Resource management and memory cleanup
Connection Options:
- Local browser launch (default)
- Chrome DevTools Protocol (CDP)
- WebSocket connections (remote browsers)
- Existing Chrome instance reuse
Sophisticated DOM analysis combining JavaScript and Python processing:
- Interactive Element Detection: Smart identification of clickable elements
- Visual Highlighting: Color-coded element highlighting for AI analysis
- Coordinate Systems: Viewport and page coordinate mapping
- Element History: Cross-page element tracking and identity management
Processing Pipeline:
- JavaScript injection for DOM analysis
- Interactive element identification and filtering
- Coordinate calculation and viewport analysis
- Python DOM tree construction with relationships
- Selector map generation for action targeting
sequenceDiagram
participant User
participant Agent
participant LLM
participant Controller
participant Browser
participant DOM
User->>Agent: Submit Task
Agent->>LLM: Generate Plan
loop Step Execution
Agent->>Browser: Get Current State
Browser->>DOM: Extract Elements
DOM-->>Browser: DOM State
Browser-->>Agent: Complete State
Agent->>LLM: Decide Action
LLM-->>Agent: Action Decision
Agent->>Controller: Execute Action
Controller->>Browser: Browser Operation
Browser-->>Controller: Result
Controller-->>Agent: Action Result
end
Agent-->>User: Task Complete
The documentation includes comprehensive examples for:
- Form Filling: Multi-step form completion with validation
- Data Extraction: Cross-site data mining with pagination handling
- E-Commerce Automation: Product research, cart management, checkout processes
- Social Media Management: Content posting, scheduling, engagement tracking
- Automated Testing: Regression testing, screenshot comparison, validation
# LLM API Keys
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
GOOGLE_API_KEY=your_google_key
# Logging Configuration
BROWSER_AI_LOGGING_LEVEL=info # result | debug | infoconfig = BrowserConfig(
headless=False, # Visual mode for development
disable_security=True, # Bypass web security for automation
extra_chromium_args=[ # Custom Chrome arguments
'--disable-blink-features=AutomationControlled',
'--user-agent=Custom Browser AI Agent'
],
proxy=ProxySettings( # Proxy configuration
server="http://proxy:8080",
username="user",
password="pass"
)
)For developers new to Browser.AI, we recommend this learning progression:
-
Start with Architecture Overview
- Understand the high-level system design
- Learn component relationships and data flow
-
Explore Agent Implementation
- Master the central orchestration system
- Learn prompt engineering and conversation management
-
Study DOM Processing
- Understand element extraction and analysis
- Learn the JavaScript/Python integration
-
Examine Controller & Actions
- Master the action system and registry patterns
- Learn to create custom actions
-
Review Browser Management
- Understand Playwright integration
- Learn advanced configuration options
-
Practice with Workflows & Integration
- Implement real-world automation scenarios
- Learn advanced integration patterns
This documentation is living and evolving. Areas for contribution:
- Additional Workflow Examples: Real-world automation scenarios
- Performance Optimization: Advanced tuning and scaling patterns
- Integration Guides: Third-party service integrations
- Troubleshooting: Common issues and solutions
- Best Practices: Lessons learned from production deployments
Each documentation section includes:
- Class Definitions: Complete API specifications
- Method Signatures: Parameter types and return values
- Configuration Options: All available settings
- Usage Examples: Practical implementation patterns
- Mermaid Diagrams: Visual architecture representations
Browser.AI excels in these automation scenarios:
- Web Scraping: Intelligent content extraction with AI-driven navigation
- E-Commerce Automation: Product research, price monitoring, inventory management
- Form Processing: Automated form filling with validation and error handling
- Testing & QA: Regression testing with visual validation
- Social Media Management: Content posting, engagement, analytics collection
- Research & Data Collection: Systematic information gathering across multiple sources
- Workflow Automation: Business process automation with decision-making capabilities