-
-
Notifications
You must be signed in to change notification settings - Fork 1
feat: Web Pipeline Engine — declarative browser automation with YAML adapters #1967
Copy link
Copy link
Closed
Description
Summary
Add a declarative web automation pipeline system to AutoBot. Agents, the knowledge system, and users can define web scraping/automation tasks as YAML pipelines with composable steps. The system includes XHR interception, accessibility snapshots, auth strategy detection, AI-powered API discovery, and self-extending adapter synthesis.
Inspired by: jackwener/opencli (reference for patterns only — no code dependency)
Design doc: docs/plans/2026-03-20-web-pipeline-engine-design.md
Use Cases
- Agent-driven web research — agents autonomously scrape sites to gather structured data
- User-triggered web automation — users define scraping tasks through chat UI
- Knowledge ingestion — pull structured data from websites into AutoBot's knowledge base
Phase Checklist
Phase 1: XHR Interceptor + Accessibility Snapshot
-
services/web_pipeline/interceptor.py— JS interceptor generator + result collector -
services/web_pipeline/snapshot.py— accessibility tree capture via Playwright - Register
intercept_apiMCP tool inbrowser_mcp.py - Register
page_snapshotMCP tool inbrowser_mcp.py - Add
inject_js()andget_accessibility_tree()toplaywright_service.py
Phase 2: Pipeline Engine + Step Library
-
services/web_pipeline/engine.py— PipelineEngine + PipelineContext -
services/web_pipeline/template.py— safe expression evaluator (restricted AST) -
services/web_pipeline/steps/— step handlers (navigate, fetch, evaluate, intercept, snapshot, select, map, filter, sort, limit, click, type, wait) -
services/web_pipeline/loader.py— YAML adapter discovery + loading -
services/web_pipeline/adapters/— adapter directory structure - Register
run_pipelineMCP tool inbrowser_mcp.py
Phase 3: Auth Strategy Cascade
-
services/web_pipeline/auth.py— AuthStrategy enum + AuthCascade (5-tier) - Strategy-aware step execution in engine.py
- Redis caching of detected strategies per domain (24h TTL)
Phase 4: API Discovery (Explore)
-
services/web_pipeline/explore.py— site explorer + network analyzer -
services/web_pipeline/field_roles.py— field heuristics + endpoint scoring - Register
explore_siteMCP tool inbrowser_mcp.py
Phase 5: Adapter Registry + Auto-Synthesis
-
services/web_pipeline/registry.py— AdapterRegistry (CRUD) -
services/web_pipeline/synthesizer.py— AdapterSynthesizer -
services/web_pipeline/adapters/builtin/— shipped adapters -
services/web_pipeline/adapters/discovered/— gitignored auto-generated adapters - Register
synthesize_adapters,register_adapter,list_adaptersMCP tools - Knowledge base integration via
web_crawler_connector.py
Architecture
User/Agent Request
↓
Browser MCP (7 new tools)
↓
Pipeline Engine (YAML → step executor)
├─ navigate, evaluate, fetch (existing Playwright)
├─ intercept, snapshot (Phase 1)
├─ select, map, filter, sort, limit (Phase 2)
└─ auth cascade (Phase 3)
↓
Playwright Service → Browser VM (.25)
↓
Results → Agent / Knowledge Base / Chat UI
Deliverables
| Phase | New MCP Tools | New Files | Modifies |
|---|---|---|---|
| 1 | intercept_api, page_snapshot |
3 | 2 |
| 2 | run_pipeline |
8+ | 1 |
| 3 | — | 1 | 3 |
| 4 | explore_site |
2 | 1 |
| 5 | synthesize_adapters, register_adapter, list_adapters |
3+ | 3 |
| Total | 7 new tools | ~17 files | ~10 files |
Reactions are currently unavailable