Skip to content

feat: Web Pipeline Engine — declarative browser automation with YAML adapters #1967

@mrveiss

Description

@mrveiss

Summary

Add a declarative web automation pipeline system to AutoBot. Agents, the knowledge system, and users can define web scraping/automation tasks as YAML pipelines with composable steps. The system includes XHR interception, accessibility snapshots, auth strategy detection, AI-powered API discovery, and self-extending adapter synthesis.

Inspired by: jackwener/opencli (reference for patterns only — no code dependency)

Design doc: docs/plans/2026-03-20-web-pipeline-engine-design.md

Use Cases

  • Agent-driven web research — agents autonomously scrape sites to gather structured data
  • User-triggered web automation — users define scraping tasks through chat UI
  • Knowledge ingestion — pull structured data from websites into AutoBot's knowledge base

Phase Checklist

Phase 1: XHR Interceptor + Accessibility Snapshot

  • services/web_pipeline/interceptor.py — JS interceptor generator + result collector
  • services/web_pipeline/snapshot.py — accessibility tree capture via Playwright
  • Register intercept_api MCP tool in browser_mcp.py
  • Register page_snapshot MCP tool in browser_mcp.py
  • Add inject_js() and get_accessibility_tree() to playwright_service.py

Phase 2: Pipeline Engine + Step Library

  • services/web_pipeline/engine.py — PipelineEngine + PipelineContext
  • services/web_pipeline/template.py — safe expression evaluator (restricted AST)
  • services/web_pipeline/steps/ — step handlers (navigate, fetch, evaluate, intercept, snapshot, select, map, filter, sort, limit, click, type, wait)
  • services/web_pipeline/loader.py — YAML adapter discovery + loading
  • services/web_pipeline/adapters/ — adapter directory structure
  • Register run_pipeline MCP tool in browser_mcp.py

Phase 3: Auth Strategy Cascade

  • services/web_pipeline/auth.py — AuthStrategy enum + AuthCascade (5-tier)
  • Strategy-aware step execution in engine.py
  • Redis caching of detected strategies per domain (24h TTL)

Phase 4: API Discovery (Explore)

  • services/web_pipeline/explore.py — site explorer + network analyzer
  • services/web_pipeline/field_roles.py — field heuristics + endpoint scoring
  • Register explore_site MCP tool in browser_mcp.py

Phase 5: Adapter Registry + Auto-Synthesis

  • services/web_pipeline/registry.py — AdapterRegistry (CRUD)
  • services/web_pipeline/synthesizer.py — AdapterSynthesizer
  • services/web_pipeline/adapters/builtin/ — shipped adapters
  • services/web_pipeline/adapters/discovered/ — gitignored auto-generated adapters
  • Register synthesize_adapters, register_adapter, list_adapters MCP tools
  • Knowledge base integration via web_crawler_connector.py

Architecture

User/Agent Request
  ↓
Browser MCP (7 new tools)
  ↓
Pipeline Engine (YAML → step executor)
  ├─ navigate, evaluate, fetch (existing Playwright)
  ├─ intercept, snapshot (Phase 1)
  ├─ select, map, filter, sort, limit (Phase 2)
  └─ auth cascade (Phase 3)
  ↓
Playwright Service → Browser VM (.25)
  ↓
Results → Agent / Knowledge Base / Chat UI

Deliverables

Phase New MCP Tools New Files Modifies
1 intercept_api, page_snapshot 3 2
2 run_pipeline 8+ 1
3 1 3
4 explore_site 2 1
5 synthesize_adapters, register_adapter, list_adapters 3+ 3
Total 7 new tools ~17 files ~10 files

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions