-
Notifications
You must be signed in to change notification settings - Fork 502
Refactor: pip install --upgrade DocAgent #2076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
DocAgent RefactorCurrent state of DocAgentThe existing DocAgent follows a swarm architecture with multiple specialized agents (Triage, Task Manager, Parser, Data Ingestion, Query, Error, and Summary agents). While this design provides clear separation of concerns, it introduces several production-readiness issues: the new design will feature 4 layers:
### How do we solve this problem?
Instead of processing documents during runtime, the new architecture will use an event-driven approach, where documents will be ingested based on triggered events like button clicks, file uploads, etc. # Before: Synchronous processing during query
user_query = "What's in this PDF?"
# Agent processes PDF → chunks → vectorizes → stores → queries (slow!)
# After: Event-driven ingestion
ingestion_service.ingest_document("large_report.pdf") # Async event
# Later...
user_query = "What's in this PDF?"
# Agent queries pre-processed data (fast!)
@dataclass
class StorageConfig:
storage_type: str = "local" # "local", "s3", "azure", "gcs", "minio"
base_path: Path = field(default_factory=lambda: Path("./storage"))
bucket_name: str | None = None
credentials: dict[str, Any] | None = None
@dataclass
class RAGConfig:
rag_type: str = "vector" # "vector", "structured", "graph"
backend: str = "chromadb" # "chromadb", "weaviate", "neo4j", "inmemory"
collection_name: str | None = None
embedding_model: str = "all-MiniLM-L6-v2"
config = DocAgentConfig(
rag=RAGConfig(
rag_type="vector",
backend="weaviate",
embedding_model="all-MiniLM-L6-v2"
),
storage=StorageConfig(
storage_type="s3",
bucket_name="my-docs-bucket"
),
processing=ProcessingConfig(
chunk_size=1024,
max_file_size=500 * 1024 * 1024 # 500MB
)
)example usagefrom autogen.agents.experimental.document_agent import DocAgent2, DocumentIngestionService
from autogen.agents.experimental.document_agent.core import DocAgentConfig
# Configure for production use
config = DocAgentConfig(
rag=RAGConfig(backend="weaviate", rag_type="vector"),
storage=StorageConfig(storage_type="s3", bucket_name="company-docs")
)
# Initialize query engine (supports multiple backends)
query_engine = WeaviateQueryEngine(config.rag)
# Create ingestion service (handles document processing)
ingestion_service = DocumentIngestionService(query_engine, config)
# Process documents asynchronously (event-driven)
ingestion_service.ingest_document("large_manual.pdf") # Non-blocking
# Create query agent (fast, no document processing)
doc_agent = DocAgent2(
query_engine=query_engine,
config=config
)
# Query pre-processed documents
response = doc_agent.query("What are the safety procedures?")todos:
rough FS structure document_agent/
├── core/
│ ├── __init__.py
│ ├── base_interfaces.py # Extract interfaces from existing code
│ └── config.py # Configuration from existing code
├── ingestion/
│ ├── __init__.py
│ ├── document_processor.py # Move from parser_utils.py + docling_doc_ingest_agent.py
│ └── chunking_strategies.py # Extract from existing parsing logic
├── storage/
│ ├── __init__.py
│ └── local_storage.py # Move from document_utils.py
├── rag/
│ ├── __init__.py
│ ├── base_rag.py # Extract from chroma_query_engine.py + inmemory_query_engine.py
│ └── vector_rag.py # Move chroma_query_engine.py
└── agents/
├── __init__.py
├── doc_agent.py # Simplified version of document_agent.py
└── ingestion_agent.py # Move from docling_doc_ingest_agent.py
The refactored DocAgent transforms from a research prototype into a production/enterprise-ready Ag2 feature with following benefits:
|
|
@marklysze can you help review? Thank you! |
87d30b5 to
da2723a
Compare
Why are these changes needed?
Identified Enterprise-readiness issues:
the base refactor solves the 1st 2 problems defined above , runtime Performance and resource waste via decoupling data ingestion from the parent architecture.
example output:
Related issue number
closes #2078
Checks