Skip to content

Conversation

@priyansh4320
Copy link
Collaborator

@priyansh4320 priyansh4320 commented Sep 2, 2025

Why are these changes needed?

Identified Enterprise-readiness issues:

  • Runtime Performance: Large document ingestion happens synchronously during user interactions, creating poor UX.
  • Resource Waste: New vector storage processes are created for every document, even if already processed.
  • Limited Storage Support: Only local file paths are supported, missing cloud storage capabilities.
  • Single RAG Backend: Limited to ChromaDB without enterprise alternatives like Weaviate or graph-based approaches.

the base refactor solves the 1st 2 problems defined above , runtime Performance and resource waste via decoupling data ingestion from the parent architecture.

   # Setup
    llm_config = LLMConfig(model="o3-mini", api_type="openai", api_key=os.getenv("OPENAI_API_KEY"))

    # Initialize components
    query_engine = VectorChromaQueryEngine(collection_name="new_collection")
    ingestion_service = DocumentIngestionService(query_engine=query_engine)
    doc_agent = DocAgent(llm_config=llm_config, query_engine=query_engine)

    # Test document
    doc_path = "test/agentchat/contrib/graph_rag/Toast_financial_report.pdf"

    if Path(doc_path).exists():
        # Step 1: Ingest document
        print("Step 1: Ingesting document...")
        result = ingestion_service.ingest_document(doc_path)
        # print(f"Ingestion: {result}")

        # Step 2: Query document
        print("\nStep 2: Querying document...")
        response = doc_agent.run(message="What is the fiscal year 2024 financial summary? ", max_turns=1)

example output:

DocAgent (to DocAgent):

What is the fiscal year 2024 financial summary?

--------------------------------------------------------------------------------
_User (to chat_manager):

What is the fiscal year 2024 financial summary?

--------------------------------------------------------------------------------

Next speaker: QueryAgent


>>>>>>>> USING AUTO REPLY...
QueryAgent (to chat_manager):

***** Suggested tool call (call_VrLT1PH5lY4fdVLLKzgwoEZ9): execute_rag_query *****
Arguments: 
{}
**********************************************************************************

--------------------------------------------------------------------------------

Next speaker: _Group_Tool_Executor


>>>>>>>> EXECUTING FUNCTION execute_rag_query...
Call ID: call_VrLT1PH5lY4fdVLLKzgwoEZ9
Input arguments: {}

>>>>>>>> EXECUTED FUNCTION execute_rag_query...
Call ID: call_VrLT1PH5lY4fdVLLKzgwoEZ9
Input arguments: {}
Output:
{'content': "The financial summary for Toast, Inc. for the fiscal year 2024, as of September 30, includes total assets of $2,227 million and total liabilities of $807 million. The stockholders' equity stands at $1,420 million. Current assets amount to $1,802 million, with cash and cash equivalents at $761 million. The company has an accumulated deficit of $1,636 million and additional paid-in capital of $3,053 million. Total current liabilities are $748 million."}
_Group_Tool_Executor (to chat_manager):

***** Response from calling tool (call_VrLT1PH5lY4fdVLLKzgwoEZ9) *****
{'content': "The financial summary for Toast, Inc. for the fiscal year 2024, as of September 30, includes total assets of $2,227 million and total liabilities of $807 million. The stockholders' equity stands at $1,420 million. Current assets amount to $1,802 million, with cash and cash equivalents at $761 million. The company has an accumulated deficit of $1,636 million and additional paid-in capital of $3,053 million. Total current liabilities are $748 million."}
**********************************************************************

--------------------------------------------------------------------------------

Next speaker: QueryAgent


>>>>>>>> USING AUTO REPLY...
QueryAgent (to chat_manager):

The financial summary for Toast, Inc. for the fiscal year 2024, as of September 30, is as follows:

- Total Assets: $2,227 million
- Total Liabilities: $807 million
- Stockholders' Equity: $1,420 million
- Current Assets: $1,802 million
- Cash and Cash Equivalents: $761 million
- Accumulated Deficit: $1,636 million
- Additional Paid-in Capital: $3,053 million
- Total Current Liabilities: $748 million.

--------------------------------------------------------------------------------

Next speaker: SummaryAgent


>>>>>>>> USING AUTO REPLY...
SummaryAgent (to chat_manager):

Ingestions:
1. The financial summary for Toast, Inc. for the fiscal year 2024, as of September 30, includes total assets of $2,227 million and total liabilities of $807 million. The stockholders' equity stands at $1,420 million. Current assets amount to $1,802 million, with cash and cash equivalents at $761 million. The company has an accumulated deficit of $1,636 million and additional paid-in capital of $3,053 million. Total current liabilities are $748 million.

Queries:
1. What is the fiscal year 2024 financial summary?
Answer: The financial summary for Toast, Inc. for the fiscal year 2024, as of September 30, includes total assets of $2,227 million, total liabilities of $807 million, stockholders' equity of $1,420 million, current assets of $1,802 million, cash and cash equivalents of $761 million, an accumulated deficit of $1,636 million, additional paid-in capital of $3,053 million, and total current liabilities of $748 million.

--------------------------------------------------------------------------------

>>>>>>>> TERMINATING RUN (4f10222b-717c-4c1c-bccf-c83aa3666058): No next speaker selected
DocAgent (to DocAgent):

Ingestions:
1. The financial summary for Toast, Inc. for the fiscal year 2024, as of September 30, includes total assets of $2,227 million and total liabilities of $807 million. The stockholders' equity stands at $1,420 million. Current assets amount to $1,802 million, with cash and cash equivalents at $761 million. The company has an accumulated deficit of $1,636 million and additional paid-in capital of $3,053 million. Total current liabilities are $748 million.

Queries:
1. What is the fiscal year 2024 financial summary?
Answer: The financial summary for Toast, Inc. for the fiscal year 2024, as of September 30, includes total assets of $2,227 million, total liabilities of $807 million, stockholders' equity of $1,420 million, current assets of $1,802 million, cash and cash equivalents of $761 million, an accumulated deficit of $1,636 million, additional paid-in capital of $3,053 million, and total current liabilities of $748 million.

--------------------------------------------------------------------------------

>>>>>>>> TERMINATING RUN (d35f8d2a-e639-4d99-bfd5-0ffb1c3bb7f1): Maximum turns (1) reached
Answer: Ingestions:
1. The financial summary for Toast, Inc. for the fiscal year 2024, as of September 30, includes total assets of $2,227 million and total liabilities of $807 million. The stockholders' equity stands at $1,420 million. Current assets amount to $1,802 million, with cash and cash equivalents at $761 million. The company has an accumulated deficit of $1,636 million and additional paid-in capital of $3,053 million. Total current liabilities are $748 million.

Queries:
1. What is the fiscal year 2024 financial summary?
Answer: The financial summary for Toast, Inc. for the fiscal year 2024, as of September 30, includes total assets of $2,227 million, total liabilities of $807 million, stockholders' equity of $1,420 million, current assets of $1,802 million, cash and cash equivalents of $761 million, an accumulated deficit of $1,636 million, additional paid-in capital of $3,053 million, and total current liabilities of $748 million.

Related issue number

closes #2078

Checks

@joggrbot
Copy link
Contributor

joggrbot bot commented Sep 2, 2025

📝 Documentation Analysis

All docs are up to date! 🎉


✅ Latest commit analyzed: 34358b4 | Powered by Joggr

@priyansh4320
Copy link
Collaborator Author

priyansh4320 commented Sep 2, 2025

DocAgent Refactor

Current state of DocAgent

The existing DocAgent follows a swarm architecture with multiple specialized agents (Triage, Task Manager, Parser, Data Ingestion, Query, Error, and Summary agents). While this design provides clear separation of concerns, it introduces several production-readiness issues:
Runtime Performance: Large document ingestion happens synchronously during user interactions, creating poor UX
Resource Waste: New vector storage processes are created for every document, even if already processed
Limited Storage Support: Only local file paths are supported, missing cloud storage capabilities
Single RAG Backend: Limited to ChromaDB without enterprise alternatives like Weaviate or graph-based approaches

the new design will feature 4 layers:

  1. Query Layer: Handles user interactions and RAG queries
  2. Ingestion Layer: Processes documents asynchronously via events
  3. Storage Layer: Abstracts storage backends (local, cloud, blob)
  4. RAG Layer: Supports multiple RAG strategies (vector, structured, graph)

### How do we solve this problem?
  1. Event-Driven Ingestion

Instead of processing documents during runtime, the new architecture will use an event-driven approach, where documents will be ingested based on triggered events like button clicks, file uploads, etc.

# Before: Synchronous processing during query
user_query = "What's in this PDF?"
# Agent processes PDF → chunks → vectorizes → stores → queries (slow!)

# After: Event-driven ingestion
ingestion_service.ingest_document("large_report.pdf")  # Async event
# Later...
user_query = "What's in this PDF?"
# Agent queries pre-processed data (fast!)
  1. Decoupled Storage
    The storage layer will be separated from the query logic, this will allow users to configure cloud storage without changing the core agent logic.
@dataclass
class StorageConfig:
    storage_type: str = "local"  # "local", "s3", "azure", "gcs", "minio"
    base_path: Path = field(default_factory=lambda: Path("./storage"))
    bucket_name: str | None = None
    credentials: dict[str, Any] | None = None
  1. Multiple RAG Backends
    The new architecture supports three RAG strategies through a unified interface, add can be configured for any backend
@dataclass
class RAGConfig:
    rag_type: str = "vector"  # "vector", "structured", "graph"
    backend: str = "chromadb"  # "chromadb", "weaviate", "neo4j", "inmemory"
    collection_name: str | None = None
    embedding_model: str = "all-MiniLM-L6-v2"
  1. Configuration & Interfaces
    Unified Configuration
    The DocAgentConfig consolidates all settings in one place:
config = DocAgentConfig(
    rag=RAGConfig(
        rag_type="vector",
        backend="weaviate",
        embedding_model="all-MiniLM-L6-v2"
    ),
    storage=StorageConfig(
        storage_type="s3",
        bucket_name="my-docs-bucket"
    ),
    processing=ProcessingConfig(
        chunk_size=1024,
        max_file_size=500 * 1024 * 1024  # 500MB
    )
)

example usage

from autogen.agents.experimental.document_agent import DocAgent2, DocumentIngestionService
from autogen.agents.experimental.document_agent.core import DocAgentConfig

# Configure for production use
config = DocAgentConfig(
    rag=RAGConfig(backend="weaviate", rag_type="vector"),
    storage=StorageConfig(storage_type="s3", bucket_name="company-docs")
)

# Initialize query engine (supports multiple backends)
query_engine = WeaviateQueryEngine(config.rag)

# Create ingestion service (handles document processing)
ingestion_service = DocumentIngestionService(query_engine, config)

# Process documents asynchronously (event-driven)
ingestion_service.ingest_document("large_manual.pdf")  # Non-blocking

# Create query agent (fast, no document processing)
doc_agent = DocAgent2(
    query_engine=query_engine,
    config=config
)

# Query pre-processed documents
response = doc_agent.query("What are the safety procedures?")

todos:
  • initial refactoring plan:
  1. Extract base interfaces from existing query engines
  2. Move document processing to separate ingestion module
  3. Simplify DocAgent to be query-only
  4. Create separate ingestion service using existing code

rough FS structure

document_agent/
├── core/
│   ├── __init__.py
│   ├── base_interfaces.py          # Extract interfaces from existing code
│   └── config.py                   # Configuration from existing code
├── ingestion/
│   ├── __init__.py
│   ├── document_processor.py       # Move from parser_utils.py + docling_doc_ingest_agent.py
│   └── chunking_strategies.py      # Extract from existing parsing logic
├── storage/
│   ├── __init__.py
│   └── local_storage.py            # Move from document_utils.py
├── rag/
│   ├── __init__.py
│   ├── base_rag.py                 # Extract from chroma_query_engine.py + inmemory_query_engine.py
│   └── vector_rag.py               # Move chroma_query_engine.py
└── agents/
    ├── __init__.py
    ├── doc_agent.py                # Simplified version of document_agent.py
    └── ingestion_agent.py          # Move from docling_doc_ingest_agent.py
  • step 2: We will add a Database Storage Layer add blob storage support (S3, Azure, GCS), Implement MinIO/DynamoDB bucket support and Creating storage abstraction layer

  • - [ ] step 4: Add structured RAG support:
    add postgresDBqueryengine support, implement structured query capabilities, create structured RAG strategy

  • step 5: We will add Graph RAG Backend , event based Knowledge Graph Creation support. add support for cypher queries support for data retrieval.

  • step 6: add unit test module for new DocAgent


The refactored DocAgent transforms from a research prototype into a production/enterprise-ready Ag2 feature with following benefits:
  • Performance: Query responses are instant since documents are pre-processed.
  • Scalability: Cloud storage support handles enterprise document volumes.
  • Flexibility: Multiple RAG backends for different use cases.
  • Maintainability: Clear separation of concerns and unified configuration.
  • Production Ready: Event-driven architecture supports real-world orchestrations.

@priyansh4320 priyansh4320 changed the title [Refactor]: pip install --upgrade DocAgent Refactor: pip install --upgrade DocAgent Sep 2, 2025
@qingyun-wu
Copy link
Collaborator

@marklysze can you help review? Thank you!

@priyansh4320 priyansh4320 force-pushed the docagent-base-refactor branch from 87d30b5 to da2723a Compare September 3, 2025 19:51
@priyansh4320 priyansh4320 self-assigned this Sep 7, 2025
@codecov
Copy link

codecov bot commented Oct 28, 2025

Codecov Report

❌ Patch coverage is 71.70418% with 88 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...tal/document_agent/ingestion/document_processor.py 28.57% 55 Missing ⚠️
...ts/experimental/document_agent/agents/doc_agent.py 74.48% 24 Missing and 1 partial ⚠️
...xperimental/document_agent/core/base_interfaces.py 82.60% 8 Missing ⚠️
Files with missing lines Coverage Δ
...imental/document_agent/agents/ingestion_service.py 100.00% <100.00%> (ø)
.../agents/experimental/document_agent/core/config.py 100.00% <100.00%> (ø)
...xperimental/document_agent/core/base_interfaces.py 82.60% <82.60%> (ø)
...ts/experimental/document_agent/agents/doc_agent.py 74.48% <74.48%> (ø)
...tal/document_agent/ingestion/document_processor.py 28.57% <28.57%> (ø)

... and 41 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Refactor]: decouple Data ingestion process from DocAgent.

4 participants