Skip to content

Continuously updated paper list on advancements in Data Agents. Companion repo to our paper "A Survey of Data Agents: Emerging Paradigm or Overstated Hype?"

Notifications You must be signed in to change notification settings

HKUSTDial/awesome-data-agents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌟 Awesome Data Agents 🌟

Awesome arXiv

Note

Curated papers and resources on Data Agents. Companion repo and paper list for our survey on data agents - A Survey of Data Agents: Emerging Paradigm or Overstated Hype? [Paper]

We also release slides for a recent talk (Chinese): [Slides]

If you find our work useful or inspiring, please kindly give us a star ⭐️ and cite our survey:

@misc{zhu2025surveydataagentsemerging,
      title={A Survey of Data Agents: Emerging Paradigm or Overstated Hype?}, 
      author={Yizhang Zhu and Liangwei Wang and Chenyu Yang and Xiaotian Lin and Boyan Li and Wei Zhou and Xinyu Liu and Zhangyang Peng and Tianqi Luo and Yu Li and Chengliang Chai and Chong Chen and Shimin Di and Ju Fan and Ji Sun and Nan Tang and Fugee Tsung and Jiannan Wang and Chenglin Wu and Yanwei Xu and Shaolei Zhang and Yong Zhang and Xuanhe Zhou and Guoliang Li and Yuyu Luo},
      year={2025},
      eprint={2510.23587},
      archivePrefix={arXiv},
      primaryClass={cs.DB},
      url={https://arxiv.org/abs/2510.23587}, 
}

Contents

🎯 Introduction

Teaser

The rapid advancement of large language models (LLMs) has spurred the emergence of data agents — autonomous systems designed to orchestrate Data + AI ecosystems for tackling complex data-related tasks. However, the term "data agent" currently suffers from terminological ambiguity and inconsistent adoption, conflating simple query responders with sophisticated autonomous architectures. This terminological ambiguity fosters mismatched user expectations, accountability challenges, and barriers to industry growth.

Inspired by the SAE J3016 standard for driving automation, this survey introduces the first systematic hierarchical taxonomy for data agents, comprising six levels that delineate and trace progressive shifts in autonomy, from manual operations (L0) to a vision of generative, fully autonomous data agents (L5), thereby clarifying capability boundaries and responsibility allocation.

Through this lens, we offer a structured review of existing research arranged by increasing autonomy, encompassing specialized data agents for data management, preparation, and analysis, alongside emerging efforts toward versatile, comprehensive systems with enhanced autonomy. We further analyze critical evolutionary leaps and technical gaps for advancing data agents, especially the ongoing L2-to-L3 transition, where data agents evolve from procedural execution to autonomous orchestration. Finally, we conclude with a forward-looking roadmap, envisioning the advent of proactive, generative data agents.

🪜 Levels of Data Agents

As mentioned above, to bring clarity to the diverse landscape of data agents, we propose a hierarchical taxonomy based on their degree of autonomy. This framework maps the progressive shift of responsibility from human to agent, defining the distinct roles each plays at every stage, as summarized in the overview figure and the table below.

Level Degree of Autonomy Human Role Data Agent Role
L0 Manual/No Autonomy Dominator (Solo) N/A (None)
L1 Assisted Dominator (Integrating) Assistant (Responder)
L2 Partial Autonomy Dominator (Orchestrating) Executor (Procedural)
L3 Conditional Autonomy Supervisor (Overseeing) Dominator (Autonomous)
L4 High Autonomy Onlooker (Delegating) Dominator (Proactive)
L5 Full Autonomy N/A (None) Dominator (Generative)

The transition between these levels represents more than just incremental progress; each step up the hierarchy requires a significant evolutionary leap as shown below. These leaps involve fundamental shifts in a data agent's capabilities—such as gaining environmental perception (L1→L2), achieving autonomous orchestrating and dominating the task (L2→L3), attaining proactive self-governance with supervision removed (L3→L4), and innovating or pioneering new paradigm (L4→L5).

Leaps

📑 Paper List

Leaps

We index papers by autonomy level, then by data-related tasks across Data Management, Data Preparation, and Data Analysis. Most existing work clusters in L1–L3, while L4–L5 are aspirational. We also list relevant surveys and tutorials.

💬 L0-L1: From Manual Labor to Preliminary Assistance

In L0 level, data-related tasks are performed entirely by human experts without any automation. The process is completely human-driven, requiring extensive domain knowledge and solid technical expertise, making it highly specialized and time-consuming.

L1

At L1 level, data agents start to have the capabilities to provide preliminary and single-point assistance through typical question-answering interactions. While they can help with atomic tasks like code pieces generation, they lack environmental perception and require considerable human validation, editing, and optimization.

Data Management

Configuration Tuning
Query Optimization
System Diagnosis

Data Preparation

Data Cleaning
Data Integration
Data Discovery

Data Analysis

TableQA
NL2SQL
NL2VIS
Unstructured Data Analysis
Report Generation

Comparison of L1 Data Agents

ICL: In-Context Learning; RAG: Retrieval-Augmented Generation; SFT: Supervised Fine-Tuning; RL: Reinforcement Learning. Data complexity dimensions include Multi-source (Multis.), Heterogeneous (Hete.), and Multimodal (Multim.) data support.

🌏 L2: Perceive the Environment

L2

At L2, data agents gain the ability to perceive and interact with their environment, including data lakes, code interpreters, APIs, and other resources. In addition, L2 data agents can possess memory, invoke external tools, and adaptively optimize their actions based on environmental feedback, enabling partial autonomy in task-specific procedures. At this level, they evolve from simple responders to procedural executors operating within human-orchestrated pipelines, where humans remain responsible for managing the overall workflow and still retain dominance over data-related tasks.

Data Management

Configuration Tuning
Query Optimization
System Diagnosis

Data Preparation

Data Cleaning
Data Integration
Data Discovery

Data Analysis

TableQA
NL2SQL
NL2VIS
Unstructured Data Analysis
Report Generation

Comparison of L2 Data Agents

RAG: Retrieval-Augmented Generation; Percept: Environmental Perception; Plan: Planning; Mem: Memory; Tool: Tool invocation; Reflect: Self-reflection mechanism; MAS: Multi-agent system; SFT: Supervised Fine-Tuning; RL: Reinforcement Learning. Data complexity dimensions include Multi-source (Multis.), Heterogeneous (Hete.), and Multimodal (Multim.) data support.

🤖 Proto-L3: Striving for Autonomous Data Agents

L3

L3 data agents are expected to autonomously orchestrate tailored data pipelines for a wide range of diverse and comprehensive data-related tasks under supervision, extending beyond human-defined workflows or specific tasks. This level marks a critical transition in which the data agent assumes a dominant role in data-related tasks, while humans act as supervisors overseeing the data agents' operation. To date, no existing system has fully realized such versatile, self-directed orchestration capabilities that define a complete L3 data agent. However, emerging efforts from both academia and industry are beginning to address these challenges, giving rise to what we term "Proto-L3" data agents.

Academia Research

Industry Products

Comparison of Proto-L3 Data Agents from Academia Research and Industry Products

Compares Open-source: availability; Undef Ops.: capabilities in utilizing unpredefined operators; data-related task coverage across data management, preparation, analysis; data complexity dimensions: Multi-source (Multis.), Heterogeneous (Hete.), and Multimodal (Multim.)

🔮 L4-L5: Vision of Proactive and Generative Data Agents (Prospect)

L4: Vision of Proactive Data Agents

At L4, data agents are expected to achieve a high level of autonomy and reliability, eliminating the need for human supervision and explicit task instructions. They can proactively identify issues worthy of investigation through continuous monitoring and exploration of data lakes, and selectively orchestrate pipelines to tackle self-discovered problems. At this level, data agents take initiative in their operations while humans fully delegate responsibility, becoming onlookers.

L4

L5: The Ultimate Vision of Ubiquitous and Generative Data Agents

At the ultimate level of L5, beyond applying existing methods, data agents are envisioned to be capable of inventing novel solutions and pioneering new paradigms. In doing so, they advance the state-of-the-art in data management, preparation, and analysis, making any form of human involvement unnecessary.

L4

📚 Survey and Tutorial

🔬 Research Opportunities

Enhancing the core components of data agents can involve advancing five key aspects: Perception (e.g., environmental understanding), Planning (e.g., task decomposition and reflection), Actions (e.g., autonomous pipeline orchestration), Tools (e.g., tool invocation and discovery), and Memory (e.g., strategic knowledge retention).

Our survey also identifies and discusses that while significant progress has been made, a substantial chasm still separates current "Proto-L3" systems from the expected true L3 autonomy. Bridging this gap, and eventually progressing to L4/L5, requires addressing several fundamental challenges, which represent key research opportunities for the field. For more information, please refer to our [Survey] and [Slides] of a recent (Chinese) talk.

The Critical L2-to-L3 Transition

The primary bottleneck lies in elevating data agents from procedural executors (L2) to autonomous dominators (L3). As highlighted in the figure above, this requires significant advancements across five key aspects of agentic architecture: Perception, Planning, Actions, Tools, and Memory. Successfully bridging this gap involves overcoming four main deficiencies:

  • 1. Limited Autonomy in Pipeline Orchestration: Current data agents heavily rely on a predefined set of operators and tools. While some systems demonstrate "Tool Evolution" through recombination, a key opportunity lies in automatic data-skill discovery. Future research should focus on enabling data agents to generate, validate, and deploy emergent skills ab initio, transcending fixed toolsets.

  • 2. Incomplete Coverage of the Data Lifecycle: Most Proto-L3 systems are "predominantly centered on data analysis". Crucial data management tasks—such as configuration tuning, system diagnosis, and query optimization—remain "largely unaddressed". A major research direction is enhancing data agent versatility to create "data experts" that can reason about and autonomously orchestrate tasks across the full data lifecycle.

  • 3. Deficiencies in Advanced Reasoning: Data agents exhibit strong tactical capabilities (e.g., fixing immediate errors) but lack strategic "higher-order reasoning". This can trap them in "unproductive loops". This highlights a need for research in integrated causal reasoning and meta-reasoning. Furthermore, data agents require sophisticated memory architectures that capture abstract strategic knowledge, not just task execution history.

  • 4. Inadequate Adaptation to Dynamic Environments: Most systems are designed and evaluated against static data and tasks. They lack the ability to "genuinely self-evolve" when faced with data drift or changing schemas. Developing effective and human-free adaptation methods—and robust evaluation benchmarks for dynamic conditions—remains a promising and critical research direction.

The L4-L5 Vision: The Research Odyssey Ahead

Moving beyond L3 supervision towards proactive (L4) and generative (L5) data agents presents a long-term research odyssey. Key frontiers include:

  • Autonomous Problem Discovery (L4): Shifting from reacting to user tasks to proactively identifying valuable tasks. This requires endowing data agents with intrinsic motivation and "curiosity" to independently monitor and explore data lakes to discover anomalies or opportunities worthy of investigation.

  • Long-Horizon and Holistic Planning (L4): Moving beyond local, step-by-step optimization to long-horizon planning. L4 data agents must be able to make strategic trade-offs, such as balancing the immediate cost of data cleaning against long-term analytical benefits.

  • Generative Innovation (L5): The ultimate vision is a data agent that can innovate and pioneer new paradigms. Instead of merely applying existing methods, an L5 data agent would recognize the limitations of current approaches and invent novel algorithms, theories, or frameworks to advance the state-of-the-art.