🌟 Awesome Data Agents 🌟

Note

Curated papers and resources on Data Agents. Companion repo and paper list for our survey on data agents - A Survey of Data Agents: Emerging Paradigm or Overstated Hype? [Paper]

We also release slides for a recent talk (Chinese): [Slides]

If you find our work useful or inspiring, please kindly give us a star ⭐️ and cite our survey:

@misc{zhu2025surveydataagentsemerging,
      title={A Survey of Data Agents: Emerging Paradigm or Overstated Hype?}, 
      author={Yizhang Zhu and Liangwei Wang and Chenyu Yang and Xiaotian Lin and Boyan Li and Wei Zhou and Xinyu Liu and Zhangyang Peng and Tianqi Luo and Yu Li and Chengliang Chai and Chong Chen and Shimin Di and Ju Fan and Ji Sun and Nan Tang and Fugee Tsung and Jiannan Wang and Chenglin Wu and Yanwei Xu and Shaolei Zhang and Yong Zhang and Xuanhe Zhou and Guoliang Li and Yuyu Luo},
      year={2025},
      eprint={2510.23587},
      archivePrefix={arXiv},
      primaryClass={cs.DB},
      url={https://arxiv.org/abs/2510.23587}, 
}

🎯 Introduction

The rapid advancement of large language models (LLMs) has spurred the emergence of data agents — autonomous systems designed to orchestrate Data + AI ecosystems for tackling complex data-related tasks. However, the term "data agent" currently suffers from terminological ambiguity and inconsistent adoption, conflating simple query responders with sophisticated autonomous architectures. This terminological ambiguity fosters mismatched user expectations, accountability challenges, and barriers to industry growth.

Inspired by the SAE J3016 standard for driving automation, this survey introduces the first systematic hierarchical taxonomy for data agents, comprising six levels that delineate and trace progressive shifts in autonomy, from manual operations (L0) to a vision of generative, fully autonomous data agents (L5), thereby clarifying capability boundaries and responsibility allocation.

Through this lens, we offer a structured review of existing research arranged by increasing autonomy, encompassing specialized data agents for data management, preparation, and analysis, alongside emerging efforts toward versatile, comprehensive systems with enhanced autonomy. We further analyze critical evolutionary leaps and technical gaps for advancing data agents, especially the ongoing L2-to-L3 transition, where data agents evolve from procedural execution to autonomous orchestration. Finally, we conclude with a forward-looking roadmap, envisioning the advent of proactive, generative data agents.

🪜 Levels of Data Agents

As mentioned above, to bring clarity to the diverse landscape of data agents, we propose a hierarchical taxonomy based on their degree of autonomy. This framework maps the progressive shift of responsibility from human to agent, defining the distinct roles each plays at every stage, as summarized in the overview figure and the table below.

Level	Degree of Autonomy	Human Role	Data Agent Role
L0	Manual/No Autonomy	Dominator (Solo)	N/A (None)
L1	Assisted	Dominator (Integrating)	Assistant (Responder)
L2	Partial Autonomy	Dominator (Orchestrating)	Executor (Procedural)
L3	Conditional Autonomy	Supervisor (Overseeing)	Dominator (Autonomous)
L4	High Autonomy	Onlooker (Delegating)	Dominator (Proactive)
L5	Full Autonomy	N/A (None)	Dominator (Generative)

The transition between these levels represents more than just incremental progress; each step up the hierarchy requires a significant evolutionary leap as shown below. These leaps involve fundamental shifts in a data agent's capabilities—such as gaining environmental perception (L1→L2), achieving autonomous orchestrating and dominating the task (L2→L3), attaining proactive self-governance with supervision removed (L3→L4), and innovating or pioneering new paradigm (L4→L5).

📑 Paper List

We index papers by autonomy level, then by data-related tasks across Data Management, Data Preparation, and Data Analysis. Most existing work clusters in L1–L3, while L4–L5 are aspirational. We also list relevant surveys and tutorials.

💬 L0-L1: From Manual Labor to Preliminary Assistance

In L0 level, data-related tasks are performed entirely by human experts without any automation. The process is completely human-driven, requiring extensive domain knowledge and solid technical expertise, making it highly specialized and time-consuming.

At L1 level, data agents start to have the capabilities to provide preliminary and single-point assistance through typical question-answering interactions. While they can help with atomic tasks like code pieces generation, they lack environmental perception and require considerable human validation, editing, and optimization.

Data Management

Data Preparation

Data Cleaning

Can Foundation Models Wrangle Your Data? - VLDB 2022
RetClean: Retrieval-Based Data Cleaning Using LLMs and Data Lakes - VLDB demo 2024
Data Imputation with Limited Data Redundancy Using Data Lakes - VLDB 2025
UNIDM: A UNIFIED FRAMEWORK FOR DATA MANIPULATION WITH LARGE LANGUAGE MODELS - MLSys 2024
LLMClean: Context-Aware Tabular Data Cleaning via LLM-Generated OFDs - ADBIS 2024

Data Integration

Table-GPT: Table-tuned GPT for Diverse Table Tasks - SIGMOD 2024
Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration - ICDE 2024
Jellyfish: Instruction-Tuning Local Large Language Models for Data Preprocessing - EMNLP 2024

Data Discovery

ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models - VLDB 2024
RACOON: An LLM-based Framework for Retrieval-Augmented Column Type Annotation with a Knowledge Graph - arXiv 2024
Cocoon: Semantic table profiling using large language model - HILDA 2024
Autoddg: Automated dataset description generation using large language models - arXiv 2025
Pneuma: Leveraging llms for tabular data representation and retrieval in an end-to-end system - SIGMOD 2025
Evaluating Knowledge Generation and Self-refinement Strategies for LLM-Based Column Type Annotation - ADBIS 2025
Columbo: Expanding Abbreviated Column Names for Tabular Data Using Large Language Models - EMNLP 2025

Data Analysis

TableQA

Large Language Models are Versatile Decomposers: Decomposing Evidence and Questions for Table-based Reasoning - SIGIR 2023
Binding Language Models in Symbolic Languages - ICLR 2023
Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study - WSDM 2024
TableLlama: Towards Open Large Generalist Models for Tables - NAACL 2024

NL2SQL

DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction. - NeurIPS 2023
Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation - VLDB 2023
ACT-SQL: In-Context Learning for Text-to-SQL with Automatically-Generated Chain-of-Thought - EMNLP 2023
The Dawn of Natural Language to SQL: Are We Fully Ready? - VLDB 2024

NL2VIS

Unstructured Data Analysis

LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering - EMNLP 2024
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval - ICLR 2024
PDFTriage: Question Answering over Long, Structured Documents - EMNLP 2024
Unifying Multimodal Retrieval via Document Screenshot Embedding - EMNLP 2024
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation - NAACL 2025
Docopilot: Improving Multimodal Models for Document-Level Understanding - CVPR 2025

Report Generation

DataTales: Investigating the use of Large Language Models for Authoring Data-Driven Articles - VIS 2023
Enhancing Data Literacy On-demand: LLMs as Guides for Novices in Chart Interpretation - TVCG 2024
ReportGPT: Human‑in‑the‑Loop Verifiable Table‑to‑Text Generation - EMNLP Industry 2025
InterChat: Enhancing Generative Visual Analytics using Multimodal Interactions - EuroVis 2025
VizTA: Enhancing Comprehension of Distributional Visualization with Visual-Lexical Fused Conversational Interface - EuroVis 2025
ChartLens: Fine-grained Visual Attribution in Charts - ACL 2025

Comparison of L1 Data Agents

ICL: In-Context Learning; RAG: Retrieval-Augmented Generation; SFT: Supervised Fine-Tuning; RL: Reinforcement Learning. Data complexity dimensions include Multi-source (Multis.), Heterogeneous (Hete.), and Multimodal (Multim.) data support.

🌏 L2: Perceive the Environment

At L2, data agents gain the ability to perceive and interact with their environment, including data lakes, code interpreters, APIs, and other resources. In addition, L2 data agents can possess memory, invoke external tools, and adaptively optimize their actions based on environmental feedback, enabling partial autonomy in task-specific procedures. At this level, they evolve from simple responders to procedural executors operating within human-orchestrated pipelines, where humans remain responsible for managing the overall workflow and still retain dominance over data-related tasks.

Data Management

Configuration Tuning

Is Large Language Model Good at Database Knob Tuning? A Comprehensive Experimental Evaluation - arXiv 2024
LLMIdxAdvis: Resource-Efficient Index Advisor Utilizing Large Language Model - arXiv 2025
Rabbit: Retrieval-Augmented Generation Enables Better Automatic Database Knob Tuning - ICDE 2025
MCTuner: Spatial Decomposition-Enhanced Database Tuning via LLM-Guided Exploration - arXiv 2025

Query Optimization

SERAG: Self-Evolving RAG System for Query Optimization - SIGMOD Workshop 2025
QUITE: A Query Rewrite System Beyond Rules with LLM Agents - arXiv 2025
R-Bot: An LLM-Based Query Rewrite System - VLDB 2025
Cracking SQL Barriers: An LLM-based Dialect Translation System - SIGMOD 2025

System Diagnosis

Panda: Performance Debugging for Databases using LLM Agents - CIDR 2024
D-Bot: Database Diagnosis System using Large Language Models - VLDB 2024
DBAIOps: A Reasoning LLM-Enhanced Database Operation and Maintenance System using Knowledge Graphs - arXiv 2025

Data Preparation

Data Cleaning

SketchFill: Sketch-Guided Code Generation for Imputing Derived Missing Values - arXiv 2024
IterClean: An Iterative Data Cleaning Framework with Large Language Models - ACM-TURC 2024
AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework - VLDB 2025
AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark - EMNLP Findings 2025
CleanAgent: Automating Data Standardization with LLM-based Agents - VLDB Workshop 2025
Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets - ICLR Workshop 2025
Weak-to-Strong Prompts with Lightweight-to-Powerful LLMs for High-Accuracy, Low-Cost, and Explainable Data Transformation - VLDB 2025

Data Integration

Agent-OM: Leveraging LLM Agents for Ontology Matching - VLDB 2024
Ontology Matching with Large Language Models and Prioritized Depth-First Search - Information Fusion 2025
Match, Compare, or Select? An Investigation of Large Language Models for Entity Matching - COLING 2025

Data Discovery

Chorus: Foundation Models for Unified Data Discovery and Exploration - VLDB 2024
Data-driven Discovery with Large Generative Models - arXiv 2024
DATALORE: Can a Large Language Model Find All Lost Scrolls in a Data Repository? - ICDE 2024
LEDD: Large Language Model-Empowered Data Discovery in Data Lakes - arXiv 2025
Automated Metadata Generation Using Large Language Models: A GPT-4 Case Study for Enterprise Data Profiling - Journal for Engineering and Computer Science 2025
Automatic database description generation for Text-to-SQL - arXiv 2025
Towards Operationalizing Heterogeneous Data Discovery - arXiv 2025

Data Analysis

TableQA

StructGPT: A General Framework for Large Language Model to Reason over Structured Data - EMNLP 2023
ReAcTable: Enhancing ReAct for Table Question Answering - VLDB 2024
Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding - ICLR 2024
AutoTQA: Towards Autonomous Tabular Question Answering through Multi-Agent Large Language Models - VLDB 2024
Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning - ACL 2025
ST-Raptor: LLM-Powered Semi-Structured Table Question Answering - SIGMOD 2026

NL2SQL

MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL - COLING 2023
Chatbi: Towards natural language to complex business - arXiv 2024
CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL - ICLR 2024
Alpha-SQL: Zero-Shot Text-to-SQL using Monte Carlo Tree Search - ICML 2025
OpenSearch-SQL: Enhancing Text-to-SQL with Dynamic Few-shot and Consistency Alignment. - SIGMOD 2025
ReFoRCE: A Text-to-SQL Agent with Self-Refinement, Consensus Enforcement, and Column Exploration - arXiv 2025
DeepEye-SQL: A Software-Engineering-Inspired Text-to-SQL Framework - arXiv 2025

NL2VIS

Unstructured Data Analysis

Active Retrieval Augmented Generation - NeurIPS 2023
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts - ICML 2024
GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models - EMNLP 2024
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection - ICLR 2024
REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering - EMNLP 2024
QUEST: Query Optimization in Unstructured Document Analysis - VLDB 2025
Doctopus: Budget-aware Structural Table Extraction from Unstructured Documents - VLDB 2025
Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling - arXiv 2025
DataPuzzle: Breaking Free from the Hallucinated Promise of LLMs in Data Analysis - arXiv 2025

Report Generation

DataNarrative: Automated Data-Driven Storytelling with Visualizations and Texts - EMNLP 2024
From Data to Story: Towards Automatic Animated Data Video Creation with LLM-based Multi-Agent Systems - VIS workshop 2024
LightVA: Lightweight Visual Analytics with LLM Agent-Based Task Planning and Execution - TVCG 2024
Multimodal DeepResearcher: Generating Text–Chart Interleaved Reports with an Agentic Framework - arXiv 2025
DAgent: A Relational Database-Driven Data Analysis Report Generation Agent - arXiv 2025
ChartInsighter: An Approach for Mitigating Hallucination in Time-Series Chart Summary Generation With a Benchmark Dataset - TVCG 2025
ProactiveVA: Proactive Visual Analytics with LLM-Based UI Agent - VIS 2025
VOICE: Visual Oracle for Interaction, Conversation, and Explanation - TVCG 2025
NLI4VolVis: Natural Language Interaction for Volume Visualization via LLM Multi-Agents and Editable 3D Gaussian Splatting (VIS'25) - VIS 2025

Comparison of L2 Data Agents

RAG: Retrieval-Augmented Generation; Percept: Environmental Perception; Plan: Planning; Mem: Memory; Tool: Tool invocation; Reflect: Self-reflection mechanism; MAS: Multi-agent system; SFT: Supervised Fine-Tuning; RL: Reinforcement Learning. Data complexity dimensions include Multi-source (Multis.), Heterogeneous (Hete.), and Multimodal (Multim.) data support.

🤖 Proto-L3: Striving for Autonomous Data Agents

L3 data agents are expected to autonomously orchestrate tailored data pipelines for a wide range of diverse and comprehensive data-related tasks under supervision, extending beyond human-defined workflows or specific tasks. This level marks a critical transition in which the data agent assumes a dominant role in data-related tasks, while humans act as supervisors overseeing the data agents' operation. To date, no existing system has fully realized such versatile, self-directed orchestration capabilities that define a complete L3 data agent. However, emerging efforts from both academia and industry are beginning to address these challenges, giving rise to what we term "Proto-L3" data agents.

Academia Research

AgenticData: An Agentic Data Analytics System for Heterogeneous Data - arXiv 2025
DeepAnalyze: Agentic Large Language Models for Autonomous Data Science - arXiv 2025
AOP: Automated and Interactive LLM Pipeline Orchestration for Answering Complex Queries - CIDR 2025
iDataLake: An LLM-Powered Analytics System on Data Lakes - IEEE Data Engineering Bulletin 2025
SiriusBI: A Comprehensive LLM-Powered Solution for Data Analytics in Business Intelligence - VLDB 2025
Data Interpreter: An LLM Agent For Data Science - ACL Findings 2024

Industry Products

JoyAgent — JDCHO
TabTab — TabTab AI
Assist. DS Agent — Databricks
Data Agent — Bytedance
BigQuery — Google
Cortex — Snowflake
Xata Agent — Xata

Comparison of Proto-L3 Data Agents from Academia Research and Industry Products

Compares Open-source: availability; Undef Ops.: capabilities in utilizing unpredefined operators; data-related task coverage across data management, preparation, analysis; data complexity dimensions: Multi-source (Multis.), Heterogeneous (Hete.), and Multimodal (Multim.)

🔮 L4-L5: Vision of Proactive and Generative Data Agents (Prospect)

L4: Vision of Proactive Data Agents

At L4, data agents are expected to achieve a high level of autonomy and reliability, eliminating the need for human supervision and explicit task instructions. They can proactively identify issues worthy of investigation through continuous monitoring and exploration of data lakes, and selectively orchestrate pipelines to tackle self-discovered problems. At this level, data agents take initiative in their operations while humans fully delegate responsibility, becoming onlookers.

L5: The Ultimate Vision of Ubiquitous and Generative Data Agents

At the ultimate level of L5, beyond applying existing methods, data agents are envisioned to be capable of inventing novel solutions and pioneering new paradigms. In doing so, they advance the state-of-the-art in data management, preparation, and analysis, making any form of human involvement unnecessary.

📚 Survey and Tutorial

A Survey of Text-to-SQL in the Era of LLMs: Where are we, and where are we going? - TKDE 2025
Natural Language to SQL: State of the Art and Open Problems - VLDB Tutorial 2025
A Survey of LLM × DATA - arXiv 2025
LLM/Agent-as-Data-Analyst: A Survey - arXiv 2025
Large Language Model-based Data Science Agent: A Survey - arXiv 2025
LLM-Based Data Science Agents: A Survey of Capabilities, Challenges, and Future Directions - arXiv 2025
Large Language Models for Data Science: A Survey - arXiv 2025
A Survey on Large Language Model-based Agents for Statistics and Data Science - The American Statistician 2025
Large Language Models for Data Discovery and Integration: Challenges and Opportunities - IEEE Data Eng. Bull. 2025
Data+AI: LLM4Data and Data4LLM - SIGMOD Tutorial 2025
LLM for Data Management - VLDB Tutorial 2024
Large Language Models for Data Annotation and Synthesis: A Survey - ACL 2024
Data Management for Machine Learning: A Survey - TKDE 2023
LLM As DBA - arXiv 2023
Demystifying Artificial Intelligence for Data Preparation - SIGMOD Tutorial 2023

🔬 Research Opportunities

Enhancing the core components of data agents can involve advancing five key aspects: Perception (e.g., environmental understanding), Planning (e.g., task decomposition and reflection), Actions (e.g., autonomous pipeline orchestration), Tools (e.g., tool invocation and discovery), and Memory (e.g., strategic knowledge retention).

Our survey also identifies and discusses that while significant progress has been made, a substantial chasm still separates current "Proto-L3" systems from the expected true L3 autonomy. Bridging this gap, and eventually progressing to L4/L5, requires addressing several fundamental challenges, which represent key research opportunities for the field. For more information, please refer to our [Survey] and [Slides] of a recent (Chinese) talk.

The Critical L2-to-L3 Transition

The primary bottleneck lies in elevating data agents from procedural executors (L2) to autonomous dominators (L3). As highlighted in the figure above, this requires significant advancements across five key aspects of agentic architecture: Perception, Planning, Actions, Tools, and Memory. Successfully bridging this gap involves overcoming four main deficiencies:

1. Limited Autonomy in Pipeline Orchestration: Current data agents heavily rely on a predefined set of operators and tools. While some systems demonstrate "Tool Evolution" through recombination, a key opportunity lies in automatic data-skill discovery. Future research should focus on enabling data agents to generate, validate, and deploy emergent skills ab initio, transcending fixed toolsets.
2. Incomplete Coverage of the Data Lifecycle: Most Proto-L3 systems are "predominantly centered on data analysis". Crucial data management tasks—such as configuration tuning, system diagnosis, and query optimization—remain "largely unaddressed". A major research direction is enhancing data agent versatility to create "data experts" that can reason about and autonomously orchestrate tasks across the full data lifecycle.
3. Deficiencies in Advanced Reasoning: Data agents exhibit strong tactical capabilities (e.g., fixing immediate errors) but lack strategic "higher-order reasoning". This can trap them in "unproductive loops". This highlights a need for research in integrated causal reasoning and meta-reasoning. Furthermore, data agents require sophisticated memory architectures that capture abstract strategic knowledge, not just task execution history.
4. Inadequate Adaptation to Dynamic Environments: Most systems are designed and evaluated against static data and tasks. They lack the ability to "genuinely self-evolve" when faced with data drift or changing schemas. Developing effective and human-free adaptation methods—and robust evaluation benchmarks for dynamic conditions—remains a promising and critical research direction.

The L4-L5 Vision: The Research Odyssey Ahead

Moving beyond L3 supervision towards proactive (L4) and generative (L5) data agents presents a long-term research odyssey. Key frontiers include:

Autonomous Problem Discovery (L4): Shifting from reacting to user tasks to proactively identifying valuable tasks. This requires endowing data agents with intrinsic motivation and "curiosity" to independently monitor and explore data lakes to discover anomalies or opportunities worthy of investigation.
Long-Horizon and Holistic Planning (L4): Moving beyond local, step-by-step optimization to long-horizon planning. L4 data agents must be able to make strategic trade-offs, such as balancing the immediate cost of data cleaning against long-term analytical benefits.
Generative Innovation (L5): The ultimate vision is a data agent that can innovate and pioneer new paradigms. Instead of merely applying existing methods, an L5 data agent would recognize the limitations of current approaches and invent novel algorithms, theories, or frameworks to advance the state-of-the-art.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
assets		assets
reports		reports
slides		slides
src		src
.gitignore		.gitignore
README.md		README.md

HKUSTDial/awesome-data-agents

Folders and files

Latest commit

History

Repository files navigation

🌟 Awesome Data Agents 🌟

Contents

🎯 Introduction

🪜 Levels of Data Agents

📑 Paper List

💬 L0-L1: From Manual Labor to Preliminary Assistance

Data Management

Configuration Tuning

Query Optimization

System Diagnosis

Data Preparation

Data Cleaning

Data Integration

Data Discovery

Data Analysis

TableQA

NL2SQL

NL2VIS

Unstructured Data Analysis

Report Generation

Comparison of L1 Data Agents

🌏 L2: Perceive the Environment

Data Management

Configuration Tuning

Query Optimization

System Diagnosis

Data Preparation

Data Cleaning

Data Integration

Data Discovery

Data Analysis

TableQA

NL2SQL

NL2VIS

Unstructured Data Analysis

Report Generation

Comparison of L2 Data Agents

🤖 Proto-L3: Striving for Autonomous Data Agents

Academia Research

Industry Products

Comparison of Proto-L3 Data Agents from Academia Research and Industry Products

🔮 L4-L5: Vision of Proactive and Generative Data Agents (Prospect)

L4: Vision of Proactive Data Agents

L5: The Ultimate Vision of Ubiquitous and Generative Data Agents

📚 Survey and Tutorial

🔬 Research Opportunities

The Critical L2-to-L3 Transition

The L4-L5 Vision: The Research Odyssey Ahead

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

Packages