Note
Curated papers and resources on Data Agents. Companion repo and paper list for our survey on data agents - A Survey of Data Agents: Emerging Paradigm or Overstated Hype? [Paper]
We also release slides for a recent talk (Chinese): [Slides]
If you find our work useful or inspiring, please kindly give us a star ⭐️ and cite our survey:
@misc{zhu2025surveydataagentsemerging,
title={A Survey of Data Agents: Emerging Paradigm or Overstated Hype?},
author={Yizhang Zhu and Liangwei Wang and Chenyu Yang and Xiaotian Lin and Boyan Li and Wei Zhou and Xinyu Liu and Zhangyang Peng and Tianqi Luo and Yu Li and Chengliang Chai and Chong Chen and Shimin Di and Ju Fan and Ji Sun and Nan Tang and Fugee Tsung and Jiannan Wang and Chenglin Wu and Yanwei Xu and Shaolei Zhang and Yong Zhang and Xuanhe Zhou and Guoliang Li and Yuyu Luo},
year={2025},
eprint={2510.23587},
archivePrefix={arXiv},
primaryClass={cs.DB},
url={https://arxiv.org/abs/2510.23587},
}- 🎯 Introduction
- 🪜 Levels of Data Agents
- 📑 Paper List
- 🔬 Research Opportunites
The rapid advancement of large language models (LLMs) has spurred the emergence of data agents — autonomous systems designed to orchestrate Data + AI ecosystems for tackling complex data-related tasks. However, the term "data agent" currently suffers from terminological ambiguity and inconsistent adoption, conflating simple query responders with sophisticated autonomous architectures. This terminological ambiguity fosters mismatched user expectations, accountability challenges, and barriers to industry growth.
Inspired by the SAE J3016 standard for driving automation, this survey introduces the first systematic hierarchical taxonomy for data agents, comprising six levels that delineate and trace progressive shifts in autonomy, from manual operations (L0) to a vision of generative, fully autonomous data agents (L5), thereby clarifying capability boundaries and responsibility allocation.
Through this lens, we offer a structured review of existing research arranged by increasing autonomy, encompassing specialized data agents for data management, preparation, and analysis, alongside emerging efforts toward versatile, comprehensive systems with enhanced autonomy. We further analyze critical evolutionary leaps and technical gaps for advancing data agents, especially the ongoing L2-to-L3 transition, where data agents evolve from procedural execution to autonomous orchestration. Finally, we conclude with a forward-looking roadmap, envisioning the advent of proactive, generative data agents.
As mentioned above, to bring clarity to the diverse landscape of data agents, we propose a hierarchical taxonomy based on their degree of autonomy. This framework maps the progressive shift of responsibility from human to agent, defining the distinct roles each plays at every stage, as summarized in the overview figure and the table below.
| Level | Degree of Autonomy | Human Role | Data Agent Role |
|---|---|---|---|
| L0 | Manual/No Autonomy | Dominator (Solo) | N/A (None) |
| L1 | Assisted | Dominator (Integrating) | Assistant (Responder) |
| L2 | Partial Autonomy | Dominator (Orchestrating) | Executor (Procedural) |
| L3 | Conditional Autonomy | Supervisor (Overseeing) | Dominator (Autonomous) |
| L4 | High Autonomy | Onlooker (Delegating) | Dominator (Proactive) |
| L5 | Full Autonomy | N/A (None) | Dominator (Generative) |
The transition between these levels represents more than just incremental progress; each step up the hierarchy requires a significant evolutionary leap as shown below. These leaps involve fundamental shifts in a data agent's capabilities—such as gaining environmental perception (L1→L2), achieving autonomous orchestrating and dominating the task (L2→L3), attaining proactive self-governance with supervision removed (L3→L4), and innovating or pioneering new paradigm (L4→L5).
We index papers by autonomy level, then by data-related tasks across Data Management, Data Preparation, and Data Analysis. Most existing work clusters in L1–L3, while L4–L5 are aspirational. We also list relevant surveys and tutorials.
In L0 level, data-related tasks are performed entirely by human experts without any automation. The process is completely human-driven, requiring extensive domain knowledge and solid technical expertise, making it highly specialized and time-consuming.
At L1 level, data agents start to have the capabilities to provide preliminary and single-point assistance through typical question-answering interactions. While they can help with atomic tasks like code pieces generation, they lack environmental perception and require considerable human validation, editing, and optimization.
- LLMTune: Accelerate Database Knob Tuning with Large Language Models - arXiv 2024
- LATuner: An LLM-Enhanced Database Tuning System Based on Adaptive Surrogate Model - ECML PKDD 2024
- GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization - VLDB 2024
- λ-Tune: Harnessing Large Language Models for Automated Database System Tuning - SIGMOD 2025
- E2ETune: End-to-End Knob Tuning via Fine-tuned Generative Language Model - VLDB 2025
- DB-GPT: Large Language Model Meets Database - Data Science and Engineering 2024
- LLM-R2: A Large Language Model Enhanced Rule-Based Rewrite System for Boosting Query Efficiency - VLDB 2024
- Query Rewriting via Large Language Models - arXiv 2024
- Query Rewriting via LLMs - arXiv 2025
- Can Large Language Models Be Query Optimizer for Relational Databases? - arXiv 2025
- A Query Optimization Method Utilizing Large Language Models - arXiv 2025
- E3-Rewrite: Learning to Rewrite SQL for Executability, Equivalence, and Efficiency - arXiv 2025
- DBG-PT: A Large Language Model Assisted Query Performance Regression Debugger - VLDB 2024
- Automatic Database Configuration Debugging using Retrieval-Augmented Language Models - SIGMOD 2025
- Can Foundation Models Wrangle Your Data? - VLDB 2022
- RetClean: Retrieval-Based Data Cleaning Using LLMs and Data Lakes - VLDB demo 2024
- Data Imputation with Limited Data Redundancy Using Data Lakes - VLDB 2025
- UNIDM: A UNIFIED FRAMEWORK FOR DATA MANIPULATION WITH LARGE LANGUAGE MODELS - MLSys 2024
- LLMClean: Context-Aware Tabular Data Cleaning via LLM-Generated OFDs - ADBIS 2024
- Table-GPT: Table-tuned GPT for Diverse Table Tasks - SIGMOD 2024
- Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration - ICDE 2024
- Jellyfish: Instruction-Tuning Local Large Language Models for Data Preprocessing - EMNLP 2024
- ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models - VLDB 2024
- RACOON: An LLM-based Framework for Retrieval-Augmented Column Type Annotation with a Knowledge Graph - arXiv 2024
- Cocoon: Semantic table profiling using large language model - HILDA 2024
- Autoddg: Automated dataset description generation using large language models - arXiv 2025
- Pneuma: Leveraging llms for tabular data representation and retrieval in an end-to-end system - SIGMOD 2025
- Evaluating Knowledge Generation and Self-refinement Strategies for LLM-Based Column Type Annotation - ADBIS 2025
- Columbo: Expanding Abbreviated Column Names for Tabular Data Using Large Language Models - EMNLP 2025
- Large Language Models are Versatile Decomposers: Decomposing Evidence and Questions for Table-based Reasoning - SIGIR 2023
- Binding Language Models in Symbolic Languages - ICLR 2023
- Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study - WSDM 2024
- TableLlama: Towards Open Large Generalist Models for Tables - NAACL 2024
- DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction. - NeurIPS 2023
- Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation - VLDB 2023
- ACT-SQL: In-Context Learning for Text-to-SQL with Automatically-Generated Chain-of-Thought - EMNLP 2023
- The Dawn of Natural Language to SQL: Are We Fully Ready? - VLDB 2024
- Chat2VIS: Generating Data Visualizations via Natural Language Using ChatGPT, Codex and GPT-3 Large Language Models - IEEE Access 2023
- Generating Analytic Specifications for Data Visualization from Natural Language Queries using Large Language Models - VIS 2024
- prompt4vis: prompting large language models with example mining for tabular data visualization: Prompt4Vis: prompting large language models with example mining - VLDB 2025
- nvBench 2.0: Resolving Ambiguity in Text-to-Visualization through Stepwise Reasoning - NeurIPS 2025
- LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering - EMNLP 2024
- RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval - ICLR 2024
- PDFTriage: Question Answering over Long, Structured Documents - EMNLP 2024
- Unifying Multimodal Retrieval via Document Screenshot Embedding - EMNLP 2024
- VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation - NAACL 2025
- Docopilot: Improving Multimodal Models for Document-Level Understanding - CVPR 2025
- DataTales: Investigating the use of Large Language Models for Authoring Data-Driven Articles - VIS 2023
- Enhancing Data Literacy On-demand: LLMs as Guides for Novices in Chart Interpretation - TVCG 2024
- ReportGPT: Human‑in‑the‑Loop Verifiable Table‑to‑Text Generation - EMNLP Industry 2025
- InterChat: Enhancing Generative Visual Analytics using Multimodal Interactions - EuroVis 2025
- VizTA: Enhancing Comprehension of Distributional Visualization with Visual-Lexical Fused Conversational Interface - EuroVis 2025
- ChartLens: Fine-grained Visual Attribution in Charts - ACL 2025
ICL: In-Context Learning; RAG: Retrieval-Augmented Generation; SFT: Supervised Fine-Tuning; RL: Reinforcement Learning. Data complexity dimensions include Multi-source (Multis.), Heterogeneous (Hete.), and Multimodal (Multim.) data support.
At L2, data agents gain the ability to perceive and interact with their environment, including data lakes, code interpreters, APIs, and other resources. In addition, L2 data agents can possess memory, invoke external tools, and adaptively optimize their actions based on environmental feedback, enabling partial autonomy in task-specific procedures. At this level, they evolve from simple responders to procedural executors operating within human-orchestrated pipelines, where humans remain responsible for managing the overall workflow and still retain dominance over data-related tasks.
- Is Large Language Model Good at Database Knob Tuning? A Comprehensive Experimental Evaluation - arXiv 2024
- LLMIdxAdvis: Resource-Efficient Index Advisor Utilizing Large Language Model - arXiv 2025
- Rabbit: Retrieval-Augmented Generation Enables Better Automatic Database Knob Tuning - ICDE 2025
- MCTuner: Spatial Decomposition-Enhanced Database Tuning via LLM-Guided Exploration - arXiv 2025
- SERAG: Self-Evolving RAG System for Query Optimization - SIGMOD Workshop 2025
- QUITE: A Query Rewrite System Beyond Rules with LLM Agents - arXiv 2025
- R-Bot: An LLM-Based Query Rewrite System - VLDB 2025
- Cracking SQL Barriers: An LLM-based Dialect Translation System - SIGMOD 2025
- Panda: Performance Debugging for Databases using LLM Agents - CIDR 2024
- D-Bot: Database Diagnosis System using Large Language Models - VLDB 2024
- DBAIOps: A Reasoning LLM-Enhanced Database Operation and Maintenance System using Knowledge Graphs - arXiv 2025
- SketchFill: Sketch-Guided Code Generation for Imputing Derived Missing Values - arXiv 2024
- IterClean: An Iterative Data Cleaning Framework with Large Language Models - ACM-TURC 2024
- AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework - VLDB 2025
- AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark - EMNLP Findings 2025
- CleanAgent: Automating Data Standardization with LLM-based Agents - VLDB Workshop 2025
- Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets - ICLR Workshop 2025
- Weak-to-Strong Prompts with Lightweight-to-Powerful LLMs for High-Accuracy, Low-Cost, and Explainable Data Transformation - VLDB 2025
- Agent-OM: Leveraging LLM Agents for Ontology Matching - VLDB 2024
- Ontology Matching with Large Language Models and Prioritized Depth-First Search - Information Fusion 2025
- Match, Compare, or Select? An Investigation of Large Language Models for Entity Matching - COLING 2025
- Chorus: Foundation Models for Unified Data Discovery and Exploration - VLDB 2024
- Data-driven Discovery with Large Generative Models - arXiv 2024
- DATALORE: Can a Large Language Model Find All Lost Scrolls in a Data Repository? - ICDE 2024
- LEDD: Large Language Model-Empowered Data Discovery in Data Lakes - arXiv 2025
- Automated Metadata Generation Using Large Language Models: A GPT-4 Case Study for Enterprise Data Profiling - Journal for Engineering and Computer Science 2025
- Automatic database description generation for Text-to-SQL - arXiv 2025
- Towards Operationalizing Heterogeneous Data Discovery - arXiv 2025
- StructGPT: A General Framework for Large Language Model to Reason over Structured Data - EMNLP 2023
- ReAcTable: Enhancing ReAct for Table Question Answering - VLDB 2024
- Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding - ICLR 2024
- AutoTQA: Towards Autonomous Tabular Question Answering through Multi-Agent Large Language Models - VLDB 2024
- Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning - ACL 2025
- ST-Raptor: LLM-Powered Semi-Structured Table Question Answering - SIGMOD 2026
- MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL - COLING 2023
- Chatbi: Towards natural language to complex business - arXiv 2024
- CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL - ICLR 2024
- Alpha-SQL: Zero-Shot Text-to-SQL using Monte Carlo Tree Search - ICML 2025
- OpenSearch-SQL: Enhancing Text-to-SQL with Dynamic Few-shot and Consistency Alignment. - SIGMOD 2025
- ReFoRCE: A Text-to-SQL Agent with Self-Refinement, Consensus Enforcement, and Column Exploration - arXiv 2025
- DeepEye-SQL: A Software-Engineering-Inspired Text-to-SQL Framework - arXiv 2025
- MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization - ACL 2024
- Text2Chart31: Instruction Tuning for Chart Generation with Automatic Feedback - EMNLP 2024
- NVAGENT: Automated Data Visualization from Natural Language via Collaborative Agent Workflow - ACL 2025
- DeepVIS: Bridging Natural Language and Data Visualization Through Step-wise Reasoning - VIS 2025
- Active Retrieval Augmented Generation - NeurIPS 2023
- A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts - ICML 2024
- GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models - EMNLP 2024
- Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection - ICLR 2024
- REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering - EMNLP 2024
- QUEST: Query Optimization in Unstructured Document Analysis - VLDB 2025
- Doctopus: Budget-aware Structural Table Extraction from Unstructured Documents - VLDB 2025
- Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling - arXiv 2025
- DataPuzzle: Breaking Free from the Hallucinated Promise of LLMs in Data Analysis - arXiv 2025
- DataNarrative: Automated Data-Driven Storytelling with Visualizations and Texts - EMNLP 2024
- From Data to Story: Towards Automatic Animated Data Video Creation with LLM-based Multi-Agent Systems - VIS workshop 2024
- LightVA: Lightweight Visual Analytics with LLM Agent-Based Task Planning and Execution - TVCG 2024
- Multimodal DeepResearcher: Generating Text–Chart Interleaved Reports with an Agentic Framework - arXiv 2025
- DAgent: A Relational Database-Driven Data Analysis Report Generation Agent - arXiv 2025
- ChartInsighter: An Approach for Mitigating Hallucination in Time-Series Chart Summary Generation With a Benchmark Dataset - TVCG 2025
- ProactiveVA: Proactive Visual Analytics with LLM-Based UI Agent - VIS 2025
- VOICE: Visual Oracle for Interaction, Conversation, and Explanation - TVCG 2025
- NLI4VolVis: Natural Language Interaction for Volume Visualization via LLM Multi-Agents and Editable 3D Gaussian Splatting (VIS'25) - VIS 2025
RAG: Retrieval-Augmented Generation; Percept: Environmental Perception; Plan: Planning; Mem: Memory; Tool: Tool invocation; Reflect: Self-reflection mechanism; MAS: Multi-agent system; SFT: Supervised Fine-Tuning; RL: Reinforcement Learning. Data complexity dimensions include Multi-source (Multis.), Heterogeneous (Hete.), and Multimodal (Multim.) data support.
L3 data agents are expected to autonomously orchestrate tailored data pipelines for a wide range of diverse and comprehensive data-related tasks under supervision, extending beyond human-defined workflows or specific tasks. This level marks a critical transition in which the data agent assumes a dominant role in data-related tasks, while humans act as supervisors overseeing the data agents' operation. To date, no existing system has fully realized such versatile, self-directed orchestration capabilities that define a complete L3 data agent. However, emerging efforts from both academia and industry are beginning to address these challenges, giving rise to what we term "Proto-L3" data agents.
- AgenticData: An Agentic Data Analytics System for Heterogeneous Data - arXiv 2025
- DeepAnalyze: Agentic Large Language Models for Autonomous Data Science - arXiv 2025
- AOP: Automated and Interactive LLM Pipeline Orchestration for Answering Complex Queries - CIDR 2025
- iDataLake: An LLM-Powered Analytics System on Data Lakes - IEEE Data Engineering Bulletin 2025
- SiriusBI: A Comprehensive LLM-Powered Solution for Data Analytics in Business Intelligence - VLDB 2025
- Data Interpreter: An LLM Agent For Data Science - ACL Findings 2024
- JoyAgent — JDCHO
- TabTab — TabTab AI
- Assist. DS Agent — Databricks
- Data Agent — Bytedance
- BigQuery — Google
- Cortex — Snowflake
- Xata Agent — Xata
Compares Open-source: availability; Undef Ops.: capabilities in utilizing unpredefined operators; data-related task coverage across data management, preparation, analysis; data complexity dimensions: Multi-source (Multis.), Heterogeneous (Hete.), and Multimodal (Multim.)
At L4, data agents are expected to achieve a high level of autonomy and reliability, eliminating the need for human supervision and explicit task instructions. They can proactively identify issues worthy of investigation through continuous monitoring and exploration of data lakes, and selectively orchestrate pipelines to tackle self-discovered problems. At this level, data agents take initiative in their operations while humans fully delegate responsibility, becoming onlookers.
At the ultimate level of L5, beyond applying existing methods, data agents are envisioned to be capable of inventing novel solutions and pioneering new paradigms. In doing so, they advance the state-of-the-art in data management, preparation, and analysis, making any form of human involvement unnecessary.
- A Survey of Text-to-SQL in the Era of LLMs: Where are we, and where are we going? - TKDE 2025
- Natural Language to SQL: State of the Art and Open Problems - VLDB Tutorial 2025
- A Survey of LLM × DATA - arXiv 2025
- LLM/Agent-as-Data-Analyst: A Survey - arXiv 2025
- Large Language Model-based Data Science Agent: A Survey - arXiv 2025
- LLM-Based Data Science Agents: A Survey of Capabilities, Challenges, and Future Directions - arXiv 2025
- Large Language Models for Data Science: A Survey - arXiv 2025
- A Survey on Large Language Model-based Agents for Statistics and Data Science - The American Statistician 2025
- Large Language Models for Data Discovery and Integration: Challenges and Opportunities - IEEE Data Eng. Bull. 2025
- Data+AI: LLM4Data and Data4LLM - SIGMOD Tutorial 2025
- LLM for Data Management - VLDB Tutorial 2024
- Large Language Models for Data Annotation and Synthesis: A Survey - ACL 2024
- Data Management for Machine Learning: A Survey - TKDE 2023
- LLM As DBA - arXiv 2023
- Demystifying Artificial Intelligence for Data Preparation - SIGMOD Tutorial 2023
Enhancing the core components of data agents can involve advancing five key aspects: Perception (e.g., environmental understanding), Planning (e.g., task decomposition and reflection), Actions (e.g., autonomous pipeline orchestration), Tools (e.g., tool invocation and discovery), and Memory (e.g., strategic knowledge retention).
Our survey also identifies and discusses that while significant progress has been made, a substantial chasm still separates current "Proto-L3" systems from the expected true L3 autonomy. Bridging this gap, and eventually progressing to L4/L5, requires addressing several fundamental challenges, which represent key research opportunities for the field. For more information, please refer to our [Survey] and [Slides] of a recent (Chinese) talk.
The primary bottleneck lies in elevating data agents from procedural executors (L2) to autonomous dominators (L3). As highlighted in the figure above, this requires significant advancements across five key aspects of agentic architecture: Perception, Planning, Actions, Tools, and Memory. Successfully bridging this gap involves overcoming four main deficiencies:
-
1. Limited Autonomy in Pipeline Orchestration: Current data agents heavily rely on a predefined set of operators and tools. While some systems demonstrate "Tool Evolution" through recombination, a key opportunity lies in automatic data-skill discovery. Future research should focus on enabling data agents to generate, validate, and deploy emergent skills ab initio, transcending fixed toolsets.
-
2. Incomplete Coverage of the Data Lifecycle: Most Proto-L3 systems are "predominantly centered on data analysis". Crucial data management tasks—such as configuration tuning, system diagnosis, and query optimization—remain "largely unaddressed". A major research direction is enhancing data agent versatility to create "data experts" that can reason about and autonomously orchestrate tasks across the full data lifecycle.
-
3. Deficiencies in Advanced Reasoning: Data agents exhibit strong tactical capabilities (e.g., fixing immediate errors) but lack strategic "higher-order reasoning". This can trap them in "unproductive loops". This highlights a need for research in integrated causal reasoning and meta-reasoning. Furthermore, data agents require sophisticated memory architectures that capture abstract strategic knowledge, not just task execution history.
-
4. Inadequate Adaptation to Dynamic Environments: Most systems are designed and evaluated against static data and tasks. They lack the ability to "genuinely self-evolve" when faced with data drift or changing schemas. Developing effective and human-free adaptation methods—and robust evaluation benchmarks for dynamic conditions—remains a promising and critical research direction.
Moving beyond L3 supervision towards proactive (L4) and generative (L5) data agents presents a long-term research odyssey. Key frontiers include:
-
Autonomous Problem Discovery (L4): Shifting from reacting to user tasks to proactively identifying valuable tasks. This requires endowing data agents with intrinsic motivation and "curiosity" to independently monitor and explore data lakes to discover anomalies or opportunities worthy of investigation.
-
Long-Horizon and Holistic Planning (L4): Moving beyond local, step-by-step optimization to long-horizon planning. L4 data agents must be able to make strategic trade-offs, such as balancing the immediate cost of data cleaning against long-term analytical benefits.
-
Generative Innovation (L5): The ultimate vision is a data agent that can innovate and pioneer new paradigms. Instead of merely applying existing methods, an L5 data agent would recognize the limitations of current approaches and invent novel algorithms, theories, or frameworks to advance the state-of-the-art.











