🍓 Alibaba International Digital Commerce 🍓
📝 HSCodeComp Paper | 📝 DeepWideSearch Paper | 🤗 HSCodeComp Dataset | 🤗 DeepWideSearch Dataset
Marco DeepResearch is a comprehensive initiative from Alibaba International Digital Commerce that advances real-world AI agent capabilities through challenging benchmarks and practical applications. Our work bridges the gap between AI agents and human experts by exposing and addressing critical limitations in domain-specific reasoning, hierarchical rule application, and large-scale information seeking.
We introduce a suite of benchmarks, frameworks and models that evaluate and advance agents across fundamental dimensions essential for real-world deployment:
- 🏆 HSCodeComp: Tests hierarchical rule application with 95.0% human performance vs. 46.8% best AI (SmolAgent + GPT-5 VLM)
- 🏆 DeepWideSearch: Challenges deep-and-wide information seeking with 414 avg. information units and 4.21 avg. reasoning depth
- 🏆 Table-as-Search: Production-ready hierarchical multi-agent framework demonstrating the "scissor gap effect" on challenging benchmarks
- 🏆 UMEM: Self-evolving memory system that avoids the "rote memorization trap" through joint optimization of memory extraction and management
These benchmarks and frameworks reveal and address fundamental gaps in current AI systems for:
- Complex hierarchical decision-making in vertical domains (tariff, legal, medical, taxation)
- Simultaneous wide-scale exploration and deep multi-hop reasoning
- Structured information organization and synthesis
- Generalizable long-term memory that evolves without overfitting
- [2026-02] 🎉 Released UMEM (Unified Memory Extraction and Management) - a self-evolving memory framework that jointly optimizes extraction and management for generalizable agent memory.
- [2026-02] 🎉 Released Table-as-Search - a structured planning framework for long-horizon agentic informatin seeking.
- [2026-02] 🏆 DeepWideSearch: A-MapReduce uses DeepWideSearch as the primary benchmark for wide-search systems, achieving 79.09% Core Entity Accuracy, 51.78% Column F1, and 4.43% Success Rate (SOTA among open-source frameworks), Setting new standards for evaluating agentic search capabilities with reproducible metrics
- [2025-10] 🔥 Initial release of Marco DeepResearch with DeepWideSearch and HSCodeComp benchmarks.
Real-world deployment showcases how our research frameworks address critical challenges in Alibaba International Digital Commerce Group's business scenarios.
Challenge: BD tasks require both breadth (finding many qualified merchants across platforms) and depth (multi-hop extraction of contact details from official sites). ReAct-style baselines suffer from unclear planning, state confusion, and coverage gaps on the DeepWideSearch benchmark.
Our Solution: Table-as-Search — We formalize long-horizon search as table completion: explicit state tracking, clear planning from the partially-filled table, and hierarchical orchestration of Wide (tabular) and Deep (multi-hop) sub-agents.
Results: On our real-world scenario Business Development datasets, Table-as-Search delivers 40%+ gain on hard tasks (Success Rate 15.2% → 55.8%), with Entity Recall 89.3% (vs. 62.1%) and Attribute Completeness 85.7% (vs. 58.4%). Deployed in BD workflows, it significantly improves the working efficiency.
Performance across task difficulties: Table-as-Search (blue) vs. Multi-Agent ReAct (orange) and Single-Agent (gray) baselines.
The Problem: Hierarchical Rule Application in a Vertical Domain
Predicting a destination country’s 10-digit HS code and tariff from incomplete product information (e.g., from an ERP or catalog) requires hierarchical rule application: tariff rules have vague boundaries and implicit logic, making precise application challenging for agents. See our benchmark paper HSCodeComp for task formulation and related work.
Our Approach: Benchmark First, Then Tool-Augmented Agents
We first established the HSCodeComp benchmark and found that state-of-the-art agents perform poorly—far below human experts. We then designed an agent-based framework with Marco as the orchestrator: (1) multi-modal input parsing (titles, attributes, images → normalized attributes), (2) retrieval-augmented reasoning via Deep Search (historical labels, expert knowledge, customs rulings), (3) tool-integrated verification (tariff lookup, chapter/section notes, ruling validation), and (4) structured output with an auditable evidence trail.
Results: Clear Gain vs. Baselines, Large Gap vs. Humans
On 10-digit HS code accuracy, Marco Agent reaches 65.0% Top-1, outperforming GPT-5–based agents (46.8%), Agentorchestra (41.3%), and Claude Sonnet 4 (11.9%). As shown below, tool-augmented decision-making substantially improves over general-purpose agents. Nevertheless, a large gap remains versus human experts (95.0%), indicating significant room for further improvement.
HSCodeComp benchmark (10-digit accuracy): Marco Agent (65.0%) vs. baselines and human experts (95.0%).
The Problem: Rules Are Subtle and Shifting
In e-commerce commodity auditing are multi-modal, subtle, and constantly evolving. When the agent’s decision disagrees with expert labels (e.g., misidentifying a branded product as "counterfeit"), fixing behavior used to require 3–5 days of manual tuning.
Our Approach: Self-Evolving Agent + UMEM
The Self-Evolving Agent learns from the gap between agent judgment and expert ground truth: it extracts nuances and integrates them into long-term memory. The engine is UMEM (Unified Memory Extraction and Management), which distills interaction traces into actionable and generalizable insights instead of merely retrieving past data. The loop is Action → Rewarding (compare with ground truth, detect badcases) → Memory Extraction (reflect, generate candidate rules) → Validation (safety gate, then update memory or retry).
Results: 30–50× Faster Tuning, Quality Gains
The workflow is compressed from 3–5 days to a ~10-minute autonomous cycle. The self-evolving agent outperforms human-optimized baselines by +11% on white-background image auditing and +2% on short title review. On benchmarks, UMEM consistently outperforms state-of-the-art memory baselines (ReMem, Memp) across environments. Besides, we also evaluate our UMEM on other reasoning benchmarks (below figure). Extensive experimental results demonstrate that our proposed UMEM could learn highly generalizable memory that improves the performance of future tasks.
UMEM vs. baselines (e.g., Gemini 2.5 Flash setting): UMEM improves performance across evaluation setups.
| Benchmark | HuggingFace | GitHub | Paper |
|---|---|---|---|
| HSCodeComp | 🤗 AIDC-AI/HSCodeComp | 📁 HSCodeComp/data | 📝 arXiv |
| DeepWideSearch | 🤗 AIDC-AI/DeepWideSearch | 📁 DeepWideSearch/data | 📝 arXiv |
| Table-as-Search | 🤗 Table-as-Search Paper | 📁 Table-as-Search Codebase | 📝 arXiv |
| UMEM | 🤗 UMEM Paper | 📁 UMEM Codebase | 📝 arXiv |
Marco-DeepResearch/
├── Marco-DeepResearch-Family/ # Unified directory for all projects
│ ├── HSCodeComp/ # Hierarchical rule application benchmark
│ │ ├── data/ # 632 expert-annotated product samples
│ │ ├── eval/ # Evaluation scripts
│ │ └── README.md
│ ├── DeepWideSearch/ # Deep-and-wide information seeking benchmark
│ │ ├── data/ # 220 complex multi-hop queries
│ │ ├── eval/ # Evaluation scripts
│ │ ├── scripts/ # Batch evaluation tools
│ │ └── README.md
│ ├── Table-as-Search/ # Hierarchical multi-agent framework
│ │ ├── tools/ # Core tool implementations
│ │ ├── prompts/ # Agent prompt templates
│ │ └── README.md
│ ├── UMEM/ # Self-evolving memory system
│ │ ├── verl/ # Core source code
│ │ ├── umem_scripts/ # Training and evaluation scripts
│ │ └── README.md
│ ├── README.md # Family overview (English)
│ └── README_zh.md # Family overview (Chinese)
├── assets/ # Shared resources and visualizations
└── README.md # Main project README
Each project has its own dependencies. Navigate to the specific project directory:
# For HSCodeComp
cd Marco-DeepResearch-Family/HSCodeComp
pip install -r requirements.txt
# For DeepWideSearch
cd Marco-DeepResearch-Family/DeepWideSearch
pip install -r requirements.txt
# For Table-as-Search
cd Marco-DeepResearch-Family/Table-as-Search
pip install -r requirements.txt
# For UMEM
cd Marco-DeepResearch-Family/UMEM
pip install -r requirements.txt
pip install -e .HSCodeComp:
cd Marco-DeepResearch-Family/HSCodeComp
python eval/test_llm.py \
--model_name your_model \
--data_path data/test_data.jsonl \
--output_path results/DeepWideSearch:
cd Marco-DeepResearch-Family/DeepWideSearch
bash scripts/batch_eval.shTable-as-Search:
cd Marco-DeepResearch-Family/Table-as-Search
python run_widesearch_inference.py --query "your query" --instance-id "test_001"UMEM:
cd Marco-DeepResearch-Family/UMEM
bash umem_scripts/run_eval.shFor detailed setup and usage instructions, please refer to:
- HSCodeComp README - Hierarchical rule application evaluation
- DeepWideSearch README - Deep-wide search evaluation
- Table-as-Search README - Framework usage and deployment
- UMEM README - Memory system training and evaluation
The Marco DeepResearch initiative encompasses multiple benchmarks and frameworks addressing distinct challenges in real-world agent systems. Visit our Marco DeepResearch Family directory for detailed information about each project:
- 📑 HSCodeComp: Hierarchical rule application in e-commerce domain
- 🌐 DeepWideSearch: Deep-and-wide agentic information seeking
- 📊 Table-as-Search: Production-ready hierarchical multi-agent framework
- 🧠 UMEM: Unified memory extraction and management for self-evolving agents
Main contributors are from AI Business, Alibaba International Digital Commerce. For questions or collaboration, please contact:
Special Thanks:
- HSCodeComp: Human tariff experts for meticulous annotation (hourly wage: >$34/hr)
- DeepWideSearch: Built upon the open-source WideSearch framework by ByteDance-Seed (MIT License)
This project is licensed under the Apache-2.0 License. See LICENSE for details.
Our datasets are constructed using publicly accessible data sources:
- HSCodeComp: Product data from real e-commerce platforms
- DeepWideSearch: Built upon BrowseComp, BrowseComp-ZH, and WideSearch datasets
Due to the complexity of these tasks and diverse data sources, we cannot guarantee complete freedom from copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us for prompt resolution.
If you find our work useful, please consider citing:
@misc{yang2025hscodecomprealisticexpertlevelbenchmark,
title={HSCodeComp: A Realistic and Expert-level Benchmark for Deep Search Agents in Hierarchical Rule Application},
author={Yiqian Yang and Tian Lan and Qianghuai Jia and Li Zhu and Hui Jiang and Hang Zhu and Longyue Wang and Weihua Luo and Kaifu Zhang},
year={2025},
eprint={2510.19631},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2510.19631},
}
@misc{lan2025deepwidesearchbenchmarkingdepthwidth,
title={DeepWideSearch: Benchmarking Depth and Width in Agentic Information Seeking},
author={Tian Lan and Bin Zhu and Qianghuai Jia and Junyang Ren and Haijun Li and Longyue Wang and Zhao Xu and Weihua Luo and Kaifu Zhang},
year={2025},
eprint={2510.20168},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.20168},
}
@misc{lan2026tableassearchformulatelonghorizonagentic,
title={Table-as-Search: Formulate Long-Horizon Agentic Information Seeking as Table Completion},
author={Tian Lan and Felix Henry and Bin Zhu and Qianghuai Jia and Junyang Ren and Qihang Pu and Haijun Li and Longyue Wang and Zhao Xu and Weihua Luo},
year={2026},
eprint={2602.06724},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.06724},
}
@misc{ye2026umemunifiedmemoryextraction,
title={UMEM: Unified Memory Extraction and Management Framework for Generalizable Memory},
author={Yongshi Ye and Hui Jiang and Feihu Jiang and Tian Lan and Yichao Du and Biao Fu and Xiaodong Shi and Qianghuai Jia and Longyue Wang and Weihua Luo},
year={2026},
eprint={2602.10652},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.10652},
}
