Skip to content

guestrin-lab/deepscholar-bench

Repository files navigation

🌐🔍DeepScholar-Bench: A Live Benchmark for Generative Research Synthesis

📊 Dataset | 📄 Paper | 🏆 Live Leaderboard | 🤖 DeepResearch Preview


DeepScholar-Bench provides a live benchmark dataset and holistic evaluation of generative research synthesis, an emerging capability among AI systems designed for DeepResearch.

This repository provides:

  1. Dataset Scripts - which allow you to collect new datasets from recent, high-quality Arxiv papers using our automated data-collection pipeline. You can set your own configurations (e.g., choice of valid date ranges and valid Arxiv domains) to customize your dataset
  2. An Evaluation Suite - for measuring performance of long-form research synthesis answers. Our evaluation framework supports a holistic set of metrics, which demonstrate high agreement with human annotations. Our eval suite is built using the LOTUS framework for LLM-based data processing, which provides a library for LLM-based evaluations and can be used directly to instantiate your custom LLM-judges.

If you run into any problems with the code in this repo, leaderboard, or dataset, please feel free to raise an issue and we will address it promptly. If you would like to add your AI system to the DeepScholar-bench leaderboard, please fill out this form.

🚀 Quick Start

To get started, make sure you are using Python 3.10, simply clone the repository and install dependencies as follows:

# Clone the repository
git clone [email protected]:guestrin-lab/deepscholar-bench.git
cd deepscholar-bench

# Install dependencies
conda create -n dsbench python=3.10 -y
conda activate dsbench
pip install -r requirements.txt

Basic Usage

1. Collect Research Data

# Collect recent AI papers since May 1, 2025
python -m data_pipeline.main \
    --categories cs.AI \
    --start-date 2025-05-01

2. Evaluate Research Generation Systems

# Evaluate the system answers generated by deepscholar_base_gpt_4.1 using gpt-4o as a judge model to assess organization, nugget coverage, reference coverage, and citation precision metrics
python -m eval.main \
    --modes deepscholar_base \
    --evals organization nugget_coverage reference_coverage cite_p \
    --input_folder tests/baselines_results/deepscholar_base_gpt_4.1 \
    --output_folder results \
    --dataset_path dataset/related_works_combined.csv \
    --model_name gpt-4o

For more details and a full introduction, please continue to our Dataset Scripts Description and/or our Evaluation library Description.

📚 DeepScholar Base

DeepScholar Base is our baseline research synthesis pipeline that generates comprehensive literature reviews from a research topic. It demonstrates a modular approach to deep research with the following stages:

Pipeline Overview

  1. Search - Performs agentic or recursive web search to find relevant academic papers
  2. Filter - Uses semantic filtering and ranking to select the most relevant results
  3. Intro Generation - Creates an introductory section summarizing the research landscape
  4. Taxonomization - Categorizes references into meaningful groups with category summaries
  5. Insight Generation - Extracts key insights from each document
  6. Final Report - Synthesizes everything into a cohesive research report

Usage

from deepscholar_base import deepscholar_base
from deepscholar_base.configs import Configs
from lotus.models import LM
import asyncio

# Configure the pipeline
configs = Configs(
    lm=LM(model="gpt-5-mini", temperature=1.0, max_tokens=10000)
)

# Run the pipeline
async def main():
    final_output, docs_df, stats = await deepscholar_base(
        settings, 
        "What are the latest developments in the field of AI."
    )
    print(final_output)

asyncio.run(main())

Configuration Options

Parameter Default Description
use_agentic_search True Use agentic search (vs recursive search)
enable_web_search True Enable web search for papers
per_query_max_search_results_count 10 Max results per search query
use_sem_filter True Apply semantic filtering
use_sem_topk True Use semantic top-k ranking
final_max_results_count 30 Max papers in final report
categorize_references True Group references into categories
generate_insights True Generate insights from documents

You can also configure separate LMs for different pipeline stages (search_lm, filter_lm, taxonomize_lm, generation_lm) for fine-grained control.

🤝 Contributing

We welcome contributions to DeepScholarBench! Please feel free to submit a PR for code contributions. If you would like to add your AI system to the DeepScholar-bench leaderboard, please fill out this this form.

Citation

If you use DeepScholar-Bench in an academic work, we would greatly appreciate it if you can cite this work as follows:

@article{patel2025deepscholarbench,
      title={DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis}, 
      author={Liana Patel and Negar Arabzadeh and Harshit Gupta and Ankita Sundar and Ion Stoica and Matei Zaharia and Carlos Guestrin},
      year={2025},
      url={https://arxiv.org/abs/2508.20033}, 
}