⭐ Please remember to star this repo if you find it useful and cite our work if you use it in your research! ⭐
🩺 If you have any questions or feedback, please create an issue! 🩺
This repository contains the official code to reconstruct HealthChat-11K, a curated dataset of approx. 11,000 real-world conversations where users seek healthcare information from Large Language Models (LLMs). The goal of this work is to provide a high-quality resource for systematically studying and improving health conversations involving humans and AI (e.g., LLMs). HealthChat-11K corresponds to an EMNLP 2025 Findings paper - "What's Up, Doc?": Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets.
This codebase fetches conversational data from large-scale source datasets and merges it with our detailed annotations to produce the final, ready-to-use dataset.
- Release
v1.0.0of the master annotations and dataset artifacts generation script. - Complete additional, minor taxonomy revisions and update master annotations.
- Release
v2.0.0of the master annotations and dataset artifacts generation script. - Release
v2.1.0— dual-license-by-upstream-source reissue (same data and schema asv2.0.0; only license metadata changed). See Licensing for details.
The final dataset is a composition of three parts: two large-scale source datasets and our own layer of annotations. The script in this repository automates the process of combining them.
-
Source Datasets (The Raw Text): Our conversations are filtered from two public datasets:
-
HealthChat Annotations (Our Contribution): We provide a master annotation file containing our core analysis, including a clinician-driven taxonomy, specialty classifications, and sycophancy analysis. This file is hosted on the Hugging Face Hub:
-
Final Dataset (The Output): The script in this repo uses our annotations file to pull the correct conversations from the source datasets and generate the final, merged
HealthChat-11K_v2.1.0.jsonlfile.
This project uses Conda for environment management. The following steps will create a clean environment and install all necessary dependencies.
STEP 1: Clone the repository
git clone https://github.com/yahskapar/HealthChat.git
cd HealthChatSTEP 2: Run the setup script
This will create a healthchat conda environment with Python 3.13 and install the required packages.
bash setup.shSTEP 3: Activate the environment
conda activate healthchatOnce the setup is complete, you can generate the full HealthChat-11K dataset and the accompanying review files by running the main script.
python generate_artifacts.pyThis will perform the following steps:
- Download the master annotation file (
v2.1.0) from the Hugging Face Hub. - Stream the source datasets (
lmsys-chat-1mandWildChat-1M) to find the required conversations. - Merge the source data with the annotations.
- Save all generated files into a new directory named
HealthChat-11K_v2.1.0_artifacts/.
This output directory will contain:
HealthChat-11K_v2.1.0.jsonl: The final, complete dataset.HealthChat-11K_v2.1.0_full_review.csv: A CSV with every conversation turn for review.HealthChat-11K_v2.1.0_sycophancy_review.csv: A CSV with leading questions seeking treatment (LQST) annotations marked for review.
If you use the HealthChat dataset or the code in this toolbox for your research, please cite our work.
@article{paruchuri2025s,
title={" What's Up, Doc?": Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets},
author={Paruchuri, Akshay and Aziz, Maryam and Vartak, Rohit and Ali, Ayman and Uchehara, Best and Liu, Xin and Chatterjee, Ishan and Agrawal, Monica},
journal={arXiv preprint arXiv:2506.21532},
year={2025}
}Starting with v2.1.0, the dataset is dual-licensed by upstream source. Each subset inherits the terms of its upstream corpus — no new restrictive prose has been added. Downstream users can filter by the existing dataset_source field to select the subset that matches their use case.
-
Code: All source code in this repository (e.g.,
generate_artifacts.py,setup.sh) is licensed under the MIT License. -
Data Annotations — LMSYS-derived subset (
dataset_source == "lmsys", ~61.8% of conversations / 6,780 conversations): Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License. The non-commercial restriction is inherited from LMSYS-Chat-1M's upstream terms; commercial use is not permitted on this subset. -
Data Annotations — WildChat-derived subset (
dataset_source == "wildchat", ~38.2% of conversations / 4,185 conversations): Licensed under the Open Data Commons Attribution License 1.0 (ODC-BY 1.0), matching WildChat-1M's upstream license (relicensed by AI2 from the prior ImpACT terms on 2024-06-26, retroactive). Commercial use is permitted with attribution.
Attribution: When using either subset, please cite the HealthChat-11K paper (see Citation) and the applicable upstream dataset (LMSYS-Chat-1M and/or WildChat-1M).
Users are responsible for independently complying with each upstream dataset's terms in addition to the annotation license above.