- 🗓️ 2025/09/04: Updated leaderboard with 8 models (93 models in total)! View the full list of added models
- 🗓️ 2025/07/22: Updated leaderboard with 9 models (85 models in total)! View the full list of added models
- 🗓️ 2025/06/03: Updated leaderboard with 21 models (76 models in total)! View the full list of added models
- 🗓️ 2025/04/28: BRIDGE Leaderboard V1.0.0 is now live!
- 🗓️ 2025/04/28: Our paper BRIDGE is now available on arXiv!
Large Language Models (LLMs) have demonstrated transformative potential in healthcare, yet concerns remain around their reliability and clinical validity across diverse clinical tasks, specialties, and languages. To support timely and trustworthy evaluation, building upon our systematic review of global clinical text resources, we introduce BRIDGE, a multilingual benchmark that comprises 87 real-world clinical text tasks spanning nine languages and more than one million samples. Furthermore, we construct this leaderboard of LLM in clinical text understanding by systematically evaluating 95 state-of-the-art LLMs (by 2025/10/27).
Key Features: Real-world Clinical Text, 9 Languages, 9 Task types, 14 Clinical specialties, 7 Clinical document types, 20 Clinical applications covering 6 clinical stages of patient care.
More Details can be found in our BRIDGE paper and systematic review, and the comprehensive leaderboard is available at BRIDGE Leaderboard.
This project is led and maintained by the team of Prof. Jie Yang and Prof. Kueiyu Joshua Lin at Harvard Medical School and Brigham and Women's Hospital.
All fully open-access datasets in BRIDGE are available in BRIDGE-Open. To ensure leaderboard fairness, we release the following data for each task: Five completed samples serve as few-shot examples, and all testing samples with instructions and input information. Due to privacy and security considerations of clinical data, regulated-access datasets can not be directly published. Therefore, all detailed task descriptions and their corresponding data sources are available in our BRIDGE paper Appendix Section 5 BRIDGE Dataset and Task Information. Importantly, all 87 datasets have been verified to be either fully open-access or publicly accessible via reasonable request.
Additionally, we provide a python script (dataset_download.py) to download the dataset from Hugging Face, or you can manually download the dataset from Hugging Face.
Install the required packages, mainly including:
python==3.12
if need to run gpt-oss models, please install:
pip install vllm==0.10.1+gptoss (details in: gpt-oss vLLM Usage Guide)
Or, you can directly install vllm:
pip install vllm==0.8.3
The requirements.txt file includes all the dependencies, which are used to conduct our experiments with H100 GPUs.
- Put the downloaded data into the
dataset_rawfolder. - Edit the
BRIDGE.yamlfile to specify the tasks you want to evaluate. - Edit the
run.shfile to specify the model you want to evaluate. - Run the
run.shfile to start the evaluation, which will automatically load the model and run inference on the specified tasks (main.py).
- Result folder: All inference results will be saved in the
resultfolder, which will be automatically created. The structure of the result folder is as follows: result -> task -> model -> experiment - Result extraction: We develop an automated script for each task separately to extract results from the standardized LLM outputs; details can be found in the
datasetfolder:classification.py,extraction.py, andgeneration.py. - Evaluation metrics: We provide an evaluation function for different tasks; details can be found in the
metricfolder:classification.py,extraction.py, andgeneration.py. - Evaluation script: run
evaluate_BRIDGE.pyto evaluate all tasks. The performance of each task will be saved in theperformancefolder.
If you would like to submit your model results to BRIDGE Leaderboard and demonstrate its performance, please send the generated result folder to us, and we will update the leaderboard accordingly. The leaderboard will be updated regularly, and we will notify you via email once your results are added.
We welcome and greatly value contributions and collaborations from the community! If you have clinical text datasets that you would like to share for broader exploration, please contact us! We are committed to expanding BRIDGE while strictly adhering to appropriate data use agreements and ethical guidelines. Let's work together to advance the responsible application of LLMs in medicine!
BRIDGE is a non-profit, researcher-led benchmark that requires substantial resources (e.g., high-performance GPUs, a dedicated team) to sustain. To support open and impactful academic research that advances clinical care, we welcome your contributions. Please contact Prof. Jie Yang at [email protected] to discuss donation opportunities.
If you have any questions about BRIDGE or the leaderboard, feel free to reach out!
- Leaderboard Managers: Jiageng Wu ([email protected]), Kevin Xie ([email protected]), Bowen Gu ([email protected])
- Benchmark Managers: Jiageng Wu, Bowen Gu
- Project Lead: Jie Yang ([email protected])
If you find this leaderboard useful for your research and applications, please cite the following papers:
@article{BRIDGE-benchmark,
title={BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text},
author={Wu, Jiageng and Gu, Bowen and Zhou, Ren and Xie, Kevin and Snyder, Doug and Jiang, Yixing and Carducci, Valentina and Wyss, Richard and Desai, Rishi J and Alsentzer, Emily and Celi, Leo Anthony and Rodman, Adam and Schneeweiss, Sebastian and Chen, Jonathan H. and Romero-Brufau, Santiago and Lin, Kueiyu Joshua and Yang, Jie},
year={2025},
journal={arXiv preprint arXiv: 2504.19467},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.19467},
}If you use the datasets in BRIDGE, please also cite the original paper of datasets, which can be found in our BRIDGE paper.



