pRAGe

📄 Code repository for Retrieve, Generate, Evaluate: A Case Study for Medical Paraphrases Generation with Small Language Models

🎉 Paper accepted at KnowledgeableLMs, an ACL 2024 Workshop.

Recent surge in the accessibility of large language models (LLMs) to the general population can lead to untrackable use of such models for medical-related recommendations.

Language generation via LLMs models has two key problems:

firstly, they are prone to hallucination and therefore, for any medical purpose they require scientific and factual grounding;
secondly, LLMs pose tremendous challenge to computational resources due to their gigantic model size.

We introduce pRAGe, a pipeline for Retrieval Augmented Generation and evaluation of medical paraphrases generation using Small Language Models (SLM). We study the effectiveness of SLMs and the impact of external knowledge base for medical paraphrase generation in French.

Setup Environment

Create a virtual environment (e.g. ragenv) and install the requirements file.

python3 -m venv ~/ragenv
source ~/ragenv/bin/activate
pip3 install -r requirements.txt

Files Description

🗂️data contains:

Refomed-KB.zip: a 1.7M tokens knowledge base automatically extracted from Wikipedia articles for 1,253 medical terms from RefoMED (the test list).
- for every term, for example, asthma, the Refomed-KB contains top-3 wiki extracts namely, asthma-0.txt, asthma-1.txt, asthma-2.txt.
RefoMED dataset (Buhnila, 2023): an open-source dataset of 6,297 pairs of unique medical terms and their corresponding sub-sentential paraphrases in French.
- refomed_test.csv: list used for test and evaluation
- refomed_train.csv: list used for finetuning BioMistral and BARTHEZ
- refomed_val.csv: list used for validation

💻notebooks contains the Python codes for inference generation, finetuning, pRAGe settings and data visualization.

📊plots contains data visualization plots.

💻scripts contains the Python codes for evaluation of the experiments and report generation.

Citations

Please cite our work:

pRAGe Paper

@inproceedings{buhnila2024retrieve,
  title={Retrieve, Generate, Evaluate: A Case Study for Medical Paraphrases Generation with Small Language Models},
  author={Buhnila, Ioana and Sinha, Aman and Constant, Matthieu},
  booktitle={Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024)},
  pages={189--203},
  year={2024}
}

RefoMED Paraphrase Dataset

@phdthesis{buhnila2023methode,
  title={Une m{\'e}thode automatique de construction de corpus de reformulation},
  author={Buhnila, Ioana},
  year={2023},
  school={Universit{\'e} de Strasbourg}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pRAGe

Setup Environment

Files Description

Citations

pRAGe Paper

RefoMED Paraphrase Dataset

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
notebooks		notebooks
plots		plots
scripts		scripts
README.md		README.md
requirements.txt		requirements.txt

ATILF-UMR7118/pRAGe

Folders and files

Latest commit

History

Repository files navigation

pRAGe

Setup Environment

Files Description

Citations

pRAGe Paper

RefoMED Paraphrase Dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages