Skip to content

Code repository: "Retrieve, Generate, Evaluate: A Case Study for Medical Paraphrases Generation with Small Language Models" [KnowledgeableLM Workshop @ ACL 2024]

Notifications You must be signed in to change notification settings

ATILF-UMR7118/pRAGe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pRAGe

📄 Code repository for Retrieve, Generate, Evaluate: A Case Study for Medical Paraphrases Generation with Small Language Models

🎉 Paper accepted at KnowledgeableLMs, an ACL 2024 Workshop.

Recent surge in the accessibility of large language models (LLMs) to the general population can lead to untrackable use of such models for medical-related recommendations.

Language generation via LLMs models has two key problems:

  • firstly, they are prone to hallucination and therefore, for any medical purpose they require scientific and factual grounding;
  • secondly, LLMs pose tremendous challenge to computational resources due to their gigantic model size.

We introduce pRAGe, a pipeline for Retrieval Augmented Generation and evaluation of medical paraphrases generation using Small Language Models (SLM). We study the effectiveness of SLMs and the impact of external knowledge base for medical paraphrase generation in French.

method-prage

Setup Environment

Create a virtual environment (e.g. ragenv) and install the requirements file.

python3 -m venv ~/ragenv
source ~/ragenv/bin/activate
pip3 install -r requirements.txt

Files Description

🗂️data contains:

  • Refomed-KB.zip: a 1.7M tokens knowledge base automatically extracted from Wikipedia articles for 1,253 medical terms from RefoMED (the test list).

    • for every term, for example, asthma, the Refomed-KB contains top-3 wiki extracts namely, asthma-0.txt, asthma-1.txt, asthma-2.txt.
  • RefoMED dataset (Buhnila, 2023): an open-source dataset of 6,297 pairs of unique medical terms and their corresponding sub-sentential paraphrases in French.

    • refomed_test.csv: list used for test and evaluation
    • refomed_train.csv: list used for finetuning BioMistral and BARTHEZ
    • refomed_val.csv: list used for validation

💻notebooks contains the Python codes for inference generation, finetuning, pRAGe settings and data visualization.

📊plots contains data visualization plots.

💻scripts contains the Python codes for evaluation of the experiments and report generation.

Citations

Please cite our work:

pRAGe Paper

@inproceedings{buhnila2024retrieve,
  title={Retrieve, Generate, Evaluate: A Case Study for Medical Paraphrases Generation with Small Language Models},
  author={Buhnila, Ioana and Sinha, Aman and Constant, Matthieu},
  booktitle={Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024)},
  pages={189--203},
  year={2024}
}

RefoMED Paraphrase Dataset

@phdthesis{buhnila2023methode,
  title={Une m{\'e}thode automatique de construction de corpus de reformulation},
  author={Buhnila, Ioana},
  year={2023},
  school={Universit{\'e} de Strasbourg}
}

About

Code repository: "Retrieve, Generate, Evaluate: A Case Study for Medical Paraphrases Generation with Small Language Models" [KnowledgeableLM Workshop @ ACL 2024]

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages