This repository contains the implementation of a research project based on the paper: Rapid Word Learning Through Meta In-Context Learning. The project focuses on Meta-training for IN-context learNing Of Words (Minnow), which trains language models to generate new examples of a word's usage given a few in-context examples.
- Python 3.11
- Required packages: see requirements.txt.
You can download the meta-learning and language modeling datasets here.
For other evaluation datasets, please follow the instructions below.
Create .txt file(s) containing one sentence per line.
Use the code here.
Download and unzip the data.
Then, run
python babylm_data_processing.py --data_path ${BABYLM_DATA_PATH}where ${BABYLM_DATA_PATH} is the path to the BabyLM split, such as babylm_data/babylm_10M.
data_processing.py processes the preprocessed .txt dataset above and generates the dataset for training, validation, and test. Its argument --dataset ${DATA_PATH} is the path to the directory containing the preprocessed dataset. Its argument --word_use_data_dir ${WORD_USE_DATA_DIR} is the directory containing the generated dataset, which will have path ${WORD_USE_DATA_DIR}/${DATA_PATH}/word (${DATASET_DIR}). ${WORD_USE_DATA_DIR} defaults to word_use_data.
Note data_processing.py will use SpaCy en_core_web_trf model to produce POS tags for the whole preprocessed dataset and cache them in a file. This may take a long time, so you may run the model on GPU by setting use_gpu_for_spacy_model = True in the code. However, doing so will change the locale due to a bug of CUDA, so you will have to set back use_gpu_for_spacy_model = False after you obtain the cache and retry.
python data_processing.py --dataset ${CHILDES_DATA_PATH} --lower --remove_sents_less_than_n_words 1 --plot_word_frequency --plot_pos --min_n_examples 5 --max_freq 200 --seed 0python data_processing.py --dataset ${BABYLM_DATA_PATH} --lower --remove_sents_less_than_n_words 1 --remove_sents_longer_than_n_tokens 70 --plot_word_frequency --plot_pos --min_n_examples 5 --max_freq 15 --seed 0where ${BABYLM_DATA_PATH} is the path to the BabyLM 10M split, such as babylm_data/babylm_10M.
python data_processing.py --dataset ${BABYLM_DATA_PATH} --lower --remove_sents_less_than_n_words 1 --remove_sents_longer_than_n_tokens 70 --plot_word_frequency --plot_pos --min_n_examples 10 --max_freq 100 --split_ratio 96 2 2 --seed 0where ${BABYLM_DATA_PATH} is the path to the BabyLM 100M split, such as babylm_data/babylm_100M.
This command get the statistics of the dataset in ${DATASET_DIR}, excluding words with frequency <= 9 from the vocabulary (so they will be treated as unks) and plotting the length distribution between 0 and 70 and the number of uses distribution between 10 and 100:
python data_loading.py stat --data ${DATASET_DIR} --freq_cutoff 9 --length_range 0 70 --n_uses_range 10 100Download the dataset file and rename it to chimeras.txt (we will use ${PATH} to refer to its path). Please be sure not to modify the file in any sense (including opening and resaving it with other encodings). The file is in mixed UTF-8 and ISO-8859-1 encoding.
Then process the dataset file:
python generation_evaluation_data_processing.py chimeras ${PATH}It will generate a JSON file with the same stem in the same folder as ${PATH}, which can be loaded by our code.
Download the dataset from the CoLLEGe paper site (may need to ask the author for the permission to access the dataset). It should be a HuggingFace dataset (we will use ${PATH} to refer to its path).
Then process the dataset file:
python generation_evaluation_data_processing.py defgen ${PATH}It will generate a JSON file with the same stem in the same folder as ${PATH}, which can be loaded by our code.
Download and extract the dataset as described here. Say the path to the extracted directory is ${DEFINITION_DATA_PATH}. Let PATH=${DEFINITION_DATA_PATH}/oxford/test (you may replace test with another split or oxford with another subfolder if you want).
Then process the dataset file:
python generation_evaluation_data_processing.py ishiwatari ${PATH}It will generate a JSON file with the same stem in the same folder as ${PATH}, which can be loaded by our code.
Run main.py. You can use runner.py to create and submit Slurm jobs:
python runner.py --job_name_base meta-word --config runner_config/config.py --run_name_flag name --submitYou may read and change the config file as you need.
You may need to change the Slurm script header to fit your environment. Default header file is runner_config/header.slurm.
You can download the pretrained checkpoints here.
Simply change the config to runner_config/finetune_config.py:
python runner.py --job_name_base meta-word --config runner_config/finetune_config.py --run_name_flag name --submitYou can download the model finetuned from Llama-3-8B on Hugging Face: wwtforever/Meta-Llama-3-8B-Minnow-babylm-10m, and the model finetuned from Llama-3-8B-Instruct on Hugging Face: wwtforever/Meta-Llama-3-8B-Instruct-Minnow-babylm-10m (replace the pretrained_model argument with these names in the Llama_finetuned grid in runner_config/evaluation_config.py and check out other arguments for how to run evaluation on these models).
Run evaluation.py. On a dataset (given by --data_dir), it first evaluates the classification accuracies (the number(s) of classes are given by --eval_n_classes), then for each word in the dataset it generate the next example (or definition) by greedy decoding, top-p sampling, and beam search. You can use runner.py to create and submit Slurm jobs:
python runner.py --job_name_base meta-word-eval --config runner_config/evaluation_config.py --submitThe standard output will consist of classification accuracies and examples along with the generations. It should be saved (in a slurm output file, for examples) for further evaluations below.
Run evaluate_generation.py on the output file(s) from evaluation.py. You may also read evaluate_generation_runner.py for how to run evaluate_generation.py in different settings, and modify/run it for your evaluations.
Run compare_generation.py on the two output files from evaluation.py.