Skip to content

Files

Latest commit

64471c1 · Nov 1, 2023

History

History

megatron

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023
Nov 1, 2023

Megatron-LM

Readme

Original Megatron-LM readme

Installation

cd ./data
make

Data Preprocessing

docs from Megatron-LM

changes in preprocess_data.py:

  • preprocess_data.py script is moved to megatron folder
  • supports tokenizers from HuggingFace Transformers
  • input can be a folder with multiple json/jsonl files

example usage with HF Tokenizer:

python preprocess_data.py \
       --input ./train \
       --output-prefix ./train \
       --dataset-impl mmap \
       --tokenizer-type HFTokenizer \
       --tokenizer-name-or-path bert-base-uncased \
       --split-sentences --workers 8