-
Notifications
You must be signed in to change notification settings - Fork 86
Notes: Preparing GLUE data as input for MIGraphX
This page provides examples of how GLUE benchmark data can be preprocessed for use in MIGraphX (or frameworks).
GLUE can be downloaded from http://www.gluebenchmark.com. In particular, from a link "STARTER CODE" one can pull a download script that can be invoked as
python download_glue_data.py --data_dir glue_data --tasks all
Running this script creates a directory "glue_data" with subdirectories for each individual task. Within the MRPC task one finds a set of tab separated values files (.tsv) including one for training (train.tsv), validation (dev.tsv) and testing (test.tsv).
While GLUE tasks all seem to have .tsv files that contain sentences and labels, the exact field assignments vary between tasks. In the case of MRPC, the first field is the label and the fourth and fifth fields are sentences. These sentences can be extracted using the following awk expression and shell script(*)
IFS='|'
cat glue_data/MRPC/dev.tsv | awk '{ print $1 "|" $4 "|" $5 }' | while read label sentence1 sentence2
do
echo "label = $label"
echo "sentence1 = $sentence1"
echo "sentence2 = $sentence2"
done
(*) Not sure why setting IFS directly as "\t" didn't work, but using awk to insert "|" as a separating token does work.
Now that we've parsed out each individual sentence from the .tsv file, what remains is turning sentences into tokens. There is a python tokenizer capability that can be invoked. I have encapsulated it into the following shell script
#!/usr/bin/python3
#
# convert stdin into comma separated list of tokens
from pytorch_transformers import BertTokenizer
import sys
tokenizer = BertTokenizer.from_pretrained("bert-base-cased",do_lower_case=False)
for line in sys.stdin:
tokens = tokenizer.tokenize(line)
token_ids=list(map(tokenizer.convert_tokens_to_ids,tokens))
str_token_ids=str(token_ids)
print(*token_ids,sep=',')
This shell script takes as input a sentence and returns a comma separated list of tokens.
Once we have a tokenized list of sentences as well as the resultant labels, one can plug this in a variety of methods. These examples can be used in the hard-coded examples we used in our 0.4 release example. In our own scripts, we also did separate parsing of these tokens to prepare input vectors that are more directly passed to the MIGraphX.