Skip to content

Notes: Preparing GLUE data as input for MIGraphX

mvermeulen edited this page Sep 4, 2019 · 1 revision

Overview

This page provides examples of how GLUE benchmark data can be preprocessed for use in MIGraphX (or frameworks).

Get GLUE

GLUE can be downloaded from http://www.gluebenchmark.com. In particular, from a link "STARTER CODE" one can pull a download script that can be invoked as

python download_glue_data.py --data_dir glue_data --tasks all

Running this script creates a directory "glue_data" with subdirectories for each individual task. Within the MRPC task one finds a set of tab separated values files (.tsv) including one for training (train.tsv), validation (dev.tsv) and testing (test.tsv).

Parsing and tokenizing sentences

While GLUE tasks all seem to have .tsv files that contain sentences and labels, the exact field assignments vary between tasks. In the case of MRPC, the first field is the label and the fourth and fifth fields are sentences. These sentences can be extracted using the following awk expression and shell script(*)

IFS='|'
cat glue_data/MRPC/dev.tsv | awk '{ print $1 "|" $4 "|" $5 }' | while read label sentence1 sentence2
do
   echo "label = $label"
   echo "sentence1 = $sentence1"
   echo "sentence2 = $sentence2"
done

(*) Not sure why setting IFS directly as "\t" didn't work, but using awk to insert "|" as a separating token does work.

Now that we've parsed out each individual sentence from the .tsv file, what remains is turning sentences into tokens. There is a python tokenizer capability that can be invoked. I have encapsulated it into the following shell script

#!/usr/bin/python3
#
# convert stdin into comma separated list of tokens
from pytorch_transformers import BertTokenizer
import sys

tokenizer = BertTokenizer.from_pretrained("bert-base-cased",do_lower_case=False)

for line in sys.stdin:
	    tokens = tokenizer.tokenize(line)
	    token_ids=list(map(tokenizer.convert_tokens_to_ids,tokens))
	    str_token_ids=str(token_ids)
	    print(*token_ids,sep=',')

This shell script takes as input a sentence and returns a comma separated list of tokens.

Putting it together

Once we have a tokenized list of sentences as well as the resultant labels, one can plug this in a variety of methods. These examples can be used in the hard-coded examples we used in our 0.4 release example. In our own scripts, we also did separate parsing of these tokens to prepare input vectors that are more directly passed to the MIGraphX.