Sorry if some of this is already answered.
-
What are the numbers in the train_dataset.csv and how can we convert into words? Per my understanding, is it through downloading the ethz-privsec numpy datasets (if we're running locally and don't want to download the PILE and run load_dataset.py to build our own data) then we are primarily using the .npy files as our actual data?
-
How can we see where the prefix ends and suffix starts, (ie. in the traindataset.csv are the three zeroes in each row some sort of divider)? Per my understanding right now, we're actually just using the suffix and prefix from .npy files. If that is the case, 1) how can we convert the .npy back to words, ie. using a GPT-2 Tokenizer? 2) more importantly, will .npy files also be provided for validation and subsequent datasets, such that we won't need to build those ourselves.
Thank you very much.
Sorry if some of this is already answered.
What are the numbers in the train_dataset.csv and how can we convert into words? Per my understanding, is it through downloading the ethz-privsec numpy datasets (if we're running locally and don't want to download the PILE and run load_dataset.py to build our own data) then we are primarily using the .npy files as our actual data?
How can we see where the prefix ends and suffix starts, (ie. in the traindataset.csv are the three zeroes in each row some sort of divider)? Per my understanding right now, we're actually just using the suffix and prefix from .npy files. If that is the case, 1) how can we convert the .npy back to words, ie. using a GPT-2 Tokenizer? 2) more importantly, will .npy files also be provided for validation and subsequent datasets, such that we won't need to build those ourselves.
Thank you very much.