Train_Dataset Question

Sorry if some of this is already answered. 

- What are the numbers in the train_dataset.csv and how can we convert into words? Per my understanding, is it through downloading the [ethz-privsec numpy datasets](https://github.com/ethz-privsec/lm-extraction-benchmark-data/tree/main/datasets) (if we're running locally and don't want to download the PILE and run load_dataset.py to build our own data) then we are primarily using the .npy files as our actual data? 

- How can we see where the prefix ends and suffix starts, (ie. in the traindataset.csv are the three zeroes in each row some sort of divider)? Per my understanding right now, we're actually just using the suffix and prefix from .npy files. If that is the case, 1) how can we convert the .npy back to words, ie. using a GPT-2 Tokenizer? 2) more importantly, will .npy files also be provided for validation and subsequent datasets, such that we won't need to build those ourselves. 

Thank you very much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train_Dataset Question #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Train_Dataset Question #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions