Skip to content

Train_Dataset Question #14

@ETedward

Description

@ETedward

Sorry if some of this is already answered.

  • What are the numbers in the train_dataset.csv and how can we convert into words? Per my understanding, is it through downloading the ethz-privsec numpy datasets (if we're running locally and don't want to download the PILE and run load_dataset.py to build our own data) then we are primarily using the .npy files as our actual data?

  • How can we see where the prefix ends and suffix starts, (ie. in the traindataset.csv are the three zeroes in each row some sort of divider)? Per my understanding right now, we're actually just using the suffix and prefix from .npy files. If that is the case, 1) how can we convert the .npy back to words, ie. using a GPT-2 Tokenizer? 2) more importantly, will .npy files also be provided for validation and subsequent datasets, such that we won't need to build those ourselves.

Thank you very much.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions