For each dataset below, follow the provided instructions to setup each one. To be used to train models, the data must be converted into a dictionary that (at the very minimum) has the following keys:
type(dataset) #dict
dataset['train'] #CSR Matrix Ntrain x Nfeatures
dataset['valid'] #CSR Matrix Nvalid x Nfeatures
dataset['test'] #CSR Matrix Ntest x Nfeatures
dataset['dim_observations'] #Scalar Nfeatures
dataset['data_type'] #Typically 'bow'
- Used in Miao et. al
#Naviate to <path>/inference_introspection/optvaedatasets/rcv2_miao/
python download.py
- Preprocess the data (requires Torch)
th preprocessing.lua #(This step requires Torch to be installed)
#Navigate to <path>/inference_introspection/optvaedatasets/
python rcv2.py
#Download and setup 20newsgroups dataset:
python newsgroups.py
- Based on the Wikipedia datadump of Huang et. al
- Download and setup dataset from raw Wikipedia text
#In <path>/inference_introspection/optvaedatasets/
python wikicorp.py #Downloads the dataset into wikicorp/
#Navigate to <path>/inference_introspection/optvaedatasets/wikicorp/
python
>>> import nltk
>>> nltk.download() #Follow instructions and download "wordnet" and "stopwords"
#Parses the text and converts it into a BOW representation
python tokenizer.py
- Limits the vocabulary (change parameters to get different variants of this dataset)
ipython trust *.ipynb
ipython notebook ProcessWikicorp-learning.ipynb
- Final check to ensure the data can be loaded from python
python wikicorp.py