vae_sparse/optvaedatasets at master · rahulk90/vae_sparse

History

Name		Name	Last commit message	Last commit date
parent directory ..
20news_miao		20news_miao
rcv2_miao		rcv2_miao
wikicorp		wikicorp
.gitignore		.gitignore
ProcessWikicorp-learning.ipynb		ProcessWikicorp-learning.ipynb
README.md		README.md
__init__.py		__init__.py
load.py		load.py
newsgroups.py		newsgroups.py
rcv2.py		rcv2.py
wikicorp.py		wikicorp.py

README.md

Datasets

Setup instructions:

For each dataset below, follow the provided instructions to setup each one. To be used to train models, the data must be converted into a dictionary that (at the very minimum) has the following keys:

type(dataset)    #dict
dataset['train'] #CSR Matrix Ntrain x Nfeatures
dataset['valid'] #CSR Matrix Nvalid x Nfeatures
dataset['test']  #CSR Matrix Ntest  x Nfeatures
dataset['dim_observations']  #Scalar Nfeatures
dataset['data_type']  #Typically 'bow'

Overview

RCV2/20newsgroup

Used in Miao et. al

#Naviate to <path>/inference_introspection/optvaedatasets/rcv2_miao/
python download.py

Preprocess the data (requires Torch)

th preprocessing.lua #(This step requires Torch to be installed)
#Navigate to <path>/inference_introspection/optvaedatasets/
python rcv2.py

#Download and setup 20newsgroups dataset:
python newsgroups.py

Wikicorp

Based on the Wikipedia datadump of Huang et. al
Download and setup dataset from raw Wikipedia text

#In <path>/inference_introspection/optvaedatasets/
python wikicorp.py #Downloads the dataset into wikicorp/ 
#Navigate to <path>/inference_introspection/optvaedatasets/wikicorp/

Install inflect, gensim and nltk
Install wordnet and stopwords within nltk by running:

python
>>> import nltk
>>> nltk.download() #Follow instructions and download "wordnet" and "stopwords"

#Parses the text and converts it into a BOW representation
python tokenizer.py

Limits the vocabulary (change parameters to get different variants of this dataset)

ipython trust *.ipynb
ipython notebook ProcessWikicorp-learning.ipynb

Final check to ensure the data can be loaded from python

python wikicorp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optvaedatasets

optvaedatasets

README.md

Datasets

Setup instructions:

Overview

RCV2/20newsgroup

Wikicorp

Files

optvaedatasets

Directory actions

More options

Directory actions

More options

Latest commit

History

optvaedatasets

Folders and files

parent directory

README.md

Datasets

Setup instructions:

Overview

RCV2/20newsgroup

Wikicorp