GitHub - yqsong/DatalessClassification: Dataless classification with more experiments for AIJ

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
conf		conf
data		data
matlab		matlab
script		script
src/edu/illinois/cs/cogcomp		src/edu/illinois/cs/cogcomp
README.txt		README.txt
pom.xml		pom.xml

Repository files navigation

This package is a Java library that implements Hierarchical Dataless Text Classification using different representations. The representations are mainly constructed based on Wikipedia:

http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2. This is a very large file (>10GB) and will take time.

Some of the processed data can be found here:

http://cogcomp.cs.illinois.edu/page/resource_view/97

--Update 20190412--
Some of the files are copied here.
http://home.cse.ust.hk/~yqsong/uiuc_backup/WikiLuceneIndex_word500_link30.zip
http://home.cse.ust.hk/~yqsong/uiuc_backup/wikiIndexLucene3.0.2_vivek.zip
http://home.cse.ust.hk/~yqsong/uiuc_backup/vectors-enwikitext_vivek200.zip
http://home.cse.ust.hk/~yqsong/uiuc_backup/MemoryBasedESA.zip
http://home.cse.ust.hk/~yqsong/uiuc_backup/wikiPageIDMapping.txt
--End of Update 20190412--

It contails following files:

1. Wikipedia index simple version (11G): used for edu.illinois.cs.cogcomp.classification.representation.esa.simple.SimpleESALocal

2. Wikipedia index complex version (7G): intermediate index file for edu.illinois.cs.cogcomp.classification.representation.indexer.complex.IndexWords1stStep
3. Wikipedia structured data (3G): used for edu.illinois.cs.cogcomp.classification.representation.esa.complex.DiskBasedComplexESA
4. Wikipedia word inverted index based on complex method (7G): used for edu.illinois.cs.cogcomp.classification.representation.esa.complex.DiskBasedComplexESA

5. Memory based ESA: caching each word representation (0.4G): used for edu.illinois.cs.cogcomp.classification.representation.esa.complex.MemoryBasedESA

6. Word embedding index trained based on Wikipedia (5G): used for edu.illinois.cs.cogcomp.classification.representation.word2vec.DiskBasedWordEmbedding

7. Original word2vec representation in raw format (2.2G): used for edu.illinois.cs.cogcomp.classification.representation.word2vec.MemoryBasedWordEmbedding

8. Word2vec representation for Wikipedia Titles (1.4G): edu.illinois.cs.cogcomp.classification.densification.representation.SparseSimilarityCondensation

Always configure the paths of source files in this file:

conf/configurations.properties

used by this class:

edu.illinois.cs.cogcomp.classification.main.DatalessResourcesConfig (You should always new an instance of this class before running).

Then for general text classification, modify and try

edu.illinois.cs.cogcomp.classification.main.Pipeline

---------------------------------------------------------------

To replicate the experiments for AAAI14, prepare the data here:

edu.illinois.cs.cogcomp.classification.hierarchy.run.preparedata.newsgroups

and run the classification here:

edu.illinois.cs.cogcomp.classification.hierarchy.run.ml.newsgroups

Citation: On Dataless Hierarchical Text Classification. Y. Song and D. Roth. AAAI. 2014.
---------------------------------------------------------------

To replicate the experiments for NAACL15, test the code here:

edu.illinois.cs.cogcomp.classification.densification.run

Citation: Unsupervised Sparse Vector Densification for Short Text Similarity. Y. Song and D. Roth. NAACL. 2015.
---------------------------------------------------------------

Here is an example script for running a class in the package:

./script/20NGTestDataless.sh

About

Dataless classification with more experiments for AIJ

Readme

Activity

4 stars

1 watching

2 forks

Report repository

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages