This is a prototypal implementation of Continuous Skip-gram Models (CSGM) using plain Java and the Fork/Join framework. It has been developed in the scope of a student project at Hochschule der Medien Stuttgart. It yields competitive results when compared to gensim when applied to the first 50k articles of the german wikipedia:
|v|=100 | gensim(numpy) | gensim(cython) | word2vec4j | gensim(BLAS) |
---|---|---|---|---|
kwords/sec | 0.16 | 180.11 | 205.11 | 309.87 |
docs/sec | 0.11 | 138.75 | 145.19 | 238.28 |
This project is currently just a proof-of-concept. Currently there are still paths to local files and folders specific to my machine. The tests are crappy. It lacks documentation. It is really raw. But it will be refined to a full-fledged library in the future. Stay tuned!