-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYSTEMDS-3179] Add GloVe implementation #2201
base: main
Are you sure you want to change the base?
[SYSTEMDS-3179] Add GloVe implementation #2201
Conversation
GloVe implementation based on https://github.com/roamanalytics/mittens/tree/master using systemds
I suggest editing the PR's title to indicate that it is related to the issue [SYSTEMDS-3179], for example, #2206. Moreover, I suggest indicating that this PR depends on accepting #2200. Otherwise, it is not possible to import |
Improve the gloveWithCoocMatrix function by incorporating an epsilon value for initialization and implementing a tolerance threshold to mitigate overfitting.
Implement cosine similarity and accuracy computation for word embeddings - Added `cosine_similarity` function to compute pairwise cosine similarity between word embeddings. - Implemented `get_top` function to retrieve the top-k most similar word embeddings for each word. - Created `accuracy` function to evaluate the overlap of top-k nearest neighbors between two sets of word embeddings. - Utilized matrix operations for efficient computation of similarity scores. This implementation aids in evaluating the GloVe models by measuring similarity and accuracy.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2201 +/- ##
============================================
+ Coverage 72.28% 72.37% +0.09%
- Complexity 44986 45316 +330
============================================
Files 1452 1466 +14
Lines 169310 170507 +1197
Branches 33038 33241 +203
============================================
+ Hits 122389 123411 +1022
- Misses 37617 37710 +93
- Partials 9304 9386 +82 ☔ View full report in Codecov by Sentry. |
Change glove function to be compatible with occurrence Matrix script
Add GloVe script in Builtins.java
- This code first computes the cosine similarity of each pair of words in the glove result. - In the get_top function, the top k most similar words for each word is computed. - The result of this script used for testing.
- This data is used to test the DML script results for GloVe word embedding. - This file contains the top 10 most similar words for each word in the GloVe word embedding, based on (https://github.com/roamanalytics/mittens/tree/master). - The test dataset is provided under test/resources in '20news/20news_subset_untokenized.csv'.
- This test first runs the DML script to generate the top K most similar words for each word in the GloVe word embedding. - Then, it computes the accuracy of the DML results based on the hits of the most similar words for the entire vocabulary, comparing the expected results with the DML output. - To validate the correctness of our GloVe word embedding implementation, we employ a Controlled Overfitting Validation approach. - This methodology addresses the inherent challenge of testing stochastic algorithms, where random initialization typically prevents direct output comparison between different runs or implementations.
Deleted the script for glove test since it is located in the wrong folder. Corresponding test added to the right directory.
This PR introduces a matrix-based implementation of GloVe (Global Vectors for Word Representation) for SystemDS, enabling the computation of word embeddings using co-occurrence statistics. The implementation follows the original GloVe paper and is based on the open-source implementation mittens.
It leverages efficient matrix operations for weight computation and gradient updates while integrating adaptive learning through momentum-based updates for stability in dml.
The PR depends on accepting #2200. However, users can run the gloveWithCoocMatrix function independently, provided they have a pre-computed co-occurrence matrix and an index file.