Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-3179] Add GloVe implementation #2201

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

xixuanzhang2022
Copy link

@xixuanzhang2022 xixuanzhang2022 commented Jan 30, 2025

This PR introduces a matrix-based implementation of GloVe (Global Vectors for Word Representation) for SystemDS, enabling the computation of word embeddings using co-occurrence statistics. The implementation follows the original GloVe paper and is based on the open-source implementation mittens.

It leverages efficient matrix operations for weight computation and gradient updates while integrating adaptive learning through momentum-based updates for stability in dml.

The PR depends on accepting #2200. However, users can run the gloveWithCoocMatrix function independently, provided they have a pre-computed co-occurrence matrix and an index file.

@cmcuza
Copy link
Contributor

cmcuza commented Feb 1, 2025

I suggest editing the PR's title to indicate that it is related to the issue [SYSTEMDS-3179], for example, #2206. Moreover, I suggest indicating that this PR depends on accepting #2200. Otherwise, it is not possible to import scripts/builtin/cooccur.dml. Finally, I suggest adding the proper testing of the proposed Glove implementation.

@xixuanzhang2022 xixuanzhang2022 changed the title add GloVe implementation [SYSTEMDS-3179] add GloVe implementation Feb 1, 2025
Improve the gloveWithCoocMatrix function by incorporating an epsilon value for initialization and implementing a tolerance threshold to mitigate overfitting.
Implement cosine similarity and accuracy computation for word embeddings

- Added `cosine_similarity` function to compute pairwise cosine similarity between word embeddings.
- Implemented `get_top` function to retrieve the top-k most similar word embeddings for each word.
- Created `accuracy` function to evaluate the overlap of top-k nearest neighbors between two sets of word embeddings.
- Utilized matrix operations for efficient computation of similarity scores.

This implementation aids in evaluating the GloVe models by measuring similarity and accuracy.
@xixuanzhang2022 xixuanzhang2022 changed the title [SYSTEMDS-3179] add GloVe implementation [SYSTEMDS-3179] Add GloVe implementation Feb 3, 2025
Copy link

codecov bot commented Feb 3, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.37%. Comparing base (f7af63f) to head (efeadc8).
Report is 11 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2201      +/-   ##
============================================
+ Coverage     72.28%   72.37%   +0.09%     
- Complexity    44986    45316     +330     
============================================
  Files          1452     1466      +14     
  Lines        169310   170507    +1197     
  Branches      33038    33241     +203     
============================================
+ Hits         122389   123411    +1022     
- Misses        37617    37710      +93     
- Partials       9304     9386      +82     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

saminbassiri and others added 6 commits February 4, 2025 07:16
Change glove function to be compatible with occurrence Matrix script
Add GloVe script in Builtins.java
- This code first computes the cosine similarity of each pair of words in the glove result.
- In the get_top function, the top k most similar words for each word is computed.
- The result of this script used for testing.
- This data is used to test the DML script results for GloVe word embedding.
- This file contains the top 10 most similar words for each word in the GloVe word embedding, based on (https://github.com/roamanalytics/mittens/tree/master).
- The test dataset is provided under test/resources in '20news/20news_subset_untokenized.csv'.
- This test first runs the DML script to generate the top K most similar words for each word in the GloVe word embedding.  
- Then, it computes the accuracy of the DML results based on the hits of the most similar words for the entire vocabulary, comparing the expected results with the DML output.  
- To validate the correctness of our GloVe word embedding implementation, we employ a Controlled Overfitting Validation approach.  
- This methodology addresses the inherent challenge of testing stochastic algorithms, where random initialization typically prevents direct output comparison between different runs or implementations.
Deleted the script for glove test since it is located in the wrong folder. Corresponding test added to the right directory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants