Support for MWE Schemes #38

pegasus-lynx · 2021-07-31T20:27:07Z

Hello Team,

We have been trying to add the support for ngrams and skipgrams into nlcodec. The PR above does the following:

Adds the following schemes : NgramScheme , SkipgramScheme , MWEScheme ( combines both NgramScheme and SkipgramScheme )
Added a list of PMI variants to allow merging of tokens on the basis of the metrics other than freq.

The implementation for the above cannot directly be merged in the code base as the code is not properly factored. However, we would like you to review the code and see how does it align with the purpose of the codec.

thammegowda · 2021-08-01T20:15:24Z

Hi @pegasus-lynx thanks for the pull request.

I will gladly review it, and look forward to merging it.
This week is a little busy one as the ACL is going on ... I will get back to this after a few days.

Thanks again for submitting the PR! Its looking good.

thammegowda · 2021-09-07T08:11:14Z

Hey @pegasus-lynx
Sorry for the long delay.

I tried to test these new features. But I am clueless regarding how to use this API.
I need a simple example of how to use ngram, skipgram and MWE.

A simple test case for creating vocab would do; here is an example:

nlcodec/tests/test_codec.py

Lines 20 to 39 in b99d023

    
           def test_bpe(): 
        
               vocab_size = 6000 
        
               args = dict(inp=IO.read_as_stream(paths=[en_txt, fr_txt]), 
        
                           level='bpe', 
        
                           vocab_size=vocab_size, 
        
                           min_freq=1, 
        
                           term_freqs=False, 
        
                           char_coverage=0.99999, 
        
                           min_co_ev=2) 
        
               with tempfile.TemporaryDirectory() as tmpdir: 
        
                   model_file = Path(tmpdir) / 'model.tsv' 
        
                   args['model'] = model_file 
        
                   table = nlc.learn_vocab(**args) 
        
                   assert len(table) == vocab_size 
        
                   table2, meta = nlc.Type.read_vocab(model_file) 
        
                   assert len(table2) == len(table) 
        
                   table_str = '\n'.join(x.format() for x in table) 
        
                   table2_str = '\n'.join(x.format() for x in table2) 
        
                   assert  table_str == table2_str

AND/OR a sample usage e.g. https://github.com/isi-nlp/nlcodec/blob/master/docs/intro.adoc#python-api so we

Thanks.

Once again, I apologize for the delay.

thammegowda · 2021-12-23T23:44:09Z

@pegasus-lynx let me know if this is ready for a review!

pegasus-lynx · 2022-01-09T06:53:55Z

Hey @thammegowda, sorry for the delay. We have moved in a different direction with this branch. I will close this PR and create a new PR for this. As of now, the SkipScheme has some issues in decoding, so it is still not ready for review.

pegasus-lynx added 2 commits July 27, 2021 08:43

Added ngram and skipgram schemes

d1b2e76

Fixed circular import

9836984

pegasus-lynx added 4 commits December 16, 2021 19:58

Refactored code

5af79ea

Merge branch 'master' into mwe_schemes

a85dad9

Fixes for pip install -e . to work

ceca7b0

Added get_scheme function

88b261b

pegasus-lynx added 6 commits January 6, 2022 02:07

Fixed naive pmi func

8c1c77c

Added try catch to get the error while decoding

54c25d1

Disabled bpe learn kwargs to pass by

14fc9fd

Added Ext MWE Scheme. Fixed decode in skip scheme

bd9989e

Bug Fix in Ext MWE Scheme

8693928

Fixed encoding and decoding for Ext MWE

5db8a7d

pegasus-lynx added 12 commits January 13, 2022 05:04

Fixed skip scheme decoding and changes to ext mwe

f8cffbc

Changes to ExtMWE Scheme for variable lists

1fa75a1

Fix in merge_types_list

48644aa

Added kids to the types

e18a6ab

Fixed stochastic split function

886d6cb

Fixed decoding scheme for skipgrams

55bdef1

Added debugging in decode str

760955f

Fixed kids for bpe tokens

9681a84

Fixed kids in extmwe-scheme

850ff1a

Fixing ExtMweScheme loading

a2f715f

Allow max_mwes parameter

a37b499

Fixed working for max_mwes in ExtMWEScheme

1bd4280

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for MWE Schemes #38

Support for MWE Schemes #38

pegasus-lynx commented Jul 31, 2021

thammegowda commented Aug 1, 2021

thammegowda commented Sep 7, 2021

thammegowda commented Dec 23, 2021

pegasus-lynx commented Jan 9, 2022

Support for MWE Schemes #38

Are you sure you want to change the base?

Support for MWE Schemes #38

Conversation

pegasus-lynx commented Jul 31, 2021

thammegowda commented Aug 1, 2021

thammegowda commented Sep 7, 2021

thammegowda commented Dec 23, 2021

pegasus-lynx commented Jan 9, 2022