Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for MWE Schemes #38

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

pegasus-lynx
Copy link

Hello Team,

We have been trying to add the support for ngrams and skipgrams into nlcodec. The PR above does the following:

  • Adds the following schemes : NgramScheme , SkipgramScheme , MWEScheme ( combines both NgramScheme and SkipgramScheme )
  • Added a list of PMI variants to allow merging of tokens on the basis of the metrics other than freq.

The implementation for the above cannot directly be merged in the code base as the code is not properly factored. However, we would like you to review the code and see how does it align with the purpose of the codec.

@thammegowda
Copy link
Member

Hi @pegasus-lynx thanks for the pull request.

I will gladly review it, and look forward to merging it.
This week is a little busy one as the ACL is going on ... I will get back to this after a few days.

Thanks again for submitting the PR! Its looking good.

@thammegowda
Copy link
Member

Hey @pegasus-lynx
Sorry for the long delay.

I tried to test these new features. But I am clueless regarding how to use this API.
I need a simple example of how to use ngram, skipgram and MWE.

A simple test case for creating vocab would do; here is an example:

def test_bpe():
vocab_size = 6000
args = dict(inp=IO.read_as_stream(paths=[en_txt, fr_txt]),
level='bpe',
vocab_size=vocab_size,
min_freq=1,
term_freqs=False,
char_coverage=0.99999,
min_co_ev=2)
with tempfile.TemporaryDirectory() as tmpdir:
model_file = Path(tmpdir) / 'model.tsv'
args['model'] = model_file
table = nlc.learn_vocab(**args)
assert len(table) == vocab_size
table2, meta = nlc.Type.read_vocab(model_file)
assert len(table2) == len(table)
table_str = '\n'.join(x.format() for x in table)
table2_str = '\n'.join(x.format() for x in table2)
assert table_str == table2_str

AND/OR a sample usage e.g. https://github.com/isi-nlp/nlcodec/blob/master/docs/intro.adoc#python-api so we

Thanks.

Once again, I apologize for the delay.

@thammegowda
Copy link
Member

@pegasus-lynx let me know if this is ready for a review!

@pegasus-lynx
Copy link
Author

Hey @thammegowda, sorry for the delay. We have moved in a different direction with this branch. I will close this PR and create a new PR for this. As of now, the SkipScheme has some issues in decoding, so it is still not ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants