v0.3.9
- byte-level BPE support
- remove support for Python 2
v0.3.8:
- multiprocessing support (get_vocab and apply_bpe)
- progress bar for learn_bpe
- seed parameter for deterministic BPE dropout
- ignore some unicode line separators which would crash subword-nmt
v0.3.7:
- BPE dropout (Provilkov et al., 2019)
- more efficient glossaries (#69)
v0.3.6:
- fix to subword-bpe command encoding
v0.3.5:
- fix to subword-bpe command under Python 2
- wider support of --total-symbols argument
v0.3.4:
- segment_tokens method to improve library usability (#52)
- support regex glossaries (#56)
- allow unicode separators (#57)
- new option --total-symbols in learn-bpe (commit 61ad8)
- fix documentation (best practices) (#60)
v0.3:
- library is now installable via pip
- fix occasional problems with UTF-8 whitespace and new lines in learn_bpe and apply_bpe.
- do not silently convert UTF-8 newline characters into "\n"
- do not silently convert UTF-8 whitespace characters into " "
- UTF-8 whitespace and newline characters are now considered part of a word, and segmented by BPE
v0.2:
- different, more consistent handling of end-of-word token (commit a749a7) (#19)
- allow passing of vocabulary and frequency threshold to apply_bpe.py, preventing the production of OOV (or rare) subword units (commit a00db)
- made learn_bpe.py deterministic (commit 4c54e)
- various changes to make handling of UTF more consistent between Python versions
- new command line arguments for apply_bpe.py:
- '--glossaries' to prevent given strings from being affected by BPE
- '--merges' to apply a subset of learned BPE operations
- new command line arguments for learn_bpe.py:
- '--dict-input': rather than raw text file, interpret input as a frequency dictionary (as created by get_vocab.py).
v0.1:
- consistent cross-version unicode handling
- all scripts are now deterministic