Skip to content

Commit 7d02027

Browse files
Instructions for generating the tokenizer configs for marian-mt. (huggingface#1225)
1 parent 392a00a commit 7d02027

File tree

2 files changed

+1404
-0
lines changed

2 files changed

+1404
-0
lines changed

candle-examples/examples/marian-mt/README.md

+19
Original file line numberDiff line numberDiff line change
@@ -17,3 +17,22 @@ cargo run --example marian-mt --release -- \
1717
I know you are waiting for me. I will go through the forest, I will go through the
1818
mountain. I cannot stay far from you any longer.</s>
1919
```
20+
21+
## Generating the tokenizer.json files
22+
23+
You can use the following script to generate the `tokenizer.json` config files
24+
from the hf-hub repos. This requires the `tokenizers` and `sentencepiece`
25+
packages to be install and use the `convert_slow_tokenizer.py` script from this
26+
directory.
27+
28+
```python
29+
from convert_slow_tokenizer import MarianConverter
30+
from transformers import AutoTokenizer
31+
32+
33+
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-fr-en", use_fast=False)
34+
fast_tokenizer = MarianConverter(tokenizer, index=0).converted()
35+
fast_tokenizer.save(f"tokenizer-marian-base-fr.json")
36+
fast_tokenizer = MarianConverter(tokenizer, index=1).converted()
37+
fast_tokenizer.save(f"tokenizer-marian-base-en.json")
38+
```

0 commit comments

Comments
 (0)