Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support tokenizer configurations #5

Open
VoVAllen opened this issue Jun 4, 2024 · 2 comments
Open

feat: Support tokenizer configurations #5

VoVAllen opened this issue Jun 4, 2024 · 2 comments

Comments

@VoVAllen
Copy link
Member

VoVAllen commented Jun 4, 2024

No description provided.

@VoVAllen
Copy link
Member Author

VoVAllen commented Jun 4, 2024

@VoVAllen
Copy link
Member Author

VoVAllen commented Jun 30, 2024

@VoVAllen VoVAllen assigned cutecutecat and unassigned cutecutecat Jul 1, 2024
@jwnz jwnz mentioned this issue Sep 23, 2024
6 tasks
usamoi pushed a commit that referenced this issue Oct 8, 2024
### Description
This PR aims to add some better functionality for configuration of the
tokenizer (#5).

### Todo
- [x] Allow for selection of huggingface tokenizer. Downloads the model
from the hub.
- [x] Jieba tokenizer (chinese)
- [x] tiktoken
- [x] tiniestsegmenter (japanese) [optional]
- [x] Allow switching of HF tokenizers
- [x] Add Tests for tokenizing

### Notes
~~1. Currently, if a HuggingFace tokenizer is initialized, the tokenizer
cannot be changed. e.g. Doing `SELECT tokenize('i have an apple', 'hf',
'sentence-transformers/LaBSE');` after `SELECT tokenize('i have an
apple', 'hf', 'google-t5/t5-base');` would just use the LaBSE tokenizer
as it has already been initialized. Need a better way to handle this.~~

~~2. There is no official rust crate for tiktoken.~~ Used [tiktoken-rs](
https://github.com/zurawiki/tiktoken-rs )

### Usage
```SQL
SELECT tokenize('i have an apple', 'hf', 'google-bert/bert-base-uncased');
SELECT tokenize('i have an apple', 'hf', 'google-t5/t5-base');
SELECT tokenize('i have an apple', 'tiktoken', 'gpt2');
SELECT tokenize('i have an apple', 'tiniestsegmenter', '');
SELECT tokenize('i have an apple', 'jieba', '');
SELECT tokenize('i have an apple', 'ws', '');
```

---------

Signed-off-by: jwnz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants