Skip to content

Commit

Permalink
Partial sync of codebase
Browse files Browse the repository at this point in the history
  • Loading branch information
hauntsaninja committed Oct 3, 2024
1 parent 9f7f69d commit 4cfd51f
Show file tree
Hide file tree
Showing 16 changed files with 296 additions and 138 deletions.
10 changes: 5 additions & 5 deletions .github/workflows/build_wheels.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,13 @@ jobs:
matrix:
# cibuildwheel builds linux wheels inside a manylinux container
# it also takes care of procuring the correct python version for us
os: [ubuntu-latest, windows-latest, macos-13]
python-version: [38, 39, 310, 311, 312]
os: [ubuntu-latest, windows-latest, macos-latest]
python-version: [39, 310, 311, 312, 313]

steps:
- uses: actions/checkout@v4

- uses: pypa/cibuildwheel@v2.18.0
- uses: pypa/cibuildwheel@v2.21.2
env:
CIBW_BUILD: "cp${{ matrix.python-version}}-*"

Expand All @@ -37,7 +37,7 @@ jobs:
fail-fast: false
matrix:
os: [ubuntu-latest]
python-version: [38, 39, 310, 311, 312]
python-version: [39, 310, 311, 312, 313]

steps:
- uses: actions/checkout@v4
Expand All @@ -48,7 +48,7 @@ jobs:
platforms: arm64

- name: Build wheels
uses: pypa/cibuildwheel@v2.18.0
uses: pypa/cibuildwheel@v2.21.2
env:
CIBW_BUILD: "cp${{ matrix.python-version}}-*"
CIBW_ARCHS: aarch64
Expand Down
30 changes: 28 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,26 @@

This is the changelog for the open source version of tiktoken.

## [v0.8.0]

- Support for `o1-` and `chatgpt-4o-` models
- Build wheels for Python 3.13
- Add possessive quantifiers to limit backtracking in regular expressions, thanks to @l0rinc!
- Provide a better error message and type for invalid token decode
- Permit tuples in type hints
- Better error message for passing invalid input to `get_encoding`
- Better error messages during plugin loading
- Add a `__version__` attribute
- Update versions of `pyo3`, `regex`, `fancy-regex`
- Drop support for Python 3.8

## [v0.7.0]

- Support for `gpt-4o`
- Performance improvements

## [v0.6.0]

- Optimise regular expressions for a 20% performance improvement, thanks to @paplorinc!
- Add `text-embedding-3-*` models to `encoding_for_model`
- Check content hash for downloaded files
Expand All @@ -16,14 +31,17 @@ This is the changelog for the open source version of tiktoken.
Thank you to @paplorinc, @mdwelsh, @Praneet460!

## [v0.5.2]

- Build wheels for Python 3.12
- Update version of PyO3 to allow multiple imports
- Avoid permission errors when using default cache logic

## [v0.5.1]

- Add `encoding_name_for_model`, undo some renames to variables that are implementation details

## [v0.5.0]

- Add `tiktoken._educational` submodule to better document how byte pair encoding works
- Ensure `encoding_for_model` knows about several new models
- Add `decode_with_offets`
Expand All @@ -32,23 +50,28 @@ Thank you to @paplorinc, @mdwelsh, @Praneet460!
- Update versions of dependencies

## [v0.4.0]

- Add `decode_batch` and `decode_bytes_batch`
- Improve error messages and handling

## [v0.3.3]

- `tiktoken` will now make a best effort attempt to replace surrogate pairs with the corresponding
Unicode character and will replace lone surrogates with the Unicode replacement character.
Unicode character and will replace lone surrogates with the Unicode replacement character.

## [v0.3.2]

- Add encoding for GPT-4

## [v0.3.1]

- Build aarch64 wheels
- Make `blobfile` an optional dependency

Thank you to @messense for the environment variable that makes cargo not OOM under emulation!

## [v0.3.0]

- Improve performance by 5-20%; thank you to @nistath!
- Add `gpt-3.5-turbo` models to `encoding_for_model`
- Add prefix matching to `encoding_for_model` to better support future model versions
Expand All @@ -57,16 +80,19 @@ Thank you to @messense for the environment variable that makes cargo not OOM und
- Add packaging metadata

## [v0.2.0]
- Add ``tiktoken.encoding_for_model`` to get the encoding for a specific model

- Add `tiktoken.encoding_for_model` to get the encoding for a specific model
- Improve portability of caching logic

Thank you to @fritzo, @arvid220u, @khanhvu207, @henriktorget for various small corrections

## [v0.1.2]

- Avoid use of `blobfile` for public files
- Add support for Python 3.8
- Add py.typed
- Improve the public tests

## [v0.1.1]

- Initial release
4 changes: 2 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "tiktoken"
version = "0.7.0"
version = "0.8.0"
edition = "2021"
rust-version = "1.57.0"

Expand All @@ -9,7 +9,7 @@ name = "_tiktoken"
crate-type = ["cdylib"]

[dependencies]
pyo3 = { version = "0.20.0", features = ["extension-module"] }
pyo3 = { version = "0.22.2", default-features = false, features = ["extension-module", "macros"] }

# tiktoken dependencies
fancy-regex = "0.13.0"
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,3 +128,4 @@ setup(

Then simply `pip install ./my_tiktoken_extension` and you should be able to use your
custom encodings! Make sure **not** to use an editable install.

5 changes: 3 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
[project]
name = "tiktoken"
version = "0.7.0"
version = "0.8.0"
description = "tiktoken is a fast BPE tokeniser for use with OpenAI's models"
readme = "README.md"
license = {file = "LICENSE"}
authors = [{name = "Shantanu Jain"}, {email = "[email protected]"}]
dependencies = ["regex>=2022.1.18", "requests>=2.26.0"]
optional-dependencies = {blobfile = ["blobfile>=2"]}
requires-python = ">=3.8"
requires-python = ">=3.9"

[project.urls]
homepage = "https://github.com/openai/tiktoken"
Expand Down Expand Up @@ -42,3 +42,4 @@ test-command = "pytest {project}/tests --import-mode=append"
[[tool.cibuildwheel.overrides]]
select = "*linux_aarch64"
test-command = """python -c 'import tiktoken; enc = tiktoken.get_encoding("gpt2"); assert enc.encode("hello world") == [31373, 995]'"""

Loading

0 comments on commit 4cfd51f

Please sign in to comment.