Skip to content

Commit a2124a1

Browse files
committed
Added a note to the README advising users to deduct the number of automatically added special tokens from their chunk sizes.
1 parent e4d0a97 commit a2124a1

File tree

3 files changed

+8
-2
lines changed

3 files changed

+8
-2
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
## Changelog 🔄
22
All notable changes to `semchunk` will be documented here. This project adheres to [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
33

4+
## [3.1.1] - 2025-02-18
5+
### Added
6+
- Added a note to the quickstart section of the README advising users to deduct the number of special tokens automatically added by their tokenizer from their chunk size. This note had been removed in version 3.0.0 but has been readded as it is unlikely to be obvious to users.
7+
48
## [3.1.0] - 2025-02-16
59
### Added
610
- Introduced a new `cache_maxsize` argument to `chunkerify()` and `chunk()` that specifies the maximum number of text-token count pairs that can be stored in a token counter's cache. The argument defaults to `None`, in which case the cache is unbounded.

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,9 @@ import semchunk
3434
import tiktoken # `transformers` and `tiktoken` are not required.
3535
from transformers import AutoTokenizer # They're just here for demonstration purposes.
3636

37-
chunk_size = 4
37+
chunk_size = 4 # A low chunk size is used here for demonstration purposes. Keep in mind, `semchunk`
38+
# does not know how many special tokens, if any, your tokenizer adds to every input,
39+
# so you may want to deduct the number of special tokens added from your chunk size.
3840
text = 'The quick brown fox jumps over the lazy dog.'
3941

4042
# You can construct a chunker with `semchunk.chunkerify()` by passing the name of an OpenAI model,

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "semchunk"
7-
version = "3.1.0"
7+
version = "3.1.1"
88
authors = [
99
{name="Isaacus", email="[email protected]"},
1010
{name="Umar Butler", email="[email protected]"},

0 commit comments

Comments
 (0)