Added a note to the README advising users to deduct the number of automatically added special tokens from their chunk sizes.

umarbutler · umarbutler · commit a2124a1a7c10 · 2025-02-18T11:04:43.000+11:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,10 @@
 ## Changelog 🔄
 All notable changes to `semchunk` will be documented here. This project adheres to [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [3.1.1] - 2025-02-18
+### Added
+- Added a note to the quickstart section of the README advising users to deduct the number of special tokens automatically added by their tokenizer from their chunk size. This note had been removed in version 3.0.0 but has been readded as it is unlikely to be obvious to users.
+
 ## [3.1.0] - 2025-02-16
 ### Added
 - Introduced a new `cache_maxsize` argument to `chunkerify()` and `chunk()` that specifies the maximum number of text-token count pairs that can be stored in a token counter's cache. The argument defaults to `None`, in which case the cache is unbounded.
diff --git a/README.md b/README.md
@@ -34,7 +34,9 @@ import semchunk
 import tiktoken                        # `transformers` and `tiktoken` are not required.
 from transformers import AutoTokenizer # They're just here for demonstration purposes.
 
-chunk_size = 4
+chunk_size = 4 # A low chunk size is used here for demonstration purposes. Keep in mind, `semchunk`
+               # does not know how many special tokens, if any, your tokenizer adds to every input,
+               # so you may want to deduct the number of special tokens added from your chunk size.
 text = 'The quick brown fox jumps over the lazy dog.'
 
 # You can construct a chunker with `semchunk.chunkerify()` by passing the name of an OpenAI model,
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 
 [project]
 name = "semchunk"
-version = "3.1.0"
+version = "3.1.1"
 authors = [
     {name="Isaacus", email="support@isaacus.com"},
     {name="Umar Butler", email="umar@umar.au"},