Releases · isaacus-dev/semchunk · GitHub

28 Oct 02:13

umarbutler

v3.2.5 Latest

Latest

Changed

Switched to more accurate monthly download counts from pypistats.org rather than the less accurate counts from pepy.tech.

Assets 2

26 Oct 04:51

umarbutler

3.2.4

Fixed

Fixed splitters being sorted lexographically rather than by length, which should improve the meaningfulness of chunks.

Assets 2

26 Oct 04:51

umarbutler

v3.2.3

Fixed

Fixed broken Python download count shield (crflynn/pypistats.org#82).

Assets 2

09 Jun 08:16

umarbutler

v3.2.2

Fixed

Fixed IndexError being raised when chunking whitespace only texts with overlapping enabled (#18).

Assets 2

09 Jun 08:16

umarbutler

v3.2.1

Fixed

Fixed minor typos in the README and docstrings.

Assets 2

20 Mar 04:46

umarbutler

v3.2.0

Changed

Significantly improved the quality of chunks produced when chunking with low chunk sizes or documents with minimal varying levels of whitespace by adding a new rule to the semchunk algorithm that prioritizes splitting at the occurrence of single whitespace characters preceded by hierarchically meaningful non-whitespace characters over splitting at all single whitespace characters in general (#17).

Assets 2

11 Mar 06:17

umarbutler

v3.1.3

Changed

Added mention of Isaacus to the README.

Full Changelog: v3.1.2...v3.1.3

Assets 2

06 Mar 11:16

umarbutler

v3.1.2

Changed

Changed test model from isaacus/emubert to isaacus/kanon-tokenizer.

Full Changelog: v3.1.1...v3.1.2

Assets 2

18 Feb 05:02

umarbutler

v3.1.1

Added

Added a note to the quickstart section of the README advising users to deduct the number of special tokens automatically added by their tokenizer from their chunk size. This note had been removed in version 3.0.0 but has been readded as it is unlikely to be obvious to users.

Assets 2

16 Feb 10:14

umarbutler

3.1.0

Added

Introduced a new cache_maxsize argument to chunkerify() and chunk() that specifies the maximum number of text-token count pairs that can be stored in a token counter's cache. The argument defaults to None, in which case the cache is unbounded.

Assets 2