v3.2.1: doc_cleaner component, new Matcher attributes, bug fixes and more
✨ New features and improvements
- NEW:
doc_cleaner
component for removingdoc.tensor
,doc._._trf_data
or otherDoc
attributes at the end of the pipeline to reduce size of output docs. - NEW:
ENT_ID
andENT_KB_ID
toMatcher
pattern attributes. - Support
kb_id
for entities in displaCy fromDoc
input. - Add
Span.sents
property for spans spanning over more than one sentence. - Add
EntityRuler.remove
to remove patterns byid
. - Make the
Tagger
neg_prefix
configurable. - Use
Language.pipe
inLanguage.evaluate
for more efficient processing. - Test suite updates: move regression tests into core test modules with pytest markers for issue numbers, extend tests for languages with alpha support.
🔴 Bug fixes
- Fix issue #9638: Make
JsonlCorpus
path optional again. - Fix issue #9654: Fix
spancat
for empty docs and zero suggestions. - Fix issue #9658: Improve error message for incorrect
.jsonl
paths inEntityRuler
. - Fix issue #9674: Fix language-specific factory handling in package CLI.
- Fix issue #9694: Convert labels to strings for README in package CLI.
- Fix issue #9697: Exclude strings from source vector checks.
- Fix issue #9701: Allow
Scorer.score_spans
to handle predicted docs with missing annotation. - Fix issue #9722: Initialize
parser
from reference parse rather than aligned example. - Fix issue #9764: Set annotations more efficiently in
tagger
andmorphologizer
.
📖 Documentation and examples
- Various documentation updates:
init_tok2vec
after pretraining, batch contract for listeners. - New additions to the spaCy universe:
eng-spacysentiment
: Sentiment analysis for English.- Applied Language Technology course: NLP for newcomers using spaCy and Stanza.
👥 Contributors
@adrianeboyd, @danieldk, @DuyguA, @honnibal, @ines, @ljvmiranda921, @narayanacharya6, @nrodnova, @Pantalaymon, @polm, @richardpaulhudson, @svlandeg, @thiippal, @Vishnunkumar