Releases: explosion/spaCy
v3.3.2: Bug fixes and future NumPy compatibility
This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.
🔴 Bug fixes
- #10911, #11194: Improve speed in
precomputable_biaffine
by avoiding concatenation. - #11276, #11331, #11701: Clean up warnings in spaCy and its test suite.
- #11845: Don't raise an error in displaCy for unset spans keys.
- #11860: Fix
spancat
for docs with zero suggestions. - #11864: Add
smart_open
requirement and update deprecated options. - #11899: Fix
spacy init config --gpu
for environments withoutspacy-transformers
. - #11933: Update for compatibility with NumPy v1.24+ integer conversions.
- #11934: Add strings when initializing from labels in
EditTreeLemmatizer
. - #11935: Restore missing error messages for beam search.
👥 Contributors
v3.2.5: Bug fixes and future NumPy compatibility
This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.
🔴 Bug fixes
- #10573: Remove Click pin following Typer updates.
- #11331, #11701: Clean up warnings in spaCy and its test suite.
- #11845: Don't raise an error in displaCy for unset spans keys.
- #11860: Fix
spancat
for docs with zero suggestions. - #11864: Add
smart_open
requirement and update deprecated options. - #11899: Fix
spacy init config --gpu
for environments withoutspacy-transformers
. - #11933: Update for compatibility with NumPy v1.24+ integer conversions.
- #11935: Restore missing error messages for beam search.
👥 Contributors
v3.1.7: Bug fixes and future NumPy compatibility
This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.
🔴 Bug fixes
- #10573: Remove Click pin following Typer updates.
- #11331, #11701: Clean up warnings in spaCy and its test suite.
- #11845: Don't raise an error in displaCy for unset spans keys.
- #11860: Fix
spancat
for docs with zero suggestions. - #11864: Add
smart_open
requirement and update deprecated options. - #11899: Fix
spacy init config --gpu
for environments withoutspacy-transformers
. - #11933: Update for compatibility with NumPy v1.24+ integer conversions.
- #11935: Restore missing error messages for beam search.
👥 Contributors
v3.4.3: Extended Typer support and bug fixes
✨ New features and improvements
- Extend Typer support to v0.7.x (#11720).
🔴 Bug fixes
- #11640: Handle docs with no entities in
EntityLinker
. - #11688: Restore custom doc extension values in
Doc.to_json()
for attributes set by getters. - #11706: Remove incorrect warning for
pipeline_package.load()
. - #11735: Improve
spacy project
requirements checks for unsupported specifiers and requirements lines. - #11745: Revert modifications to
spacy.load(disable=)
that could enable currently disabled components.
👥 Contributors
@aaronzipp, @adrianeboyd, @honnibal, @ines, @polm, @rmitsch, @ryndaniels, @svlandeg, @thomashacker
v3.4.2: Latin and Luganda support, Python 3.11 wheels and more
✨ New features and improvements
- NEW: Luganda language support (#10847).
- NEW: Latin language support (#11349).
- NEW:
spacy.ConsoleLogger.v2
optionally saves training logs to JSONL (#11214). - NEW: New operators for the
DependencyMatcher
to include matching parents or children to the left or the right of the node (#10371). - Prebuilt Python 3.11 wheels are now available for all spaCy dependencies distributed by @explosion.
- Support pydantic v1.10 and mypy 0.980+, drop mypy support for Python 3.6 (#11546, #11635).
- Support CuPy v11 and add extras for
cuda11x
andcuda-autodetect
(usingcupy-wheel
) (#11279). - Support custom attributes for tokens and spans in
Doc.to_json()
andDoc.from_json()
(#11125). - Make the
enable
anddisable
options forspacy.load()
more consistent (#11459). - Allow a single string argument for
disable
/enclude
/exclude
forspacy.load()
(#11406). - New
--url
flag forspacy info
to print the direct download URL for a pipeline (#11175). - Add a check for missing requirements in the
spacy project
CLI (#11226). - Add a Levenshtein distance function (#11418).
- Improvements to the
spacy debug data
CLI for spancat data (#11504). - Allow overriding
spacy_version
inspacy package
metadata (#11552). - Improve the error message when using the wrong command for
spacy project assets
(#11458). - Ensure parent directories are created when storing the results of the
spacy pretrain
command (#11210). - Extend support to newer versions of
natto-py
for theko
extra (#11222).
📦 Trained pipelines updates
This release includes updated English pipelines for spaCy v3.4 with improved NER performance. The updates in en_core_web_*
v3.4.1 address issues related to training from data with partial named entity annotation, which led to lower NER recall in English pipeline versions v3.0.0–v3.4.0. In particular, entities that appear in the sections of the OntoNotes training data without NER annotation were not predicted consistently by the earlier pipeline versions, such as names and places that are frequent in the Biblical sections, e.g., "David" and "Egypt" (see #7493).
Use spacy download
to update your English pipelines to the newest version. If you'd prefer to keep using an earlier version, you can specify the version directly with e.g. spacy download -d en_core_web_sm-3.4.0
. You can check that you are using the new version (v3.4.1) with spacy validate
:
NAME SPACY VERSION
en_core_web_md >=3.4.0,<3.5.0 3.4.1 ✔
🔴 Bug fixes
- #11275: Fix Dutch noun chunks to skip overlapping spans.
- #11276: Fix regex invalid escape sequences.
- #11312: Better handling of unexpected types in
SetPredicate
. - #11460: Fix config validation failures caused by NVTX pipeline wrappers.
- #11506: Avoid unwanted side effects in
Doc.__init__
. - #11540: Preserve missing entity annotation in augmenters.
- #11592: Fix issues with DVC commands.
- #11631: Fix initialization for
pymorphy2_lookup
lemmatizer mode for Russian and Ukrainian.
⚠️ Backwards incompatibilities
- If you're using a custom component that does not return a
Doc
type, an error will now be raised (#11424). - If you're using a dot in a factory name, an error is raised as this is not supported (#11336).
📖 Documentation and examples
- Added documentation for the new experimental coref component.
- Added Ukrainian trained pipelines to the website.
- Added documentation for the
spacy.models_and_pipes_with_nvtx_range.v1
callback. - Fix English pipeline names in v3.4 release notes.
- Various fixes to the
Example
API documentation. - Extensions and improvements to the
displacy
docs. - Fix the example command for
spacy project dvc
. - Update example code for
spacy-wordnet
. - Improve API documentation around the
initialize()
function for pipeline components. - Fix various typos and inconsistencies.
- spaCy universe additions:
- concepCy: A spaCy wrapper for ConceptNet.
- spaCy partial tagger: build a CRF tagger with a partially annotated dataset.
- Zshot: Zero and Few shot named entity & relationships recognition.
👥 Contributors
@adrianeboyd, @bdura, @danieldk, @diyclassics, @DSLituiev, @GabrielePicco, @honnibal, @ines, @JulesBelveze, @kadarakos, @ljvmiranda921, @ninjalu, @pmbaumgartner, @polm, @radandreicristian, @richardpaulhudson, @rmitsch, @shadeMe, @stefawolf, @svlandeg, @thomashacker, @tobiusaolo, @tzussman , @yasufumy
v2.3.8: Updates for Python 3.10 and 3.11
✨ New features and improvements
- Updates and binary wheels for Python 3.10 and 3.11.
👥 Contributors
v3.4.1: Fix compatibility with CuPy v9.x
🔴 Bug fixes
- Fix issue #11137: Fix compatibility with CuPy v9.x.
📖 Documentation and examples
- spaCy universe additions:
- BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics.
- English Interpretation Sentence Pattern: English interpretation for accurate translation from English to Japanese.
👥 Contributors
@adrianeboyd, @danieldk, @honnibal, @ines, @lll-lll-lll-lll, @Lucaterre, @MaartenGr, @mr-bjerre, @polm, @radenkovic
v3.4.0: Updated types, speed improvements and pipelines for Croatian
✨ New features and improvements
- Support for mypy 0.950+ and pydantic v1.9 (#10786).
- Prebuilt linux aarch64 wheels are now available for all spaCy dependencies distributed by @explosion.
- Min/max
{n,m}
operator forMatcher
patterns (#10981). - Language updates:
- Improved speed of vector lookups (#10992).
- For the parser, use C
saxpy
/sgemm
provided by theOps
implementation in order to use Accelerate throughthinc-apple-ops
(#10773). - Improved speed of
Example.get_aligned_parse
andExample.get_aligned
(#10952). - Improved speed of
StringStore
lookups (#10938). - Updated
spacy project clone
to try bothmain
andmaster
branches by default (#10843). - Added confidence threshold for named entity linker (#11016).
- Improved handling of Typer optional default values for
init_config_cli
(#10788). - Added cycle detection in parser projectivization methods (#10877).
- Added counts for NER labels in
debug data
(#10960). - Support for adding NVTX ranges to
TrainablePipe
components (#10965). - Support env variable
SPACY_NUM_BUILD_JOBS
to specify the number of build jobs to run in parallel withpip
(#11073).
📦 Trained pipelines updates
We have added new pipelines for Croatian that use the trainable lemmatizer and floret vectors.
Package | UPOS | Parser LAS | NER F |
---|---|---|---|
hr_core_news_sm |
96.6 | 77.5 | 76.1 |
hr_core_news_md |
97.3 | 80.1 | 81.8 |
hr_core_news_lg |
97.5 | 80.4 | 83.0 |
🙏 Special thanks to @gtoffoli for help with the new pipelines!
The English pipelines have new word vectors:
Package | Model Version | TAG | Parser LAS | NER F |
---|---|---|---|---|
en_core_news_md |
v3.3.0 | 97.3 | 90.1 | 84.6 |
en_core_news_md |
v3.4.0 | 97.2 | 90.3 | 85.5 |
en_core_news_lg |
v3.3.0 | 97.4 | 90.1 | 85.3 |
en_core_news_lg |
v3.4.0 | 97.3 | 90.2 | 85.6 |
All CNN pipelines have been extended to add whitespace augmentation.
🔴 Bug fixes
- Fix issue #10960: Support hyphens in NER labels.
- Fix issue #10994: Fix horizontal spacing for spans in displaCy.
- Fix issue #11013: Check for any token with a vector in
Doc.has_vector
, distinguish 0-vectors and missing vectors insimilarity
warnings. - Fix issue #11056: Don't use
get_array_module
intextcat
. - Fix issue #11092: Fix vertical alignment for spans in displaCy.
🚀 Notes about upgrading from v3.3
Doc.has_vector
now matchesToken.has_vector
andSpan.has_vector
: it returnsTrue
if at least one token in the doc has a vector rather than checking only whether the vocab contains vectors.
📖 Documentation and examples
- spaCy universe additions:
- Aim-spacy: An Aim-based spaCy experiment tracker.
- Asent: Fast, flexible and transparent sentiment analysis.
- spaCy fishing: Named entity disambiguation and linking on Wikidata in spaCy with Entity-Fishing.
- spacy-report: Generates interactive reports for spaCy models.
👥 Contributors
@adrianeboyd, @danieldk, @ericholscher, @gorarakelyan, @honnibal, @ines, @jademlc, @kadarakos, @KennethEnevoldsen, @koaning, @Lucaterre, @maxTarlov, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @sadovnychyi, @shadeMe, @shen-qin, @single-fingal, @svlandeg, @victorialslocum, @Zackere
v3.3.1: New Span Ruler component, JSON (de)serialization of Doc, span analyzer and more
✨ New features and improvements
- Add the SpanRuler component. This component saves a list of matched spans to
Doc.spans[spans_key]
. - Support for JSON serialization and deserialization of
Doc
objects. - Add span analysis to
debug data
. - Allow data assets to be made optional in a spaCy project.
- Prebuilt macOS ARM64 wheels are now available for all spaCy dependencies distributed by @explosion.
🔴 Bug fixes
- Fix issue #9575: Fix Entity Linker with tokenization mismatches between gold and predicted
Doc
objects. - Fix issue #10685: Fix serialization of
SpanGroup
objects that share the same name within oneSpanGroups
container. - Fix issue #10718: Remove debug print statements in
walk_head_nodes
to avoid acquiring the GIL. - Fix issue #10741: Make the
StringStore.__getitem__
return type dependent on its parameter type. - Fix issue #10734: Support removal of overlapping terms in
PhraseMatcher
. - Fix issue #10772: Override
SpanGroups.setdefault
to also supportIterable[SpanGroup]
as the default. - Fix issue #10817: Ensure that the term
ROOT
is in the glossary. - Fix issue #10830: Better errors for
Doc.has_annotation
andMatcher
. - Fix issue #10864: Avoid pickling
Doc
inputs passed toLanguage.pipe()
. - Fix issue #10898: Fix schemas import in
Doc
.
⚠️ Backward incompatibilities
-
Before this release, a validation bug allowed the configuration of a pipeline component to override the name of the pipeline itself through the
name
attribute. For example, the following pipeline component:[components.transformer] factory = "transformer" name = "custom_transformer_name"
would be registered erroneously as
custom_transformer_name
. Such overrides are now ignored and a warning is emitted (#10779). From spaCy v3.3.1 onwards, this component will be registered astransformer
.
👥 Contributors
@adrianeboyd, @danieldk, @freddyheppell, @honnibal, @ines, @kadarakos, @ldorigo, @ljvmiranda921, @maxTarlov, @pmbaumgartner, @polm, @pypae, @richardpaulhudson, @rmitsch, @shadeMe, @single-fingal, @svlandeg
v3.3.0: Improved speed, new trainable lemmatizer, and pipelines for Finnish, Korean and Swedish
✨ New features and improvements
- Improved speeds for many components, see speed benchmarks for trained pipelines:
- Speed up parser and NER by using constant-time head lookups (#10048).
- Support unnormalized softmax probabilities in
spacy.Tagger.v2
to speed up inference for the tagger, morphologizer, senter and trainable lemmatizer (#10197). - Speed up parser projectivization functions (#10241).
- Replace
Ragged
with fasterAlignmentArray
inExample
for training (#10319). - Improve
Matcher
speed (#10659). - Improve serialization speed for empty
Doc.spans
(#10250).
- NEW: A trainable lemmatizer component that uses edit trees to transform tokens to lemmas. Add it to your config with
spacy init config -p trainable_lemmatizer
or using the quickstart. - Language updates:
- Big endian support with
thinc
v8.0.14+ andthinc-bigendian-ops
. - Config comparisons with
spacy debug diff-config
. - displaCy support for overlapping span annotation and multiple labeled arcs between the same tokens.
SpanCategorizer.set_candidates
for debugging span suggesters.- The quickstart now supports adding
spancat
andtrainable_lemmatizer
components.
📦 Trained pipelines
v3.3 introduces trained pipelines for Finnish, Korean and Swedish which feature the trainable lemmatizer and floret vectors. Due to the use Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.
Package | Language | UPOS | Parser LAS | NER F |
---|---|---|---|---|
fi_core_news_sm |
Finnish | 92.5 | 71.9 | 75.9 |
fi_core_news_md |
Finnish | 95.9 | 78.6 | 80.6 |
fi_core_news_lg |
Finnish | 96.2 | 79.4 | 82.4 |
ko_core_news_sm |
Korean | 86.1 | 65.6 | 71.3 |
ko_core_news_md |
Korean | 94.7 | 80.9 | 83.1 |
ko_core_news_lg |
Korean | 94.7 | 81.3 | 85.3 |
sv_core_news_sm |
Swedish | 95.0 | 75.9 | 74.7 |
sv_core_news_md |
Swedish | 96.3 | 78.5 | 79.3 |
sv_core_news_lg |
Swedish | 96.3 | 79.1 | 81.1 |
🙏 Special thanks to @aajanki, @thiippal (Finnish) and Elena Fano (Swedish) for their help with the new pipelines!
The new trainable lemmatizer is used for Danish, Dutch, Finnish, German, Greek, Italian, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian and Swedish.
Model | v3.2 Lemma Acc | v3.3 Lemma Acc |
---|---|---|
da_core_news_md |
84.9 | 94.8 |
de_core_news_md |
73.4 | 97.7 |
el_core_news_md |
56.5 | 88.9 |
fi_core_news_md |
- | 86.2 |
it_core_news_md |
86.6 | 97.2 |
ko_core_news_md |
- | 90.0 |
lt_core_news_md |
71.1 | 84.8 |
nb_core_news_md |
76.7 | 97.1 |
nl_core_news_md |
81.5 | 94.0 |
pl_core_news_md |
87.1 | 93.7 |
pt_core_news_md |
76.7 | 96.9 |
ro_core_news_md |
81.8 | 95.5 |
sv_core_news_md |
- | 95.5 |
🔴 Bug fixes
- Fix issue #5447: Avoid overlapping arcs when using displaCy in manual mode.
- Fix issue #9443: Fix
Scorer.score_cats
for missing labels. - Fix issue #9669: Fix entity linker batching.
- Fix issue #9903: Handle
_
value for UPOS in CoNLL-U converter. - Fix issue #9904: Fix textcat loss scaling.
- Fix issue #9956: Compare all
Span
attributes consistently. - Fix issue #10073: Add
"spans"
to the output ofdoc.to_json
. - Fix issue #10086: Add tokenizer option to allow
Matcher
handling for all special cases. - Fix issue #10189: Allow
Example
to align whitespace annotation. - Fix issue #10302: Fix check for NER annotation in MISC in CoNLL-U converter.
- Fix issue #10324: Fix
Tok2Vec
for empty batches. - Fix issue #10347: Update basic functionality for
rehearse
. - Fix issue #10394: Fix
Vectors.n_keys
for floret vectors. - Fix issue #10400: Use
meta
inutil.load_model_from_config
. - Fix issue #10451: Fix
Example.get_matching_ents
. - Fix issue #10460: Fix initial special cases for
Tokenizer.explain
. - Fix issue #10521: Stream large assets on download in spaCy projects.
- Fix issue #10536: Handle unknown tags in
KoreanTokenizer
tag map. - Fix issue #10551: Add automatic vector deduplication for
init vectors
.
🚀 Notes about upgrading from v3.2
- To see the speed improvements for the
Tagger
architecture, edit your configs to switch fromspacy.Tagger.v1
tospacy.Tagger.v2
and then runinit fill-config
. - Span comparisons involving ordering (
<
,<=
,>
,>=
) now take all span attributes into account (start, end, label, and KB ID) so spans may be sorted in a slightly different order (#9956). - Annotation on whitespace tokens is handled in the same way as annotation on non-whitespace tokens during training in order to allow custom whitespace annotation (#10189).
Doc.from_docs
now includesDoc.tensor
by default and supports excludes with anexclude
argument in the same format asDoc.to_bytes
. The supported exclude fields arespans
,tensor
anduser_data
.
📖 Documentation and examples
- spaCy universe additions:
- classy-classification: A Python library for classy few-shot and zero-shot classification within spaCy.
- Concise Concepts: Concise Concepts uses few-shot NER based on word embedding similarity.
- Crosslingual Coreference: Crosslingual coreference with an English coreference model plus crosslingual embeddings.
- EDS-NLP: spaCy components to extract information from clinical notes written in French.
- HuSpaCy: Industrial-strength Hungarian natural language processing.
- Klayers: spaCy as a AWS Lambda Layer.
- Named Entity Recognition (NER) using spaCy (video).
- Scrubadub: Remove personally identifiable information from text using spaCy.
- spacy-setfit-textcat: Experiments with SetFit & Few-Shot Classification.
- tmtoolkit: Text mining and topic modeling toolkit.
👥 Contributors
@aajanki, @adrianeboyd, @apjanco, @bdura, @BramVanroy, @danieldk, @danmysak, @davidberenstein1957, @DuyguA, @fonfonx, @gremur, @HaakonME, @harmbuisman, @honnibal, @ines, @internaut, @jfainberg, @jnphilipp, @jsnfly, @kadarakos, @koaning, @ljvmiranda921, @martinjack, @mgrojo, @nrodnova, @ofirnk, @orglce, @pepemedigu, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @ryndaniels, @SamEdwardes, @Schero1994, @shadeMe, @single-fingal, @svlandeg, @thebugcreator, @thomashacker, @umaxfun, @y961996