Releases: explosion/spaCy
v3.0.2: CLI overrides and env variables in projects, base support for Setswana, PhraseMatcher for spans and bug fixes
✨ New features and improvements
- NEW: Base support for Setswana.
- The
PhraseMatcher
can now also be run onSpan
objects. - Support CLI overrides and environment variables in
project.yml
: a sectionenv
defines environment variable names that can be used in commands. Theproject run
command now also supports CLI overrides, e.g.--vars.batch_size 128
. - Reduce memory load when reading all vectors from file during initialization.
- Update recommended transformers in training quickstart and
init config
CLI.
🔴 Bug fixes
- Fix issue #6826: Ensure the loss value is cast to a float.
- Fix issue #6891: Include
noun_chunks
when picklingVocab
. - Fix issue #6908: Fix expected type for textcat labels.
- Fix issue #6924: Correctly pass
vocab
forward inspacy.blank
. - Fix issue #6950: Allow pickling Tok2Vec with listeners .
- Fix issue #6983: Ensure
is_same_func
works correctly for classes in component decorator. - Fix issue #7019: Correctly handle non-float/int values in
spacy evaluate
printer. - Fix issue #7029: Fix listener architecture with empty
Doc
in batch.
📖 Documentation and examples
- Improve installation instructions.
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @peter-exos, @KoichiYasuoka, @tarskiandhutch, @reneoctavio, @melonwater211, @mapmeld and @Shumie82 for the pull requests and contributions.
v3.0.1: Bug fixes for transfomer training
🔴 Bug fixes
- Fix issue #6883: Fix bug in transformer training for
Cannot get dimension 'nO' for model 'transformer': value unset
.
v3.0.0: Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more
📣 NEW: Want to make the transition from spaCy v2 to spaCy v3 as smooth as possible for you and your organization? We're now offering commercial migration support for your spaCy pipelines! We've put a lot of work into making it easy to upgrade your existing code and training workflows – but custom projects may always need some custom work, especially when it comes to taking advantage of the new capabilities. Details & application →
🚀 Quickstart
For the smoothest updating process, we recommend starting with a fresh virtual environment.
pip install -U spacy
- New in v3.0: New features, backwards incompatibilities and migration guide.
- Installation Quickstart: Install the new version, pipelines and add-ons for your specific setup.
- Training Quickstart: Generate a training config for your specific use case.
- Benchmarks: Results and accuracy comparisons.
- Projects & Project Templates: Get started by cloning a project template.
✨ New features and improvements
- Transformer-based pipelines with support for multi-task learning.
- Retrained model families for 18+ languages and 58 trained pipelines in total, including 5 transformer-based pipelines.
- Retrained pipelines for all supported languages, plus new core pipelines for Macedonian and Russian. Thanks to @borijang, @buriy and @kuk for the contributions!
- New training workflow and config system.
- Implement custom models using any machine learning framework, including PyTorch, TensorFlow and MXNet.
- spaCy Projects for managing end-to-end multi-step workflows from preprocessing to model deployment.
- Integrations with Data Version Control (DVC), Streamlit, Weights & Biases, Ray and more.
- Parallel training and distributed computing with Ray.
- New built-in pipeline components:
SentenceRecognizer
,Morphologizer
,Lemmatizer
,AttributeRuler
andTransformer
. - New and improved pipeline component API and decorators for custom components.
- Source trained components from other pipelines in your training config.
- Pre-built and more efficient binary wheels for all trained pipeline packages.
DependencyMatcher
for matching patterns within the dependency parse using Semgrex operators.- Support for greedy patterns in
Matcher
. - New data structure
SpanGroup
for efficiently storing collections of potentially overlapping spans via theDoc.spans
. - Type hints and type-based data validation for custom registered functions.
- Various new methods, attributes and commands.
📺 Video introductions & tutorials
spaCy v3: State-of-the-art NLP from Prototype to Production | spaCy v3: Design concepts explained (behind the scenes) | spaCy v3: Custom trainable relation extraction component |
---|---|---|
📦 Trained pipelines (58)
To download a trained pipeline, you can use the spacy download
command. See the training documentation for details on how to train your own pipelines on your data.
Name | Language | POS | TAG | LAS | UAS | NER | Sent | Size | |
---|---|---|---|---|---|---|---|---|---|
da_core_news_lg v3.0.0 |
Danish | 0.97 | 0.97 | 0.78 | 0.82 | 0.82 | 0.88 | 547 MB | 📖 |
da_core_news_md v3.0.0 |
Danish | 0.96 | 0.96 | 0.78 | 0.82 | 0.81 | 0.86 | 47 MB | 📖 |
da_core_news_sm v3.0.0 |
Danish | 0.95 | 0.95 | 0.76 | 0.81 | 0.72 | 0.86 | 17 MB | 📖 |
de_core_news_lg v3.0.0 |
German | 0.98 | 0.98 | 0.91 | 0.93 | 0.85 | 0.95 | 546 MB | 📖 |
de_core_news_md v3.0.0 |
German | 0.98 | 0.98 | 0.91 | 0.93 | 0.84 | 0.95 | 47 MB | 📖 |
de_core_news_sm v3.0.0 |
German | 0.98 | 0.97 | 0.90 | 0.92 | 0.82 | 0.94 | 18 MB | 📖 |
de_dep_news_trf v3.0.0 |
German | 0.99 | 0.99 | 0.95 | 0.96 | n/a | 0.98 | 393 MB | 📖 |
el_core_news_lg v3.0.0 |
Greek | 0.97 | 0.94 | 0.85 | 0.88 | 0.80 | 1.00 | 544 MB | 📖 |
el_core_news_md v3.0.0 |
Greek | 0.96 | 0.93 | 0.84 | 0.87 | 0.79 | 1.00 | 42 MB | 📖 |
el_core_news_sm v3.0.0 |
Greek | 0.94 | 0.91 | 0.81 | 0.85 | 0.72 | 1.00 | 12 MB | 📖 |
en_core_web_lg v3.0.0 |
English | n/a | 0.97 | 0.90 | 0.92 | 0.86 | 0.89 | 742 MB | 📖 |
en_core_web_md v3.0.0 |
English | n/a | 0.97 | 0.90 | 0.92 | 0.85 | 0.89 | 44 MB | 📖 |
en_core_web_sm v3.0.0 |
English | n/a | 0.97 | 0.90 | 0.92 | 0.84 | 0.89 | 13 MB | 📖 |
en_core_web_trf v3.0.0 |
English | n/a | 0.98 | 0.94 | 0.95 | 0.90 | 0.89 | 438 MB | 📖 |
es_core_news_lg v3.0.0 |
Spanish | 0.99 | 0.98 | 0.88 | 0.91 | 0.90 | 1.00 | 547 MB | 📖 |
es_core_news_md v3.0.0 |
Spanish | 0.99 | 0.98 | 0.88 | 0.91 | 0.90 | 1.00 | 46 MB | 📖 |
es_core_news_sm v3.0.0 |
Spanish | 0.98 | 0.97 | 0.87 | 0.90 | 0.89 | 1.00 | 17 MB | 📖 |
es_dep_news_trf v3.0.0 |
Spanish | 0.99 | 0.98 | 0.93 | 0.95 | n/a | 0.97 | 395 MB | 📖 |
fr_core_news_lg v3.0.0 |
French | 0.98 | 0.95 | 0.86 | 0.90 | 0.82 | 0.88 | 546 MB | 📖 |
fr_core_news_md v3.0.0 |
French | 0.97 | 0.94 | 0.85 | 0.89 | 0.81 | 0.87 | 45 MB | 📖 |
fr_core_news_sm v3.0.0 |
French | 0.96 | 0.93 | 0.84 | 0.88 | 0.79 | 0.85 | 16 MB | 📖 |
fr_dep_news_trf v3.0.0 |
French | 0.99 | 0.96 | 0.92 | 0.94 | n/a | 0.94 | 381 MB | 📖 |
it_core_news_lg v3.0.0 |
Italian | 0.98 | 0.97 | 0.88 | 0.91 | 0.89 | 0.97 | 545 MB | 📖 |
[it_core_news_md ](https://spacy.io/models/it#it_core_news... |
v3.0.0rc3: Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more
🌙 This release is a nightly pre-release and not intended for production yet. We recommend using a new virtual environment. For more details on the new features and usage guides, see the v3 documentation.
⚠️ ⚠️ ⚠️ Make sure to retrain your models!⚠️ ⚠️ ⚠️
This release includes changes to the config and model architectures, so if you've trained a custom pipeline withv3.0.0rc1
orv3.0.0rc2
, you'll need to retrain it. We recommend using the new spaCy projects system to make it easy to re-run your training process. To auto-fill and update your configs, you can use theinit fill-config
command.
📣 NEW: Want to make the transition from spaCy v2 to spaCy v3 as smooth as possible for you and your organization? We're now offering commercial migration support for your spaCy pipelines! We've put a lot of work into making it easy to upgrade your existing code and training workflows – but custom projects may always need some custom work, especially when it comes to taking advantage of the new capabilities. Details & application →
🚀 Quickstart
pip install -U spacy-nightly --pre
- Introducing spaCy v3.0 nightly
- New in v3.0: New features, backwards incompatibilities and migration guide.
- Installation Quickstart: Install the new version, pipelines and add-ons for your specific setup.
- Training Quickstart: Generate a training config for your specific use case.
- Benchmarks: Results and accuracy comparisons.
- Projects & Project Templates: Get started by cloning a project template.
✨ New features and improvements
- Transformer-based pipelines with support for multi-task learning.
- Retrained model families for 18 languages and 58 trained pipelines in total, including 5 transformer-based pipelines.
- New core pipelines for Macedonian and Russian. Thanks to @borijang, @buriy and @kuk for their contributions!
- New training workflow and config system.
- Implement custom models using any machine learning framework, including PyTorch, TensorFlow and MXNet.
- spaCy Projects for managing end-to-end multi-step workflows from preprocessing to model deployment.
- Integrations with Data Version Control (DVC), Streamlit, Weights & Biases, Ray and more.
- Parallel training and distributed computing with Ray.
- New built-in pipeline components:
SentenceRecognizer
,Morphologizer
,Lemmatizer
,AttributeRuler
andTransformer
. - New and improved pipeline component API and decorators for custom components.
- Source trained components from other pipelines in your training config.
DependencyMatcher
for matching patterns within the dependency parse using Semgrex operators.- Support for greedy patterns in
Matcher
. - Type hints and type-based data validation for custom registered functions.
- Various new methods, attributes and commands.
⚠️ Backwards incompatibilities
For more info on how to migrate from spaCy v2.x, see the detailed migration guide.
API changes
- Pipeline package symlinks, the
link
command and shortcut names are now deprecated. There can be many different trained pipelines and not just one "English model", so you should always use the full package name likeen_core_web_sm
explicitly. - A pipeline's
meta.json
is now only used to provide meta information like the package name, author, license and labels. It's not used to construct the processing pipeline anymore. This is all defined in theconfig.cfg
, which also includes all settings used to train the pipeline. - The
train
,pretrain
anddebug data
commands now only take aconfig.cfg
. Language.add_pipe
now takes the string name of the component factory instead of the component function.- Custom pipeline components now need to be decorated with the
@Language.component
or@Language.factory
decorator. - The
Language.update
,Language.evaluate
andTrainablePipe.update
methods now all take batches ofExample
objects instead ofDoc
andGoldParse
objects, or raw text and a dictionary of annotations. - The
begin_training
methods have been renamed toinitialize
and now take a function that returns a sequence ofExample
objects to initialize the model instead of a list of tuples. Matcher.add
andPhraseMatcher.add
now only accept a list of patterns as the second argument (instead of a variable number of arguments). Theon_match
callback becomes an optional keyword argument.- The
Doc
flags likeDoc.is_parsed
orDoc.is_tagged
have been replaced byDoc.has_annotation
. - The
spacy.gold
module has been renamed tospacy.training
. - The
PRON_LEMMA
symbol and-PRON-
as an indicator for pronoun lemmas has been removed. - The
TAG_MAP
andMORPH_RULES
in the language data have been replaced by the more flexibleAttributeRuler
. - The
Lemmatizer
is now a standalone pipeline component and doesn't provide lemmas by default or switch automatically between lookup and rule-based lemmas. You can now add it to your pipeline explicitly and set its mode on initialization. - Various keyword arguments across functions and methods are now explicitly declared as keyword-only arguments. Those arguments are documented accordingly across the API reference.
Removed or renamed API
Removed | Replacement |
---|---|
Language.disable_pipes |
Language.select_pipes , Language.disable_pipe , Language.enable_pipe |
Language.begin_training , Pipe.begin_training , ... |
Language.initialize , Pipe.initialize , ... |
Doc.is_tagged , Doc.is_parsed , ... |
Doc.has_annotation |
GoldParse |
Example |
GoldCorpus |
Corpus ... |
v2.3.5: Bug fixes and simpler source installs
✨ New features and improvements
- Modify
blis
andnumpy
build dependencies to simplify source installations. - Support
cupy
v8+ in combination withthinc
v7.4.5.
🔴 Bug fixes
- Fix issue #6443: Only set
NORM
on token in retokenizer. - Fix issue #6453: Add
SPACY
as aMatcher
attribute. - Fix issue #6512: Add
nlp.max_length
check tonlp.pipe
throughnlp.make_doc
. - Fix issue #6515: Add missing
.pipe
methods to Chinese, Japanese, Korean and Thai tokenizers. - Fix issue #6518: Fix subsequent pipe detection in
EntityRuler
. - Fix issue #6523: Remove non-working
--use-chars
from train CLI.
👥 Contributors
Thanks to @KoichiYasuoka for the pull requests and contributions.
v2.3.4: Fix beam parser API
🔴 Bug fixes
- Fix issue #6446: Restore
cleanup_beam
method.
📖 Documentation and examples
- Update rule-based matching docs
👥 Contributors
Thanks to @jabortell for the pull requests and contributions.
v2.3.3: Alpha support for Macedonian and Sanskrit, updates for many languages and bug fixes
✨ New features and improvements
- NEW: Add alpha support for Macedonian and Sanskrit.
- Update language data for Croatian, Czech, English, Hebrew, Hindi, Indonesian, Swedish, Thai and Turkish.
- Add support for aarch64 and ppc64le on linux with binary packages available on conda-forge.
🔴 Bug fixes
- Fix issue #5610: Make sure
sys.argv
exists. - Fix issue #5643: Add
ent_id_
to strings serialized withDoc
. - Fix issue #5727: Clarify warning for misaligned BILUO tags.
- Fix issue #5768: Improve tag map initialization and updating.
- Fix issue #5794: Improve warnings around normalization tables.
- Fix issue #5796: Update invalid tag maps.
- Fix issue #5799: Remove hard-coded GPU ID from
pretrain
. - Fix issue #5802: Mark Japanese documents as tagged.
- Fix issue #5823: Fix typo in unit tests.
- Fix issue #5838: Fix
EntityRenderer
to support break lines (after last entity). - Fix issue #5843: Prefer earlier spans in
EntityRuler
. - Fix issue #5849: Allow
Doc.char_span
to snap to token boundaries. - Fix issue #5853: Fix span boundary handling in Spanish noun chunks.
- Fix issue #5861: Add
Span
index boundary checks. - Fix issue #5904: Fix typos in comments.
- Fix issue #5910: Update default sentencizer characters for Armenian, Greek and Arabic.
- Fix issue #6014: Fix off-by-one error for best iteration calculation.
- Fix issue #6112: Fix overlapping German noun chunks.
- Fix issue #6148: Identify final
Matcher
pattern node by quantifier. - Fix issue #6164: Reorder so tag map is replaced only if a custom file is provided.
- Fix issue #6218: Reproducibility for
TextCategorizer
andTok2Vec
. - Fix issue #6219: Add re-enabled pipe names back to the meta before serializing.
- Fix issue #6300: Fix
on_match
callback and exclude empty match lists from results forDependencyMatcher
. - Fix issue #6347: Memory leak issues with
beam_parse
(requiresthinc>=7.4.3
). - Fix issue #6373: Bugfix textcat reproducibility on GPU (requires
thinc>=7.4.3
). - Fix issue #6405: Add all vectors to vocab before pruning.
- Fix issue #6413: Use int8_t instead of char in
Matcher
.
👥 Contributors
Thanks to @abchapman93, @baranitharan2020, @bittlingmayer, @bjascob, @borijang, @BramVanroy, @chopeen, @danielvasic, @delzac, @DuyguA, @erip, @florijanstamenkovic, @graue70, @hiroshi-matsuda-rit, @holubvl3, @idoshr, @jgutix, @KKsharma99, @leyendecker, @lizhe2004, @MartinoMensio, @nipunsadvilkar, @Nuccy90, @oculusrepairo, @rahul1990gupta, @rasyidf, @robertsipek, @SamEdwardes, @snsten, @solarmist, @Stannislav, @tamuhey, @tilusnet, @vha14, @wannaphong, @zaibacu for the pull requests and contributions.
v3.0.0rc1: Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more
🌙 This release is a nightly pre-release and not intended for production yet. We recommend using a new virtual environment. For more details on the new features and usage guides, see the v3 documentation.
🚀 Quickstart
pip install -U spacy-nightly --pre
- Introducing spaCy v3.0 nightly
- New in v3.0: New features, backwards incompatibilities and migration guide.
- Installation Quickstart: Install the new version, pipelines and add-ons for your specific setup.
- Training Quickstart: Generate a training config for your specific use case.
- Benchmarks: Results and accuracy comparisons.
- Projects & Project Templates: Get started by cloning a project template.
✨ New features and improvements
- Transformer-based pipelines with support for multi-task learning.
- Retrained model families for 16 languages and 52 trained pipelines in total, including 6 transformer-based pipelines.
- New training workflow and config system.
- Implement custom models using any machine learning framework, including PyTorch, TensorFlow and MXNet.
- spaCy Projects for managing end-to-end multi-step workflows from preprocessing to model deployment.
- Integrations with Data Version Control (DVC), Streamlit, Weights & Biases, Ray and more.
- Parallel training and distributed computing with Ray.
- New built-in pipeline components:
SentenceRecognizer
,Morphologizer
,Lemmatizer
,AttributeRuler
andTransformer
. - New and improved pipeline component API and decorators for custom components.
- Source trained components from other pipelines in your training config.
DependencyMatcher
for matching patterns within the dependency parse using Semgrex operators.- Support for greedy patterns in
Matcher
. - Type hints and type-based data validation for custom registered functions.
- Various new methods, attributes and commands.
⚠️ Backwards incompatibilities
For more info on how to migrate from spaCy v2.x, see the detailed migration guide.
API changes
- Pipeline package symlinks, the
link
command and shortcut names are now deprecated. There can be many different trained pipelines and not just one "English model", so you should always use the full package name likeen_core_web_sm
explicitly. - A pipeline's
meta.json
is now only used to provide meta information like the package name, author, license and labels. It's not used to construct the processing pipeline anymore. This is all defined in theconfig.cfg
, which also includes all settings used to train the pipeline. - The
train
,pretrain
anddebug data
commands now only take aconfig.cfg
. Language.add_pipe
now takes the string name of the component factory instead of the component function.- Custom pipeline components now need to be decorated with the
@Language.component
or@Language.factory
decorator. - The
Language.update
,Language.evaluate
andTrainablePipe.update
methods now all take batches ofExample
objects instead ofDoc
andGoldParse
objects, or raw text and a dictionary of annotations. - The
begin_training
methods have been renamed toinitialize
and now take a function that returns a sequence ofExample
objects to initialize the model instead of a list of tuples. Matcher.add
andPhraseMatcher.add
now only accept a list of patterns as the second argument (instead of a variable number of arguments). Theon_match
callback becomes an optional keyword argument.- The
Doc
flags likeDoc.is_parsed
orDoc.is_tagged
have been replaced byDoc.has_annotation
. - The
spacy.gold
module has been renamed tospacy.training
. - The
PRON_LEMMA
symbol and-PRON-
as an indicator for pronoun lemmas has been removed. - The
TAG_MAP
andMORPH_RULES
in the language data have been replaced by the more flexibleAttributeRuler
. - The
Lemmatizer
is now a standalone pipeline component and doesn't provide lemmas by default or switch automatically between lookup and rule-based lemmas. You can now add it to your pipeline explicitly and set its mode on initialization. - Various keyword arguments across functions and methods are now explicitly declared as keyword-only arguments. Those arguments are documented accordingly across the API reference.
Removed or renamed API
Removed | Replacement |
---|---|
Language.disable_pipes |
Language.select_pipes , Language.disable_pipe , Language.enable_pipe |
Language.begin_training , Pipe.begin_training , ... |
Language.initialize , Pipe.initialize , ... |
Doc.is_tagged , Doc.is_parsed , ... |
Doc.has_annotation |
GoldParse |
Example |
GoldCorpus |
Corpus |
KnowledgeBase.load_bulk , KnowledgeBase.dump |
KnowledgeBase.from_disk , KnowledgeBase.to_disk |
Matcher.pipe , PhraseMatcher.pipe |
not needed |
gold.offsets_from_biluo_tags , gold.spans_from_biluo_tags , gold.biluo_tags_from_offsets |
training.biluo_tags_to_offsets , training.biluo_tags_to_spans , training.offsets_to_biluo_tags |
... |
v2.3.2: Improved Korean tokenizer speed, experimental character-based pretraining and bug fixes
✨ New features and improvements
- Improve Korean tokenizer speed.
- Add experimental character-based pretraining.
🔴 Bug fixes
- Fix issue #5728: Fix French lemmatizer.
- Fix issue #5729: Fix lemmatizer for python 2.7.
- Fix issue #5751: Fix meta serialization in train CLI.
👥 Contributors
Thanks to @graue70, @mikeizbicki, @jbesomi, @gandersen101 and @DeNeutoy for the pull requests and contributions.
v2.3.1: Alpha support for Nepali, updated Armenian and Japanese language data and bug fixes
✨ New features and improvements
- NEW: Add alpha support for Nepali.
- Refactor Japanese tokenizer and include additional custom tokenizer features.
- Update Armenian language data.
- Include spacy git commit in package and model meta for reference.
🔴 Bug fixes
- Fix issue #5620: Skip vocab in component config overrides.
- Fix issue #5634: Fix polarity of
Token.is_oov
andLexeme.is_oov
. - Fix issue #5643: Add strings and
ENT_KB_ID
toDoc
serialization. - Fix issue #5648: Disregard special tag _SP in check for new tag map.
- Fix issue #5658 : Move lemmatizer
is_base_form
to language settings.
👥 Contributors
Thanks to @myavrum, @mahnerak, @rameshhpathak, @hiroshi-matsuda-rit, @PluieElectrique, @hertelm and @alvaroabascar for the pull requests and contributions.