Releases · explosion/spaCy

14 Feb 01:50

ines

v3.0.2

b31471b

v3.0.2: CLI overrides and env variables in projects, base support for Setswana, PhraseMatcher for spans and bug fixes

✨ New features and improvements

NEW: Base support for Setswana.
The PhraseMatcher can now also be run on Span objects.
Support CLI overrides and environment variables in project.yml: a section env defines environment variable names that can be used in commands. The project run command now also supports CLI overrides, e.g. --vars.batch_size 128.
Reduce memory load when reading all vectors from file during initialization.
Update recommended transformers in training quickstart and init config CLI.

🔴 Bug fixes

Fix issue #6826: Ensure the loss value is cast to a float.
Fix issue #6891: Include noun_chunks when pickling Vocab.
Fix issue #6908: Fix expected type for textcat labels.
Fix issue #6924: Correctly pass vocab forward in spacy.blank.
Fix issue #6950: Allow pickling Tok2Vec with listeners .
Fix issue #6983: Ensure is_same_func works correctly for classes in component decorator.
Fix issue #7019: Correctly handle non-float/int values in spacy evaluate printer.
Fix issue #7029: Fix listener architecture with empty Doc in batch.

📖 Documentation and examples

Improve installation instructions.
Fix various typos and inconsistencies.

👥 Contributors

Thanks to @peter-exos, @KoichiYasuoka, @tarskiandhutch, @reneoctavio, @melonwater211, @mapmeld and @Shumie82 for the pull requests and contributions.

Assets 2

02 Feb 10:55

adrianeboyd

v3.0.1

91a3cab

v3.0.1: Bug fixes for transfomer training

🔴 Bug fixes

Fix issue #6883: Fix bug in transformer training for Cannot get dimension 'nO' for model 'transformer': value unset.

Assets 2

01 Feb 13:41

ines

v3.0.0

a59f3fc

v3.0.0: Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more

📣 NEW: Want to make the transition from spaCy v2 to spaCy v3 as smooth as possible for you and your organization? We're now offering commercial migration support for your spaCy pipelines! We've put a lot of work into making it easy to upgrade your existing code and training workflows – but custom projects may always need some custom work, especially when it comes to taking advantage of the new capabilities. Details & application →

🚀 Quickstart

For the smoothest updating process, we recommend starting with a fresh virtual environment.

pip install -U spacy

New in v3.0: New features, backwards incompatibilities and migration guide.
Installation Quickstart: Install the new version, pipelines and add-ons for your specific setup.
Training Quickstart: Generate a training config for your specific use case.
Benchmarks: Results and accuracy comparisons.
Projects & Project Templates: Get started by cloning a project template.

✨ New features and improvements

Transformer-based pipelines with support for multi-task learning.
Retrained model families for 18+ languages and 58 trained pipelines in total, including 5 transformer-based pipelines.
Retrained pipelines for all supported languages, plus new core pipelines for Macedonian and Russian. Thanks to @borijang, @buriy and @kuk for the contributions!
New training workflow and config system.
Implement custom models using any machine learning framework, including PyTorch, TensorFlow and MXNet.
spaCy Projects for managing end-to-end multi-step workflows from preprocessing to model deployment.
Integrations with Data Version Control (DVC), Streamlit, Weights & Biases, Ray and more.
Parallel training and distributed computing with Ray.
New built-in pipeline components: SentenceRecognizer, Morphologizer, Lemmatizer, AttributeRuler and Transformer.
New and improved pipeline component API and decorators for custom components.
Source trained components from other pipelines in your training config.
Pre-built and more efficient binary wheels for all trained pipeline packages.
DependencyMatcher for matching patterns within the dependency parse using Semgrex operators.
Support for greedy patterns in Matcher.
New data structure SpanGroup for efficiently storing collections of potentially overlapping spans via the Doc.spans.
Type hints and type-based data validation for custom registered functions.
Various new methods, attributes and commands.

📺 Video introductions & tutorials

spaCy v3: State-of-the-art NLP from Prototype to Production	spaCy v3: Design concepts explained (behind the scenes)	spaCy v3: Custom trainable relation extraction component

📦 Trained pipelines (58)

To download a trained pipeline, you can use the spacy download command. See the training documentation for details on how to train your own pipelines on your data.

Name	Language	POS	TAG	LAS	UAS	NER	Sent	Size
`da_core_news_lg` v3.0.0	Danish	0.97	0.97	0.78	0.82	0.82	0.88	547 MB	📖
`da_core_news_md` v3.0.0	Danish	0.96	0.96	0.78	0.82	0.81	0.86	47 MB	📖
`da_core_news_sm` v3.0.0	Danish	0.95	0.95	0.76	0.81	0.72	0.86	17 MB	📖
`de_core_news_lg` v3.0.0	German	0.98	0.98	0.91	0.93	0.85	0.95	546 MB	📖
`de_core_news_md` v3.0.0	German	0.98	0.98	0.91	0.93	0.84	0.95	47 MB	📖
`de_core_news_sm` v3.0.0	German	0.98	0.97	0.90	0.92	0.82	0.94	18 MB	📖
`de_dep_news_trf` v3.0.0	German	0.99	0.99	0.95	0.96	n/a	0.98	393 MB	📖
`el_core_news_lg` v3.0.0	Greek	0.97	0.94	0.85	0.88	0.80	1.00	544 MB	📖
`el_core_news_md` v3.0.0	Greek	0.96	0.93	0.84	0.87	0.79	1.00	42 MB	📖
`el_core_news_sm` v3.0.0	Greek	0.94	0.91	0.81	0.85	0.72	1.00	12 MB	📖
`en_core_web_lg` v3.0.0	English	n/a	0.97	0.90	0.92	0.86	0.89	742 MB	📖
`en_core_web_md` v3.0.0	English	n/a	0.97	0.90	0.92	0.85	0.89	44 MB	📖
`en_core_web_sm` v3.0.0	English	n/a	0.97	0.90	0.92	0.84	0.89	13 MB	📖
`en_core_web_trf` v3.0.0	English	n/a	0.98	0.94	0.95	0.90	0.89	438 MB	📖
`es_core_news_lg` v3.0.0	Spanish	0.99	0.98	0.88	0.91	0.90	1.00	547 MB	📖
`es_core_news_md` v3.0.0	Spanish	0.99	0.98	0.88	0.91	0.90	1.00	46 MB	📖
`es_core_news_sm` v3.0.0	Spanish	0.98	0.97	0.87	0.90	0.89	1.00	17 MB	📖
`es_dep_news_trf` v3.0.0	Spanish	0.99	0.98	0.93	0.95	n/a	0.97	395 MB	📖
`fr_core_news_lg` v3.0.0	French	0.98	0.95	0.86	0.90	0.82	0.88	546 MB	📖
`fr_core_news_md` v3.0.0	French	0.97	0.94	0.85	0.89	0.81	0.87	45 MB	📖
`fr_core_news_sm` v3.0.0	French	0.96	0.93	0.84	0.88	0.79	0.85	16 MB	📖
`fr_dep_news_trf` v3.0.0	French	0.99	0.96	0.92	0.94	n/a	0.94	381 MB	📖
`it_core_news_lg` v3.0.0	Italian	0.98	0.97	0.88	0.91	0.89	0.97	545 MB	📖
[`it_core_news_md`](https://spacy.io/models/it#it_core_news...

Assets 2

19 Jan 08:35

ines

v3.0.0rc3

76e25af

v3.0.0rc3: Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more Pre-release

Pre-release

🌙 This release is a nightly pre-release and not intended for production yet. We recommend using a new virtual environment. For more details on the new features and usage guides, see the v3 documentation.

⚠️⚠️⚠️ Make sure to retrain your models! ⚠️⚠️⚠️
This release includes changes to the config and model architectures, so if you've trained a custom pipeline with v3.0.0rc1 or v3.0.0rc2, you'll need to retrain it. We recommend using the new spaCy projects system to make it easy to re-run your training process. To auto-fill and update your configs, you can use the init fill-config command.

📣 NEW: Want to make the transition from spaCy v2 to spaCy v3 as smooth as possible for you and your organization? We're now offering commercial migration support for your spaCy pipelines! We've put a lot of work into making it easy to upgrade your existing code and training workflows – but custom projects may always need some custom work, especially when it comes to taking advantage of the new capabilities. Details & application →

🚀 Quickstart

pip install -U spacy-nightly --pre

Introducing spaCy v3.0 nightly
New in v3.0: New features, backwards incompatibilities and migration guide.
Installation Quickstart: Install the new version, pipelines and add-ons for your specific setup.
Training Quickstart: Generate a training config for your specific use case.
Benchmarks: Results and accuracy comparisons.
Projects & Project Templates: Get started by cloning a project template.

✨ New features and improvements

Transformer-based pipelines with support for multi-task learning.
Retrained model families for 18 languages and 58 trained pipelines in total, including 5 transformer-based pipelines.
New core pipelines for Macedonian and Russian. Thanks to @borijang, @buriy and @kuk for their contributions!
New training workflow and config system.
Implement custom models using any machine learning framework, including PyTorch, TensorFlow and MXNet.
spaCy Projects for managing end-to-end multi-step workflows from preprocessing to model deployment.
Integrations with Data Version Control (DVC), Streamlit, Weights & Biases, Ray and more.
Parallel training and distributed computing with Ray.
New built-in pipeline components: SentenceRecognizer, Morphologizer, Lemmatizer, AttributeRuler and Transformer.
New and improved pipeline component API and decorators for custom components.
Source trained components from other pipelines in your training config.
DependencyMatcher for matching patterns within the dependency parse using Semgrex operators.
Support for greedy patterns in Matcher.
Type hints and type-based data validation for custom registered functions.
Various new methods, attributes and commands.

⚠️ Backwards incompatibilities

For more info on how to migrate from spaCy v2.x, see the detailed migration guide.

API changes

Pipeline package symlinks, the link command and shortcut names are now deprecated. There can be many different trained pipelines and not just one "English model", so you should always use the full package name like en_core_web_sm explicitly.
A pipeline's meta.json is now only used to provide meta information like the package name, author, license and labels. It's not used to construct the processing pipeline anymore. This is all defined in the config.cfg, which also includes all settings used to train the pipeline.
The train, pretrain and debug data commands now only take a config.cfg.
Language.add_pipe now takes the string name of the component factory instead of the component function.
Custom pipeline components now need to be decorated with the @Language.component or @Language.factory decorator.
The Language.update, Language.evaluate and TrainablePipe.update methods now all take batches of Example objects instead of Doc and GoldParse objects, or raw text and a dictionary of annotations.
The begin_training methods have been renamed to initialize and now take a function that returns a sequence of Example objects to initialize the model instead of a list of tuples.
Matcher.add and PhraseMatcher.add now only accept a list of patterns as the second argument (instead of a variable number of arguments). The on_match callback becomes an optional keyword argument.
The Doc flags like Doc.is_parsed or Doc.is_tagged have been replaced by Doc.has_annotation.
The spacy.gold module has been renamed to spacy.training.
The PRON_LEMMA symbol and -PRON- as an indicator for pronoun lemmas has been removed.
The TAG_MAP and MORPH_RULES in the language data have been replaced by the more flexible AttributeRuler.
The Lemmatizer is now a standalone pipeline component and doesn't provide lemmas by default or switch automatically between lookup and rule-based lemmas. You can now add it to your pipeline explicitly and set its mode on initialization.
Various keyword arguments across functions and methods are now explicitly declared as keyword-only arguments. Those arguments are documented accordingly across the API reference.

Removed or renamed API

Removed	Replacement
`Language.disable_pipes`	`Language.select_pipes`, `Language.disable_pipe`, `Language.enable_pipe`
`Language.begin_training`, `Pipe.begin_training`, ...	`Language.initialize`, `Pipe.initialize`, ...
`Doc.is_tagged`, `Doc.is_parsed`, ...	`Doc.has_annotation`
`GoldParse`	`Example`
`GoldCorpus`	`Corpus` ...

Assets 2

11 Dec 14:44

adrianeboyd

v2.3.5

1d4b1de

v2.3.5: Bug fixes and simpler source installs

✨ New features and improvements

Modify blis and numpy build dependencies to simplify source installations.
Support cupy v8+ in combination with thinc v7.4.5.

🔴 Bug fixes

Fix issue #6443: Only set NORM on token in retokenizer.
Fix issue #6453: Add SPACY as a Matcher attribute.
Fix issue #6512: Add nlp.max_length check to nlp.pipe through nlp.make_doc.
Fix issue #6515: Add missing .pipe methods to Chinese, Japanese, Korean and Thai tokenizers.
Fix issue #6518: Fix subsequent pipe detection in EntityRuler.
Fix issue #6523: Remove non-working --use-chars from train CLI.

👥 Contributors

Thanks to @KoichiYasuoka for the pull requests and contributions.

Assets 2

26 Nov 00:36

adrianeboyd

v2.3.4

6fb3e47

v2.3.4: Fix beam parser API

🔴 Bug fixes

Fix issue #6446: Restore cleanup_beam method.

📖 Documentation and examples

Update rule-based matching docs

👥 Contributors

Thanks to @jabortell for the pull requests and contributions.

Assets 2

24 Nov 17:42

adrianeboyd

v2.3.3

08fc876

v2.3.3: Alpha support for Macedonian and Sanskrit, updates for many languages and bug fixes

✨ New features and improvements

NEW: Add alpha support for Macedonian and Sanskrit.
Update language data for Croatian, Czech, English, Hebrew, Hindi, Indonesian, Swedish, Thai and Turkish.
Add support for aarch64 and ppc64le on linux with binary packages available on conda-forge.

🔴 Bug fixes

Fix issue #5610: Make sure sys.argv exists.
Fix issue #5643: Add ent_id_ to strings serialized with Doc.
Fix issue #5727: Clarify warning for misaligned BILUO tags.
Fix issue #5768: Improve tag map initialization and updating.
Fix issue #5794: Improve warnings around normalization tables.
Fix issue #5796: Update invalid tag maps.
Fix issue #5799: Remove hard-coded GPU ID from pretrain.
Fix issue #5802: Mark Japanese documents as tagged.
Fix issue #5823: Fix typo in unit tests.
Fix issue #5838: Fix EntityRenderer to support break lines (after last entity).
Fix issue #5843: Prefer earlier spans in EntityRuler.
Fix issue #5849: Allow Doc.char_span to snap to token boundaries.
Fix issue #5853: Fix span boundary handling in Spanish noun chunks.
Fix issue #5861: Add Span index boundary checks.
Fix issue #5904: Fix typos in comments.
Fix issue #5910: Update default sentencizer characters for Armenian, Greek and Arabic.
Fix issue #6014: Fix off-by-one error for best iteration calculation.
Fix issue #6112: Fix overlapping German noun chunks.
Fix issue #6148: Identify final Matcher pattern node by quantifier.
Fix issue #6164: Reorder so tag map is replaced only if a custom file is provided.
Fix issue #6218: Reproducibility for TextCategorizer and Tok2Vec.
Fix issue #6219: Add re-enabled pipe names back to the meta before serializing.
Fix issue #6300: Fix on_match callback and exclude empty match lists from results for DependencyMatcher.
Fix issue #6347: Memory leak issues with beam_parse (requires thinc>=7.4.3).
Fix issue #6373: Bugfix textcat reproducibility on GPU (requires thinc>=7.4.3).
Fix issue #6405: Add all vectors to vocab before pruning.
Fix issue #6413: Use int8_t instead of char in Matcher.

👥 Contributors

Thanks to @abchapman93, @baranitharan2020, @bittlingmayer, @bjascob, @borijang, @BramVanroy, @chopeen, @danielvasic, @delzac, @DuyguA, @erip, @florijanstamenkovic, @graue70, @hiroshi-matsuda-rit, @holubvl3, @idoshr, @jgutix, @KKsharma99, @leyendecker, @lizhe2004, @MartinoMensio, @nipunsadvilkar, @Nuccy90, @oculusrepairo, @rahul1990gupta, @rasyidf, @robertsipek, @SamEdwardes, @snsten, @solarmist, @Stannislav, @tamuhey, @tilusnet, @vha14, @wannaphong, @zaibacu for the pull requests and contributions.

Assets 2

15 Oct 15:35

ines

v3.0.0rc1

ff4267d

v3.0.0rc1: Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more Pre-release

Pre-release

🌙 This release is a nightly pre-release and not intended for production yet. We recommend using a new virtual environment. For more details on the new features and usage guides, see the v3 documentation.

🚀 Quickstart

pip install -U spacy-nightly --pre

Introducing spaCy v3.0 nightly
New in v3.0: New features, backwards incompatibilities and migration guide.
Installation Quickstart: Install the new version, pipelines and add-ons for your specific setup.
Training Quickstart: Generate a training config for your specific use case.
Benchmarks: Results and accuracy comparisons.
Projects & Project Templates: Get started by cloning a project template.

✨ New features and improvements

Transformer-based pipelines with support for multi-task learning.
Retrained model families for 16 languages and 52 trained pipelines in total, including 6 transformer-based pipelines.
New training workflow and config system.
Implement custom models using any machine learning framework, including PyTorch, TensorFlow and MXNet.
spaCy Projects for managing end-to-end multi-step workflows from preprocessing to model deployment.
Integrations with Data Version Control (DVC), Streamlit, Weights & Biases, Ray and more.
Parallel training and distributed computing with Ray.
New built-in pipeline components: SentenceRecognizer, Morphologizer, Lemmatizer, AttributeRuler and Transformer.
New and improved pipeline component API and decorators for custom components.
Source trained components from other pipelines in your training config.
DependencyMatcher for matching patterns within the dependency parse using Semgrex operators.
Support for greedy patterns in Matcher.
Type hints and type-based data validation for custom registered functions.
Various new methods, attributes and commands.

⚠️ Backwards incompatibilities

For more info on how to migrate from spaCy v2.x, see the detailed migration guide.

API changes

Pipeline package symlinks, the link command and shortcut names are now deprecated. There can be many different trained pipelines and not just one "English model", so you should always use the full package name like en_core_web_sm explicitly.
A pipeline's meta.json is now only used to provide meta information like the package name, author, license and labels. It's not used to construct the processing pipeline anymore. This is all defined in the config.cfg, which also includes all settings used to train the pipeline.
The train, pretrain and debug data commands now only take a config.cfg.
Language.add_pipe now takes the string name of the component factory instead of the component function.
Custom pipeline components now need to be decorated with the @Language.component or @Language.factory decorator.
The Language.update, Language.evaluate and TrainablePipe.update methods now all take batches of Example objects instead of Doc and GoldParse objects, or raw text and a dictionary of annotations.
The begin_training methods have been renamed to initialize and now take a function that returns a sequence of Example objects to initialize the model instead of a list of tuples.
Matcher.add and PhraseMatcher.add now only accept a list of patterns as the second argument (instead of a variable number of arguments). The on_match callback becomes an optional keyword argument.
The Doc flags like Doc.is_parsed or Doc.is_tagged have been replaced by Doc.has_annotation.
The spacy.gold module has been renamed to spacy.training.
The PRON_LEMMA symbol and -PRON- as an indicator for pronoun lemmas has been removed.
The TAG_MAP and MORPH_RULES in the language data have been replaced by the more flexible AttributeRuler.
The Lemmatizer is now a standalone pipeline component and doesn't provide lemmas by default or switch automatically between lookup and rule-based lemmas. You can now add it to your pipeline explicitly and set its mode on initialization.
Various keyword arguments across functions and methods are now explicitly declared as keyword-only arguments. Those arguments are documented accordingly across the API reference.

Removed or renamed API

Removed	Replacement
`Language.disable_pipes`	`Language.select_pipes`, `Language.disable_pipe`, `Language.enable_pipe`
`Language.begin_training`, `Pipe.begin_training`, ...	`Language.initialize`, `Pipe.initialize`, ...
`Doc.is_tagged`, `Doc.is_parsed`, ...	`Doc.has_annotation`
`GoldParse`	`Example`
`GoldCorpus`	`Corpus`
`KnowledgeBase.load_bulk`, `KnowledgeBase.dump`	`KnowledgeBase.from_disk`, `KnowledgeBase.to_disk`
`Matcher.pipe`, `PhraseMatcher.pipe`	not needed
`gold.offsets_from_biluo_tags`, `gold.spans_from_biluo_tags`, `gold.biluo_tags_from_offsets`	`training.biluo_tags_to_offsets`, `training.biluo_tags_to_spans`, `training.offsets_to_biluo_tags`
...

Assets 2

13 Jul 16:09

adrianeboyd

v2.3.2

bf778f5

v2.3.2: Improved Korean tokenizer speed, experimental character-based pretraining and bug fixes

✨ New features and improvements

Improve Korean tokenizer speed.
Add experimental character-based pretraining.

🔴 Bug fixes

Fix issue #5728: Fix French lemmatizer.
Fix issue #5729: Fix lemmatizer for python 2.7.
Fix issue #5751: Fix meta serialization in train CLI.

👥 Contributors

Thanks to @graue70, @mikeizbicki, @jbesomi, @gandersen101 and @DeNeutoy for the pull requests and contributions.

Assets 2

07 Jul 17:08

adrianeboyd

v2.3.1

5542915

v2.3.1: Alpha support for Nepali, updated Armenian and Japanese language data and bug fixes

✨ New features and improvements

NEW: Add alpha support for Nepali.
Refactor Japanese tokenizer and include additional custom tokenizer features.
Update Armenian language data.
Include spacy git commit in package and model meta for reference.

🔴 Bug fixes

Fix issue #5620: Skip vocab in component config overrides.
Fix issue #5634: Fix polarity of Token.is_oov and Lexeme.is_oov.
Fix issue #5643: Add strings and ENT_KB_ID to Doc serialization.
Fix issue #5648: Disregard special tag _SP in check for new tag map.
Fix issue #5658 : Move lemmatizer is_base_form to language settings.

👥 Contributors

Thanks to @myavrum, @mahnerak, @rameshhpathak, @hiroshi-matsuda-rit, @PluieElectrique, @hertelm and @alvaroabascar for the pull requests and contributions.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ New features and improvements

🔴 Bug fixes

📖 Documentation and examples

👥 Contributors

🔴 Bug fixes

🚀 Quickstart

✨ New features and improvements

📺 Video introductions & tutorials

📦 Trained pipelines (58)

🚀 Quickstart

✨ New features and improvements

⚠️ Backwards incompatibilities

API changes

Removed or renamed API

✨ New features and improvements

🔴 Bug fixes

👥 Contributors

🔴 Bug fixes

📖 Documentation and examples

👥 Contributors

✨ New features and improvements

🔴 Bug fixes

👥 Contributors

🚀 Quickstart

✨ New features and improvements

⚠️ Backwards incompatibilities

API changes

Removed or renamed API

✨ New features and improvements

🔴 Bug fixes

👥 Contributors

✨ New features and improvements

🔴 Bug fixes

👥 Contributors

Releases: explosion/spaCy

v3.0.2: CLI overrides and env variables in projects, base support for Setswana, PhraseMatcher for spans and bug fixes

✨ New features and improvements

🔴 Bug fixes

📖 Documentation and examples

👥 Contributors

v3.0.1: Bug fixes for transfomer training

🔴 Bug fixes

v3.0.0: Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more

🚀 Quickstart

✨ New features and improvements

📺 Video introductions & tutorials

📦 Trained pipelines (58)

v3.0.0rc3: Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more

🚀 Quickstart

✨ New features and improvements

⚠️ Backwards incompatibilities

API changes

Removed or renamed API

v2.3.5: Bug fixes and simpler source installs

✨ New features and improvements

🔴 Bug fixes

👥 Contributors

v2.3.4: Fix beam parser API

🔴 Bug fixes

📖 Documentation and examples

👥 Contributors

v2.3.3: Alpha support for Macedonian and Sanskrit, updates for many languages and bug fixes

✨ New features and improvements

🔴 Bug fixes

👥 Contributors

v3.0.0rc1: Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more

🚀 Quickstart

✨ New features and improvements

⚠️ Backwards incompatibilities

API changes

Removed or renamed API

v2.3.2: Improved Korean tokenizer speed, experimental character-based pretraining and bug fixes

✨ New features and improvements

🔴 Bug fixes

👥 Contributors

v2.3.1: Alpha support for Nepali, updated Armenian and Japanese language data and bug fixes

✨ New features and improvements

🔴 Bug fixes

👥 Contributors