Skip to content

Releases: explosion/spaCy

v3.1.6: Workaround for Click/Typer issues

30 Mar 14:15
e147a52
Compare
Choose a tag to compare

🔴 Bug fixes

  • Fix issue #10564: Restrict supported Click versions as a workaround for incompatibilities between Click v8.1.0 and Typer v0.4.0.

👥 Contributors

@adrianeboyd, @honnibal, @ines

v3.2.4: Workaround for Click/Typer issues

29 Mar 18:34
b50fe5e
Compare
Choose a tag to compare

🔴 Bug fixes

  • Fix issue #10564: Restrict supported Click versions as a workaround for incompatibilities between Click v8.1.0 and Typer v0.4.0.

👥 Contributors

@adrianeboyd, @honnibal, @ines

v3.2.3: Fix Tok2Vec for empty batches

01 Mar 12:13
99425de
Compare
Choose a tag to compare

🔴 Bug fixes

  • Fix issue #10324: Fix Tok2Vec for empty batches.

👥 Contributors

@adrianeboyd, @honnibal, @ines

v3.1.5: Bug fixes for Tok2Vec, SpanCategorizer, and more

01 Mar 12:13
1355396
Compare
Choose a tag to compare

🔴 Bug fixes

  • Fix issue #9593: Use metaclass to subclass errors for easier pickling.
  • Fix issue #9654: Fix spancat for empty docs and zero suggestions.
  • Fix issue #9979: Fix type of Lexeme.rank.
  • Fix issue #10324: Fix Tok2Vec for empty batches.

👥 Contributors

@adrianeboyd, @BramVanroy, @brucewlee, @danieldk, @honnibal, @ines, @ljvmiranda921, @polm, @svlandeg, @vgautam, @xxyzz

v3.0.8: Fix Tok2Vec for empty batches

01 Mar 12:12
f55b876
Compare
Choose a tag to compare

🔴 Bug fixes

  • Fix issue #10324: Fix Tok2Vec for empty batches.

👥 Contributors

@adrianeboyd, @danieldk, @honnibal, @ines

v3.2.2: Improved NER and parser speeds, bug fixes and more

11 Feb 13:12
bbaf41f
Compare
Choose a tag to compare

✨ New features and improvements

  • Improved parser and ner speeds on long documents (see technical details in #10019).
  • Support for spancat components in debug data.
  • Support for ENT_IOB as a Matcher token pattern key.
  • Extended and improved types for many classes.

🔴 Bug fixes

  • Fix issue #9735: Make floret murmurhash endian-neutral.
  • Fix issue #9738: Support string IOB values for ENT_IOB.
  • Fix issue #9746: Updates to avoid "dictionary size changed during iteration" runtime errors.
  • Fix issue #9960: Warn about entities that cross sentence boundaries in debug data.
  • Fix issue #9979: Fix type for Lexeme.rank.
  • Fix issue #10026: Check for 0-size assets in spacy project.
  • Fix issue #10051: Consistently return scalars from similarity methods.
  • Fix issue #10052: Fix spaces in Doc.from_docs() for empty docs.
  • Fix issue #10079: Fix label detection in debug data for components with custom names.
  • Fix issue #10109: Add types to Underscore and DependencyMatcher and improve types in Language, Matcher and PhraseMatcher.
  • Fix issue #10130: Fix Tokenizer.explain when infixes appear as prefixes.
  • Fix issue #10143: Use simple suggester in spancat initialization.
  • Fix issue #10164: Support IS_SENT_END in Doc.has_annotation.
  • Fix issue #10192: Detect invalid package names in spacy package.
  • Fix issue #10223: Support mixed case in package names.
  • Fix issue #10234: Fix type in PhraseMatcher.

📖 Documentation and examples

  • Various documentation updates.
  • New spaCy version tags in spaCy universe.
  • New Dockerfile for repeatable website builds and easier local development.
  • New additions to spaCy universe:
    • Augmenty: a text augmentation library
    • Healthsea: an end-to-end spaCy pipeline for exploring health supplement effects
    • spacy-wrap: wrap fine-tuned transformers in spaCy pipelines
    • spacypdfreader: easy PDF to text to spaCy text extraction
    • textnets: text analysis with networks

👥 Contributors

@adrianeboyd, @antonpibm, @ColleterVi, @danieldk, @DuyguA, @ezorita, @HaakonME, @honnibal, @ines, @jboynyc, @KennethEnevoldsen, @ljvmiranda921, @mrshu, @pmbaumgartner, @polm, @ramonziai, @richardpaulhudson, @ryndaniels, @svlandeg, @thiippal, @thomashacker, @yoavxyoav

v3.2.1: doc_cleaner component, new Matcher attributes, bug fixes and more

07 Dec 16:30
800737b
Compare
Choose a tag to compare

✨ New features and improvements

  • NEW: doc_cleaner component for removing doc.tensor,doc._._trf_data or other Doc attributes at the end of the pipeline to reduce size of output docs.
  • NEW: ENT_ID and ENT_KB_ID to Matcher pattern attributes.
  • Support kb_id for entities in displaCy from Doc input.
  • Add Span.sents property for spans spanning over more than one sentence.
  • Add EntityRuler.remove to remove patterns by id.
  • Make the Tagger neg_prefix configurable.
  • Use Language.pipe in Language.evaluate for more efficient processing.
  • Test suite updates: move regression tests into core test modules with pytest markers for issue numbers, extend tests for languages with alpha support.

🔴 Bug fixes

  • Fix issue #9638: Make JsonlCorpus path optional again.
  • Fix issue #9654: Fix spancat for empty docs and zero suggestions.
  • Fix issue #9658: Improve error message for incorrect .jsonl paths in EntityRuler.
  • Fix issue #9674: Fix language-specific factory handling in package CLI.
  • Fix issue #9694: Convert labels to strings for README in package CLI.
  • Fix issue #9697: Exclude strings from source vector checks.
  • Fix issue #9701: Allow Scorer.score_spans to handle predicted docs with missing annotation.
  • Fix issue #9722: Initialize parser from reference parse rather than aligned example.
  • Fix issue #9764: Set annotations more efficiently in tagger and morphologizer.

📖 Documentation and examples

👥 Contributors

@adrianeboyd, @danieldk, @DuyguA, @honnibal, @ines, @ljvmiranda921, @narayanacharya6, @nrodnova, @Pantalaymon, @polm, @richardpaulhudson, @svlandeg, @thiippal, @Vishnunkumar

v3.2.0: Registered scoring functions, Doc input, floret vectors and more

05 Nov 15:54
0fc3dee
Compare
Choose a tag to compare

✨ New features and improvements

  • NEW: Registered scoring functions for each component in the config.
  • NEW: nlp() and nlp.pipe() accept Doc input, which simplifies setting custom tokenization or extensions before processing.
  • NEW: Support for floret vectors, which combine fastText subwords with Bloom embeddings for compact, full-coverage vectors.
  • overwrite config settings for entity_linker, morphologizer, tagger, sentencizer and senter.
  • extend config setting for morphologizer for whether existing feature types are preserved.
  • Support for a wider range of language codes in spacy.blank() including IETF language tags, for example fra for French and zh-Hans for Chinese.
  • New package spacy-loggers for additional loggers.
  • New Irish lemmatizer.
  • New Portuguese noun chunks and updated Spanish noun chunks.
  • Language updates for Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
  • Japanese reading and inflection from sudachipy are annotated as Token.morph features.
  • Additional morph_micro_p/r/f scores for morphological features from Scorer.score_morph_per_feat().
  • LIKE_URL attribute includes the tokenizer URL pattern.
  • --n-save-epoch option for spacy pretrain.
  • Trained pipelines:
    • New transformer pipeline for Japanese ja_core_news_trf, thanks to @hiroshi-matsuda-rit and the spaCy Japanese community!
    • Updates for Catalan data, tokenizer and lemmatizer, thanks to @cayorodriguez, Carme Armentano and @TeMU-BSC!
    • Transformer pipelines are trained using spacy-transformers v1.1, with improved IO and more options for model config and output.
    • Universal Dependencies corpora updated to v2.8.
    • Trailing space added as a tok2vec feature, improving the performance for many components, especially fine-grained tagging and sentence segmentation.
    • English attribute ruler patterns updated to improve Token.pos and Token.morph.

For more details, see the New in v3.2 usage guide.

🔴 Bug fixes

  • Fix issue #8972: Fix pickling for Japanese, Korean and Vietnamese tokenizers.
  • Fix issue #9032: Retain alignment between doc and context for Language.pipe(as_tuples=True) for multiprocessing with custom error handlers.
  • Fix issue #9136: Ignore prefixes when applying suffix patterns in Tokenizer.
  • Fix issue #9584: Use metaclass to subclass errors to allow better pickling.

⚠️ Backwards incompatibilities

  • In the Tokenizer, prefixes are now removed before suffix matches are applied, which may lead to minor differences in the output. In particular, the default tokenization of °[cfk]. is now ° c . instead of ° c. for most languages.
  • The tokenizer classes ChineseTokenizer, JapaneseTokenizer, KoreanTokenizer, ThaiTokenizer and VietnameseTokenizer require Vocab rather than Language in __init__.
  • In DocBin, user data is now always serialized according to the store_user_data option, see #9190.

📖 Documentation and examples

👥 Contributors

@adrianeboyd, @Avi197, @baxtree, @BramVanroy, @cayorodriguez, @DuyguA, @fgaim, @honnibal, @ines, @Jette16, @jimregan, @polm, @rspeer, @rumeshmadhusanka, @svlandeg, @syrull, @thomashacker

v3.1.4: Python 3.10 wheels and support for AppleOps

29 Oct 14:14
006df1a
Compare
Choose a tag to compare

✨ New features and improvements

  • NEW: Binary wheels for Python 3.10.
  • NEW: Improve performance on Apple M1 with AppleOps: pip install spacy[apple].
  • GPU profiling with spacy.models_with_nvtx_range.v1.
  • Full mypy integration in the CI and many type fixes across the code base.
  • Added custom Protocol classes in ty.py to define behavior of pipeline components.
  • Support for entity linking visualization in displacy.
  • Allow overriding vars in spacy project assets .
  • Standalone train function to run the training from Python scripts just like the spacy train CLI.
  • Support for spacy-transformers>=1.1.0 with improved IO.
  • Support for thinc>=8.0.11 with improved gradient clipping.

🔴 Bug fixes

  • Fix issue #5507: Improve UX for multiprocessing on GPU.
  • Fix issue #9137: Fix serialization for KnowledgeBase.set_entities.
  • Fix issue #9244: Fix vectors for 0-length spans.
  • Fix issue #9247: Improve UX for the DocBin constructor.
  • Fix Issue #9254: Allow unicode in a spacy project title.
  • Fix issue #9263: Make added patterns consistent in the DependencyMatcher.
  • Fix issue #9305: Restore tokenization timing during evaluation.
  • Fix issue #9335: Sync vocab in vectors and sourced components.
  • Fix issue #9387: Ensure lemmas are consistent for Catalan, Dutch, French, Russian and Ukrainian.
  • Fix issue #9404: Create consistent default textcat and textcat_multilabel configurations.
  • Fix issue #9437: Improve UX around Doc object creation.
  • Fix issue #9465: Fix minor issues with convert CLI.
  • Fix issue #9500: Include .pyi files in the distributed package.

📖 Documentation and examples

  • Various updates to the documentation.
  • New additions to the spaCy universe:
    • deplacy: CUI-based dependency visualizer
    • ipymarkup: Visualizations for NER and syntax trees
    • PhruzzMatcher: Find fuzzy matches
    • spacy-huggingface-hub: Push spaCy pipelines to the Hugging Face Hub
    • spaCyOpenTapioca: Entity Linking on Wikidata
    • spacy-clausie: Clause-based information extraction system
    • "Applied Natural Language Processing in the Enterprise": Book by Ankur A. Patel
    • "Introduction to spaCy 3": Free course by Dr. W.J.B. Mattingly

👥 Contributors

@adrianeboyd, @connorbrinton, @danieldk, @DuyguA, @honnibal, @ines, @Jette16, @ljvmiranda921, @mjvallone, @philipvollet, @polm, @rspeer, @ryndaniels, @shigapov, @svlandeg, @thomashacker

v3.1.3: Bug fixes and UX updates

20 Sep 12:06
8bda39f
Compare
Choose a tag to compare

✨ New features and improvements

  • The v3 of WandbLogger now supports optional run_name and entity parameters.
  • Improved UX when providing invalid pos values for a Doc or Token.

🔴 Bug fixes

  • Fix issue #9001: Pass alignments to Matcher callbacks.
  • Fix issue #9009: Include component factories in third-party dependencies resolver.
  • Fix issue #9012: Correct type of config in create_pipe.
  • Fix issue #9014: Allow typer 0.4 to provide support for both Click 7 and Click 8.
  • Fix issue #9033: Fix verbs list for French tokenizer exceptions.
  • Fix issue #9059: Pass overrides to subcommands in spacy project workflows.
  • Fix issue #9074: Improve UX around repo and path arguments in spacy project.
  • Fix issue #9084: Fix inference of epoch_resume in spacy pretrain.
  • Fix issue #9163: Handle spacy-legacy in spacy package dependency detection.
  • Fix issue #9211: Include only runtime-relevant dependencies in spacy package.

📖 Documentation and examples

  • Various updates to the documentation.
  • Few additions and updates to the spaCy universe.
  • Extended the developer documentation with information about the listener pattern, the StringStore and the Vocab.

👥 Contributors

@adrianeboyd, @davidefiocco, @davidstrouk, @filipematos95, @honnibal, @ines, @j-frei, @Joozty, @kwhumphreys, @mjhajharia, @mylibrar, @polm, @rspeer, @shigapov, @svlandeg, @thomashacker