Downloading Spacy NLP Engine by default when using AzureAILanguageRecognizer #1551

BlaiseCz · 2025-03-13T14:44:40Z

BlaiseCz
Mar 13, 2025

When initializing AnalyzerEngine in Presidio with AzureAILanguageRecognizer, the default behavior is to download Spacy models automatically if they are missing. However, in scenarios where only Azure AI is used for entity recognition, this auto-download is unnecessary and can lead to unwanted dependencies, increased startup time, and potential failures due to missing models.

Do you know how to make presidio use only AzureAILanguageRecognizer?

Answered by omri374

Mar 13, 2025

A spaCy pipeline is an inherent part in Presidio, as it's used not just for detecting entities but also for tokenization, lemmatization and more (used in the context awareness mechanism and other places). If you're not interested in downloading a large model, you can configure Presidio to use a small spaCy model, which contains the NLP pipeline components but doesn't use a model for NER.

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_analyzer.predefined_recognizers import AzureAILanguageRecognizer

# Set up the NLP engine
nlp_engine_provider = NlpEngineProvider(nlp_configuration={
    "nlp_engine_name"

View full answer

omri374 · 2025-03-13T14:57:46Z

omri374
Mar 13, 2025
Maintainer

A spaCy pipeline is an inherent part in Presidio, as it's used not just for detecting entities but also for tokenization, lemmatization and more (used in the context awareness mechanism and other places). If you're not interested in downloading a large model, you can configure Presidio to use a small spaCy model, which contains the NLP pipeline components but doesn't use a model for NER.

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_analyzer.predefined_recognizers import AzureAILanguageRecognizer

# Set up the NLP engine
nlp_engine_provider = NlpEngineProvider(nlp_configuration={
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "en", "model_name": "en_core_web_sm"}]
})
nlp_engine = nlp_engine_provider.create_engine()

# Set up the recognizer registry and add the Azure AI Language Recognizer
recognizer_registry = RecognizerRegistry()
azure_ai_recognizer = AzureAILanguageRecognizer()
recognizer_registry.add_recognizer(azure_ai_recognizer)

# Set up the analyzer engine with the custom NLP engine and recognizer registry
analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine,
    registry=recognizer_registry
)

# Analyze text
text = "My name is John Doe and my phone number is 555-555-5555"
results = analyzer.analyze(text=text, language="en")

# Print the results
for result in results:
    print(f"Entity: {result.entity_type}, Text: {result.text}, Confidence: {result.score}")

3 replies

BlaiseCz Mar 13, 2025
Author

Thanks for the clarification! I understand that a spaCy pipeline is an integral part of Presidio, not just for NER but also for tokenization, lemmatization, and context awareness.

Since AzureAILanguageRecognizer does not cover this functionality i guess I'd have to create my own nlp engine that covers these NLP tasks.

Thanks!
Blaise Cz.

omri374 Mar 13, 2025
Maintainer

np! What do you mean by creating your own? In the example I've given, we use Presidio's NLP Engine but with minimal overhead (since the small spacy pipeline is quite minimal). Since other functions in Presidio use the output of this, I would suggest to use it.

BlaiseCz Mar 14, 2025
Author

Reimplementation, so it does not download spacyM model at all, I have to build slim docker image, as slim as possible :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downloading Spacy NLP Engine by default when using AzureAILanguageRecognizer #1551

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Downloading Spacy NLP Engine by default when using AzureAILanguageRecognizer #1551

BlaiseCz Mar 13, 2025

Replies: 1 comment · 3 replies

omri374 Mar 13, 2025 Maintainer

BlaiseCz Mar 13, 2025 Author

omri374 Mar 13, 2025 Maintainer

BlaiseCz Mar 14, 2025 Author

BlaiseCz
Mar 13, 2025

Replies: 1 comment 3 replies

omri374
Mar 13, 2025
Maintainer

BlaiseCz Mar 13, 2025
Author

omri374 Mar 13, 2025
Maintainer

BlaiseCz Mar 14, 2025
Author