-
When initializing AnalyzerEngine in Presidio with AzureAILanguageRecognizer, the default behavior is to download Spacy models automatically if they are missing. However, in scenarios where only Azure AI is used for entity recognition, this auto-download is unnecessary and can lead to unwanted dependencies, increased startup time, and potential failures due to missing models. Do you know how to make presidio use only AzureAILanguageRecognizer? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
A spaCy pipeline is an inherent part in Presidio, as it's used not just for detecting entities but also for tokenization, lemmatization and more (used in the context awareness mechanism and other places). If you're not interested in downloading a large model, you can configure Presidio to use a small spaCy model, which contains the NLP pipeline components but doesn't use a model for NER. from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_analyzer.predefined_recognizers import AzureAILanguageRecognizer
# Set up the NLP engine
nlp_engine_provider = NlpEngineProvider(nlp_configuration={
"nlp_engine_name": "spacy",
"models": [{"lang_code": "en", "model_name": "en_core_web_sm"}]
})
nlp_engine = nlp_engine_provider.create_engine()
# Set up the recognizer registry and add the Azure AI Language Recognizer
recognizer_registry = RecognizerRegistry()
azure_ai_recognizer = AzureAILanguageRecognizer()
recognizer_registry.add_recognizer(azure_ai_recognizer)
# Set up the analyzer engine with the custom NLP engine and recognizer registry
analyzer = AnalyzerEngine(
nlp_engine=nlp_engine,
registry=recognizer_registry
)
# Analyze text
text = "My name is John Doe and my phone number is 555-555-5555"
results = analyzer.analyze(text=text, language="en")
# Print the results
for result in results:
print(f"Entity: {result.entity_type}, Text: {result.text}, Confidence: {result.score}") |
Beta Was this translation helpful? Give feedback.
A spaCy pipeline is an inherent part in Presidio, as it's used not just for detecting entities but also for tokenization, lemmatization and more (used in the context awareness mechanism and other places). If you're not interested in downloading a large model, you can configure Presidio to use a small spaCy model, which contains the NLP pipeline components but doesn't use a model for NER.