How to use lemmatizer with span categorizer? #9201

mbrunecky · 2021-09-13T17:30:44Z

mbrunecky
Sep 13, 2021

Hi,
I am trying to add lemmatizer to span categorizer, and because english lemmatizer uses mode='rules', I am also adding a tagger:
pipeline = ["tagger","lemmatizer","spancat"]

However (after a struggle with config.cfg), I end up with a message:

ValueError: [E143] Labels for component 'tagger' not initialized. This can be fixed by calling add_label, or by providing a representative batch of examples to the component's initialize method.

I know I could add labels, but why? In any other tagger use I did not have to do anything (I assume a callback may have done that).

Is there an example of config.cfg using spancat along with (english) lemmatizer?

Here is my config.cfg

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tagger","lemmatizer","spancat"]
# batch_size = orig 500, used 1000 but go low to save memory 
batch_size = 600
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.lemmatizer]
factory = "lemmatizer"
mode = "rule"
model = null
overwrite = false

[components.spancat]
factory = "spancat"
max_positive = null
spans_key = "party"
# threshold = 0.4 may have made it less 'eager' but generally .5 seems the best
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"

[components.spancat.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = 96
#ows = [5000,2000,1000,1000] - used in spancat demo
#ows = [5000,2500,2500,2500] - used in NER
rows = [5000,2500,2500,2500]
attrs = ["ORTH","PREFIX","SUFFIX","SHAPE"]
include_static_vectors = false

[components.spancat.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[components.spancat.suggester]
@misc = "spacy.name_suggester.v1"
#misc = "spacy.ngram_suggester.v1" (default, I am usimg a custom one)
sizes = [1,2,3,4,5,6,7,8,9,10,11,12,13,14]

[components.tagger]
factory = "tagger"

[components.tagger.model]
@architectures = "spacy.Tagger.v1"
nO = null

[components.tagger.model.tok2vec]
#@architectures = "spacy.HashEmbedCNN.v2"
#pretrained_vectors = null
#width = 96
#depth = 4
#embed_size = 2000
#window_size = 1
#maxout_pieces = 3
#subword_features = true
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.spancat.model.tok2vec.encode.width}
upstream = "*"



[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
# patience = 1600 My Epoch is ~6 to 10k docs. To avoid bailing out before epoch 0 end, using 8000 = epoch
patience = 10000
max_epochs = 0
eval_frequency = 200
frozen_components = []
before_to_disk = null
# was defaulting to: max_steps = 20000, trying ~4 epochs
max_steps = 40000
annotating_components = []

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
# Had to override this - it was giving tag_acc=0.33, lemma_acc=0.33, spans_party*=0.33
tag_acc = 0.1
lemma_acc = 0.1
spans_sc_f = null
spans_sc_p = null
spans_sc_r = null
spans_party_f = 0.4
spans_party_p = 0.0
spans_party_r = 0.4

[pretraining]

[initialize]
vectors = "en_core_web_lg"
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

#[initialize.components.spancat]
# This was needed when 'labels' could not be retrieved from training data
#[initialize.components.spancat.labels]
#@readers = "spacy.read_labels.v1"
#path = "${paths.train}/aaa_labels.json"

[initialize.tokenizer]

Answered by polm

Sep 14, 2021

Do you want to train the tagger, or just use the pre-trained tagger from an existing pipeline? I assume you want to do the latter, in which case you should have something like this:

[components.tagger]
source = "en_core_web_lg"

And you can repeat that for the attribute ruler and lemmatizer. That will load all those components from the existing pipeline without changes. See this section in the docs. (You probably also want to freeze the components since you don't need them when training the spancat.)

The error is happening because you are using a blank tagger and your pipeline has no tok2vec for it to get input from - your only tok2vec is the one included inside the spancat, which isn't a…

View full answer

polm · 2021-09-14T05:19:21Z

polm
Sep 14, 2021

Do you want to train the tagger, or just use the pre-trained tagger from an existing pipeline? I assume you want to do the latter, in which case you should have something like this:

[components.tagger]
source = "en_core_web_lg"

And you can repeat that for the attribute ruler and lemmatizer. That will load all those components from the existing pipeline without changes. See this section in the docs. (You probably also want to freeze the components since you don't need them when training the spancat.)

The error is happening because you are using a blank tagger and your pipeline has no tok2vec for it to get input from - your only tok2vec is the one included inside the spancat, which isn't accessible from other components.

4 replies

mbrunecky Sep 14, 2021
Author

Thank you for the explanation.
Using source="en_core_web_lg" along with frozen_components=["tagger", "lemmatizer'] gets the pipeline working, BUT I am now flooded with messages:
UserWarning: [W108] The rule-based lemmatizer did not find POS annotation for the token '6677313'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.

I am not sure if my fix is right, but I added (a frozen) attribute_ruler to my pipeline, and it fixed the messages. However, the log reports TAG_ACC =0.0 and LEMMA_ACC = 0.0 - why is it reporting 'accuracy' when I am not training those components?

On a related issue, WHY does the spancat use a separate (private?) tok2vec? The general the V3 documentation says it should be the first component in the pipeline, shared by all (downstream) components? Perhaps this is because (in absence of better example) I based my config.cfg on the spancat demo and it is not 'right'?

mbrunecky Sep 15, 2021
Author

I am not sure fixing my pipeline with [components.*] source = "en_core_web_lg" was enough.
IS THERE ANY WAY to 'see' if the lemmatizer did anything?
My results using the pipeline with a lemmatizer are identical to those without (I mean loss, scores f,pr), even at the same 'step' (20000).
PERHAPS the 'private' tok2vec within the spancat does not use the results of the preceding pipeline.
Or something else is wrong.

polm Sep 16, 2021

I am not sure if my fix is right, but I added (a frozen) attribute_ruler to my pipeline, and it fixed the messages.

That's fine.

However, the log reports TAG_ACC =0.0 and LEMMA_ACC = 0.0 - why is it reporting 'accuracy' when I am not training those components?

If you give a component a value in [training.score_weights] that isn't null, scores will show up in the log, even if it's rule based / frozen / whatever.

On a related issue, WHY does the spancat use a separate (private?) tok2vec? The general the V3 documentation says it should be the first component in the pipeline, shared by all (downstream) components? Perhaps this is because (in absence of better example) I based my config.cfg on the spancat demo and it is not 'right'?

Using a private tok2vec vs a shared tok2vec is a tradeoff, and which one is right depends on your situation. Usually using a private one is simpler - it makes the configuration less complicated. Neither configuration is right or wrong. There is a section in the docs on shared embedding layers that covers this.

My results using the pipeline with a lemmatizer are identical to those without (I mean loss, scores f,pr), even at the same 'step' (20000).

I see you want to use the output of the lemmatizer as a source of features for spancat. This is the first time in this thread you made that clear - I thought this question was just about using both at the same time, for downstream use.

If you want to use lemma attributes as input to the tok2vec you have to add it to the list of attributes in the embed section:

attrs = ["ORTH","PREFIX","SUFFIX","SHAPE", "LEMMA"]

adrianeboyd Sep 16, 2021

You want your config to include the tok2vec with the tagger from en_core_web_lg and to also include the attribute ruler that maps token.tag to token.pos for the lemmatizer in your pipeline:

If you're not using the parser (which would also require the same tok2vec), then replace_listeners makes this easier and more modular:

[components.tagger]
source = "en_core_web_lg"
replace_listeners = "model.tok2vec"

[components.attribute_ruler]
source = "en_core_web_lg"

[components.lemmatizer]
source = "en_core_web_lg"

We tried to improve the default warnings filters for this particular error in in v3.1, but I see that it doesn't seem to be working as intended. The idea is that you should only see this warning once, so we should have a look at that.

mbrunecky · 2021-09-20T22:50:17Z

mbrunecky
Sep 20, 2021
Author

Well, I think a have my 'spancat pipeline using lemmatizer working. I am not sure how to prove it - perhaps there is something to see in the saved model (or perhaps I have to look for attributes in the doc?).

I used the suggestions in #7149 by @svlandeg , ending with a pipeline:
pipeline = ["tok2vec", "tagger", "lemmatizer", "spancat"]
using 'frozen' source=en_core_web_lg components:
frozen_components = ["tok2vec", "tagger","lemmatizer"]
from en_core_web_lg along with my "spancat" using it's own tok2vec layer.

I could not use @svlandeg suggestion to simply re-train the source tok2vec, because en_core_web_lg tok2vec uses
width = 96 depth = 4
and I found that my spancat run scores (f/p/r) are much better (+5%) when I train spancat with a 'bigger' tok2vec such as
width = 300 depth = 8

That (of course) puts a strain on my resources, especially since my 'spans' are up to 14 tokens. I am running out of 64 GB pf memory unless I play games with batch_size=400 (I asume it only affects validation) and training.batcher.size (going down to 100-800).
I am not sure about that, but batch sizes seem to influence not only the resource utilization and training times, but also the result/scores.

My training run now exceed 24 hours, and it seems that lemmatization only made results somewhat worse (~1-2%).. Disappointing...

1 reply

mbrunecky Sep 22, 2021
Author

I am afraid that my 48 hour run did not doo any lemmatization. The results are slightly worse, but that can be explained by using smaller training mini batch sizes (100 to 800).
Unfortunately, documentation for tok2vec MultiHashEmbed does not make it much clear as to what attrs values can be used and what they mean. Sime of them are easy ro guess (i.e. "ORTH"), I assumed attrs = [...,"LEMMA"] means there will be a hash derived off of doc.lemma - but who knows.
It looks like my only recourse is running a document thru my trained model and look if any doc.lemma is set.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use lemmatizer with span categorizer? #9201

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to use lemmatizer with span categorizer? #9201

mbrunecky Sep 13, 2021

Replies: 2 comments · 5 replies

polm Sep 14, 2021

mbrunecky Sep 14, 2021 Author

mbrunecky Sep 15, 2021 Author

polm Sep 16, 2021

adrianeboyd Sep 16, 2021

mbrunecky Sep 20, 2021 Author

mbrunecky Sep 22, 2021 Author

mbrunecky
Sep 13, 2021

Replies: 2 comments 5 replies

polm
Sep 14, 2021

mbrunecky Sep 14, 2021
Author

mbrunecky Sep 15, 2021
Author

mbrunecky
Sep 20, 2021
Author

mbrunecky Sep 22, 2021
Author