You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your efforts in open-sourcing the code, it's vital for us trying to reproduce the result presented in the paper.
Problem
But I've come across a RuntimeError when adapting the model with our private data which shows:
/*/EEND-vector-clustering/eend/pytorch_backend/train.py:186: RuntimeWarning: invalid value encountered in true_divide
fet_arr[spk] = org / norm
...
Traceback (most recent call last):
...
RuntimeError: The loss (nan) is not finite.
Detail
After some debugging, I found the problem actually happens during the backpropagation step when there exists an entry left with zeros in the embedding layer:
Since the embeddings are actually loaded from the dumped speaker embeddings generated by the save_spkv_lab.py script when adapting the model, I suspect there might exist some issue in the save_spkv_lab function.
After some careful step-by-step checking with pdb, I found there is actually some silent speaker label added in the all_labels variable when dumping the speaker embeddings:
. (This is where makes me feels confused since it should not happen as both lab and T/t_chunked produced with info from kaldi_obj.utt2spk)
Since these silent speaker labels are -1 and the python list support negative indexing, this issue is silently ignored when dumping the embedding but will cause Exceptions when training begins.
Question
I could simply fix this issue by adding speaker label to all_labels only if lab < 0 when saving speaker embeddings and the followed training process could continue smoothly resulting in a good performing model.
But before opening any PR, I would like to know if you guys have ever come across such an issue or do you have any idea on why this will happen.
Thanks!
The text was updated successfully, but these errors were encountered:
Hello there,
Thanks for your efforts in open-sourcing the code, it's vital for us trying to reproduce the result presented in the paper.
Problem
But I've come across a
RuntimeError
when adapting the model with our private data which shows:Detail
After some debugging, I found the problem actually happens during the backpropagation step when there exists an entry left with zeros in the embedding layer:
EEND-vector-clustering/eend/pytorch_backend/train.py
Lines 173 to 186 in b3649ee
Since the embeddings are actually loaded from the dumped speaker embeddings generated by the
save_spkv_lab.py
script when adapting the model, I suspect there might exist some issue in thesave_spkv_lab
function.After some careful step-by-step checking with pdb, I found there is actually some silent speaker label added in the
all_labels
variable when dumping the speaker embeddings:EEND-vector-clustering/eend/pytorch_backend/infer.py
Lines 349 to 355 in b3649ee
Even when
if torch.sum(t_chunked_t[sigma[i]]) > 0
,lab
can still be-1
which is considered as silent speaker acroding to code in:EEND-vector-clustering/eend/pytorch_backend/diarization_dataset.py
Lines 94 to 99 in b3649ee
Since these silent speaker labels are -1 and the python list support negative indexing, this issue is silently ignored when dumping the embedding but will cause Exceptions when training begins.
Question
I could simply fix this issue by adding speaker label to
all_labels
onlyif lab < 0
when saving speaker embeddings and the followed training process could continue smoothly resulting in a good performing model.But before opening any PR, I would like to know if you guys have ever come across such an issue or do you have any idea on why this will happen.
Thanks!
The text was updated successfully, but these errors were encountered: