As we know, when training DiffSinger, zeros are padded at the end of the token sequence to meet the maximum frame requirement.
Therefore, the fft block requires the input of key_padding_mask to ignore the padded zeros.
My question is, does the ESM module also need to address this issue?
We understand that ESM learns the latent representations of different language arrangements in the token sequence. Could the untreated zeros in the padding negatively impact the performance?
@hualizhou167 @linyueqian