Skip to content

How to label the vocal bursts such as <laugh>? #277

@hbwu-ntu

Description

@hbwu-ntu

Thanks for sharing this really good work. I have several questions about how you get the vocal bursts labels in data.
I see your model support (laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles). May I ask

  1. Do you train a dedicated ASR to annotate all these vocal-burst tags? If so, which model do you use as the base? Did you follow the Whisper-D approach described in https://jordandarefsky.com/blog/2024/parakeet/?
  2. How do you obtain the data to train your own Whisper-D? Approximately how many hours of human-labeled data do you use, and could you share the data sources and share one data sample?
  3. Do you perform iterative training for Whisper-D — for example, first training a v1 model on human-labeled data, then using that v1 to generate pseudo-labels and retraining the model?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions