-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Open
Description
Thanks for sharing this really good work. I have several questions about how you get the vocal bursts labels in data.
I see your model support (laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles). May I ask
- Do you train a dedicated ASR to annotate all these vocal-burst tags? If so, which model do you use as the base? Did you follow the Whisper-D approach described in https://jordandarefsky.com/blog/2024/parakeet/?
- How do you obtain the data to train your own Whisper-D? Approximately how many hours of human-labeled data do you use, and could you share the data sources and share one data sample?
- Do you perform iterative training for Whisper-D — for example, first training a v1 model on human-labeled data, then using that v1 to generate pseudo-labels and retraining the model?
Metadata
Metadata
Assignees
Labels
No labels