Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create conceptual replication of Liesenfeld & Dingemanse 2022 #24

Open
mdingemanse opened this issue Jun 2, 2023 · 1 comment
Open

Comments

@mdingemanse
Copy link

mdingemanse commented Jun 2, 2023

Conceptually replicating the analysis of our Interspeech paper is a useful goal to guide scikit-talk development. To that end I'm going to create a proof of concept of our code for identifying continuers and selecting a set of utterances that can then feed into audio clip extraction and clustering analysis. I'll be using the IFADV data so that it can be fully open.

I'll start work in the playground repo and will try to get to it within the next few days. From my side this will be based on our R code. For the audio extraction and clustering, it will be based on the code in the existing OSF repo. Both will require some degree of porting and editing to be made to work with the open IFADV data.

The paper:

  • Liesenfeld, Andreas, and Mark Dingemanse. 2022. “Bottom-up Discovery of Structure and Variation in Response Tokens (‘Backchannels’) across Diverse Languages.” In Proceeding of Interspeech 2022. https://doi.org/10.21437/Interspeech.2022-11288.
@mdingemanse
Copy link
Author

mdingemanse commented Jun 2, 2023

Alright @bvreede @n400peanuts @liesenf the playground repo now contains a first go at a dataset similar to the one that underlies the first half of our paper, but now using only the IFADV package.

The R code for generating this should be fairly straightforward to port to Python. I have tried to comment as needed. Let me know if you need any further guidance. To preview the steps:

  1. We add a column streak that holds a streak counter using the cumsum() function. This counter increments whenever a speaker produces the same utterance in succession.
  2. We select items that occur in streaks of >2: these are our candidate continuers.

Surprise! It so happens that in the exotic language of the IFADV dataset, the top three formats found in streaks are ja, ja ja, hum, as depicted in this quick and dirty convplot of a few sample sequences:

image

The selected utterances are in continuers_in_streaks.csv which looks like this:

 uid       language utterance utterance_stripped  begin    end participant
  <chr>     <chr>    <chr>     <chr>               <dbl>  <dbl> <chr>      
1 dutch-01… dutch    kch       kch                 56867  57227 spreker2 […
2 dutch-01… dutch    ja        ja                 421869 422277 spreker1 […
3 dutch-02… dutch    ja        ja                 121341 121579 spreker2 […
4 dutch-02… dutch    ja [unk_… ja                 124408 125008 spreker2 […
5 dutch-02… dutch    ja        ja                 137234 137505 spreker1 […
6 dutch-02… dutch    ja        ja                 141980 142252 spreker2 […

Next steps

In the other half of the paper, Andreas takes over, taking roughly the following steps (correct me if I'm wrong @liesenf):

  1. Use the source column to identify corresponding audio files
  2. Use the begin and end columns to identify positions at which to clip those audio files
  3. Use ffmpeg (or similar) to extract audio clips
  4. Generate spectrograms for all audio clips
  5. Use UMAP from a fork of avgn to cluster audio clips

Which then ultimately leads to something like this (for Dutch only):

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant