Supplementary material for the submission to ACL.
This paper provides a proof of concept that audio of tabletop role-playing games (TTRPG) could serve as a challenge for diarization systems. TTRPGs are carried out mostly by conversation. Participants often alter their voices to indicate that they are talking as a fictional character. Audio processing systems are susceptible to voice conversion with or without technological assistance. TTRPG present a conversational phenomenon in which voice conversion is an inherent characteristic for an immersive gaming experience. This could make it more challenging for diarizers to pick the real speaker and determine that impersonating is just that. We present the creation of a small TTRPG audio dataset and compare it against the AMI and the ICSI corpus. The performance of two diarizers, pyannote.audio and wespeaker, were evaluated. We observed that TTRPGs' properties result in a higher confusion rate for both diarizers. Additionally, wespeaker strongly underestimates the number of speakers in the TTRPG audio files. We propose TTRPG audio as a promising challenge for diarization systems.
Required python packages and their versions can be found in requirements.txt.
In this repository, we provide the following:
- Code: How the diarizer was applied
- Code: How we applied forced alignment
- Code: How we converted the forced alignment results to rttm
- Code: How the evaluation has been done
- Code: How we calculated the amount of overlapping speech
- Code: How we calculated the amount of interjections
- Code: How we applied the Mann-Whitney U test
- Links: The YouTube videos we used for our TTRPG dataset
- Citation
Calling the diarizer as we did in our work. The diarizer can be called by
python main.py -d <options>The available options are:
-a(or)--audio_path: Path to the audio files that should be diarized. Required.-r(or)--result_path: Path to the directory where the results should be saved. The files will have the same file name as their respective audio files. Required.-t(or)--access_token: The hugging face access token. You need to have access pyannote.audio 3.1. Required.-f(or)--audio_format: The format of the audio files. The default iswav.-w(or)--reference_path: Path to the reference files. The reference files need to berttmfiles and have the same file name as their respective audio files. Only required ifcis set (see below).-c(or)--consider_speaker_no: If set, the diarizer will receive the number of expected speakers as argument. The number is taken from the reference files.
Applying forced alignment as we did in our work.
python main.py -fa <options>The results of the forced alignment will be saved in csv files containing start and end time of each word and the confidence of the alignment (Word, Start_ms, End_ms, Score).
The available options are:
-a(or)--audio_path: Path to the audio files that should be aligned. Required.-r(or)--result_path: Path to the directory where the results should be saved. The files will have the same file name as their respective audio files. Required.-w(or)--reference_path: Path to the transcript files. The reference files need to betxtfiles and have the same file name as their respective audio files. Required.-f(or)--audio_format: The format of the audio files. The default iswav.
Converting csv files in the form of Word, Start_ms, End_ms, Score, Speaker into an rttm file.
python main.py -c2r <options>Entries of the same speaker that are less than 500ms apart are merged into one entry.
The available options are:
-r(or)--result_path: Path to therttmresult file Required.-w(or)--reference_path: Path to thecsvreference file. Required.
Evaluating the diarization by calculation diarization error rate (DER), confusion, false alarm, and missed detection.
Additionally, the csv result file will contain the number of detected speakers, the actual number of speakers, the speaker ratio, and the length of the audio file(s).
python main.py -e <options>The available options are:
-a(or)--audio_path: Path to the audio files. Required.-r(or)--result_path: Path to the hypothesis files. The reference files need to berttmfiles and have the same file name as their respective audio files. Required.-w(or)--reference_path: Path to the reference/ground-truth files. The reference files need to berttmfiles and have the same file name as their respective audio files. Required.-e(or)--evaluate_file: Thecsvfile for the evaluation results. Required.-f(or)--audio_format: The format of the audio files. The default iswav.
Calculates the amount of overlapping speech in the rttm files.
Writes the results to a csv file.
python main.py -o <options>The available options are:
-w(or)--reference_path: Path to the reference/ground-truth files. The reference files need to berttmfiles. Required.-e(or)--evaluate_file: Thecsvfile for the evaluation results. Required.
Calculates the amount of filler words in the txt files.
Writes the results to a csv file.
python main.py -fw <options>The available options are:
-w(or)--reference_path: Path to the reference files. The reference files need to betxtfiles. Required.-e(or)--evaluate_file: Thecsvfile for the evaluation results. Required.
Before the calculation can be done, one has to call
python -m spacy download en_core_web_trfCalculates the Mann-Whitney U test for two datasets in the two given files. It is assumed that the two given files contain one datapoint as float in each line. The result (Mann-Whitney U statistic, the p-value) will be printed to the console.
python main.py -mw <options>-x(or)--dataset_x: Path to the file that contains the first dataset. Required.-y(or)--dataset_y: Path to the file that contains the second dataset. Required.
The links to the used YouTube videos can be found in the links.txt file.
If you use the contents of this repository, please cite the corresponding publication.
@InProceedings{remmetang-2025-tabletop,
author = {Lian Remme and Kevin Tang},
title = {Playing with Voices: Tabletop Role-Playing Game Recordings as a Diarization Challenge},
booktitle = {Findings of the {A}ssociation for {C}omputational {L}inguistics: {NAACL} 2025},
year = {2025},
publisher = {Association for Computational Linguistics},
month = {04},
pubstate = {forthcoming},
address = {New Mexico, USA},
}GitHub Copilot has been used to assist during code-writing. Copilot helped writing some documentation strings, deciding on variable and function names and wrote first drafts of some functions.