Skip to content

Using Other Datasets

Cade Stocker edited this page Nov 28, 2025 · 3 revisions

Aria Example

I had to make sure to use something like:

wget -O aria_midi.tar.gz 'https://huggingface.co/datasets/loubb/aria-midi/resolve/main/aria-midi-v1-unique-ext.tar.gz?download=true'

This caused a lot of annoyance when trying to download the file for me. Make sure you use -O to create a .tar.gz file for it to go into. I've been using this aria dataset (which has around 30,000 midi files). I've just been reducing the size to around 10,000 by using commands like:

rm -r data/aria/data/c*

The files are stored in little subdirectories named things like 'aa', 'ad', 'as', etc. So this is an easy way to get rid of many files at once.

This is in reference to something I needed to do as detailed on this wiki page.

General Instructions

Just make sure to have your directory of MIDI files inside of a directory in csce585-midi/data/. It's perfectly ok to just git clone the dataset or wget, or you can just drag the dataset into csce585-midi/data/.

After you have your dataset in the data directory, view the preprocessing wiki page if you need any help with preprocessing.

Clone this wiki locally