Skip to main content

Data export and statistics

Exporting ChoCo as a JAMS dataset

To export ChoCo from the partition branch (our factory of ChoCo collections), simply run the command below with the number of threads (n_workers) you can afford for this.

python create.py ../choco-jams \
--jams_version converted --exclude ireal-pro:forum --n_workers 4

To export ChoCo XT from the partition branch, we just need to avoid excluding some sub-collections. You can use the command below.

python create.py ../choco-jams \
--jams_version converted --n_workers 4

Create your own ChoCo dataset

If you want a custom subset of ChoCo, based on specific partitions to include/exclude or on certain expected metadata, you just need to play around with the choco/create.py script (see below for documentation).

Dataset creation scripts for ChoCo.

positional arguments:
out_dir Directory where data will be exported.

optional arguments:
-h, --help show this help message and exit
--jams_version {original,converted}
Type of JAMS files to consider from ChoCo.
--input_meta INPUT_META
Path to the CSV file with the desired metadata.
--include INCLUDE [INCLUDE ...]
Name of partitions to include in the dataset.
--exclude EXCLUDE [EXCLUDE ...]
Name of partitions to exclude from the dataset.
--n_workers N_WORKERS
Number of workers for parallel computation.
--log_dir LOG_DIR Directory where log files will be generated.
--debug Whether to print logging info messages.
--resume Whether to resume the transformation process.

Example on a custom subset of ChoCo that we are using in musilar to trace musical influence.

python create.py ../../musilar/data/influence/choco-beatles --jams_version original \
--exclude chordify robbie-williams uspop2002 rwc-pop biab-internet-corpus \
jazz-corpus wikifonia --n_workers 4

Example of a custom subset including audio annotations only.

python create.py ../../musilar/data/genre/choco-audio --jams_version original \
--include isophonics schubert-winterreise billboard chordify \
robbie-williams uspop2002 rwc-pop --n_workers 4

Extracting statistics from a ChoCo dataset

The computation of descriptive statistics of a ChoCo dataset is divided in 2 steps: (i) extraction of descriptors from every JAMS file in the given collection; (ii) aggregation of statistics per namespace (the type of annotation, such as chord, key_mode, etc.) and JAMS type (audio or score). The module responsible for this is jams_stats.py, which provides a simple CLI for both these steps (see below).

Simple extractor of chord stats from JAMS files.

positional arguments:
{extract,aggregate,plot}
Either extract, aggregate, plot.
dataset Directory where JAMS files will be read, or path to the JAMS stats previously generated

optional arguments:
-h, --help show this help message and exit
--namespaces NAMESPACES
A list of namespaces to consider for aggregation; if not provided, all namespaces will be used.
--out_dir OUT_DIR Directory where statistics will be saved.
--n_workers N_WORKERS
Number of workers for stats computation.
--compression COMPRESSION
Compression rate for saving the stats file.

Assuming that you have downloaded, or exported, a ChoCo dataset in ~/choco-jams, then you will have to run the following commands.

python jams_stats.py extract ~/choco-jams/jams --out_dir ~/choco-jams/ --n_workers 4
python jams_stats.py aggregate ~/choco-jams/jams_stats.joblib