The synthetic dataset consists of two classes, containing 2000 samples each. Each class is defined by 4 distinct, class specific sound objects that represent rhythmic,
and melodic structures. Each generated audio sample is a superposition of up to 4 class specific audio objects, 5 random sounds and Gaussian noise with a noise strength of
\(\sigma = 0.1\). Samples are generated as superpositions of periodic sine-waves with a time length of 1 second, and a synthetic sample rate of \(f_s = 16000Hz\). Randomness
is introduced by randomizing amplitude, phase, frequency, and modulation frequency from predefined ranges of each sound object.
A detailed description of the generation procedure is provided in Chapter 4.1.1. and Appendix D in the thesis report.
In the following, audio samples for one exemplary instance per class are presented, along with their class-specific sound objects, in the form of log-mel-spectrograms.