We created USCDiarLibri dataset that can be used to test speaker diarization tasks with various customized setups and randomization.

USCLibriDiar dataset is based on artificial multi-party dialogs made from noisy, reverberated audio from the LibriSpeech database and it’s highly parameterized to allow for diverse conditions.

USCDiarLibri generates USCDiarLibri dataset using external speech corpora and noise dataset. Therefore, LibriSpeech data set and QUT-NOISE dataset should be downloaded to a certain folder before you run the data generation script.

Download and Installation

Data Preparation

(1) Download the following speech dataset:

(2) Download the following noise dataset:

(3) The directory which includes USCDiarLibri should be setup as the following.

  |   +--QUT-NOISE-TIMIT/        
  |   +--QUT-NOISE-NIST2008/     
  |   +--QUT-NOISE/
  |   +--docs/
  |   +--code/
  |   +--train-clean-100/ 
  |   +--BOOKS.TXT
  |   ...
  |   +--103/
  |   +--1034/
  |   +--1040/
  |   ...


Creating USCLibriDiar Dataset

  • For pre-setup dataset, run the given python scripts.
$python  # two primary speakers, total 4 speakers
$python  # two primary speakers, total 6 speakers
  • For customizable dataset, modify the parameters in The parameters in are defined in the form of python dictionary as below:
session_dict['parameter_name'] = [Value]
  • For the parameter descriptions, read the following descriptions.

Parameters and Descriptions

The following descriptions are for parameters of The randomization is done session by session.

librispeech_directory: String. The directory path for Downloaded LibriSpeech data.

noise_data_directory: String. The directory path for Downloaded QUT-NOISE data.

wav_output_directory: String. The directory path for generated .wav files.

verbose: Python Boolean: True or False. Display messages along the data generation process.

num_of_prime_spkrs: Positive integer. This parameter determines the number of primary speakers. Currently, the number of primary speakers is fixed to 2.

num_of_all_spkrs: Positive integer. The number of total speakers per a session. This number includes both primary speakers and interfering speakers.

dialogue_prob: Python list: probablility for the states of [Silence, Overlap, speaker 1, speaker 2, speaker 3, ..., speaker N]. If you set bigger probability to a certain state than others, the state will appear more frequently than other states.

number_of_spk_turns: Positive integer or -1. The number of speaker turns in a session. Put -1 if you want to create as many turns as possible. A turn means a change of state in artificial dialogue. For example, if there are three turns in a session, the example session could be speech signal of Speaker1 for 2.3sec followed by silence for 1.8sec followed by speech signal of speaker5 for 3.6sec.

dist_prob_range_prime_spk: Python list: [Min, Max]. Determines the range of uniform random variable for distance between two primary speakers.

dist_prob_range_bgr_spk: Python list: [Min, Max]. Determines the range of uniform random variable for distance between microphone and interfering speakers.

noise: Python Boolean: True or False. Toggle the background noise.

noise_gain_dB_range: Python list: [Min, Max]. Determines the range of uniform random variable for the Signal to Noise Ratio (SNR) in dB scale.

absorption range: Python list: [Min, Max]. Determines the range of uniform random variable for the absorption coefficient of virtual room that simulates impulse response. If you put 0, you get unechoic signal.

number_of_sess: Positive integer, -1 or -2. The number of sessions you want to create. If you put -1, the system generates maximum number of sessions. If you If you want to create the specific interval of sessions, use option of -2 and specify minimum and maximum index number.

start: Positive integer. Minimum index number.

end: Positive integer. Maximum index number.

file_id: String. Determines the tag for the name of the output file.

Generated Dataset

USCDiarLibri script generates three different kinds of files.

  • WAV file - session_[N]_ch[M].wav : Wav file contains output from each microphone. It contains speech signal from primary speakers, interfering speakers and noise.

  • JSON file - session_[N]_ch[M].json : json file that contains word alignment information for each channel. the information includes alignedword, start and end time, duration of each phoneme, and ending time.

  • RTTM file - session_[N].rttm : RTTM format is an evaluation format for NIST RichTranscription dataset. Please refer to The Rich Transcription 2006 Spring Meeting Recognition Evaluation

Contact Information

Taejin Park, University of Southern California