A 100K+ Speaker Recognition Corpus and
Open-Set Speaker-Identification Benchmark

Let's Start! See VoxBlink1
The wordcloud is generated by the videos' tags.

About

Important Notice: The released dataset only contains annotation data, including the YouTube links, time stamps and speaker labels. We do not release audio or visual data and it is the user's responsibility to decide whether and how to download the video data and whether their intended purpose with the downloaded data is legal in their country.

10M Utterances

We annotate approximate 10M audio/video segments from videos on YouTube, encompassing various contexts including podcasts, lives, live streaming highlights, etc.

110K+ speakers

Our dataset spans over 15 different language families, boasting multilingual characteristics.

1.6w Hours

The scenarios covered align with real-life situations, and the audio/videos from a single speaker vary over time.

Language Distribution

Explore the lingual characteristics of the VoxBlink2! It's noted that the language labels are derived from the language detection tool of the video tags, so they are not very accurate, just for reference.

# Language Speakers # Language Speakers # Language Speakers
1 English 40000+ 7 Vietnam 1793 13 Japanese 992
2 Portuguese 6227 8 Korean 1544 14 Estonian 725
3 Spanish 6009 9 Italian 1519 15 Norwegian 574
4 Russian 3961 10 French 1503 16 Polish 490
5 Arabic 3467 11 German 1150 17 Tagalog 467
6 Indonesian 1864 12 Turkish 1150 18 Catalan 407

Theme

Explore the Themes of the VoxBlink2!

Publications

Please cite the following if you make use of the dataset.

Yuke Lin , Ming Cheng, Fulin Zhang, Yingying Gao, Shilei Zhang, Ming Li

VoxBlink2: A 100K+ Speaker Recognition Corpus and the Open-Set Speaker-Identification Benchmark, INTERSPEECH2024

Bibtex |  Abstract |  PDF  

Guidance

Build your VoxBlink2

Resource

         Following a similar pipeline like the VoxBlink's, the annotation files can be downloaded through here. Apart from annotation files, we also provide the following files:
         1. The meta-informations of videos are saved in ./spk_info, which include upload-time, themes and video tags,etc.
         2. In order to better handle multimodal processing, we provide transcription annotations implemented by Whisper-medium. We will release the ASR predicted transcript with Whisper Large model in the near future.
         3. Different size of speaker models can be downloaded for evaluation , pre-training or leveraging for speaker-related tasks. The largest model performs 0.228% EER on the Vox1-O evaluation set without any post-processing. When incorporating Score-Norm and QMF, 0.17% EER and 0.006% minDCF(ptar=0.01) have been achieved.
         If you find it hard to collect all data, you can make a request on the data resource via Link and we'll decide whether to make an assistance based on your information and purpose.

Execute

         After you download the annotation files, you can follow the guidance in Repo and build your database with vb2_meta.tar.gz. Based on your condition, you can follow the audio-visual or audio-only download recipe to build the corpus.

License

         The open-source resources and the execution scripts are licensed under the CC BY-NC-SA 4.0 for protection. Detailed terms can be found on LICENSE. The license of the model is also CC BY-NC-SA 4.0, no commercial application is allowed, because the model is trained from voxblink2 and voxceleb, in which the data are coming from youtube. If you have some legal concerns of the privacy confliction to use the data, please consult the lawyer in your local region. The metadata provided is accurate as of Feb 2024. We cannot guarantee the availability of videos on the YouTube platform in the future. For YouTube users with concerns regarding their videos' inclusion in our dataset, please contact us via E-mail: 2018302060299@whu.edu.cn or ming.li@whu.edu.cn.

Open-Set Speaker-Identification

         please refer to Repo and follow the guidance for evaluation. Make sure that you have colllected the VoxBlink-clean to obtain the fair results.