A 100K+ Speaker Recognition Corpus and
Open-Set Speaker-Identification Benchmark
10M Utterances
We annotate approximate 10M audio/video segments from videos on YouTube, encompassing various contexts including podcasts, lives, live streaming highlights, etc.
110K+ speakers
Our dataset spans over 15 different language families, boasting multilingual characteristics.
1.6w Hours
The scenarios covered align with real-life situations, and the audio/videos from a single speaker vary over time.
Language Distribution
Explore the lingual characteristics of the VoxBlink2! It's noted that the language labels are derived from the language detection tool of the video tags, so they are not very accurate, just for reference.
# | Language | Speakers | # | Language | Speakers | # | Language | Speakers |
---|---|---|---|---|---|---|---|---|
1 | English | 40000+ | 7 | Vietnam | 1793 | 13 | Japanese | 992 |
2 | Portuguese | 6227 | 8 | Korean | 1544 | 14 | Estonian | 725 |
3 | Spanish | 6009 | 9 | Italian | 1519 | 15 | Norwegian | 574 |
4 | Russian | 3961 | 10 | French | 1503 | 16 | Polish | 490 |
5 | Arabic | 3467 | 11 | German | 1150 | 17 | Tagalog | 467 |
6 | Indonesian | 1864 | 12 | Turkish | 1150 | 18 | Catalan | 407 |
Theme
Explore the Themes of the VoxBlink2!
Guidance
./spk_info
, which include upload-time, themes and video tags,etc.
vb2_meta.tar.gz
.
Based on your condition, you can follow the audio-visual or audio-only download recipe to build the corpus.