🎙 Curated Catalog · Audio

VoxCeleb

1M+ speech utterances from 7K+ celebrities extracted from YouTube — the standard speaker verification benchmark.

LQS 82 · gold ⚠ Research-only 1.3M speech utterances 280 GB WAV · MP4 Released 2017
Browse commercial Audio → Visit original source ↗
Source: robots.ox.ac.uk · maintained by Oxford Visual Geometry Group (VGG)
1.3M
speech utterances
280 GB
Size on disk
82
LQS · gold
2017
First released

About this dataset

VoxCeleb from Oxford Visual Geometry Group (VGG) is the canonical benchmark for speaker verification and identification at scale. VoxCeleb1 contains 153,516 utterances from 1,251 celebrities; VoxCeleb2 adds another 1,128,246 utterances from 6,112 celebrities — 2,000+ hours total. Audio is extracted from interview and talk-show YouTube clips using an automatic audio-visual pipeline, with speaker identity verified via face recognition. Widely used for training speaker embeddings (x-vectors, ECAPA-TDNN) and for ASR domain adaptation on celebrity speech.

Formats
WAV · MP4 · CSV

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

82
out of 100
gold tier

Solid dataset with some trade-offs

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 85
No public completeness metric; using prior for 'automated' datasets.
Uniqueness 93
Exact-hash deduplication documented by maintainer.
Validation 75
Labels generated by a trained model (e.g., automatic mask generation).
Size adequacy 93
1,281,762 clips — exceeds 100,000 adequacy target for Audio.
Format compliance 95
Industry-standard format — drop-in compatible with mainstream tooling.
Label density 52
Average 1.0 labels per item (sparse).
Class balance 75
Moderate class skew — realistic production distribution.

What it's used for

Common tasks and benchmarks where VoxCeleb is the default or competitive choice.

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

VoxCeleb1: 153,516 utterances / 1,251 speakers. VoxCeleb2: 1,128,246 utterances / 6,112 speakers. Total: 1.28M utterances, 7,363 unique identities, 2,000+ hours. Extracted via automatic audio-visual pipeline with face-recognition-based identity verification.

License

VoxCeleb is distributed under CC BY-SA 4.0 (VoxCeleb1) / Research (VoxCeleb2). This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Heads up: this dataset's license restricts commercial use. If you need audio data for production, check LabelSets' paid datasets below — every listing has an explicit commercial license.

Need commercial-licensed Audio data?

LabelSets sellers offer paid audio datasets with what public datasets often can't give you:

Browse paid Audio → Sell your dataset

Similar public datasets

Other entries in the Audio catalog.

Frequently Asked Questions

VoxCeleb is distributed under CC BY-SA 4.0 (VoxCeleb1) / Research (VoxCeleb2), which restricts commercial use. For a commercially-licensed alternative in audio, see LabelSets' paid datasets.
VoxCeleb contains 1,281,762 speech utterances. VoxCeleb1: 153,516 utterances / 1,251 speakers. VoxCeleb2: 1,128,246 utterances / 6,112 speakers. Total: 1.28M utterances, 7,363 unique identities, 2,000+ hours. Extracted via automatic audio-visual pipeline with face-recognition-based identity verification.
VoxCeleb is maintained by Oxford Visual Geometry Group (VGG) and is available at https://www.robots.ox.ac.uk/~vgg/data/voxceleb/. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.
LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.