🎙 Curated Catalog · Audio

VoxCeleb

1M+ speech utterances from 7K+ celebrities extracted from YouTube — the standard speaker verification benchmark.

LQS 82 · gold ⚠ Research-only 1.3M speech utterances 280 GB WAV · MP4 Released 2017

Browse commercial Audio → Visit original source ↗

Source: robots.ox.ac.uk · maintained by Oxford Visual Geometry Group (VGG)

About this dataset

VoxCeleb from Oxford Visual Geometry Group (VGG) is the canonical benchmark for speaker verification and identification at scale. VoxCeleb1 contains 153,516 utterances from 1,251 celebrities; VoxCeleb2 adds another 1,128,246 utterances from 6,112 celebrities — 2,000+ hours total. Audio is extracted from interview and talk-show YouTube clips using an automatic audio-visual pipeline, with speaker identity verified via face recognition. Widely used for training speaker embeddings (x-vectors, ECAPA-TDNN) and for ASR domain adaptation on celebrity speech.

Maintainer

Oxford Visual Geometry Group (VGG)

License

CC BY-SA 4.0 (VoxCeleb1) / Research (VoxCeleb2)

Formats

WAV · MP4 · CSV

Paper

Read on arxiv.org →

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

out of 100

gold tier

Solid dataset with some trade-offs

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 85

No public completeness metric; using prior for 'automated' datasets.

Uniqueness 93

Exact-hash deduplication documented by maintainer.

Validation 75

Labels generated by a trained model (e.g., automatic mask generation).

Size adequacy 93

1,281,762 clips — exceeds 100,000 adequacy target for Audio.

Format compliance 95

Industry-standard format — drop-in compatible with mainstream tooling.

Label density 52

Average 1.0 labels per item (sparse).

Class balance 75

Moderate class skew — realistic production distribution.

What it's used for

Common tasks and benchmarks where VoxCeleb is the default or competitive choice.

Speaker verification
Speaker identification
Speaker embedding learning
Audio-visual speaker recognition

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

VoxCeleb1: 153,516 utterances / 1,251 speakers. VoxCeleb2: 1,128,246 utterances / 6,112 speakers. Total: 1.28M utterances, 7,363 unique identities, 2,000+ hours. Extracted via automatic audio-visual pipeline with face-recognition-based identity verification.

License

VoxCeleb is distributed under CC BY-SA 4.0 (VoxCeleb1) / Research (VoxCeleb2). This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Heads up: this dataset's license restricts commercial use. If you need audio data for production, check LabelSets' paid datasets below — every listing has an explicit commercial license.

Need commercial-licensed Audio data?

LabelSets sellers offer paid audio datasets with what public datasets often can't give you:

Explicit commercial license in writing
LQS-verified quality in your specific use-case
Instant download — no DUA, credentialed access, or research gating
PII scanned, deduplicated, and production-ready

Browse paid Audio → Sell your dataset

Similar public datasets

Other entries in the Audio catalog.

Frequently Asked Questions

VoxCeleb is distributed under CC BY-SA 4.0 (VoxCeleb1) / Research (VoxCeleb2), which restricts commercial use. For a commercially-licensed alternative in audio, see LabelSets' paid datasets.

VoxCeleb contains 1,281,762 speech utterances. VoxCeleb1: 153,516 utterances / 1,251 speakers. VoxCeleb2: 1,128,246 utterances / 6,112 speakers. Total: 1.28M utterances, 7,363 unique identities, 2,000+ hours. Extracted via automatic audio-visual pipeline with face-recognition-based identity verification.

VoxCeleb is maintained by Oxford Visual Geometry Group (VGG) and is available at https://www.robots.ox.ac.uk/~vgg/data/voxceleb/. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.

LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.