Audio ML is a field with strong public benchmarks and a surprisingly thin landscape of commercially licensed training data. For many speech and audio tasks, the best available datasets are research-grade and come with licenses that complicate commercial deployment. This guide covers the major audio tasks, what technical requirements matter (sample rate, format, acoustic environment), and the best available datasets for each — including where to find commercially licensed options.
Three Distinct Audio ML Tasks (and Why They're Different)
A common mistake is treating "audio datasets" as a single category. The three main tasks have fundamentally different data requirements:
- Automatic Speech Recognition (ASR) — Transcribing spoken language to text. Requires aligned audio-transcript pairs. Model performance depends heavily on acoustic diversity (microphone types, room acoustics, background noise), speaker diversity (age, accent, language variety), and transcript accuracy. The unit of quality is word error rate (WER) on held-out test sets.
- Speaker Identification / Verification — Recognizing who is speaking, either from a known set (identification) or comparing two clips (verification). Requires audio clips labeled by speaker identity, with sufficient per-speaker utterances to learn speaker-discriminative features. The unit of quality is equal error rate (EER) on held-out speakers.
- Keyword / Command Detection — Detecting specific words or commands (wake word detection, voice command systems). Requires short audio clips with binary or multi-class labels, often with a heavy class imbalance between positive (command) and negative (background/non-command) examples. The unit of quality is false accept rate vs. false reject rate at deployment latency.
Technical Requirements: Sample Rate and Format
Before evaluating any audio dataset, check these technical specs:
- Sample rate — 16kHz is the minimum for ASR; models like Whisper and wav2vec 2.0 were designed for 16kHz input. Telephone audio is often 8kHz (too low for high-accuracy ASR without upsampling). 44.1kHz or 48kHz is standard for music and general audio, but doesn't improve ASR performance. If your deployment environment is phone calls, you want data recorded at 8kHz or downsampled from higher rates in the same acoustic conditions.
- Format — WAV (PCM, 16-bit) is the universal standard for training pipelines. FLAC is lossless compressed and widely supported. MP3/AAC introduces compression artifacts and is not recommended for training data. Always check whether metadata (transcripts, speaker IDs, timestamps) is provided as separate TSV/JSON files or embedded in filename conventions.
- Acoustic conditions — Clean studio recordings train clean-condition models. If your deployment captures speech from a distance, in a noisy environment, or over telephony, you need data recorded under similar conditions, or augmentation that simulates those conditions. Don't assume a model trained on studio-quality LibriSpeech will perform well on smart speaker microphone input.
Automatic Speech Recognition (ASR) Datasets
LibriSpeech
High quality Free, commercial OKLibriSpeech is the standard ASR benchmark — 960 hours of English speech derived from public-domain audiobooks from the LibriVox project. The audio quality is clean and consistent (read speech, quiet environments), the transcripts are accurate, and the CC BY 4.0 license permits commercial use with attribution. It's the baseline for nearly every published ASR paper. The limitation is that it's read speech — models trained only on LibriSpeech perform worse on conversational, spontaneous speech than their benchmark numbers suggest. For deployment in voice assistant or meeting transcription contexts, supplement with conversational data.
Mozilla Common Voice
Multilingual CC0 — completely freeMozilla Common Voice is the largest openly licensed multilingual speech dataset, with 20,000+ hours of validated speech across 100+ languages. It's crowd-sourced — volunteers record and validate clips — which means quality varies by language, with high-resource languages (English, German, French) having substantially more hours and validation density than low-resource ones. CC0 license means no restrictions whatsoever, including commercial use. The MP3 format is the main technical downside — there's minor compression artifact risk in training pipelines that expect lossless audio. Highly recommended for multilingual ASR and for building baseline models in lower-resource languages.
GigaSpeech
Conversational diversity Commercial useGigaSpeech fills the conversational speech gap in LibriSpeech. It contains 10,000 hours of English speech from audiobooks, podcasts, and YouTube videos — covering a wide range of acoustic conditions, speaking styles, and topics. Apache 2.0 license. The diversity of recording conditions makes GigaSpeech-trained models more robust to real-world microphone and environment variation than LibriSpeech-only models. The tradeoff is slightly noisier transcripts (automatic transcription with human validation). For building a production-grade ASR model, training on a combination of LibriSpeech and GigaSpeech covers both clean-speech precision and conversational robustness.
Speaker Identification and Verification Datasets
VoxCeleb 1 & 2
Speaker diversity Free, commercial OKVoxCeleb is the standard benchmark for speaker recognition, built from audio-visual data of celebrities (politicians, athletes, actors) speaking in real-world conditions — interviews, speeches, TV appearances. The "in-the-wild" recording conditions make it much more realistic for production speaker ID than clean studio datasets. VoxCeleb1 has 1,251 speakers with 153K utterances; VoxCeleb2 has 6,112 speakers and 1.1M utterances. CC BY 4.0. The celebrity-only composition means demographic coverage skews toward public figures — if your application requires diverse speaker demographics, you'll want to supplement with additional data.
Keyword and Command Detection Datasets
Google Speech Commands
Standard wake word benchmark Free, commercial OKGoogle Speech Commands contains 105,000 one-second WAV clips of 35 different spoken command words ("yes," "no," "up," "down," "left," "right," "stop," "go," plus digits and others), recorded by thousands of volunteers in various acoustic conditions. It's the standard dataset for training and benchmarking small on-device keyword detection models. CC BY 4.0. The main limitation is the fixed vocabulary — 35 predefined commands. If your application requires custom wake words or a domain-specific command vocabulary, you'll need to collect custom data or augment with your specific target words.
LabelSets Audio Datasets — Browse Audio Datasets
Commercial license Metadata structuredCommercial audio applications often require domain-specific data that public benchmarks don't cover: call center audio, voice command datasets for specific product vocabularies, accented speech for underrepresented dialects, or audio from specific recording environments (hands-free car microphones, smart speaker arrays, telephony). LabelSets hosts commercially licensed audio datasets with structured metadata (transcripts, speaker IDs, command labels, acoustic condition tags). All datasets include LQS quality scores covering completeness, format compliance, and label density. For teams building production voice products who need data they can legally deploy on, browse the audio catalog for currently available sets.
Evaluating Audio Dataset Quality
Before committing to any audio dataset, check these quality indicators:
- Transcript accuracy — What's the estimated word error rate of the transcriptions themselves? Datasets transcribed by professional annotators with domain knowledge have lower error rates than crowd-sourced transcriptions. Some datasets report this explicitly; others don't.
- Speaker diversity metadata — Are speaker demographics (age range, gender, accent/dialect, native language) documented? A dataset that doesn't document speaker demographics can't be evaluated for representativeness.
- Recording condition diversity — Single-microphone studio recordings produce models that fail on other hardware. Look for datasets that document or intentionally vary microphone type, room acoustics, and background noise level.
- Alignment quality — For ASR datasets, are timestamps available (forced alignment or manual)? Accurate word-level timestamps are critical for training attention-based models and for downstream tasks like keyword spotting from ASR output.
Have audio datasets you'd like to score before using them for training? Run any dataset through the free LQS quality audit tool at labelsets.ai/quality-audit — no account needed. It checks annotation completeness, format compliance, and label consistency in a few minutes.