Audio ML is a field with strong public benchmarks and a surprisingly thin landscape of commercially licensed training data. For many speech and audio tasks, the best available datasets are research-grade and come with licenses that complicate commercial deployment. This guide covers the major audio tasks, what technical requirements matter (sample rate, format, acoustic environment), and the best available datasets for each — including where to find commercially licensed options.

Three Distinct Audio ML Tasks (and Why They're Different)

A common mistake is treating "audio datasets" as a single category. The three main tasks have fundamentally different data requirements:

Technical Requirements: Sample Rate and Format

Before evaluating any audio dataset, check these technical specs:

Automatic Speech Recognition (ASR) Datasets

LibriSpeech

960 hours · English audiobooks · CC BY 4.0 · 16kHz WAV
High quality Free, commercial OK

LibriSpeech is the standard ASR benchmark — 960 hours of English speech derived from public-domain audiobooks from the LibriVox project. The audio quality is clean and consistent (read speech, quiet environments), the transcripts are accurate, and the CC BY 4.0 license permits commercial use with attribution. It's the baseline for nearly every published ASR paper. The limitation is that it's read speech — models trained only on LibriSpeech perform worse on conversational, spontaneous speech than their benchmark numbers suggest. For deployment in voice assistant or meeting transcription contexts, supplement with conversational data.

Mozilla Common Voice

20,000+ hours · 100+ languages · CC0 · 16kHz MP3
Multilingual CC0 — completely free

Mozilla Common Voice is the largest openly licensed multilingual speech dataset, with 20,000+ hours of validated speech across 100+ languages. It's crowd-sourced — volunteers record and validate clips — which means quality varies by language, with high-resource languages (English, German, French) having substantially more hours and validation density than low-resource ones. CC0 license means no restrictions whatsoever, including commercial use. The MP3 format is the main technical downside — there's minor compression artifact risk in training pipelines that expect lossless audio. Highly recommended for multilingual ASR and for building baseline models in lower-resource languages.

GigaSpeech

10,000 hours · English · Apache 2.0 · Audiobooks, podcasts, YouTube
Conversational diversity Commercial use

GigaSpeech fills the conversational speech gap in LibriSpeech. It contains 10,000 hours of English speech from audiobooks, podcasts, and YouTube videos — covering a wide range of acoustic conditions, speaking styles, and topics. Apache 2.0 license. The diversity of recording conditions makes GigaSpeech-trained models more robust to real-world microphone and environment variation than LibriSpeech-only models. The tradeoff is slightly noisier transcripts (automatic transcription with human validation). For building a production-grade ASR model, training on a combination of LibriSpeech and GigaSpeech covers both clean-speech precision and conversational robustness.

Speaker Identification and Verification Datasets

VoxCeleb 1 & 2

2,000+ speakers (VoxCeleb1) / 6,000+ speakers (VoxCeleb2) · CC BY 4.0 · Wild audio
Speaker diversity Free, commercial OK

VoxCeleb is the standard benchmark for speaker recognition, built from audio-visual data of celebrities (politicians, athletes, actors) speaking in real-world conditions — interviews, speeches, TV appearances. The "in-the-wild" recording conditions make it much more realistic for production speaker ID than clean studio datasets. VoxCeleb1 has 1,251 speakers with 153K utterances; VoxCeleb2 has 6,112 speakers and 1.1M utterances. CC BY 4.0. The celebrity-only composition means demographic coverage skews toward public figures — if your application requires diverse speaker demographics, you'll want to supplement with additional data.

Keyword and Command Detection Datasets

Google Speech Commands

105K clips · 35 command words · CC BY 4.0 · 16kHz WAV · 1-second clips
Standard wake word benchmark Free, commercial OK

Google Speech Commands contains 105,000 one-second WAV clips of 35 different spoken command words ("yes," "no," "up," "down," "left," "right," "stop," "go," plus digits and others), recorded by thousands of volunteers in various acoustic conditions. It's the standard dataset for training and benchmarking small on-device keyword detection models. CC BY 4.0. The main limitation is the fixed vocabulary — 35 predefined commands. If your application requires custom wake words or a domain-specific command vocabulary, you'll need to collect custom data or augment with your specific target words.

LabelSets Audio Datasets — Browse Audio Datasets

Domain-specific audio · Commercial license · LQS quality-scored · Instant download
Commercial license Metadata structured

Commercial audio applications often require domain-specific data that public benchmarks don't cover: call center audio, voice command datasets for specific product vocabularies, accented speech for underrepresented dialects, or audio from specific recording environments (hands-free car microphones, smart speaker arrays, telephony). LabelSets hosts commercially licensed audio datasets with structured metadata (transcripts, speaker IDs, command labels, acoustic condition tags). All datasets include LQS quality scores covering completeness, format compliance, and label density. For teams building production voice products who need data they can legally deploy on, browse the audio catalog for currently available sets.

Evaluating Audio Dataset Quality

Before committing to any audio dataset, check these quality indicators:

Have audio datasets you'd like to score before using them for training? Run any dataset through the free LQS quality audit tool at labelsets.ai/quality-audit — no account needed. It checks annotation completeness, format compliance, and label consistency in a few minutes.