10,000 hours of transcribed English speech from podcasts, audiobooks, and YouTube.
Browse commercial Audio → Visit original source ↗GigaSpeech from SpeechColab is a multi-domain English ASR corpus with 10,000 hours of transcribed audio sourced from audiobooks, podcasts, and YouTube. The XL subset is full 10K hrs; smaller subsets (L, M, S, XS) are provided for faster experimentation. Transcripts normalized and punctuated.
LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →
Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.
Common tasks and benchmarks where GigaSpeech is the default or competitive choice.
What's actually in the dataset — from the maintainer's published stats.
GigaSpeech is distributed under Apache 2.0. This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.
LabelSets sellers offer paid audio datasets with what public datasets often can't give you:
Other entries in the Audio catalog.