🎙 Curated Catalog · Audio

GigaSpeech

10,000 hours of transcribed English speech from podcasts, audiobooks, and YouTube.

LQS 85 · gold ✓ Commercial OK 8.3M speech segments 760 GB OPUS · JSON Released 2021
Browse commercial Audio → Visit original source ↗
Source: github.com · maintained by SpeechColab
8.3M
speech segments
760 GB
Size on disk
85
LQS · gold
2021
First released

About this dataset

GigaSpeech from SpeechColab is a multi-domain English ASR corpus with 10,000 hours of transcribed audio sourced from audiobooks, podcasts, and YouTube. The XL subset is full 10K hrs; smaller subsets (L, M, S, XS) are provided for faster experimentation. Transcripts normalized and punctuated.

Maintainer
License
Formats
OPUS · JSON

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

85
out of 100
gold tier

High-quality dataset across most dimensions

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 88
No public completeness metric; using prior for 'crowdsourced_qc' datasets.
Uniqueness 93
Exact-hash deduplication documented by maintainer.
Validation 82
Crowdsourced labels with quality-control protocol (redundancy, golden tests).
Size adequacy 96
8,300,000 segments — exceeds 100,000 adequacy target for Audio.
Format compliance 95
Industry-standard format — drop-in compatible with mainstream tooling.
Label density 52
Average 1.0 labels per item (sparse).
Class balance 75
Moderate class skew — realistic production distribution.

What it's used for

Common tasks and benchmarks where GigaSpeech is the default or competitive choice.

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

10,000 hours / 8.3M segments from audiobooks (33%), podcasts (17%), YouTube (50%). XL / L / M / S / XS subsets. 16kHz.

License

GigaSpeech is distributed under Apache 2.0. This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Need commercial-licensed Audio data?

LabelSets sellers offer paid audio datasets with what public datasets often can't give you:

Browse paid Audio → Sell your dataset

Similar public datasets

Other entries in the Audio catalog.

Frequently Asked Questions

GigaSpeech is distributed under Apache 2.0, which generally permits commercial use. Always verify the current license terms with the maintainer (SpeechColab) before using in a commercial product.
GigaSpeech contains 8,300,000 speech segments. 10,000 hours / 8.3M segments from audiobooks (33%), podcasts (17%), YouTube (50%). XL / L / M / S / XS subsets. 16kHz.
GigaSpeech is maintained by SpeechColab and is available at https://github.com/SpeechColab/GigaSpeech. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.
LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.