🎙 Curated Catalog · Audio

GigaSpeech

10,000 hours of transcribed English speech from podcasts, audiobooks, and YouTube.

LQS 85 · gold ✓ Commercial OK 8.3M speech segments 760 GB OPUS · JSON Released 2021

Browse commercial Audio → Visit original source ↗

Source: github.com · maintained by SpeechColab

About this dataset

GigaSpeech from SpeechColab is a multi-domain English ASR corpus with 10,000 hours of transcribed audio sourced from audiobooks, podcasts, and YouTube. The XL subset is full 10K hrs; smaller subsets (L, M, S, XS) are provided for faster experimentation. Transcripts normalized and punctuated.

Maintainer

SpeechColab

License

Apache 2.0

Formats

OPUS · JSON

Paper

Read on arxiv.org →

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

out of 100

gold tier

High-quality dataset across most dimensions

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 88

No public completeness metric; using prior for 'crowdsourced_qc' datasets.

Uniqueness 93

Exact-hash deduplication documented by maintainer.

Validation 82

Crowdsourced labels with quality-control protocol (redundancy, golden tests).

Size adequacy 96

8,300,000 segments — exceeds 100,000 adequacy target for Audio.

Format compliance 95

Industry-standard format — drop-in compatible with mainstream tooling.

Label density 52

Average 1.0 labels per item (sparse).

Class balance 75

Moderate class skew — realistic production distribution.

What it's used for

Common tasks and benchmarks where GigaSpeech is the default or competitive choice.

Large-scale ASR
Multi-domain speech recognition
Punctuation restoration

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

10,000 hours / 8.3M segments from audiobooks (33%), podcasts (17%), YouTube (50%). XL / L / M / S / XS subsets. 16kHz.

License

GigaSpeech is distributed under Apache 2.0. This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Need commercial-licensed Audio data?

LabelSets sellers offer paid audio datasets with what public datasets often can't give you:

Explicit commercial license in writing
LQS-verified quality in your specific use-case
Instant download — no DUA, credentialed access, or research gating
PII scanned, deduplicated, and production-ready

Browse paid Audio → Sell your dataset

Frequently Asked Questions

GigaSpeech is distributed under Apache 2.0, which generally permits commercial use. Always verify the current license terms with the maintainer (SpeechColab) before using in a commercial product.

GigaSpeech contains 8,300,000 speech segments. 10,000 hours / 8.3M segments from audiobooks (33%), podcasts (17%), YouTube (50%). XL / L / M / S / XS subsets. 16kHz.

GigaSpeech is maintained by SpeechColab and is available at https://github.com/SpeechColab/GigaSpeech. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.

LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.

GigaSpeech

About this dataset

LabelSets Quality Score

High-quality dataset across most dimensions

What it's used for

Sample statistics

License

Need commercial-licensed Audio data?

Similar public datasets

Frequently Asked Questions