What audio formats are supported on LabelSets?

WAV, MP3, and FLAC audio files with accompanying transcripts in TXT, JSON, or CSV format. The platform validates audio file integrity and transcript alignment.

What types of speech datasets are available?

ASR transcription datasets, speaker diarization data, emotion and sentiment in speech, keyword spotting sets, accent and dialect collections, call center recordings, and voice command datasets.

Can I sell recordings I collected for speech AI?

Yes, if you have consent from speakers and the data has been properly anonymized where required. Upload your audio with transcripts, pass verification, and earn 85% of every sale.

Audio & Speech Datasets for AI Training

Featured datasets

Voice corpora with verifiable provenance.

Live marketplace listings filtered to audio. Every card shows signed LQS score, consent attestation, and originality signal.

Tasks covered

Every voice-AI task on the audit checklist.

Speech recognition (ASR)

Transcribed speech across accents, languages, and environments. Clean and noisy conditions for robust model training.

transcripts · TXT · JSON · CSV

Speaker diarization

Multi-speaker recordings with speaker-turn labels for meeting transcription, call-center AI, and podcast indexing.

labels · speaker IDs · timestamps

Emotion in speech

Clips labeled with emotional state — anger, happiness, sadness, neutral — for sentiment-aware voice assistants and call-center AI.

schemes · 4-class · 6-class · arousal/valence

Keyword & wake-word

Short clips for wake-word detection and keyword spotting, labeled with target and non-target classes across ambient conditions.

clip len · 250ms–2s

Accent & dialect

Regional accent collections for improving ASR robustness across English dialects and non-native speaker speech.

coverage · 50+ accents

Music & environmental

Music genre classification, instrument recognition, and environmental sound detection datasets with timestamp-aligned tags.

taxonomy · AudioSet-aligned

WAV · PCM MP3 · encoded FLAC · lossless TXT · transcript JSON · aligned CSV · metadata

What LabelSets adds

Cert fields designed for voice-AI compliance.

Voice data is among the most regulated categories in the EU AI Act and GDPR. Every audio cert from LabelSets carries the consent and provenance fields your privacy-review team needs.

Speaker-consent attestation

Sellers attest to GDPR Art. 7 consent basis on every upload — with method, jurisdiction, and retention policy captured in the signed cert.

field · consent_basis · jurisdiction

Transcript-level PII scan

Transcripts run through entity-recognition + regex PII detection before publication. Automated flags + seller attestation on every audio dataset.

field · pii_scanned · timestamp

Ed25519-signed provenance

Every cert carries a public-key signature + fingerprint. Buyers verify at /verify any time. Revocation registry handles post-facto consent revocations.

fingerprint · aa4c070af907e2ea

FAQ

Questions privacy teams actually ask.

WAV (PCM), MP3, and FLAC audio files with accompanying transcripts in TXT, JSON, or CSV. The validation pipeline checks audio-file integrity and flags corrupted or suspiciously short clips before publication.

ASR transcription, speaker diarization, emotion and sentiment in speech, keyword spotting, accent and dialect collections, call-center recordings, and voice-command datasets. Music genre and environmental-sound datasets are also available under the audio category.

Yes — provided you have speaker consent under GDPR Art. 7 (or equivalent local basis) and have properly anonymized identifying information where required. Upload your audio with transcripts, attest to consent basis, pass verification, and earn 85% per sale.

Language availability depends on what sellers have uploaded. Use the search bar on the browse page to search for a specific language or accent. You can also post a request on the Dataset Requests board and qualified sellers will be notified.

Yes. The audio cert includes consent_basis, pii_scanned, subgroup-equity metrics, and per-dimension confidence intervals — mapped directly into Article 10 data-governance documentation for high-risk AI systems.

Browse all audio & speech datasets.

Live marketplace filtered by LQS score, consent attestation, language, and format. Or list your own labeled audio and start earning.

Browse datasets → Sell your dataset

Related categories

NLP / Text Computer Vision Medical Imaging Autonomous Vehicles Financial Data

Speech corpora with speaker consent on the cert.

Voice corpora with verifiable provenance.

Every voice-AI task on the audit checklist.

Cert fields designed for voice-AI compliance.

Questions privacy teams actually ask.

Browse all audio & speech datasets.