6.8M+ English articles — the most-used clean text corpus for pretraining and retrieval.
Browse commercial NLP / Text → Visit original source ↗The English Wikipedia XML dump is the single most widely-used clean text corpus in NLP. Updated monthly by the Wikimedia Foundation, it provides 6.8M+ articles with structured metadata (links, categories, infoboxes) and is a canonical input for pretraining LLMs, retrieval-augmented generation, and entity linking.
LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →
Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.
Common tasks and benchmarks where Wikipedia (English Dump) is the default or competitive choice.
What's actually in the dataset — from the maintainer's published stats.
Wikipedia (English Dump) is distributed under CC BY-SA 4.0. This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.
LabelSets sellers offer paid nlp / text datasets with what public datasets often can't give you:
Other entries in the NLP / Text catalog.