💬 Curated Catalog · NLP / Text

OSCAR

Multilingual web corpus spanning 166 languages, extracted from Common Crawl.

LQS 80 · gold ✓ Commercial OK 431M documents 8.9 TB JSONL Released 2019

Browse commercial NLP / Text → Visit original source ↗

Source: oscar-project.org · maintained by Inria ALMAnaCH

About this dataset

OSCAR (Open Super-large Crawled ALMAnaCH coRpus) is a multilingual web corpus maintained by Inria's ALMAnaCH team. Built from Common Crawl with language identification + cleaning, it spans 166 languages and serves as a primary multilingual pretraining source. 8.9 TB of cleaned text in the latest release.

Maintainer

Inria ALMAnaCH

License

CC0 1.0

Formats

JSONL

Paper

Read on aclanthology.org →

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

out of 100

gold tier

Solid dataset with some trade-offs

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 72

No public completeness metric; using prior for 'web_scrape' datasets.

Uniqueness 85

Near-duplicate filtering (MinHash / LSH / SimHash).

Validation 70

Unlabeled corpus — validation limited to format integrity.

Size adequacy 100

431,000,000 documents — exceeds 100,000 adequacy target for NLP / Text.

Format compliance 95

Industry-standard format — drop-in compatible with mainstream tooling.

Label density 0

Unlabeled corpus — label density not applicable.

Class balance 60

Unlabeled corpus — class balance not applicable.

What it's used for

Common tasks and benchmarks where OSCAR is the default or competitive choice.

Multilingual LLM pretraining
Low-resource language modeling
Language identification

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

166 languages in v23.01. 8.9 TB cleaned text, 431M documents. Strongest languages: English, Russian, Chinese, Spanish, German, French.

License

OSCAR is distributed under CC0 1.0. This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Need commercial-licensed NLP / Text data?

LabelSets sellers offer paid nlp / text datasets with what public datasets often can't give you:

Explicit commercial license in writing
LQS-verified quality in your specific use-case
Instant download — no DUA, credentialed access, or research gating
PII scanned, deduplicated, and production-ready

Browse paid NLP / Text → Sell your dataset

Frequently Asked Questions

OSCAR is distributed under CC0 1.0, which generally permits commercial use. Always verify the current license terms with the maintainer (Inria ALMAnaCH) before using in a commercial product.

OSCAR contains 431,000,000 documents. 166 languages in v23.01. 8.9 TB cleaned text, 431M documents. Strongest languages: English, Russian, Chinese, Spanish, German, French.

OSCAR is maintained by Inria ALMAnaCH and is available at https://oscar-project.org/post/oscar-v23-01/. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.

LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.

OSCAR

About this dataset

LabelSets Quality Score

Solid dataset with some trade-offs

What it's used for

Sample statistics

License

Need commercial-licensed NLP / Text data?

Similar public datasets

Frequently Asked Questions