FineWeb-Edu — A procurement-grade audit of HuggingFace's flagship 1.3T-token corpus.
Composite 73 / 100. Silver tier. World-class documentation. A circular LLM-as-judge dependency we think procurement teams should understand. An ODC-By attribution gap that almost every commercial user violates. Open methodology, signed result, recourse process documented.
pretraining profile: 78 · instruction-tuning profile: 61
What we audited
| Dataset | HuggingFaceFW/fineweb-edu |
|---|---|
| Size | 1.3 trillion tokens (also 5.4T variant scored separately as fineweb-edu-score-2) |
| Modality | Text (English), pretraining corpus |
| License | ODC-By 1.0 — open with attribution requirement |
| Source | Common Crawl WARC dumps, filtered through an educational-quality classifier |
| Maintainer | HuggingFace Science (Loubna Ben Allal et al.) |
| Paper | arXiv:2406.17557 |
| DOI | 10.57967/hf/2497 |
| Distribution format | Parquet with explicit schema (11 columns including score, int_score, language_score) |
| HF downloads (May 2026) | 572,057 · 1,069 likes |
The headline finding
FineWeb-Edu is one of the best-documented open pretraining corpora in existence. The HF Science team's datasheet, ablations, and educational-classifier description set a bar that almost no other web-scale corpus meets. Documentation alone earns a 96.
The two issues an enterprise procurement team should know about are not quality issues. They are provenance issues.
- The quality classifier is not independent of the data it scores. FineWeb-Edu's "educational value" classifier was trained on annotations generated by Llama-3-70B-Instruct — a model that was itself trained on overlapping Common Crawl content. The audit framework treats this as a circular dependency: the quality oracle and the rated asset share an unobserved common parent. The paper discloses this openly; we don't think it's a flaw in honesty, just in independence. It scores 55 on classifier independence because the chain is real, not because anyone is hiding it.
- ODC-By 1.0 is "open" but requires attribution that almost no downstream model card honors. Every commercial model trained on FineWeb-Edu inherits an attribution obligation back to HuggingFace and the Common Crawl Foundation. We surveyed 40 model cards published in the last six months that disclose FineWeb-Edu in their training mix. Seven include an attribution line. The license is open; the practice around it is not. License-clarity scores 70 — the document is fine, the ecosystem behavior is the issue.
Why this audit exists. Every model card that says "trained on FineWeb-Edu" inherits the provenance, license, and PII surface of the underlying corpus. Procurement teams and model-risk reviewers cannot evaluate those surfaces from a README. The LQS framework standardizes the questions and the answers so that "FineWeb-Edu = 73 Silver" means the same thing to a buyer at Bank A as to a buyer at Bank B.
Dimension-by-dimension reasoning
Documentation — 96
96 / 100Dataset card, datasheet, ablation tables, classifier description, sample configurations (10BT / 100BT / 350BT), and a published paper with detailed methodology. The paper documents what was kept, what was filtered, and the educational-classifier label distribution. Comparable only to the Pile v1 paper and the C4 datasheet in completeness.
Format compliance — 95
95 / 100Parquet with explicit schema. 11 columns, every column dtype declared. Multiple sampling configurations exposed (sample-10BT, sample-100BT, sample-350BT, default). MLCroissant metadata published. Loads cleanly via datasets, polars, dask. Deduction is for the absence of a Bloom-filter cross-shard dedup attestation.
Size adequacy — 100
100 / 1001.3T tokens at the high-quality cutoff (int_score ≥ 3); 5.4T at the broader cutoff. Adequate for end-to-end pretraining of any model up to roughly 30B parameters at Chinchilla-optimal data ratios.
Reproducibility — 88
88 / 100The filtering classifier weights are public. The trained-on annotations (Llama-3-70B-Instruct outputs) are not redistributable, so the exact training corpus for the classifier cannot be reproduced. Everything downstream of the classifier weights is fully reproducible. Above average for a 1T+ token corpus; not best-in-class because the classifier-training step has a closed-model dependency.
Provenance chain — 62
62 / 100Common Crawl → FineWeb → educational-classifier filter → FineWeb-Edu. Each hop is documented; together they form a chain whose root (Common Crawl) inherits the entire open web with all its provenance gaps. We score the chain documentation as strong but the upstream surface as inherently limited. This is the dimension where any web-scale corpus is going to land in the 50–65 range; it's not unique to FineWeb-Edu.
Classifier independence — 55
55 / 100The educational-value classifier was trained on Llama-3-70B-Instruct annotations. Llama-3 was trained on a corpus that overlaps Common Crawl. The dependency is not: classifier → rates data. It is: (model trained on data) → annotates samples of data → classifier trained on those annotations → classifier filters data. The classifier's idea of "educational" is partially inherited from Llama-3's idea of "educational" — which is itself a function of the corpus the classifier is now filtering. This isn't fatal; it's a real consideration for procurement teams who want to know whether the quality signal is independent of the underlying distribution.
License clarity — 70
70 / 100ODC-By 1.0 is well-known and unambiguously open. The deduction is not about the license; it's about the attribution chain downstream of it. ODC-By requires attribution to data providers; in a model trained on FineWeb-Edu, that means attribution to HuggingFace and to Common Crawl. Survey of 40 recent model cards finds 7 with explicit attribution. The license is fine; the practice around it is procurement-relevant because non-compliance is the buyer's exposure, not the seller's.
Copyright surface — 48
48 / 100FineWeb-Edu is filtered from web pages crawled by Common Crawl. Common Crawl operates under fair-use claims for indexing; that legal posture is not a copyright waiver for derivative training. The fact that FineWeb-Edu filters down to "educational" content does not change the underlying copyright status of the source pages. For commercial pretraining, this is the largest open legal question and the active subject of litigation. Score is low because the surface is real, not because the maintainers did anything wrong.
PII residual risk — 58
58 / 100Inherited from Common Crawl base. FineWeb (parent) applies URL filtering and language-ID filtering but no published PII scrubber. The educational filter further restricts the distribution but does not specifically target PII. No PII audit results published. For procurement profiles touching healthcare, financial, or EU-jurisdiction data, this surface needs an additional scrub layer.
Contamination disclosure — 45
45 / 100No published benchmark-contamination analysis at release time. Independent reproductions have found measurable contamination with MMLU, HellaSwag, and ARC test sets, in line with what's expected from any Common-Crawl-derived corpus. This is one of two dimensions where v3.1 buyers will care most: a model trained on FineWeb-Edu cannot honestly claim a clean evaluation on most common reasoning benchmarks without a separate decontamination pass.
Subgroup coverage — 42
42 / 100The educational filter systematically narrows the distribution toward Wikipedia, textbook, and academic-blog-style English. Casual register, conversational text, code-switched bilingual content, non-Western dialects, and AAVE are filtered out at much higher rates than literary or academic English. For a pretraining corpus where downstream coverage of underrepresented dialects matters, this is the largest single concern. The filter was deliberate; the consequence for distributional coverage is real.
Maintainer reputation — 94
94 / 100HuggingFace Science is among the most credible open-data maintainers in the field. Consistent publication record, responsive to community findings, transparent about ablations and limitations. The Loubna Ben Allal / Anton Lozhkov authorship line carries weight. No deduction for the team; only for the absence of an external audit chain attached to the release (an opportunity, not a fault).
Procurement profile — what this means for buyers
- For pretraining a general-purpose foundation model: 78. Good fit. Documentation gives your model-risk team something to attach to the model card. Contamination and copyright surfaces require additional layers (decontamination pass, fair-use posture from counsel).
- For instruction-tuning or SFT: 61. Marginal fit. FineWeb-Edu is a pretraining corpus, not an SFT corpus. Using it for instruction-tuning is a category error that some teams make to chase scale.
- For healthcare or financial domain models under SR 11-7 / 21 CFR 11 / GDPR Art. 6: Do not use as-is. PII residual risk, copyright surface, and contamination disclosure are all below the 70 threshold our procurement-grade lens requires. A scrubbed-and-decontaminated derivative might pass.
- For research / academic / non-commercial: 86. Excellent fit. License is fully compatible, documentation gives you everything you need, the open methodology is itself a teaching artifact.
Methodology
This audit was scored under LQS v3.1 with the dataset-signals adapter for public corpora. Every dimension above maps to a documented rubric in our methodology paper. Scores are computed deterministically from a signals JSON that we are publishing alongside this report.
# Reproduce this audit locally:
git clone https://github.com/labelsets/lqs-public
cd lqs-public
node scorer.js signals/fineweb-edu.json
# → { composite: 73, tier: "silver", dims: { ... } }
The 7-oracle consensus pass (used for the marketplace cert layer) was not run for this report because we do not have line-item access to the underlying records — we audited the corpus from its public metadata, paper, and the classifier-output distribution. For any downstream model trainer who wants the full v3.1 cert with oracle consensus, the HF team can contact us to run the full pipeline against a snapshot.
Recourse. If you maintain FineWeb-Edu and believe any score here is wrong, the recourse process is documented in the LQS v3.1 paper §7. In short: file an issue at the public-audit repo with a counter-citation, and we will publish a v1.1 with the correction and a changelog. We do not modify scores under non-public pressure. Every published score carries an immutable hash; corrections are issued as new versions, not silent edits.
What this audit doesn't claim
- It does not claim FineWeb-Edu is bad. Silver tier (73) on the LQS scale is "fit for procurement with documented caveats." The mean LQS for open pretraining corpora is around 64. FineWeb-Edu is above the field, not below it.
- It does not claim the HF Science team did anything wrong. Every weakness flagged above is documented openly in their own paper. The audit's job is to translate the existing documentation into a form a procurement team can act on.
- It does not predict downstream model performance. LQS scores procurement fitness, not model quality. A model trained on FineWeb-Edu can be excellent even though the corpus has a 45 contamination-disclosure score; the contamination concern is about benchmark trust, not capability.
- It does not waive copyright or commercial risk. Any commercial use of FineWeb-Edu requires its own legal review. This audit is technical fitness only.
What's next
This is Report 001 in a public-audit series. Reports planned over the next 90 days:
- Report 002 — RedPajama-V2. 30T tokens. Three-way provenance chain comparison with FineWeb-Edu and The Pile.
- Report 003 — The Pile (EleutherAI). Foundational. Known Books3 / copyright surface. The most-litigated open pretraining corpus.
- Report 004 — Dolma (AI2). Most recent open release at scale with a published license analysis.
- Report 005 — A medical imaging corpus to be selected. The first audit under the FDA 21 CFR 11 procurement lens.
Want the next report when it lands?
One email per audit. No marketing. Methodology updates included.
Subscribe to audit reports Read the methodology