Public LQS Audit · Report 001

FineWeb-Edu — A procurement-grade audit of HuggingFace's flagship 1.3T-token corpus.

Composite 73 / 100. Silver tier. World-class documentation. A circular LLM-as-judge dependency we think procurement teams should understand. An ODC-By attribution gap that almost every commercial user violates. Open methodology, signed result, recourse process documented.

Published May 13, 2026 · LabelSets Research · 9 min read
73 / 100
Silver
LQS v3.1 composite · default profile
pretraining profile: 78 · instruction-tuning profile: 61
Documentation
96
Format compliance
95
Size adequacy
100
Reproducibility
88
Provenance chain
62
Classifier independence
55
License clarity
70
Copyright surface
48
PII residual risk
58
Contamination disclosure
45
Subgroup coverage
42
Maintainer reputation
94

What we audited

DatasetHuggingFaceFW/fineweb-edu
Size1.3 trillion tokens (also 5.4T variant scored separately as fineweb-edu-score-2)
ModalityText (English), pretraining corpus
LicenseODC-By 1.0 — open with attribution requirement
SourceCommon Crawl WARC dumps, filtered through an educational-quality classifier
MaintainerHuggingFace Science (Loubna Ben Allal et al.)
PaperarXiv:2406.17557
DOI10.57967/hf/2497
Distribution formatParquet with explicit schema (11 columns including score, int_score, language_score)
HF downloads (May 2026)572,057 · 1,069 likes

The headline finding

FineWeb-Edu is one of the best-documented open pretraining corpora in existence. The HF Science team's datasheet, ablations, and educational-classifier description set a bar that almost no other web-scale corpus meets. Documentation alone earns a 96.

The two issues an enterprise procurement team should know about are not quality issues. They are provenance issues.

  1. The quality classifier is not independent of the data it scores. FineWeb-Edu's "educational value" classifier was trained on annotations generated by Llama-3-70B-Instruct — a model that was itself trained on overlapping Common Crawl content. The audit framework treats this as a circular dependency: the quality oracle and the rated asset share an unobserved common parent. The paper discloses this openly; we don't think it's a flaw in honesty, just in independence. It scores 55 on classifier independence because the chain is real, not because anyone is hiding it.
  2. ODC-By 1.0 is "open" but requires attribution that almost no downstream model card honors. Every commercial model trained on FineWeb-Edu inherits an attribution obligation back to HuggingFace and the Common Crawl Foundation. We surveyed 40 model cards published in the last six months that disclose FineWeb-Edu in their training mix. Seven include an attribution line. The license is open; the practice around it is not. License-clarity scores 70 — the document is fine, the ecosystem behavior is the issue.

Why this audit exists. Every model card that says "trained on FineWeb-Edu" inherits the provenance, license, and PII surface of the underlying corpus. Procurement teams and model-risk reviewers cannot evaluate those surfaces from a README. The LQS framework standardizes the questions and the answers so that "FineWeb-Edu = 73 Silver" means the same thing to a buyer at Bank A as to a buyer at Bank B.

Dimension-by-dimension reasoning

Documentation — 96

96 / 100

Dataset card, datasheet, ablation tables, classifier description, sample configurations (10BT / 100BT / 350BT), and a published paper with detailed methodology. The paper documents what was kept, what was filtered, and the educational-classifier label distribution. Comparable only to the Pile v1 paper and the C4 datasheet in completeness.

Sources: HF dataset card · arXiv 2406.17557 · HF blog post on the educational filter

Format compliance — 95

95 / 100

Parquet with explicit schema. 11 columns, every column dtype declared. Multiple sampling configurations exposed (sample-10BT, sample-100BT, sample-350BT, default). MLCroissant metadata published. Loads cleanly via datasets, polars, dask. Deduction is for the absence of a Bloom-filter cross-shard dedup attestation.

Sources: HF config blob · MLCroissant manifest · independent load test

Size adequacy — 100

100 / 100

1.3T tokens at the high-quality cutoff (int_score ≥ 3); 5.4T at the broader cutoff. Adequate for end-to-end pretraining of any model up to roughly 30B parameters at Chinchilla-optimal data ratios.

Sources: dataset card · Chinchilla scaling reference (Hoffmann et al. 2022)

Reproducibility — 88

88 / 100

The filtering classifier weights are public. The trained-on annotations (Llama-3-70B-Instruct outputs) are not redistributable, so the exact training corpus for the classifier cannot be reproduced. Everything downstream of the classifier weights is fully reproducible. Above average for a 1T+ token corpus; not best-in-class because the classifier-training step has a closed-model dependency.

Sources: classifier weights repo · paper Section 3

Provenance chain — 62

62 / 100

Common Crawl → FineWeb → educational-classifier filter → FineWeb-Edu. Each hop is documented; together they form a chain whose root (Common Crawl) inherits the entire open web with all its provenance gaps. We score the chain documentation as strong but the upstream surface as inherently limited. This is the dimension where any web-scale corpus is going to land in the 50–65 range; it's not unique to FineWeb-Edu.

Sources: HF dataset card · CC Foundation lineage docs · paper Section 2

Classifier independence — 55

55 / 100

The educational-value classifier was trained on Llama-3-70B-Instruct annotations. Llama-3 was trained on a corpus that overlaps Common Crawl. The dependency is not: classifier → rates data. It is: (model trained on data)annotates samples of dataclassifier trained on those annotationsclassifier filters data. The classifier's idea of "educational" is partially inherited from Llama-3's idea of "educational" — which is itself a function of the corpus the classifier is now filtering. This isn't fatal; it's a real consideration for procurement teams who want to know whether the quality signal is independent of the underlying distribution.

Sources: paper Section 3 (annotation methodology) · LQS classifier-independence rubric v1

License clarity — 70

70 / 100

ODC-By 1.0 is well-known and unambiguously open. The deduction is not about the license; it's about the attribution chain downstream of it. ODC-By requires attribution to data providers; in a model trained on FineWeb-Edu, that means attribution to HuggingFace and to Common Crawl. Survey of 40 recent model cards finds 7 with explicit attribution. The license is fine; the practice around it is procurement-relevant because non-compliance is the buyer's exposure, not the seller's.

Sources: ODC-By 1.0 text · 40-card model card survey (LabelSets internal, May 2026)

Copyright surface — 48

48 / 100

FineWeb-Edu is filtered from web pages crawled by Common Crawl. Common Crawl operates under fair-use claims for indexing; that legal posture is not a copyright waiver for derivative training. The fact that FineWeb-Edu filters down to "educational" content does not change the underlying copyright status of the source pages. For commercial pretraining, this is the largest open legal question and the active subject of litigation. Score is low because the surface is real, not because the maintainers did anything wrong.

Sources: Common Crawl Foundation policy page · active litigation tracker (NYT v. OpenAI, Authors Guild v. OpenAI, etc.) · paper Section 6 (intended use disclaimer)

PII residual risk — 58

58 / 100

Inherited from Common Crawl base. FineWeb (parent) applies URL filtering and language-ID filtering but no published PII scrubber. The educational filter further restricts the distribution but does not specifically target PII. No PII audit results published. For procurement profiles touching healthcare, financial, or EU-jurisdiction data, this surface needs an additional scrub layer.

Sources: FineWeb pipeline doc · absence of published PII scrub stats · EU AI Act Art. 10 §2 inventory requirements

Contamination disclosure — 45

45 / 100

No published benchmark-contamination analysis at release time. Independent reproductions have found measurable contamination with MMLU, HellaSwag, and ARC test sets, in line with what's expected from any Common-Crawl-derived corpus. This is one of two dimensions where v3.1 buyers will care most: a model trained on FineWeb-Edu cannot honestly claim a clean evaluation on most common reasoning benchmarks without a separate decontamination pass.

Sources: independent contamination reports (Hugo Larcher et al. blog) · LQS contamination scanner output

Subgroup coverage — 42

42 / 100

The educational filter systematically narrows the distribution toward Wikipedia, textbook, and academic-blog-style English. Casual register, conversational text, code-switched bilingual content, non-Western dialects, and AAVE are filtered out at much higher rates than literary or academic English. For a pretraining corpus where downstream coverage of underrepresented dialects matters, this is the largest single concern. The filter was deliberate; the consequence for distributional coverage is real.

Sources: paper Section 4 (label distribution) · LQS subgroup-equity rubric · BIG-bench dialect coverage benchmarks

Maintainer reputation — 94

94 / 100

HuggingFace Science is among the most credible open-data maintainers in the field. Consistent publication record, responsive to community findings, transparent about ablations and limitations. The Loubna Ben Allal / Anton Lozhkov authorship line carries weight. No deduction for the team; only for the absence of an external audit chain attached to the release (an opportunity, not a fault).

Sources: HF Science publication history · prior FineWeb v1 transparency

Procurement profile — what this means for buyers

Methodology

This audit was scored under LQS v3.1 with the dataset-signals adapter for public corpora. Every dimension above maps to a documented rubric in our methodology paper. Scores are computed deterministically from a signals JSON that we are publishing alongside this report.

# Reproduce this audit locally:
git clone https://github.com/labelsets/lqs-public
cd lqs-public
node scorer.js signals/fineweb-edu.json
# → { composite: 73, tier: "silver", dims: { ... } }

The 7-oracle consensus pass (used for the marketplace cert layer) was not run for this report because we do not have line-item access to the underlying records — we audited the corpus from its public metadata, paper, and the classifier-output distribution. For any downstream model trainer who wants the full v3.1 cert with oracle consensus, the HF team can contact us to run the full pipeline against a snapshot.

Recourse. If you maintain FineWeb-Edu and believe any score here is wrong, the recourse process is documented in the LQS v3.1 paper §7. In short: file an issue at the public-audit repo with a counter-citation, and we will publish a v1.1 with the correction and a changelog. We do not modify scores under non-public pressure. Every published score carries an immutable hash; corrections are issued as new versions, not silent edits.

What this audit doesn't claim

What's next

This is Report 001 in a public-audit series. Reports planned over the next 90 days:

Want the next report when it lands?

One email per audit. No marketing. Methodology updates included.

Subscribe to audit reports Read the methodology
Share on HN Share on X Share on LinkedIn Share on r/MachineLearning