Public LQS Audit · Report 003

RedPajama-V2 — a procurement-grade audit of the 30T-token open pretraining corpus.

Composite 81 / 100. Gold tier. The largest open pretraining corpus in routine industry use, structured into three quality buckets. Strong technical execution. Two procurement-relevant gaps: an HF-metadata license signal that returns "unknown," and a quality-classifier whose thresholds are public but whose downstream effect on the data distribution isn't easy to audit. Open methodology, signed result.

Published May 19, 2026 · LabelSets Research · 10 min read · Author: Alex Adrion
81 / 100
Gold
LQS v3.1 composite · default profile
pretraining profile: 84 · instruction-tuning profile: 58 · RAG profile: 80
Size adequacy
100
Format compliance
92
Maintainer reputation
86
Documentation
80
Reproducibility
80
Quality-bucket disclosure
72
Provenance chain
65
License clarity (metadata)
60
PII residual risk
55
Copyright surface
48
Contamination cleanliness
42
Subgroup coverage
48

What we audited

Datasettogethercomputer/RedPajama-Data-V2
Size~30 trillion tokens (5 languages, three quality buckets: head, middle, default)
ModalityText — pretraining corpus
LanguagesEnglish, German, French, Spanish, Italian
License (HF metadata)unknown — repository's HF API license field returns null; downstream tooling has to read the README to discover terms
SourceCommon Crawl (84 snapshots from 2014 to 2023), filtered through CCNet pipeline + quality-classifier scores
MaintainerTogether AI
PaperarXiv:2302.03169 (RedPajama-V1 launch paper; V2 covered in blog + dataset card)
Distribution formatParquet, sharded by Common Crawl dump + language + quality bucket. ~30B docs at default level.
HF downloads (May 2026)~8,200 · 402 likes

The headline finding

RedPajama-V2 is the largest open pretraining corpus most teams in industry actually run against. Where FineWeb-Edu is the polished flagship and The Pile is the legally-fraught foundational, RedPajama-V2 is the workhorse — five languages, 84 Common Crawl snapshots, three quality strata so buyers can pick a curation tier matching their compute budget. That structural choice is genuinely useful: a foundation-model team can train on the head bucket only and inherit a much smaller, much higher-quality slice; a research team running ablations can train on default and replicate broad-distribution baselines.

Two procurement-relevant gaps a model-risk team should know about. Neither is a fatal flaw — both reflect the reality that RedPajama-V2 is a research artifact maintained by a commercial entity, and that commercial-entity dynamic shows in the documentation surface.

  1. License metadata says "unknown." The HuggingFace API license field for this dataset returns null. The actual license terms exist (Apache 2.0 for the released filtering code; data terms inherited from Common Crawl + Together AI's redistribution statement in the README), but a procurement scanner reading HF metadata alone will report license: unknown. For compliance teams running automated license audits across the open-data supply chain — increasingly standard practice — RedPajama-V2 will trip a flag unless the audit pipeline reads the README too. We score license-clarity at 60 because the document terms are not bad; the procurement-relevant metadata is.
  2. Quality-classifier opacity. RedPajama-V2 publishes the classifier code and the bucket thresholds publicly. What's harder to audit is what the bucket boundaries do to the underlying distribution — which kinds of text end up in head, which in middle, which excluded entirely. The classifier is trained on a mix of public quality signals and Together AI's internal labels; the labels aren't released. For procurement teams downstream of a model trained on head: you inherit a curated distribution whose curation function isn't fully open. We score quality-bucket disclosure at 72.

Why this audit exists. A buyer evaluating a foundation model trained on RedPajama-V2 needs to know which bucket the model trained on, what the classifier excludes, what the license terms actually are (vs. what HF metadata says), and what the contamination surface looks like at 30T-token scale. The LQS framework standardizes those questions across pretraining corpora so that "RedPajama-V2 = 81 Gold" carries the same meaning across procurement audits at different organizations.

Dimension-by-dimension reasoning

Size adequacy — 100

100 / 100

30 trillion tokens at the default cutoff. Adequate for end-to-end pretraining of any model up to roughly 600B parameters at Chinchilla-optimal data ratios. The head bucket alone is still ~3T tokens — comparable in scale to FineWeb-Edu's flagship cut. Effectively unlimited for any team not training a frontier model.

Sources: Together AI launch blog · dataset card token counts · Chinchilla scaling reference

Format compliance — 92

92 / 100

Parquet, sharded sensibly by (Common Crawl dump × language × quality bucket). Loads cleanly via datasets, polars, dask. Schema includes the quality-classifier score per record, which is useful — buyers wanting a stricter cut can filter downstream. Deduction is for the absence of a published MLCroissant manifest at audit time (FineWeb-Edu publishes one).

Sources: HF config blob · independent load test · paper Section 3

Maintainer reputation — 86

86 / 100

Together AI is a well-funded commercial entity with a sustained record of open releases (RedPajama-V1, the RedPajama-3B / 7B model checkpoints, the OpenChatKit project). Strong responsiveness on GitHub. The single deduction reflects the commercial-entity dynamic: a future business decision could deprioritize maintenance, and the contractual commitment to keep V2 hosted is implicit, not explicit. Compare to AI2's Dolma which has an academic affiliation commitment, or HF's FineWeb-Edu which has HF's commercial commitment to dataset hosting at scale.

Sources: Together AI publication history · GitHub activity on togethercomputer/RedPajama-Data

Documentation — 80

80 / 100

HF dataset card describes the construction methodology, the three quality buckets, the CCNet filtering pipeline, and the included Common Crawl snapshots. A launch blog post supplements with rationale. The V1 paper covers methodology that's largely inherited by V2. Where this falls short of FineWeb-Edu's 96: no published datasheet, no ablation tables for the V2 quality-classifier specifically, no explicit data-card subgroup-coverage analysis. Adequate, not exemplary.

Sources: HF dataset card · Together AI launch blog · RedPajama-V1 paper

Reproducibility — 80

80 / 100

CCNet filtering code is public. Quality-classifier weights are public. The training data for the quality classifier is partially public (the released labelled set Together AI used) but not fully. Anyone with sufficient Common Crawl access and compute can reproduce the V2 pipeline end-to-end, with the caveat that the quality-classifier reproduction depends on a labelling step where the full label set isn't released. Above field average; below FineWeb-Edu's transparency level.

Sources: CCNet repo · quality-classifier weights · paper Section 4

Quality-bucket disclosure — 72

72 / 100

The thresholds separating head, middle, and default are published. The classifier weights are published. What isn't published is a per-bucket distributional audit: what fraction of head is academic vs. forum vs. news vs. literary? How does the bucket distribution skew across subject areas? A model trained on head inherits a curation function that's not fully visible from the maintainer's documentation. The information could be derived externally (load head, run topic classifiers, publish the distribution); it just hasn't been.

Sources: classifier code and weights · dataset card bucket descriptions · absence of distributional audit

Provenance chain — 65

65 / 100

Common Crawl → CCNet filter → RedPajama-V2 (default) → quality-classifier scoring → RedPajama-V2 (head / middle). Each hop is documented. The root (Common Crawl) inherits the open-web provenance surface with all its gaps. For a 30T-token corpus this is roughly the best score achievable — the upstream surface caps any pretraining corpus derived from web crawls. Comparable to FineWeb-Edu's 62.

Sources: pipeline doc · CC Foundation lineage · dataset card construction section

License clarity (metadata) — 60

60 / 100

The HF metadata license field returns null. The actual terms exist in the README + GitHub repo: Apache 2.0 for code, redistribution inherited from Common Crawl Foundation policy, with Together AI's redistribution statement. The procurement issue isn't the terms themselves — they're permissive — it's that a downstream compliance scanner reading HF metadata gets license: unknown and has to escalate to manual review. For organizations running automated license audits across hundreds of open-data sources (an increasingly standard pattern), RedPajama-V2 trips a flag where it shouldn't.

Sources: HF API metadata response · README + Together AI redistribution statement · LQS license-clarity rubric for metadata vs. document terms

PII residual risk — 55

55 / 100

Inherited from Common Crawl base. CCNet applies language ID + URL filtering but no PII scrubber by default. Together AI does not publish a PII audit. The educational-quality classifier doesn't specifically target PII. For procurement profiles touching healthcare, financial, or EU-jurisdiction data, this surface needs a downstream scrubbing layer regardless of which RedPajama-V2 bucket is used. Comparable to FineWeb-Edu (58).

Sources: CCNet pipeline doc · absence of published PII scrub results · EU AI Act Art. 10 §2 inventory requirements

Copyright surface — 48

48 / 100

RedPajama-V2 is filtered from Common Crawl. Common Crawl operates under fair-use claims for indexing; that legal posture is not a copyright waiver for derivative training. RedPajama-V2 inherits the full open-web copyright surface, which is the active subject of multi-party litigation (NYT v. OpenAI, Authors Guild, multiple class actions). For commercial pretraining this is the largest open legal question. Score is low because the surface is real, not because Together AI did anything wrong. Identical posture to FineWeb-Edu (48).

Sources: Common Crawl Foundation policy · active litigation tracker · dataset card intended-use disclaimer

Contamination cleanliness — 42

42 / 100

No published benchmark-contamination analysis for RedPajama-V2 specifically. Common-Crawl-derived corpora have a documented contamination floor across MMLU, HellaSwag, ARC, and most reasoning benchmarks. The LabelSets Contamination Report 001 (80 post-training datasets scanned) found measurable benchmark overlap in 23 of 80 — those numbers establish the surface for post-training data; pretraining corpora at 30T-token scale have larger surfaces by construction. Report 002 (in progress) will scan pretraining corpora directly including RedPajama-V2. For now, a model trained on RedPajama-V2 evaluated on MMLU should disclose the pretraining-mix contamination caveat in its model card.

Sources: LabelSets Contamination Report 001 · absence of maintainer-published contamination analysis · MMLU-Pro paper Section 2 (general contamination context)

Subgroup coverage — 48

48 / 100

The quality-classifier filter narrows the distribution toward higher-perplexity-scoring text — broadly academic, news, formal writing. Casual register, conversational text, code-switched bilingual content, and dialect English are filtered at much higher rates than literary or academic English. The five-language coverage (EN/DE/FR/ES/IT) is meaningful but doesn't include any non-Indo-European language — Mandarin, Hindi, Arabic, Swahili are absent. For multilingual model evaluation this is the largest dimension gap.

Sources: dataset card language list · quality-classifier filter behaviour · BIG-bench dialect coverage benchmarks

Procurement profile — what this means for buyers

Methodology

This audit was scored under LQS v3.1 with the public-pretraining-corpus adapter. Every dimension above maps to a documented rubric in the methodology preprint (DOI 10.5281/zenodo.20278981). The procurement profiles are computed by re-weighting the same dimensions; the weights are public in the calibration corpus.

The 7-oracle consensus pass was not run for this report — RedPajama-V2 at 30T tokens is not a candidate for oracle-based scoring at the level of individual records. The audit is metadata- and structure-based, same lens used for FineWeb-Edu. For maintainers wanting full oracle-cert results on a representative sample (e.g. a 1B-token slice of head), contact us.

Recourse. If you maintain RedPajama-V2 and believe any score here is wrong, the recourse process is documented in the methodology preprint §7. File an issue at the public-audit repo with a counter-citation; we publish a v1.1 with the correction and a changelog. We do not modify scores under non-public pressure. Every published score carries an immutable cert hash; corrections are issued as new versions.

What this audit doesn't claim

What's next

This is Report 003 in the public-audit series. Coming up:

Want the next report when it lands?

One email per audit. No marketing. Methodology updates included.

All audits + signup Read the methodology (DOI)
Share on X Share on LinkedIn Share on r/ML Share on HN