Research note · Phase 1 · 2026-04-26

49 of 50 finance datasets on HuggingFace fail procurement-grade documentation.

We scored the top 50 finance-related public datasets on the HuggingFace Hub against 7 metadata-derived LQS v3.1 dimensions. 98% failed at least one dimension. 9 had no license declared at all. 90% had no annotation provenance. This is a documentation crisis, not a data-quality finding — yet. Phase 2 (sample-based file analysis) is in progress.

Authors: LabelSets Research  ·  Date: April 26, 2026  ·  Method: public HF API + 7 LQS v3.1 metadata dimensions  ·  Reproducibility: script · raw JSON
Principal author: identity disclosed under NDA — pending counsel review. Public GitHub mirror per Q4 2026 roadmap commitment.
Companion ranking Looking for ranked scores across the broader public-dataset landscape? See 79 named ML datasets ranked by LQS — COCO 94, MIMIC-IV 93, The Pile 77, Common Crawl 65.
TL;DR for procurement / model-risk readers

Of the top 50 finance datasets on HuggingFace by downloads (search-deduped across 8 finance-related queries), 49 failed at least one of 7 metadata-derived LQS v3.1 dimensions. The most common failures: validation_health (94%), completeness (90%), size_adequacy (62%).

The most consequential for F500 deployment: 9 datasets had no license declared in their cardData (legally unusable for commercial training without case-by-case legal review), 2 had explicit noncommercial / no-derivatives licenses, and 90% lacked an annotation_creators field — a direct gap against EU AI Act Article 10 documentation requirements.

Only one dataset passed all 7 dimensions: JanosAudran/financial-reports-sec (expert-annotated SEC filings, Apache-2.0, ~50M items).

Datasets audited
50
top finance HF · search-deduped
Failed ≥1 dimension
49
98% · only 1 clean pass
No license declared
9
18% · legally unusable commercially
No annotation provenance
45
90% · EU AI Act Art. 10 gap
Method

What we measured (and what we didn't).

This is a Phase 1 audit. We scored each of 50 finance datasets against 7 dimensions of LQS v3.1 that can be derived from public HuggingFace metadata alone — no file downloads, no model training. The dimensions, mapped from the 19-dim full LQS spec:

  • completeness — does cardData.annotations_creators declare who labeled the data?
  • uniqueness — declared deduplication methodology (challenge_split vs minimal_dedup)
  • validation_health — label-production class (expert / crowdsourced_qc / official_record / crowdsourced_raw / automated_model)
  • size_adequacy — declared size_categories meets task-typical floor (≥10K items)
  • format_compliance — published in industry-standard format (parquet / arrow)
  • label_density — labels-per-item heuristic from cardData
  • license_clarity — license declared, permissive, and not NC/ND

What this audit cannot detect (yet). Train/test leakage, benchmark contamination (overlap with MMLU/HumanEval/etc.), oracle disagreement (multi-classifier Fleiss κ), label noise robustness, distributional drift, subgroup-equity gaps, adversarial stability — these are all file-content dimensions in the full 19-dim LQS v3.1 spec. They require sample-based file analysis on the actual dataset, not metadata. Phase 2 (10 representative datasets, sample-based) is in progress and will publish separately.

Important caveat on size_adequacy. This dimension fails when cardData.size_categories is missing or below the 10K-item floor. It does not mean the underlying dataset is too small — many of the audited datasets likely contain plenty of rows. It means the declared size in HF metadata is missing, which prevents a procurement reviewer from sizing the dataset without downloading it. The fix is a one-line cardData edit by the publisher; the audit cannot distinguish "small dataset" from "undocumented size."

Discovery query strategy: union of 8 search queries — finance, financial, stock, trading, banking, sec-edgar, fintech, earnings — sorted by HF download count, deduplicated by dataset id, filtered to those whose name/description/tags match a finance-domain regex, top 50 retained. The exact finance-domain regex and full search pagination are in the script.

Findings · failure-mode breakdown

Where the documentation breaks down.

validation_health
47 / 50 (94%)
completeness
45 / 50 (90%)
size_adequacy
31 / 50 (62%)
format_compliance
14 / 50 (28%)
license_clarity
11 / 50 (22%)
label_density
0 / 50 (0%)
uniqueness
0 / 50 (0%)

Why validation_health and completeness dominate

The two top failure modes are linked. completeness fails when cardData.annotations_creators is empty or absent. When that field is missing, our derivation defaults label_production to crowdsourced_raw, which fails validation_health. So a single missing field — "who labeled this?" — fails both dimensions. The fix is a one-line cardData edit by the dataset publisher; the absence of that one line affects 90% of the audit.

This isn't pedantry. Under EU AI Act Article 10, providers of high-risk AI systems must document training-data provenance, including how labels were produced. SR 11-7 model-risk frameworks treat undocumented label provenance as a material weakness. A finance dataset with no annotation_creators declaration cannot land in an F500 model package without case-by-case legal review.

Why license_clarity matters more than 22% suggests

No license declared
9 / 50
cardData.license absent or "unknown" — legally unusable for commercial deployment without case-by-case legal review
Restrictive license
2 / 50
CC-BY-NC-SA-3.0 or CC-BY-NC-ND-4.0 — explicitly blocks commercial deployment of derivative model weights

Examples of license-undeclared datasets in the audit set: AdaptLLM/finance-tasks (12,929 downloads), Zhangqingyue127/Multimodal-Stock-Forecasting-Dataset (5,880 downloads), meloqiao/us-stock-data (1,908 downloads). All three are popular enough that they're likely already inside someone's training pipeline. Without a declared license, a buyer's legal team has no documentation to attach when an audit asks "what governs your use of this data?"

Examples of NC/ND-licensed datasets: takala/financial_phrasebank (CC-BY-NC-SA-3.0, 11,120 downloads — a frequently-cited finance sentiment benchmark) and HYdsl/Open-domain_Financial_QA (CC-BY-NC-ND-4.0). Models trained on these datasets cannot be commercialized as a matter of license — but the licenses are easy to miss in cardData if you don't look.

If you have one of these in your stack

Practical steps for an active model package.

If your model risk register includes a HuggingFace finance dataset that appears in the scorecard above with one or more failed dimensions, the recommended sequence — in order of audit-defensibility risk:

  1. License-undeclared (9 datasets) → escalate to legal immediately. No commercial deployment until the dataset publisher declares a license, or you replace with a permissively-licensed alternative. JanosAudran/financial-reports-sec (Apache-2.0) is one drop-in option for SEC-filing-style use cases.
  2. NC/ND-licensed (2 datasets) → models trained on these cannot be commercialized. If the dataset is in production training, you have a license breach to disclose. Replacement is the only path.
  3. Missing annotation_creators (45 datasets) → this is your EU AI Act Article 10 documentation gap. Two options: (a) request the publisher add the field via a HuggingFace community PR — many are responsive; (b) write your own provenance attestation based on the dataset's published paper / README. Either creates a defensible audit trail.
  4. Missing size_categories (31 datasets) → low-risk, easily mitigable. Document the actual row count yourself in your model package. The audit failure flags it; it doesn't block deployment.
  5. Missing arXiv paper (41 datasets) → not a blocker on its own, but combined with missing provenance it's a weakness an auditor will probe. Reasonable mitigation: cite the dataset's published paper if one exists outside HF tags, or document the data source independently.

None of the above changes the underlying data. They change what your audit team can defend in writing — which, under SR 11-7 model-risk management and EU AI Act conformity assessment, is the deliverable that matters.

The one that passed

What good documentation looks like.

Of 50 audited, exactly one dataset passed all 7 metadata-derived dimensions: JanosAudran/financial-reports-sec. This is a corpus of SEC 10-K and 10-Q filings, expert-annotated, ~50M items, Apache-2.0 licensed. The cardData is filled in: annotations_creators declares expert-generated, license is explicit, size_categories is declared, the format is parquet.

This dataset is a useful baseline. Whether the data quality is actually high — whether splits leak, whether labels are consistent across filings, whether the corpus has temporal drift — Phase 1 cannot answer. It can only certify that the documentation is procurement-grade. That is necessary but not sufficient. Phase 2 will sample-download a representative subset and produce the full 19-dim scorecard for this and 9 other datasets.

Per-dataset scorecard

All 50, by downloads.

Composite is a metadata-derived 0–100 score (capped at 85 for proxy-derived; reserve 86–100 for expert-validated). Failed dims listed verbatim. Full per-dim reasons + raw HF API responses in the JSON output.

# HF dataset id Downloads License Composite Result Failed dims

Loaded live from /tools/output/hf-finance-50-audit.json

What's next

Phase 2 — sample-based file audit.

Documentation gaps tell you what cannot be cited in a procurement package. They do not tell you whether the underlying data is good. Phase 2 will sample-download ~10K rows from each of 10 representative datasets (covering the spread of composite scores in this audit) and run file-based dimensions:

  • Train/test leakage detection — JS-divergence between split distributions, plus exact-overlap checks across declared splits.
  • Benchmark contamination — substring + n-gram overlap against the 40+ public eval suites LQS v3.1 tracks (MMLU, HumanEval, GSM8K, FinanceBench, etc.).
  • Schema, dedup, class-balance, and PII findings from real row content — the things metadata cannot tell you.

Phase 2 will publish at labelsets.ai/research-hf-finance-50-phase-2 within ~10 days.

Reproducibility

Run this yourself.

The full audit script is open in the LabelSets repository:

Reproducibility — clone and run
git clone https://github.com/labelsets/lqs-scorer.git
cd lqs-scorer
node tools/audit-hf-finance-50.js

# outputs:
  tools/output/hf-finance-50-audit.json # all 50 scorecards + raw HF responses
  tools/output/hf-finance-50-summary.md # human-readable report

The script hits the public HuggingFace API only — no auth required. Total run time ~90 seconds. The audit is deterministic given the HF API state at run time; the JSON output records the timestamp and HF responses for replay.

The 7-dim derivation is a port of the production calibration tool used internally to bootstrap the LabelSets calibration corpus from public HF metadata (generation/quality/calibration/hf-bootstrap.js). The same logic that mints proxy entries for our calibration tune is what produced the scorecards above.

Cite

BibTeX
@misc{labelsets2026hffinance50,
  title = {49 of 50 Finance Datasets on HuggingFace Fail Procurement-Grade Documentation},
  author = {{LabelSets Research}},
  year = {2026},
  month = {April},
  url = {https://labelsets.ai/research-hf-finance-50},
  note = {Phase 1: metadata audit. Phase 2: sample-based file audit pending.}
}

Auditing your own training data?

If you're at an F500 in a regulated industry and need to audit a training-data corpus for SR 11-7, EU AI Act, FDA, or §1557, we're picking 5–10 design partners for 6 months of LQS Enterprise — free, in exchange for a logo + short case study. No demo, no sales call.