49 of 50 finance datasets on HuggingFace fail procurement-grade documentation

Method

What we measured (and what we didn't).

This is a Phase 1 audit. We scored each of 50 finance datasets against 7 dimensions of LQS v3.1 that can be derived from public HuggingFace metadata alone — no file downloads, no model training. The dimensions, mapped from the 19-dim full LQS spec:

completeness — does cardData.annotations_creators declare who labeled the data?
uniqueness — declared deduplication methodology (challenge_split vs minimal_dedup)
validation_health — label-production class (expert / crowdsourced_qc / official_record / crowdsourced_raw / automated_model)
size_adequacy — declared size_categories meets task-typical floor (≥10K items)
format_compliance — published in industry-standard format (parquet / arrow)
label_density — labels-per-item heuristic from cardData
license_clarity — license declared, permissive, and not NC/ND

What this audit cannot detect (yet). Train/test leakage, benchmark contamination (overlap with MMLU/HumanEval/etc.), oracle disagreement (multi-classifier Fleiss κ), label noise robustness, distributional drift, subgroup-equity gaps, adversarial stability — these are all file-content dimensions in the full 19-dim LQS v3.1 spec. They require sample-based file analysis on the actual dataset, not metadata. Phase 2 (10 representative datasets, sample-based) is in progress and will publish separately.

Important caveat on size_adequacy. This dimension fails when cardData.size_categories is missing or below the 10K-item floor. It does not mean the underlying dataset is too small — many of the audited datasets likely contain plenty of rows. It means the declared size in HF metadata is missing, which prevents a procurement reviewer from sizing the dataset without downloading it. The fix is a one-line cardData edit by the publisher; the audit cannot distinguish "small dataset" from "undocumented size."

Discovery query strategy: union of 8 search queries — finance, financial, stock, trading, banking, sec-edgar, fintech, earnings — sorted by HF download count, deduplicated by dataset id, filtered to those whose name/description/tags match a finance-domain regex, top 50 retained. The exact finance-domain regex and full search pagination are in the script.

Findings · failure-mode breakdown

Where the documentation breaks down.

validation_health

47 / 50 (94%)

completeness

45 / 50 (90%)

size_adequacy

31 / 50 (62%)

format_compliance

14 / 50 (28%)

license_clarity

11 / 50 (22%)

label_density

0 / 50 (0%)

uniqueness

0 / 50 (0%)

Why `validation_health` and `completeness` dominate

The two top failure modes are linked. completeness fails when cardData.annotations_creators is empty or absent. When that field is missing, our derivation defaults label_production to crowdsourced_raw, which fails validation_health. So a single missing field — "who labeled this?" — fails both dimensions. The fix is a one-line cardData edit by the dataset publisher; the absence of that one line affects 90% of the audit.

This isn't pedantry. Under EU AI Act Article 10, providers of high-risk AI systems must document training-data provenance, including how labels were produced. SR 11-7 model-risk frameworks treat undocumented label provenance as a material weakness. A finance dataset with no annotation_creators declaration cannot land in an F500 model package without case-by-case legal review.

Why `license_clarity` matters more than 22% suggests

No license declared

9 / 50

cardData.license absent or "unknown" — legally unusable for commercial deployment without case-by-case legal review

Restrictive license

2 / 50

CC-BY-NC-SA-3.0 or CC-BY-NC-ND-4.0 — explicitly blocks commercial deployment of derivative model weights

Examples of license-undeclared datasets in the audit set: AdaptLLM/finance-tasks (12,929 downloads), Zhangqingyue127/Multimodal-Stock-Forecasting-Dataset (5,880 downloads), meloqiao/us-stock-data (1,908 downloads). All three are popular enough that they're likely already inside someone's training pipeline. Without a declared license, a buyer's legal team has no documentation to attach when an audit asks "what governs your use of this data?"

Examples of NC/ND-licensed datasets: takala/financial_phrasebank (CC-BY-NC-SA-3.0, 11,120 downloads — a frequently-cited finance sentiment benchmark) and HYdsl/Open-domain_Financial_QA (CC-BY-NC-ND-4.0). Models trained on these datasets cannot be commercialized as a matter of license — but the licenses are easy to miss in cardData if you don't look.

If you have one of these in your stack

Practical steps for an active model package.

If your model risk register includes a HuggingFace finance dataset that appears in the scorecard above with one or more failed dimensions, the recommended sequence — in order of audit-defensibility risk:

License-undeclared (9 datasets) → escalate to legal immediately. No commercial deployment until the dataset publisher declares a license, or you replace with a permissively-licensed alternative. JanosAudran/financial-reports-sec (Apache-2.0) is one drop-in option for SEC-filing-style use cases.
NC/ND-licensed (2 datasets) → models trained on these cannot be commercialized. If the dataset is in production training, you have a license breach to disclose. Replacement is the only path.
Missing annotation_creators (45 datasets) → this is your EU AI Act Article 10 documentation gap. Two options: (a) request the publisher add the field via a HuggingFace community PR — many are responsive; (b) write your own provenance attestation based on the dataset's published paper / README. Either creates a defensible audit trail.
Missing size_categories (31 datasets) → low-risk, easily mitigable. Document the actual row count yourself in your model package. The audit failure flags it; it doesn't block deployment.
Missing arXiv paper (41 datasets) → not a blocker on its own, but combined with missing provenance it's a weakness an auditor will probe. Reasonable mitigation: cite the dataset's published paper if one exists outside HF tags, or document the data source independently.

None of the above changes the underlying data. They change what your audit team can defend in writing — which, under SR 11-7 model-risk management and EU AI Act conformity assessment, is the deliverable that matters.

The one that passed

What good documentation looks like.

Of 50 audited, exactly one dataset passed all 7 metadata-derived dimensions: JanosAudran/financial-reports-sec. This is a corpus of SEC 10-K and 10-Q filings, expert-annotated, ~50M items, Apache-2.0 licensed. The cardData is filled in: annotations_creators declares expert-generated, license is explicit, size_categories is declared, the format is parquet.

This dataset is a useful baseline. Whether the data quality is actually high — whether splits leak, whether labels are consistent across filings, whether the corpus has temporal drift — Phase 1 cannot answer. It can only certify that the documentation is procurement-grade. That is necessary but not sufficient. Phase 2 will sample-download a representative subset and produce the full 19-dim scorecard for this and 9 other datasets.

Per-dataset scorecard

All 50, by downloads.

Composite is a metadata-derived 0–100 score (capped at 85 for proxy-derived; reserve 86–100 for expert-validated). Failed dims listed verbatim. Full per-dim reasons + raw HF API responses in the JSON output.

#	HF dataset id	Downloads	License	Composite	Result	Failed dims

Loaded live from /tools/output/hf-finance-50-audit.json

What's next

Phase 2 — sample-based file audit.

Documentation gaps tell you what cannot be cited in a procurement package. They do not tell you whether the underlying data is good. Phase 2 will sample-download ~10K rows from each of 10 representative datasets (covering the spread of composite scores in this audit) and run file-based dimensions:

Train/test leakage detection — JS-divergence between split distributions, plus exact-overlap checks across declared splits.
Benchmark contamination — substring + n-gram overlap against the 40+ public eval suites LQS v3.1 tracks (MMLU, HumanEval, GSM8K, FinanceBench, etc.).
Schema, dedup, class-balance, and PII findings from real row content — the things metadata cannot tell you.

Phase 2 will publish at labelsets.ai/research-hf-finance-50-phase-2 within ~10 days.

Reproducibility

Run this yourself.

The full audit script is open in the LabelSets repository:

Reproducibility — clone and run

git clone https://github.com/labelsets/lqs-scorer.git
cd lqs-scorer
node tools/audit-hf-finance-50.js

# outputs:
tools/output/hf-finance-50-audit.json # all 50 scorecards + raw HF responses
tools/output/hf-finance-50-summary.md # human-readable report

The script hits the public HuggingFace API only — no auth required. Total run time ~90 seconds. The audit is deterministic given the HF API state at run time; the JSON output records the timestamp and HF responses for replay.

The 7-dim derivation is a port of the production calibration tool used internally to bootstrap the LabelSets calibration corpus from public HF metadata (generation/quality/calibration/hf-bootstrap.js). The same logic that mints proxy entries for our calibration tune is what produced the scorecards above.

Cite

BibTeX

@misc{labelsets2026hffinance50,
  title = {49 of 50 Finance Datasets on HuggingFace Fail Procurement-Grade Documentation},
  author = {{LabelSets Research}},
  year = {2026},
  month = {April},
  url = {https://labelsets.ai/research-hf-finance-50},
  note = {Phase 1: metadata audit. Phase 2: sample-based file audit pending.}
}

49 of 50 finance datasets on HuggingFace fail procurement-grade documentation.

What we measured (and what we didn't).

Where the documentation breaks down.

Why `validation_health` and `completeness` dominate

Why `license_clarity` matters more than 22% suggests

Practical steps for an active model package.

What good documentation looks like.

All 50, by downloads.

Phase 2 — sample-based file audit.

Run this yourself.

Cite

Auditing your own training data?

49 of 50 finance datasets on HuggingFace fail procurement-grade documentation.

What we measured (and what we didn't).

Where the documentation breaks down.

Why validation_health and completeness dominate

Why license_clarity matters more than 22% suggests

Practical steps for an active model package.

What good documentation looks like.

All 50, by downloads.

Phase 2 — sample-based file audit.

Run this yourself.

Cite

Auditing your own training data?

Why `validation_health` and `completeness` dominate

Why `license_clarity` matters more than 22% suggests