We scored the top 50 finance-related public datasets on the HuggingFace Hub against 7 metadata-derived LQS v3.1 dimensions. 98% failed at least one dimension. 9 had no license declared at all. 90% had no annotation provenance. This is a documentation crisis, not a data-quality finding — yet. Phase 2 (sample-based file analysis) is in progress.
Of the top 50 finance datasets on HuggingFace by downloads (search-deduped across 8 finance-related queries), 49 failed at least one of 7 metadata-derived LQS v3.1 dimensions. The most common failures: validation_health (94%), completeness (90%), size_adequacy (62%).
The most consequential for F500 deployment: 9 datasets had no license declared in their cardData (legally unusable for commercial training without case-by-case legal review), 2 had explicit noncommercial / no-derivatives licenses, and 90% lacked an annotation_creators field — a direct gap against EU AI Act Article 10 documentation requirements.
Only one dataset passed all 7 dimensions: JanosAudran/financial-reports-sec (expert-annotated SEC filings, Apache-2.0, ~50M items).
This is a Phase 1 audit. We scored each of 50 finance datasets against 7 dimensions of LQS v3.1 that can be derived from public HuggingFace metadata alone — no file downloads, no model training. The dimensions, mapped from the 19-dim full LQS spec:
completeness — does cardData.annotations_creators declare who labeled the data?uniqueness — declared deduplication methodology (challenge_split vs minimal_dedup)validation_health — label-production class (expert / crowdsourced_qc / official_record / crowdsourced_raw / automated_model)size_adequacy — declared size_categories meets task-typical floor (≥10K items)format_compliance — published in industry-standard format (parquet / arrow)label_density — labels-per-item heuristic from cardDatalicense_clarity — license declared, permissive, and not NC/NDWhat this audit cannot detect (yet). Train/test leakage, benchmark contamination (overlap with MMLU/HumanEval/etc.), oracle disagreement (multi-classifier Fleiss κ), label noise robustness, distributional drift, subgroup-equity gaps, adversarial stability — these are all file-content dimensions in the full 19-dim LQS v3.1 spec. They require sample-based file analysis on the actual dataset, not metadata. Phase 2 (10 representative datasets, sample-based) is in progress and will publish separately.
Important caveat on size_adequacy. This dimension fails when cardData.size_categories is missing or below the 10K-item floor. It does not mean the underlying dataset is too small — many of the audited datasets likely contain plenty of rows. It means the declared size in HF metadata is missing, which prevents a procurement reviewer from sizing the dataset without downloading it. The fix is a one-line cardData edit by the publisher; the audit cannot distinguish "small dataset" from "undocumented size."
Discovery query strategy: union of 8 search queries — finance, financial, stock, trading, banking, sec-edgar, fintech, earnings — sorted by HF download count, deduplicated by dataset id, filtered to those whose name/description/tags match a finance-domain regex, top 50 retained. The exact finance-domain regex and full search pagination are in the script.
validation_health and completeness dominateThe two top failure modes are linked. completeness fails when cardData.annotations_creators is empty or absent. When that field is missing, our derivation defaults label_production to crowdsourced_raw, which fails validation_health. So a single missing field — "who labeled this?" — fails both dimensions. The fix is a one-line cardData edit by the dataset publisher; the absence of that one line affects 90% of the audit.
This isn't pedantry. Under EU AI Act Article 10, providers of high-risk AI systems must document training-data provenance, including how labels were produced. SR 11-7 model-risk frameworks treat undocumented label provenance as a material weakness. A finance dataset with no annotation_creators declaration cannot land in an F500 model package without case-by-case legal review.
license_clarity matters more than 22% suggestsExamples of license-undeclared datasets in the audit set: AdaptLLM/finance-tasks (12,929 downloads), Zhangqingyue127/Multimodal-Stock-Forecasting-Dataset (5,880 downloads), meloqiao/us-stock-data (1,908 downloads). All three are popular enough that they're likely already inside someone's training pipeline. Without a declared license, a buyer's legal team has no documentation to attach when an audit asks "what governs your use of this data?"
Examples of NC/ND-licensed datasets: takala/financial_phrasebank (CC-BY-NC-SA-3.0, 11,120 downloads — a frequently-cited finance sentiment benchmark) and HYdsl/Open-domain_Financial_QA (CC-BY-NC-ND-4.0). Models trained on these datasets cannot be commercialized as a matter of license — but the licenses are easy to miss in cardData if you don't look.
If your model risk register includes a HuggingFace finance dataset that appears in the scorecard above with one or more failed dimensions, the recommended sequence — in order of audit-defensibility risk:
JanosAudran/financial-reports-sec (Apache-2.0) is one drop-in option for SEC-filing-style use cases.annotation_creators (45 datasets) → this is your EU AI Act Article 10 documentation gap. Two options: (a) request the publisher add the field via a HuggingFace community PR — many are responsive; (b) write your own provenance attestation based on the dataset's published paper / README. Either creates a defensible audit trail.size_categories (31 datasets) → low-risk, easily mitigable. Document the actual row count yourself in your model package. The audit failure flags it; it doesn't block deployment.None of the above changes the underlying data. They change what your audit team can defend in writing — which, under SR 11-7 model-risk management and EU AI Act conformity assessment, is the deliverable that matters.
Of 50 audited, exactly one dataset passed all 7 metadata-derived dimensions: JanosAudran/financial-reports-sec. This is a corpus of SEC 10-K and 10-Q filings, expert-annotated, ~50M items, Apache-2.0 licensed. The cardData is filled in: annotations_creators declares expert-generated, license is explicit, size_categories is declared, the format is parquet.
This dataset is a useful baseline. Whether the data quality is actually high — whether splits leak, whether labels are consistent across filings, whether the corpus has temporal drift — Phase 1 cannot answer. It can only certify that the documentation is procurement-grade. That is necessary but not sufficient. Phase 2 will sample-download a representative subset and produce the full 19-dim scorecard for this and 9 other datasets.
Composite is a metadata-derived 0–100 score (capped at 85 for proxy-derived; reserve 86–100 for expert-validated). Failed dims listed verbatim. Full per-dim reasons + raw HF API responses in the JSON output.
| # | HF dataset id | Downloads | License | Composite | Result | Failed dims |
|---|
Loaded live from /tools/output/hf-finance-50-audit.json
Documentation gaps tell you what cannot be cited in a procurement package. They do not tell you whether the underlying data is good. Phase 2 will sample-download ~10K rows from each of 10 representative datasets (covering the spread of composite scores in this audit) and run file-based dimensions:
Phase 2 will publish at labelsets.ai/research-hf-finance-50-phase-2 within ~10 days.
The full audit script is open in the LabelSets repository:
The script hits the public HuggingFace API only — no auth required. Total run time ~90 seconds. The audit is deterministic given the HF API state at run time; the JSON output records the timestamp and HF responses for replay.
The 7-dim derivation is a port of the production calibration tool used internally to bootstrap the LabelSets calibration corpus from public HF metadata (generation/quality/calibration/hf-bootstrap.js). The same logic that mints proxy entries for our calibration tune is what produced the scorecards above.
If you're at an F500 in a regulated industry and need to audit a training-data corpus for SR 11-7, EU AI Act, FDA, or §1557, we're picking 5–10 design partners for 6 months of LQS Enterprise — free, in exchange for a logo + short case study. No demo, no sales call.