LQS v2.0 · Published Methodology

LabelSets Quality Score

The proprietary LabelSets scoring framework — 14 dimensions, 5 pillars, live ML model runs. Every dataset is evaluated automatically at upload. No self-reporting. No estimates.

14 dimensions 5 tiers 100% file-grounded 12 format types Zero self-reporting
Tier 1
Structural Integrity
4 dimensions · 35%
Tier 2
Annotation Quality
4 dimensions · 30%
Tier 3
Statistical Health
3 dimensions · 20%
Tier 4
Training Fitness
2 dimensions · 10%
Tier 5
Provenance
1 dimension · 5%
Principles

What makes LQS different

Every LQS score is computed from direct file inspection at upload time. No dimension is estimated from metadata, self-reported by sellers, or inferred without reading the actual data. This page describes what LQS measures and why — the precise algorithms, model configurations, and internal thresholds are proprietary to LabelSets and subject to continuous improvement.

Proprietary by design. The LQS evaluation pipeline represents years of ML engineering — format-specific analysis across 12 data types, real model fine-tuning runs, and a scoring architecture calibrated against real-world training outcomes. The framework described here is what buyers and sellers need to understand their scores. The underlying implementation is not published.
Zero self-reporting. Sellers cannot influence their LQS by describing their dataset as "high quality." Every dimension is computed from the actual uploaded file. Provenance Quality (Tier 5, 5%) is the only dimension that uses metadata — and it scores the completeness of that metadata, not its claims.
Formula

Composite score

The composite LQS is a proprietary weighted sum of all 14 dimension scores. Each dimension is normalized to a 0–100 scale and contributes to its pillar score, which in turn contributes to the final composite. Pillar weights reflect their relative impact on real-world training outcomes and are calibrated against empirical ML benchmark data.

PillarDimensionsWeight
Structural Integrity435%
Annotation Quality430%
Statistical Health320%
Training Fitness210%
Provenance15%
90–100
Platinum
75–89
Gold
60–74
Silver
0–59
Bronze
Tier 1 · 35% of composite

Structural Integrity

These four dimensions measure whether the data is structurally sound and ready to be loaded — independent of what the data contains or means.

1. Completeness 12%

Measures the fraction of fields that are populated with non-null, non-empty values. Evaluated differently per format — null cell rate for tabular data, unannotated image rate for vision datasets, missing required fields for JSONL fine-tuning records. A dataset with high completeness has no gaps that would silently degrade training.

2. Uniqueness 8%

Detects exact duplicates using direct comparison across the full file. For NLP datasets, deduplication operates on the primary text column. For vision datasets, filenames and annotation hashes are compared across the full archive. Duplicate records inflate dataset size without improving model generalization.

3. Schema Validity 8%

Checks that every row conforms to the schema inferred from the file — same column count, consistent types, no parse errors. Detects schema drift and encoding issues that are invisible to file-level validators but will cause training failures or silent data corruption.

4. Format Integrity 7%

Validates adherence to the detected file format specification. For vision datasets, this includes checking for required configuration files (classes.txt, data.yaml, category mappings). For JSONL, all records must parse as valid JSON with consistent required keys. A high format integrity score means a framework like PyTorch or Ultralytics can load the dataset without pre-processing.

Tier 2 · 30% of composite

Annotation Quality

These four dimensions measure the quality and consistency of the labels themselves — not just whether they exist, but whether they are accurate, dense, and consistently applied.

5. Label Accuracy 10%

Measures the rate of structurally invalid annotations — malformed bounding boxes, out-of-range coordinates, invalid JSON, missing required annotation fields. This captures a lower bound on label error rate. Semantic errors (e.g. a correctly-formatted box with the wrong class label) are addressed separately by Label Error Estimate in Tier 3.

6. Label Density 8%

Measures annotation richness per sample. For vision datasets, this is the mean number of annotations per image. For NLP and JSONL fine-tuning, this is mean response or text character length. Sparse annotations typically produce weaker detectors; very short responses produce weaker instruction-tuned models.

7. Annotation Consistency 7%

Measures how uniformly annotations are applied across the dataset. Low variance in annotation density per sample indicates labelers followed consistent guidelines. High variance may indicate mixed labeling pools, inconsistent difficulty standards, or annotation errors concentrated in certain batches — all of which produce noisy training signal.

Why CV matters: A dataset where 10% of images have 20+ annotations and 90% have 1–2 can still produce a reasonable mean annotation density — but the high CV reveals that labelers applied different standards to different subsets. Models trained on such data learn inconsistent ground truth.
8. Class Distribution 5%

Measures class balance using an entropy-based scoring approach. A perfectly uniform distribution scores highest; severe imbalance reduces the score. Also reports the imbalance ratio and identifies rare classes. Heavily imbalanced datasets require careful over/undersampling strategies that buyers need to know about upfront.

Tier 3 · 20% of composite

Statistical Health

These three dimensions assess the statistical properties of the data that determine how well a model trained on it will generalize.

9. Distribution Health 8%

Measures whether the distribution of key measurable properties — bounding box sizes, text lengths, null rates — is healthy and artifact-free. Degenerate distributions (tiny bboxes, empty text fields, extreme null clustering) indicate collection or annotation issues that will harm generalization even when individual samples appear valid.

10. Label Error Estimate 7%

A composite estimate of the fraction of labels that are likely incorrect, derived from multiple measurable structural signals: invalid annotation rate, duplicate rate, empty label rate, and class imbalance severity. This is a conservative lower bound — it cannot detect semantically incorrect labels (e.g. a correctly-formatted box labeling a car as a truck), but captures all structurally measurable noise sources.

Limitation: Label Error Estimate only detects structurally measurable error signals. It cannot detect semantically incorrect labels (e.g., "cat" labeled as "dog" in a correctly-formatted YOLO file). Buyers relying on the highest model accuracy should always conduct independent label auditing on a sample.
11. Signal Strength 5%

Estimates how learnable the dataset is — whether the input features contain sufficient signal to distinguish between target classes. Derived from class count, label entropy, feature richness, and vocabulary diversity across format types. A high signal strength score means the dataset likely has enough discriminative variation for a model to form useful decision boundaries.

Tier 4 · 10% of composite

Training Fitness

Two dimensions that assess how ready the dataset is for production model training — sample volume and semantic diversity.

12. Size Adequacy 5%

Measures sample count against task-specific thresholds derived from ML literature and practitioner benchmarks. Each dataset type has a minimum viable size and a production-ready target. Scores scale with how far above or below those thresholds the dataset falls — ensuring buyers understand whether a dataset is sufficient for their training workload.

13. Diversity Score 5%

Estimates the semantic variety within the dataset using vocabulary breadth, class distribution entropy, and feature richness — computed per format type. A low diversity score signals a narrow dataset that covers limited scenarios and is likely to produce models that fail to generalize beyond the training distribution.

Tier 5 · 5% of composite

Provenance

The single provenance dimension scores the completeness of documentation — not the claims made. A seller cannot improve this by writing "this is excellent data"; they improve it by filling in description, tags, license type, and data source fields.

14. Provenance Quality 5%

Scores the completeness of documentation fields — title, description, tags, license type, data source, and collection method. Sellers improve this score by filling in fields, not by writing favorable descriptions. The scoring is based entirely on presence and minimum length thresholds, not content claims.

Note: Provenance is intentionally weighted at only 5% to prevent documentation-gaming from masking data quality issues. A dataset with perfect provenance documentation but poor structural quality will not score above Gold tier.
Format Support

Supported formats

LQS v2.0 has format-specific analysis pipelines for 12 data types. Each pipeline extracts the signals most meaningful for that format.

FormatTypeKey signals extracted
YOLO TXTVision / detectionannotations/image, coord validity, class frequency, bbox area distribution, density CV
COCO JSONVision / detectionimage-annotation linking, category coverage, bbox validation
Pascal VOC XMLVision / detectionXML parse rate, bbox validity, class names, density CV
KITTI TXTVision / 3D detection15-field format, 3D bbox parameters
LabelMe JSONVision / segmentationpolygon validity, class coverage
Image FolderVision / classificationclass count (folder names), per-class sample count, balance
CSV / TSVTabular / NLPnull rate, schema consistency, label distribution, text length distribution, vocab diversity
JSONLFine-tuning / NLPparse errors, field coverage, response length distribution, instruction diversity
ParquetTabularcolumnar encoding, null rate, duplicate rate, schema
Apache ArrowTabularschema validation, null rate, type compliance
SQLiteTabulartable structure, row count, null rate
HDF5Scientific / arraysdataset shape, dtype validation, fill values
Versioning

Version history

VersionDateChanges
v2.02026-04-13Expanded to 14 dimensions across 5 pillars. Added empirical ML model runs for trainability scoring. Enhanced statistical health analysis and annotation consistency evaluation. Introduced tier system (Platinum / Gold / Silver / Bronze).
v1.02026-04-07Initial 7-dimension system covering structural and annotation fundamentals.

Each dataset record stores the LQS version used to compute its scores. Datasets scored with v1.0 retain their original scores; they are re-scored to v2.0 automatically on next upload.

Get your dataset scored against LQS v2.0

Free quality audit →

No account required · Results in 60 seconds