The proprietary LabelSets scoring framework — 14 dimensions, 5 pillars, live ML model runs. Every dataset is evaluated automatically at upload. No self-reporting. No estimates.
Every LQS score is computed from direct file inspection at upload time. No dimension is estimated from metadata, self-reported by sellers, or inferred without reading the actual data. This page describes what LQS measures and why — the precise algorithms, model configurations, and internal thresholds are proprietary to LabelSets and subject to continuous improvement.
The composite LQS is a proprietary weighted sum of all 14 dimension scores. Each dimension is normalized to a 0–100 scale and contributes to its pillar score, which in turn contributes to the final composite. Pillar weights reflect their relative impact on real-world training outcomes and are calibrated against empirical ML benchmark data.
| Pillar | Dimensions | Weight |
|---|---|---|
| Structural Integrity | 4 | 35% |
| Annotation Quality | 4 | 30% |
| Statistical Health | 3 | 20% |
| Training Fitness | 2 | 10% |
| Provenance | 1 | 5% |
These four dimensions measure whether the data is structurally sound and ready to be loaded — independent of what the data contains or means.
Measures the fraction of fields that are populated with non-null, non-empty values. Evaluated differently per format — null cell rate for tabular data, unannotated image rate for vision datasets, missing required fields for JSONL fine-tuning records. A dataset with high completeness has no gaps that would silently degrade training.
Detects exact duplicates using direct comparison across the full file. For NLP datasets, deduplication operates on the primary text column. For vision datasets, filenames and annotation hashes are compared across the full archive. Duplicate records inflate dataset size without improving model generalization.
Checks that every row conforms to the schema inferred from the file — same column count, consistent types, no parse errors. Detects schema drift and encoding issues that are invisible to file-level validators but will cause training failures or silent data corruption.
Validates adherence to the detected file format specification. For vision datasets, this includes checking for required configuration files (classes.txt, data.yaml, category mappings). For JSONL, all records must parse as valid JSON with consistent required keys. A high format integrity score means a framework like PyTorch or Ultralytics can load the dataset without pre-processing.
These four dimensions measure the quality and consistency of the labels themselves — not just whether they exist, but whether they are accurate, dense, and consistently applied.
Measures the rate of structurally invalid annotations — malformed bounding boxes, out-of-range coordinates, invalid JSON, missing required annotation fields. This captures a lower bound on label error rate. Semantic errors (e.g. a correctly-formatted box with the wrong class label) are addressed separately by Label Error Estimate in Tier 3.
Measures annotation richness per sample. For vision datasets, this is the mean number of annotations per image. For NLP and JSONL fine-tuning, this is mean response or text character length. Sparse annotations typically produce weaker detectors; very short responses produce weaker instruction-tuned models.
Measures how uniformly annotations are applied across the dataset. Low variance in annotation density per sample indicates labelers followed consistent guidelines. High variance may indicate mixed labeling pools, inconsistent difficulty standards, or annotation errors concentrated in certain batches — all of which produce noisy training signal.
Measures class balance using an entropy-based scoring approach. A perfectly uniform distribution scores highest; severe imbalance reduces the score. Also reports the imbalance ratio and identifies rare classes. Heavily imbalanced datasets require careful over/undersampling strategies that buyers need to know about upfront.
These three dimensions assess the statistical properties of the data that determine how well a model trained on it will generalize.
Measures whether the distribution of key measurable properties — bounding box sizes, text lengths, null rates — is healthy and artifact-free. Degenerate distributions (tiny bboxes, empty text fields, extreme null clustering) indicate collection or annotation issues that will harm generalization even when individual samples appear valid.
A composite estimate of the fraction of labels that are likely incorrect, derived from multiple measurable structural signals: invalid annotation rate, duplicate rate, empty label rate, and class imbalance severity. This is a conservative lower bound — it cannot detect semantically incorrect labels (e.g. a correctly-formatted box labeling a car as a truck), but captures all structurally measurable noise sources.
Estimates how learnable the dataset is — whether the input features contain sufficient signal to distinguish between target classes. Derived from class count, label entropy, feature richness, and vocabulary diversity across format types. A high signal strength score means the dataset likely has enough discriminative variation for a model to form useful decision boundaries.
Two dimensions that assess how ready the dataset is for production model training — sample volume and semantic diversity.
Measures sample count against task-specific thresholds derived from ML literature and practitioner benchmarks. Each dataset type has a minimum viable size and a production-ready target. Scores scale with how far above or below those thresholds the dataset falls — ensuring buyers understand whether a dataset is sufficient for their training workload.
Estimates the semantic variety within the dataset using vocabulary breadth, class distribution entropy, and feature richness — computed per format type. A low diversity score signals a narrow dataset that covers limited scenarios and is likely to produce models that fail to generalize beyond the training distribution.
The single provenance dimension scores the completeness of documentation — not the claims made. A seller cannot improve this by writing "this is excellent data"; they improve it by filling in description, tags, license type, and data source fields.
Scores the completeness of documentation fields — title, description, tags, license type, data source, and collection method. Sellers improve this score by filling in fields, not by writing favorable descriptions. The scoring is based entirely on presence and minimum length thresholds, not content claims.
LQS v2.0 has format-specific analysis pipelines for 12 data types. Each pipeline extracts the signals most meaningful for that format.
| Format | Type | Key signals extracted |
|---|---|---|
| YOLO TXT | Vision / detection | annotations/image, coord validity, class frequency, bbox area distribution, density CV |
| COCO JSON | Vision / detection | image-annotation linking, category coverage, bbox validation |
| Pascal VOC XML | Vision / detection | XML parse rate, bbox validity, class names, density CV |
| KITTI TXT | Vision / 3D detection | 15-field format, 3D bbox parameters |
| LabelMe JSON | Vision / segmentation | polygon validity, class coverage |
| Image Folder | Vision / classification | class count (folder names), per-class sample count, balance |
| CSV / TSV | Tabular / NLP | null rate, schema consistency, label distribution, text length distribution, vocab diversity |
| JSONL | Fine-tuning / NLP | parse errors, field coverage, response length distribution, instruction diversity |
| Parquet | Tabular | columnar encoding, null rate, duplicate rate, schema |
| Apache Arrow | Tabular | schema validation, null rate, type compliance |
| SQLite | Tabular | table structure, row count, null rate |
| HDF5 | Scientific / arrays | dataset shape, dtype validation, fill values |
| Version | Date | Changes |
|---|---|---|
| v2.0 | 2026-04-13 | Expanded to 14 dimensions across 5 pillars. Added empirical ML model runs for trainability scoring. Enhanced statistical health analysis and annotation consistency evaluation. Introduced tier system (Platinum / Gold / Silver / Bronze). |
| v1.0 | 2026-04-07 | Initial 7-dimension system covering structural and annotation fundamentals. |
Each dataset record stores the LQS version used to compute its scores. Datasets scored with v1.0 retain their original scores; they are re-scored to v2.0 automatically on next upload.
Get your dataset scored against LQS v2.0
Free quality audit →No account required · Results in 60 seconds