360K document page images with layout annotations — the canonical doc parsing dataset.
Browse commercial Document / OCR → Visit original source ↗PubLayNet from IBM Research is the largest publicly-available document layout analysis dataset. 360K page images from PubMed Central open access articles with automatically-generated layout annotations for text blocks, titles, lists, tables, and figures. Standard benchmark for document understanding models like LayoutLM.
LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →
Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.
Common tasks and benchmarks where PubLayNet is the default or competitive choice.
What's actually in the dataset — from the maintainer's published stats.
PubLayNet is distributed under CDLA-Permissive 1.0. This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.
LabelSets sellers offer paid document / ocr datasets with what public datasets often can't give you:
Other entries in the Document / OCR catalog.