10K web video clips with 200K human-written captions — the standard video captioning benchmark.
Browse commercial Multimodal → Visit original source ↗MSR-VTT (MSR Video to Text) from Microsoft Research is the standard benchmark for video captioning and retrieval. 10,000 web video clips covering 20 categories, each paired with 20 human-written English captions (200,000 captions total). Widely used for video-language pretraining, video captioning, and cross-modal retrieval research.
LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →
Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.
Common tasks and benchmarks where MSR-VTT — Microsoft Video-to-Text is the default or competitive choice.
What's actually in the dataset — from the maintainer's published stats.
MSR-VTT — Microsoft Video-to-Text is distributed under Microsoft Research License (research use). This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.
LabelSets sellers offer paid multimodal datasets with what public datasets often can't give you:
Other entries in the Multimodal catalog.