💬 Curated Catalog · NLP / Text

HellaSwag — Commonsense NLI

Adversarially-filtered commonsense inference — pick the correct sentence ending.

LQS 77 · gold ✓ Commercial OK 70K multiple-choice questions 50 MB JSONL Released 2019

Browse commercial NLP / Text → Visit original source ↗

Source: rowanzellers.com · maintained by Rowan Zellers et al. (AI2 / UW)

About this dataset

HellaSwag tests commonsense natural language inference by asking models to choose the most plausible ending for a short context passage. Passages are drawn from ActivityNet captions and WikiHow, and distractor endings are adversarially filtered via Adversarial Filtering (AF) so that they're easy for humans (>95%) but hard for earlier BERT-era models. Remains a standard LLM eval today despite saturation at the frontier.

Maintainer

Rowan Zellers et al. (AI2 / UW)

License

MIT

Formats

JSONL

Paper

Read on arxiv.org →

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

out of 100

gold tier

Solid dataset with some trade-offs

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 92

No public completeness metric; using prior for 'research_release' datasets.

Uniqueness 68

Minimal deduplication disclosed.

Validation 68

Crowdsourced labels without disclosed QC protocol.

Size adequacy 81

70,000 items — below 100,000 target for NLP / Text, but usable.

Format compliance 95

Industry-standard format — drop-in compatible with mainstream tooling.

Label density 52

Average 1.0 labels per item (sparse).

Class balance 75

Moderate class skew — realistic production distribution.

What it's used for

Common tasks and benchmarks where HellaSwag — Commonsense NLI is the default or competitive choice.

Commonsense reasoning
Sentence completion
Zero-shot evaluation

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

70K train / 10K val / 10K test, 4 choices each, ~25 words average context.

License

HellaSwag — Commonsense NLI is distributed under MIT. This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Need commercial-licensed NLP / Text data?

LabelSets sellers offer paid nlp / text datasets with what public datasets often can't give you:

Explicit commercial license in writing
LQS-verified quality in your specific use-case
Instant download — no DUA, credentialed access, or research gating
PII scanned, deduplicated, and production-ready

Browse paid NLP / Text → Sell your dataset

Frequently Asked Questions

HellaSwag — Commonsense NLI is distributed under MIT, which generally permits commercial use. Always verify the current license terms with the maintainer (Rowan Zellers et al. (AI2 / UW)) before using in a commercial product.

HellaSwag — Commonsense NLI contains 70,000 multiple-choice questions. 70K train / 10K val / 10K test, 4 choices each, ~25 words average context.

HellaSwag — Commonsense NLI is maintained by Rowan Zellers et al. (AI2 / UW) and is available at https://rowanzellers.com/hellaswag/. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.

LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.

HellaSwag — Commonsense NLI

About this dataset

LabelSets Quality Score

Solid dataset with some trade-offs

What it's used for

Sample statistics

License

Need commercial-licensed NLP / Text data?

Similar public datasets

Frequently Asked Questions