Hype or not? Automatic Promotional Language Detection

Formalizing automatic detection of hype language in biomedical research texts

Bojan Batalo1, Erica K. Shimomoto1, Dipesh Satav2, Neil Millar2
(1AIST Japan, 2University of Tsukuba) — EACL 2026

Hype or Not overview

Overview

We introduce the task of automatic detection of hype language in biomedical research texts. We define hype as hyperbolic or subjective language that authors use to glamorize, promote, embellish, or exaggerate aspects of their research.

We propose formalized annotation guidelines and apply them to annotate 1,270 sentences from the NIH grant application corpus. We then evaluate traditional text classifiers and language models on this task, comparing their performance with a human baseline.

Video Presentation

Abstract

In science, promotional language ("hype") is increasing and can undermine objective evaluation of evidence, impede research development, and erode trust in science. We introduce the task of automatic detection of hype, defined as hyperbolic or subjective language that authors use to glamorize, promote, embellish, or exaggerate aspects of their research.

We propose formalized guidelines for identifying hype language and apply them to annotate a portion of the National Institutes of Health (NIH) grant application corpus. The annotation process revealed fair inter-annotator agreement initially, which improved substantially after discussion and clarification of guidelines (Cohen's Kappa: 0.30–0.52 before, 0.94–0.96 after).

We evaluate traditional text classifiers and language models on this task, comparing their performance with a human baseline. Our experiments show that formalizing annotation guidelines can help humans reliably annotate candidate hype adjectives and that using our annotated dataset to train machine learning models yields promising results. Our annotation guidelines and dataset are available at https://github.com/hype-busters/eacl2026-hype-dataset.

The Problem: Hype in Scientific Publishing

Promotional language in scientific writing has been increasing in biomedical funding applications and journal publications. Terms such as revolutionary, unprecedented, groundbreaking, and transformative are becoming more common, yet they are rarely justified and can:

  • Undermine Objectivity: Words like "groundbreaking" and "transformative" impose value judgments rather than letting research speak for itself
  • Bias Readers:
  • Promotional language can influence readers' evaluation of research quality beyond the actual evidence presented
  • Erode Trust: When promotional language creates unrealistic expectations or misrepresents findings, public trust in science is eroded
  • Impede Development: Overstatement can misdirect research efforts and policy decisions

While promotional language is common, determining whether a specific term constitutes hype remains problematic. The same word may be promotional in one context but neutral or technical in another. For example, the adjectives essential and meticulous can promote significance or rigor ("meticulous experimental design"), but they may also occur in neutral contexts or as part of technical terms ("essential fatty acid", "meticulous hemostasis").

Proposed Annotation Guidelines

We developed six sequential annotation steps to systematically evaluate whether adjectives are used in a promotional manner:

  1. Value-Judgment: Does the adjective imply positive value judgment?
  2. Hyperbolic: Is it exaggerated? (e.g., revolutionary, unprecedented)
  3. Gratuitous: Can it be removed without loss of meaning?
  4. Amplified: Is it strengthened by modifiers? (e.g., "truly novel")
  5. Coordinated: Is it stacked with other hype adjectives?
  6. Broader Context: Does the overall sentence tone suggest promotion?

Impact of Discussion: Four annotators initially showed fair agreement (Cohen's Kappa: 0.30–0.52). Discussion sessions resolved 545 out of 575 disagreements, dramatically improving agreement to 0.94–0.96.

Annotator disagreements before and after discussion

Figure 2: Initial disagreements largely resolved through guided discussion

NIH Grant Application Dataset

We annotated 1,270 sentences from the NIH grant application abstract corpus, focusing on novelty and rigour adjective groups:

Dataset Statistics:

  • Total Sentences: 1,270 annotated sentences
  • Hype Examples: 917 (72.2%)
  • Not Hype Examples: 353 (27.8%)
  • Novelty Adjectives: 11 members (creative, emerging, first, groundbreaking, innovative, latest, novel, revolutionary, unique, unparalleled, unprecedented)
  • Rigour Adjectives: 15 members (accurate, advanced, careful, cohesive, detailed, nuanced, powerful, quality, reproducible, rigorous, robust, scientific, sophisticated, strong, systematic)

Most Promotional Adjectives (100% classified as Hype): groundbreaking, revolutionary, unparalleled, unprecedented, careful

Classification Rationales:

  • Gratuitous (683 samples): Adjectives adding little substance to the proposition
  • Broader Context (220 samples): Hype determined by overall sentence tone
  • Hyperbolic (208 samples): Exaggerated or unambiguous promotional adjectives
  • Coordinated (138 samples): Adjectives stacked with other hype candidates
  • Amplified (22 samples): Adjectives strengthened by modifiers

Experimental Results

Hype Distribution Across Adjectives

Hype percentage distribution across adjectives

Figure 4: Percentage of samples labeled as 'hype' per adjective. Some adjectives (groundbreaking, revolutionary) are nearly always hype, while others vary contextually.

We evaluated four model categories: traditional classifiers, pre-trained language models (PLMs), large language models (LLMs), and humans.

Model Performance Comparison (Accuracy)

  • DISTILBERT + Fine-tuning: 85.8% ✓ Best overall
  • BERT + Fine-tuning: 86.2% ✓ Best F1-score
  • Human Baseline: 76.7%
  • SVM (Bag-of-Words): 77.6%
  • GPT-4O-MINI (Strict): 67.5%
  • No Fine-tuning Models: ~72% or lower
  • LLAMA3.1-INST: 45.1%

Key Insights:

  • Fine-tuning Works: Fine-tuned BERT models surpass human-level performance (86% vs 77%)
  • Bidirectionality Matters: BERT > GPT-2 because context is crucial before and after the adjective
  • Domain Knowledge Doesn't Help: DISTILBERT outperformed BiomedBERT despite no biomedical pretraining
  • Zero-shot LLMs Fail: Exposed to common academic parlance, LLMs classify most as "Not Hype"
  • Guidelines Enable Humans: Humans without guidelines: 77%; with guidelines: ~85%+ accuracy (estimated)

Model Classification Patterns

Confusion matrices reveal how different models classify hype adjectives, showing their strengths and weaknesses.

BERT (Fine-tuned)

BERT Confusion Matrix

Best Performance
86.2% Accuracy

SVM (GLOVE)

SVM GLOVE Confusion Matrix

Traditional Method
77.6% Accuracy

Human Annotators

Human Baseline Confusion Matrix

Human Baseline
76.7% Accuracy

BibTeX

@inproceedings{batalo-etal-2026-hype,
    title = "Hype or not? Formalizing Automatic Promotional Language Detection in Biomedical Research",
    author = "Batalo, Bojan  and
      Shimomoto, Erica K.  and
      Satav, Dipesh  and
      Millar, Neil",
    editor = "Demberg, Vera  and
      Inui, Kentaro  and
      Marquez, Llu{\'i}s",
    booktitle = "Proceedings of the 19th Conference of the {E}uropean Chapter of the {A}ssociation for {C}omputational {L}inguistics (Volume 1: Long Papers)",
    month = mar,
    year = "2026",
    address = "Rabat, Morocco",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.eacl-long.328/",
    doi = "10.18653/v1/2026.eacl-long.328",
    pages = "6979--6992",
    ISBN = "979-8-89176-380-7",
    abstract = "In science, promotional language ({'}hype') is increasing and can undermine objective evaluation of evidence, impede research development, and erode trust in science. In this paper, we introduce the task of automatic detection of hype, which we define as hyperbolic or subjective language that authors use to glamorize, promote, embellish, or exaggerate aspects of their research. We propose formalized guidelines for identifying hype language and apply them to annotate a portion of the National Institutes of Health (NIH) grant application corpus. We then evaluate traditional text classifiers and language models on this task, comparing their performance with a human baseline. Our experiments show that formalizing annotation guidelines can help humans reliably annotate candidate hype adjectives and that using our annotated dataset to train machine learning models yields promising results. Our findings highlight the linguistic complexity of the task and the potential need for domain knowledge. While some linguistic works address hype detection, to the best of our knowledge, we are the first to approach it as a natural language processing task. Our annotation guidelines and dataset are available at https://github.com/hype-busters/eacl2026-hype-dataset."
}