Study Finds Vision‑Language Models Flunk Negation, Risking Medical Missteps
TEHRAN (Tasnim) – A new MIT study reveals that widely used vision‑language models struggle with negation words such as “no” and “not,” potentially causing serious errors when retrieving medical images that must include some features but exclude others.
Researchers at MIT discovered that vision‑language models often treat negation as random noise, performing no better than chance when instructed to find images that do not contain specified objects.
Such failures could mislead clinicians: for example, a radiologist seeking chest X‑rays showing tissue swelling without an enlarged heart might retrieve cases with both conditions, skewing diagnostic insights.
“Those negation words can have a very significant impact, and if we are just using these models blindly, we may run into catastrophic consequences,” says Kumail Alhamoud, lead author and MIT graduate student.
In experiments, the team first tested existing models on image caption retrieval tasks involving negation and found accuracy plummeting by nearly 25 percent when captions included words like “no” or “doesn’t.”
Next, they evaluated multiple‑choice question tasks where models had to select the caption that correctly negated or affirmed an object’s presence; top performers achieved only about 39 percent accuracy, with some falling below random chance.
The authors attribute these shortcomings to an “affirmation bias,” whereby models simply ignore negation and focus solely on positive labels in training data.
“To date, image‑caption datasets never include examples like ‘a dog jumping over a fence, with no helicopters,’” explains senior author Marzyeh Ghassemi, associate professor of EECS at MIT.
To address the gap, the researchers generated a synthetic dataset of 10 million image‑caption pairs enriched with negation via a large language model, then fine‑tuned vision‑language models on this augmented data.
Post‑training, image retrieval performance improved by roughly 10 percent, while multiple‑choice accuracy rose by about 30 percent, demonstrating that data augmentation can partially mitigate the issue.
“But our solution is not perfect,” cautions Alhamoud, noting that the approach merely augments captions rather than redesigning model architectures.
The team urges practitioners to rigorously test vision‑language models for negation handling before deployment, especially in high‑stakes domains such as healthcare or manufacturing.
Future work may involve decoupling text and image processing or creating specialized negation‑aware datasets for critical applications.
The study, co‑authored by Shaden Alshammari, Yonglong Tian, Guohao Li, Philip H.S. Torr, and Yoon Kim, will be presented at the Conference on Computer Vision and Pattern Recognition.