Show a frontier model the word “red” printed in blue ink and ask it one thing: name the ink color, not the word. It gets it right. Show it five of those. Still fine. Show it forty, and the smartest AI on the planet falls apart like a kid who skipped his nap.

That is the finding from a new study in PNAS Nexus, published June 10. A team led by Suketu Patel ran GPT-4o, GPT-5, Claude 3.5 Sonnet, Claude Opus 4.1, and Gemini 2.5 through the Stroop task — a color-and-word attention test psychologists have leaned on since 1935. The kind of thing you could hand a third-grader.

If you have never heard of it: the Stroop task shows color words printed in colored ink. Sometimes they match — “green” in green. Sometimes they clash — “green” in red. Your job is to say the ink color and ignore what the word spells. It is deceptively hard, because reading is automatic and naming colors is not. That tension is the entire point.

The numbers are brutal. GPT-4o hit 91% accuracy on a list of five words. Bump it to ten words and it dropped to 57%. At forty words, it scored 15% — worse than a coin flip in some setups. Claude 3.5 Sonnet held steady through twenty words, looked composed, then cratered to 24% at forty.

Then the researchers got mean. They mixed matching and mismatching words in the same list — “red” in red ink sitting next to “red” in green. On the mismatched items, accuracy fell to nearly zero. The models stopped naming colors and started doing the exact thing they had been told not to do: reading the words.

Here is why that matters. The Stroop task measures executive control — the brain’s ability to suppress an automatic habit and stay on a goal. Humans are wired to read words, not name ink colors, and yet most people hold their focus across a long, conflicting list. The machines could not. They defaulted to the behavior they were trained hardest to produce, and the longer the task ran, the harder they fell.

The paper’s title says it without flinching: “Deficient executive control in transformer attention.” Translation: the self-attention mechanism behind every model you have heard of does not actually hold attention. It spreads thinner as the input grows. We named the architecture after a thing it cannot reliably do.

And this is the uncomfortable part. Companies are filing for trillion-dollar IPOs and announcing “AGI is here” while their flagship models flunk a 90-year-old elementary-school exercise the moment it runs long. The benchmark scores keep climbing. The basic cognitive control keeps not showing up.

My Opinion

I will be blunt: this is the most useful AI paper I have read all month, and it arrived with no press tour and no valuation attached. Just three researchers and a test older than the transistor.

Here is what bugs me. The industry trained us to read a 95% on some benchmark as “almost human.” This study shows the gap is not a few points — it is a different kind of mind. A person keeps the instruction in their head. The model keeps predicting the next likely token, and “red” follows the written word “red” far more often than the ink color does. That is not focus. That is autocomplete in a lab coat. It also explains why your AI assistant nails the first paragraph of a long job and quietly loses the plot by paragraph nine.

The fix is not more parameters. It is an architecture that can suppress its own strongest instinct — and nobody has shipped one. Until someone does, treat every “it reasons like a person” claim as marketing. Next time a model dazzles you on a short prompt, hand it a long one and watch exactly where its attention goes.


Author: Yahor Kamarou (Mark) / www.humai.blog / 14 Jun 2026