The PDF landed at 3 a.m. East Coast — Nature's front page, Monday morning: "Human scientists trounce the best AI agents on complex tasks." By breakfast, it was forwarded through every AI lab Slack in the Bay.

Underneath the headline is the Stanford AI Index 2026, released April 13 by the Stanford Institute for Human-Centered AI. It runs more than 400 pages of charts, benchmarks, and uncomfortable footnotes. Most people will read the executive summary and move on. But one number inside it shreds the loudest narrative of the past 18 months.

On complex, multi-step scientific workflows, the best frontier AI agents score roughly half as well as humans with PhDs.

The actual numbers

Stanford ran agents across a battery of benchmarks. On narrow tasks — one prompt, one answer — the curves keep going up and to the right. Real-world agent task success rate climbed from 20% in 2025 to 77.3%. Cybersecurity triage agents went from 15% to 93%. Those are the numbers that show up on earnings calls.

Then you ask an agent to actually do science. Design the experiment. Chase an anomaly across three papers. Debug a protocol that half-works. The best frontier agents — GPT, Claude, Gemini — land near half the score of a PhD on the same task. Not 10% worse. Half.

Across the natural sciences, AI is mentioned in only 6% to 9% of published papers. For all the talk of "AI-driven science," actual adoption in published work is rounding-error territory. Researchers are kicking the tires. They are not swapping out their grad students.

Why this matters right now

Every major lab has been telling investors that agents are the next S-curve. Anthropic says agents. OpenAI says agents. Google says agents. The pitch: LLMs got smart, agents will get autonomous, and knowledge work gets automated. Consulting decks have been drawn up. Headcount plans have been cut. This is not a fringe prediction. It is the base case being sold to Fortune 500 boards.

Stanford just called the bluff. On tasks that require judgment, planning, and verification, agents are still playing junior varsity. They cannot reliably chain six steps together. They cannot tell when they are wrong. And when they are wrong, they are confidently wrong in ways that waste a scientist's entire afternoon.

My Opinion

I'll be blunt. The agent hype has been lapping the science for a year, and I'm tired of pretending it hasn't. Every demo reel shows an agent booking a flight or filling a spreadsheet. Nobody shows the agent trying to write a grant, review a protein assay, or spot a load-bearing error in a 40-page dataset. Because when they try, it doesn't work. Stanford finally said the quiet part out loud, and it landed in Nature, and now the VCs funding "AI-first research labs" have to explain why their portfolio companies are burning through Series B checks to replicate what one postdoc can do in an afternoon.

Here's what bugs me about the reaction. The industry will read this report and say "wait six months." That is the default response to every uncomfortable benchmark. And sometimes they're right — models do improve. But the gap between "solves a closed math problem" and "runs a wet-lab investigation end-to-end" is not one training run away. It is about taste, memory, and knowing what you don't know. Those are exactly the capabilities where current systems regress the hardest the moment you push them.

The productive version of this news is the one nobody will lead with: AI makes PhDs faster, not obsolete. The people who will get 10x more done this year are the ones pairing an agent with a skilled researcher who can catch its lies. The people who fire half their research staff and hand the clipboard to an agent are about to find out, on the record, exactly why Stanford put that headline on Nature.


Author: Yahor Kamarou (Mark) / www.humai.blog / 15 Apr 2026