The number that changed everything was 75.0%.
That's GPT-5.4's score on OSWorld-Verified, the benchmark that tests whether an AI can control a real desktop computer — navigating user interfaces, managing files, clicking through applications, submitting forms. The human expert baseline on this same test is 72.4%. OpenAI released GPT-5.4 on March 5, 2026, and it became the first general-purpose AI in history to cross that line.
Your computer now has a more qualified operator than you.
OSWorld-Verified is not a trivia test. It simulates the kinds of tasks your IT department handles every day: opening software, configuring settings, writing short scripts, organizing folders. The AI doesn't get pre-written code paths — it navigates using screenshots and mouse-and-keyboard actions the same way you would. GPT-5.4 scored 75.0% success. Its predecessor, GPT-5.2, scored 47.3%. That's not a marginal improvement — it's a 27.7 percentage point leap in a single model generation.
The model also brings a 1 million token context window (double what GPT-5.2 offered), a 33% reduction in hallucination rates, and five levels of reasoning effort control. API access starts at $2.50 per million input tokens — cheap enough for startups to build autonomous agents that can run entire back-office workflows without a human watching. A premium Pro variant is available at $30 per million tokens for more complex tasks.
The immediate applications are obvious: automated software testing, QA pipelines, form-filling at scale, replacement of legacy robotic process automation (RPA) systems. But the implications run deeper. This is the moment when "AI assistant" starts to mean something fundamentally different. Not a thing you talk to — a thing that works for you, inside your actual desktop, more reliably than a junior employee.
Here's what makes this uncomfortable: every business that relies on human-in-the-loop processes for basic computer tasks now has a serious question to answer. Not in five years. Now. If GPT-5.4 can do desktop work at 75% success rate for $2.50 per million tokens, the ROI calculation for keeping humans in the loop for many tasks is becoming very difficult to justify. The companies that figure this out first will look very different from the ones that wait.
My Opinion
I'll be direct: this is the AI milestone that matters more than any chatbot benchmark released in the last two years. Reasoning leaderboards, coding tests, SAT scores — these measure what an AI knows. OSWorld measures what an AI can do. An AI that can operate software better than a human expert isn't a research curiosity anymore. It's a product decision waiting to be made.
What frustrates me is the framing. OpenAI released GPT-5.4 with the standard playbook — blog posts, demo videos, a pricing page — and most of the tech press treated it like another model drop in a crowded year. It's not. When a general-purpose AI crosses the human baseline on desktop task completion, we've entered qualitatively different territory. The coverage should reflect that.
Here's my actual prediction: 2026 is the year we stop debating whether AI agents can replace knowledge work and start arguing about the pace. GPT-5.4 just set that pace at "faster than most organizations are ready for." The companies quietly deploying this now aren't going to announce it. They're just going to show better margins in Q3.
Author: Yahor Kamarou (Mark) / www.humai.blog / 04 Apr 2026