Claude Opus 4.6 vs GPT-5.2 vs Gemini 3 Pro: Which AI Model Should You Actually Use in 2026?

Q: Which AI model has the best reasoning capabilities in 2026?

GPT-5.2 leads on pure mathematical reasoning with 100% on AIME 2025. Claude Opus 4.6 now leads on novel problem-solving with 68.8% on ARC-AGI-2 (exceeding GPT-5.2's 52.9%) and enterprise reasoning at 1,606 Elo on GDPval-AA. For most professional reasoning tasks, both models are highly capable with different strengths.

There's no single "best" AI model in 2026. If anyone tells you otherwise, they're either selling you something or haven't actually used all three.

The AI landscape shifted dramatically on February 5, 2026, when Anthropic released Claude Opus 4.6 with a 1-million-token context window and record-breaking benchmark scores. Twenty minutes later, OpenAI fired back with GPT-5.3 Codex. But even with GPT-5.2 and Gemini 3 Pro still dominating many workflows, the question remains: which model should you actually use?

Claude Opus 4.6 now leads in agentic coding with 65.4% on Terminal-Bench 2.0 and dominates enterprise knowledge work with 1,606 Elo on GDPval-AA, putting it 144 points ahead of GPT-5.2. GPT-5.2 still owns abstract reasoning with 52.9% on ARC-AGI-2 and perfect 100% on AIME 2025 mathematics. Gemini 3 Pro handles massive 2-million-token contexts with state-of-the-art multimodal understanding.

This comparison breaks down the real differences across benchmarks, pricing, coding tasks, reasoning ability, and practical use cases.

The Quick Answer (If You're in a Hurry)

For coding and agentic tasks: Claude Opus 4.6. It achieved 65.4% on Terminal-Bench 2.0 (the highest score ever recorded), 80.8% on SWE-bench Verified, and can now orchestrate multiple AI agents working in parallel on complex codebases.

For mathematical reasoning and abstract problem-solving: GPT-5.2. It achieved a perfect 100% on AIME 2025 mathematics and 52.9% on ARC-AGI-2, which measures genuine reasoning ability on novel challenges. Though notably, Opus 4.6 has closed the gap significantly with 68.8% on ARC-AGI-2.

For multimodal tasks and extremely long documents: Gemini 3 Pro. Its 2-million-token context window and native video, audio, and image processing make it unmatched for mixed-media workflows.

For enterprise knowledge work: Claude Opus 4.6. Its 1,606 Elo on GDPval-AA puts it 144 points ahead of GPT-5.2 on economically valuable tasks in finance, legal, and professional domains.

For budget-conscious teams: Gemini 3 Pro at $2/$12 per million tokens offers the best price-to-performance ratio for most general tasks.

Now let's dig into the details.

Benchmark Showdown: What the Numbers Actually Mean

Benchmarks don't tell the whole story, but they're a useful starting point. Here's how the three models stack up on the tests that matter most in February 2026.

On Terminal-Bench 2.0, the leading evaluation for agentic coding systems, Claude Opus 4.6 achieves 65.4%, the highest score ever recorded. GPT-5.2 comes in at 64.7%, just 0.7 points behind. Gemini 3 Pro trails at around 54%. This benchmark tests real command-line coding proficiency where consistency and accuracy under pressure matter.

Coding Benchmarks

Benchmark	Claude Opus 4.6	GPT-5.2	Gemini 3 Pro	Winner
Terminal-Bench 2.0	65.4%	64.7%	54.0%	Claude
SWE-bench Verified	80.8%	80.0%	76.2%	Claude
Code Review	Self-correction	Good	Basic	Claude

Reasoning Benchmarks

Benchmark	Claude Opus 4.6	GPT-5.2	Gemini 3 Pro	Winner
ARC-AGI-2	68.8%	52.9%	31.1%	Claude
AIME 2025	~94%	100%	~95%	GPT
GPQA Diamond	~90%	93.2%	93.8%	Gemini
GDPval-AA (Elo)	1,606	1,462	N/A	Claude

Context & Multimodal Benchmarks

Benchmark	Claude Opus 4.6	GPT-5.2	Gemini 3 Pro	Winner
MRCR v2 @ 1M tokens	76%	N/A	26.3%	Claude
MMMU-Pro	~75%	~78%	81.0%	Gemini
LongBench v2	~65%	54.5%	68.2%	Gemini

On SWE-bench Verified, the gold standard for real-world software engineering tasks, Claude Opus 4.6 scores 80.8%, essentially matching its predecessor's 80.9%. GPT-5.2 follows closely at 80.0%, and Gemini 3 Pro scores 76.2%. These scores measure the ability to understand real GitHub issues, navigate complex codebases, implement fixes, and ensure no existing functionality breaks.

On ARC-AGI-2, a benchmark designed to test genuine reasoning ability while resisting memorization, GPT-5.2 leads at 52.9%. But here's where Opus 4.6 made a massive leap: it scores 68.8%, nearly doubling Opus 4.5's 37.6%. This is the most striking improvement in the new model, suggesting significantly enhanced novel problem-solving capabilities.

On GDPval-AA, which measures real-world professional tasks in finance, legal, and other domains, Opus 4.6 reaches 1,606 Elo. That's 144 points ahead of GPT-5.2 and 190 points ahead of Opus 4.5. In chess terms, that gap represents the difference between a grandmaster and a strong international master. Meaningful, not insurmountable, but consistently noticeable in daily use.

On AIME 2025, the American Invitational Mathematics Examination, GPT-5.2 still achieves a perfect 100% without tools. Claude Opus 4.6 has improved to around 93-95% but remains slightly behind in pure mathematical reasoning.

On GPQA Diamond, a graduate-level science benchmark, GPT-5.2 Pro scores 93.2%, essentially tied with Gemini 3 Deep Think's 93.8%. Claude Opus 4.6 comes in at approximately 90%.

On MRCR v2, which tests long-context retrieval with multiple pieces of information buried across vast amounts of text, Opus 4.6 scores 76% on the hardest variant (eight needles hidden across one million tokens). For comparison, Opus 4.5 scored just 18.5% on the same test. Gemini 3 Pro scores 77% at 128K tokens but drops to 26.3% at the actual 1M-token mark according to Google's own evaluation card.

The takeaway: Opus 4.6 has closed significant gaps in reasoning while maintaining its coding dominance. The models are converging in many areas, but each still has distinct strengths.

Coding Performance: Where Claude Extends Its Lead

For developers, coding capability is often the deciding factor. Here's what real-world testing reveals about Opus 4.6.

Claude Opus 4.6 wasn't just an incremental upgrade. The model demonstrates stronger planning abilities, improved long-term concentration, and a much higher capacity to navigate large and complex codebases. One notable advance is its ability to detect and correct its own mistakes during code review, a long-standing weakness in previous generations.

Michael Truell, co-founder of AI coding platform Cursor, notes that Opus 4.6 "excels on the hardest problems. It shows greater persistence, stronger code review, and the ability to stay on long tasks where other models tend to give up." GitHub's Mario Rodriguez echoes the sentiment, saying the model is now "unlocking long-horizon tasks that were previously achievable only by humans."

The real revolution is Agent Teams in Claude Code. Multiple Claude instances can now coordinate on complex tasks through a tmux-based orchestrator pattern. In one demonstration, multi-agent Claude Code orchestration built a working C compiler from scratch: 100,000 lines of code that boots Linux on three CPU architectures. This isn't a demo. This is a preview of autonomous software engineering.

Feature	Claude Opus 4.6	GPT-5.2	Gemini 3 Pro
SWE-bench Verified	80.8% ✅	80.0%	76.2%
Terminal-Bench 2.0	65.4% ✅	64.7%	54.0%
Self-correction	✅ Excellent	⚠️ Good	❌ Basic
Multi-file Navigation	✅ Excellent	✅ Good	⚠️ Adequate
Agent Teams	✅ Yes	❌ No	❌ No
Max Codebase Size	1M tokens	400K tokens	Degrades early
Code Style	Sophisticated, architectural	Conventional, readable	Concise, minimal

Agent Teams Demo Results

Metric	Result
Lines of Code	100,000
Output	Working C compiler
Capability	Boots Linux on 3 CPU architectures
Approach	Multiple Claude instances in parallel

GPT-5.2 delivers excellent consistency and follows common conventions that make code easier for junior developers to understand and modify. Its 80.0% SWE-bench score essentially matches Claude, and it excels when you need structured thinking across multi-file workflows. For many routine coding tasks, the difference between the two is marginal.

Gemini 3 Pro generates notably concise code, prioritizing efficiency and performance. This brevity can be an asset for experienced developers who appreciate clean, minimal implementations. However, it sometimes comes at the expense of readability. In head-to-head testing, Gemini often delivers the "minimum viable version" while Claude and GPT add polish and depth.

The bottom line: if you're doing serious software engineering, especially on complex codebases requiring sustained attention, Opus 4.6 justifies its premium pricing. For routine development tasks, all three models are production-ready.

The Context Window Revolution: Opus 4.6's Real Breakthrough

The 1-million-token context window in Opus 4.6 isn't just a bigger number. It represents a qualitative shift in what's actually usable.

Previous AI models advertised large context windows but suffered from "context rot," where performance degraded drastically as input grew. You might have a 200K token window, but practical performance dropped off a cliff after 50K tokens. Opus 4.6 solves this problem.

Performance Comparison

Model	Advertised Window	MRCR v2 @ 1M tokens	Usable Performance
Claude Opus 4.6	1M tokens	76% ✅	Excellent
Gemini 3 Pro	2M tokens	26.3%	Poor at scale
GPT-5.2	400K tokens	N/A	Good within limits
Claude Opus 4.5	200K tokens	18.5%	Context rot issues

What 1 Million Tokens Equals

Format	Approximate Size
Words	~750,000
Pages	~1,500
Journal Articles	10-15 full papers
Regulatory Filing	1 complete submission
Codebase	Entire repo without chunking

On the MRCR v2 benchmark, which buries multiple pieces of information across vast amounts of text, Opus 4.6 scores 76% on the hardest variant with eight needles hidden across one million tokens. Its predecessor, Claude Sonnet 4.5, scored just 18.5% on the same test. That's not incremental improvement. That's a different capability entirely.

To put one million tokens in perspective: that's roughly 750,000 words or about 1,500 pages of text. You can ingest a 500-page contract alongside an entire industry precedent corpus simultaneously. You can process entire codebases without chunking. Legal discovery databases, patent portfolios, comprehensive research papers, year-long email threads: Opus 4.6 can swallow them whole and actually understand how everything connects.

Gemini 3 Pro advertises a 2-million-token context window, technically larger than Opus 4.6. But advertised capacity and usable performance are increasingly divergent metrics. Google reports Gemini 3 Pro scoring 77% on MRCR v2 at 128,000 tokens, roughly in line with Opus 4.6. But at the actual 1M-token mark, Gemini 3 Pro's score drops to 26.3% according to Google's own model evaluation card. Developer forums have echoed this gap, with users reporting significant performance degradation after using as little as 15-20% of the advertised context window.

For researchers and professionals working with large document sets, patent portfolios, or regulatory submissions, this distinction matters enormously. A million tokens that actually work is more valuable than two million tokens that don't.

Reasoning and Problem-Solving: The Gap Narrows

When it comes to pure reasoning, GPT-5.2 still holds advantages, but Opus 4.6 has closed the gap significantly.

GPT-5.2's ARC-AGI-2 score of 52.9% represented more than double Claude Opus 4.5's 37.6%. But Opus 4.6 leaped to 68.8%, now surpassing GPT-5.2 on this benchmark that was designed to resist memorization and test genuine intelligence on novel problems. This is perhaps the most surprising result from the new release.

Mathematical Reasoning (AIME 2025)

Model	Score	Notes
GPT-5.2	100% ✅	Perfect score, no tools
Gemini 3 Pro	~95%	With code execution
Claude Opus 4.6	~94%	Improved from 4.5

Novel Problem-Solving (ARC-AGI-2)

Model	Score	Change from Previous
Claude Opus 4.6	68.8% ✅	+31.2 pts (nearly 2x)
GPT-5.2	52.9%	Baseline
Claude Opus 4.5	37.6%	Previous gen
Gemini 3 Pro	31.1%	Lowest

Enterprise Knowledge Work (GDPval-AA)

Model	Elo Score	Gap vs Leader
Claude Opus 4.6	1,606 ✅	—
GPT-5.2	1,462	-144 pts
Claude Opus 4.5	1,416	-190 pts

💡 144 Elo points = difference between chess grandmaster and international master

GPT-5.2 still dominates pure mathematical reasoning. Its perfect 100% on AIME 2025 without tools remains unmatched. Claude Opus 4.6 has improved to approximately 93-95%, impressive but clearly behind in this domain.

OpenAI's GDPval benchmark, which measures performance on "well-specified knowledge work tasks" across 44 occupations, tells a more mixed story. OpenAI claims GPT-5.2 beats or ties industry professionals 70.9% of the time. But on Anthropic's competing enterprise benchmark GDPval-AA, Opus 4.6 leads by 144 Elo points over GPT-5.2 in finance, legal, and professional domains.

OpenAI reports GPT-5.2 shows 65% fewer hallucinations compared to previous versions. For tasks where accuracy matters and you can't afford confident but wrong answers, this improvement is significant. Both companies have made progress on reliability, but different benchmarks favor different models.

The bottom line: the reasoning gap has narrowed dramatically. For pure math, GPT-5.2 remains ahead. For novel problem-solving and enterprise knowledge work, Opus 4.6 now leads.

Multimodal and Native Capabilities: Gemini's Territory

Gemini 3 Pro plays a fundamentally different game with its native multimodal architecture.

While Claude and GPT were built primarily as text models with vision bolted on, Gemini was designed from the ground up to process text, images, video, and audio natively. It achieved 81.0% on MMMU-Pro, demonstrating strong comprehension across all modalities simultaneously.

Native Support Comparison

Modality	Claude Opus 4.6	GPT-5.2	Gemini 3 Pro
Text	✅	✅	✅
Images	✅ Vision	✅ Vision	✅ Native
Video	❌	❌	✅ Native
Audio	❌	✅ Voice	✅ Native
Excel	✅ Native	⚠️ Via API	✅ Sheets
PowerPoint	✅ Native	❌	✅ Slides

Multimodal Benchmark (MMMU-Pro)

Model	Score
Gemini 3 Pro	81.0% ✅
GPT-5.2	~78%
Claude Opus 4.6	~75%

Best Use Cases by Model

Model	Best Multimodal Use Cases
Gemini 3 Pro	Video analysis, mixed media workflows, brand guidelines, visual research
Claude Opus 4.6	Excel analysis, PowerPoint generation, document + image workflows
GPT-5.2	Voice conversations, image understanding, text-focused multimodal

If your workflow involves analyzing visual assets alongside text, incorporating reference materials across media types, or working with video content, Gemini handles these tasks natively rather than through workarounds. Marketing teams working with brand guidelines, researchers analyzing multimedia datasets, and creative professionals dealing with mixed media all benefit from this architecture.

The search integration is another strength. Gemini achieved 45.8% on Humanity's Last Exam with search enabled, making it particularly effective for research-heavy tasks that require current information from the web.

For organizations already embedded in Google Workspace and Google Cloud, Gemini integration is seamless. The ability to analyze data, generate reports, and automate workflows entirely within existing toolchains reduces friction and accelerates adoption.

Neither Claude nor GPT match Gemini's native multimodal fluency. If your work is primarily text and code, this doesn't matter. If you're constantly working across media types, it matters a lot.

Pricing: The Math That Actually Matters

Pricing differences between these models remain substantial enough to influence project economics significantly.

Claude Opus 4.6 costs $5 per million input tokens and $25 per million output tokens, unchanged from Opus 4.5. For a project generating 10 million output tokens monthly, you pay approximately $250 with Claude.

GPT-5.2 costs $1.75 per million input tokens and $14 per million output tokens. The Pro variant increases output costs to $21 per million for maximum reasoning. The same 10-million-token project costs approximately $140 with GPT-5.2.

Gemini 3 Pro costs $2 per million input tokens and $12 per million output tokens for contexts under 200K tokens. Larger contexts cost $4/$18. The same project costs approximately $120 with Gemini at base rates, making it the most cost-effective frontier model.

Per Million Tokens Pricing

Model	Input	Output	10M Output/Month
Claude Opus 4.6	$5.00	$25.00	~$250
GPT-5.2	$1.75	$14.00	~$140
Gemini 3 Pro	$2.00	$12.00	~$120 ✅

Extended Context Pricing (Claude Opus 4.6)

Context Size	Input	Output
≤200K tokens	$5.00	$25.00
>200K tokens	$7.50	$37.50

Annual Enterprise Cost Estimate

Model	Annual Cost	Best For
Claude Opus 4.6	~$150,000	Code quality critical
Gemini 3 Pro	~$70,000	Multimodal + long docs
GPT-5.2	~$56,500	Reasoning + general purpose

Cost Optimization Strategies

Strategy	Potential Savings
Context caching	50-75%
Batch processing	30-50%
Model routing	70-80%
Token efficiency (Claude)	22% fewer input tokens

At enterprise scale, these differences compound quickly. One analysis estimated annual costs at roughly $56,500 for GPT-5.2, $150,000 for Claude Opus 4.6, and $70,000 for Gemini 3 Pro for comparable workloads.

However, raw token pricing doesn't tell the whole story. Claude's superior token efficiency (Anthropic reports 22% fewer input tokens and 12% fewer output tokens compared to competitors at similar quality) can offset the higher base rate on certain workloads. Context caching reduces costs by 50-75% when system prompts and reference documents repeat. Batch processing on Gemini saves 50%, and GPT-5.2 batch pricing saves 30% with a 24-hour latency trade-off.

The smart money uses model routing: Claude for coding-critical and enterprise tasks, GPT-5.2 for complex mathematical reasoning, and Gemini or cheaper models for high-volume, simpler queries. This blended approach can reduce costs by 70-80% compared to uniform premium model deployment while maintaining quality where it matters.

Real-World Recommendations by Use Case

Rather than declaring a winner, here's how to match models to specific workflows in February 2026.

Recommended Routing Rules

Task Type	Route To	Why
Code generation	Claude Opus 4.6	65.4% Terminal-Bench
Code review	Claude Opus 4.6	Self-correction capability
Multi-file refactor	Claude Opus 4.6	Agent Teams
Math problems	GPT-5.2	100% AIME
Abstract reasoning	Claude Opus 4.6	68.8% ARC-AGI-2
Video analysis	Gemini 3 Pro	Native support
Long document QA	Claude Opus 4.6	76% @ 1M tokens
Simple queries	Gemini Flash	Cost savings
High volume tasks	DeepSeek	90% cheaper
Financial analysis	Claude Opus 4.6	1,606 Elo GDPval
Legal research	Claude Opus 4.6	90.2% BigLaw Bench

Cost Impact of Model Routing

Approach	Annual Cost	Performance
Single Model (Claude)	$150,000	Excellent for coding
Smart Routing	$30,000-$45,000	Better overall
Savings	70-80%	With improved results

For software development teams where code quality impacts production: Claude Opus 4.6 delivers the highest agentic coding benchmark scores, superior code review capabilities, and can now coordinate multiple agents working in parallel. The premium pricing justifies itself if bugs or technical debt carry real costs.

For research and analysis requiring deep reasoning: The gap has narrowed, but GPT-5.2's perfect AIME score and strong ARC-AGI-2 performance make it the safer choice for pure mathematical and abstract problem-solving. For broader analytical work, Opus 4.6's 68.8% ARC-AGI-2 and 1,606 Elo on GDPval-AA make it highly competitive.

For document-heavy workflows processing legal, academic, or research materials: Claude Opus 4.6's million-token context window that actually maintains performance gives it the edge over Gemini's larger but less reliable context.

For multimodal applications involving images, video, and mixed media: Gemini 3 Pro's native multimodal capabilities outperform competitors, particularly for tasks requiring visual understanding alongside text and audio.

For budget-conscious teams handling high-volume requests: Gemini 3 Pro or Flash offers the best price-to-performance for general tasks. Consider DeepSeek for even lower costs on non-critical workloads.

For enterprise deployments prioritizing security and integration: Claude Opus 4.6's safety profile, Excel/PowerPoint integration, and Compaction API make it the most enterprise-ready option for organizations that need these specific capabilities.

Complete Comparison

Category	Claude Opus 4.6	GPT-5.2	Gemini 3 Pro
Coding	🥇 65.4% T-Bench	🥈 64.7%	🥉 54%
Math	🥉 ~94% AIME	🥇 100%	🥈 ~95%
Novel Reasoning	🥇 68.8% ARC	🥈 52.9%	🥉 31.1%
Enterprise	🥇 1,606 Elo	🥈 1,462 Elo	—
Context (usable)	🥇 76% @ 1M	—	🥉 26% @ 1M
Multimodal	🥉 ~75%	🥈 ~78%	🥇 81%
Price	🥉 $5/$25	🥈 $1.75/$14	🥇 $2/$12
Security	🥇 4.7% injection	🥉 21.9%	🥈 12.5%

Winner Summary

Model	Wins At	Best For
Claude Opus 4.6	Coding, Enterprise, Context, Security, Novel Reasoning	Developers, enterprises, security-focused teams
GPT-5.2	Math, Hallucination reduction, Price-performance	Researchers, analysts, math-heavy workflows
Gemini 3 Pro	Multimodal, Price, Google integration	Budget teams, multimodal work, Google users

The 2026 Recommendation

Principle	Action
Don't choose one model	Build a portfolio
Claude Opus 4.6	Use where quality = money (coding, enterprise)
GPT-5.2	Use where precision = critical (math, analysis)
Gemini 3 Pro	Use where cost = constraint (multimodal, budget)
Result	Better performance + 70-80% cost savings

The most sophisticated approach: implement model routing that automatically selects the optimal model based on task type, complexity, and cost constraints. This multi-model strategy yields better results at lower cost than committing to any single provider.

FAQ

Which AI model is best for coding in 2026?

Claude Opus 4.6 leads in agentic coding benchmarks with 65.4% on Terminal-Bench 2.0 (the highest score ever recorded) and 80.8% on SWE-bench Verified. Its Agent Teams feature enables multiple AI instances to coordinate on complex codebases. GPT-5.2 follows closely and excels at structured, conventional code.

Is Claude Opus 4.6 better than GPT-5.2?

It depends on the task. Opus 4.6 leads in agentic coding (65.4% vs 64.7% Terminal-Bench), enterprise knowledge work (144 Elo points ahead on GDPval-AA), and novel problem-solving (68.8% vs 52.9% ARC-AGI-2). GPT-5.2 wins on pure math (100% vs ~94% AIME) and costs less ($1.75/$14 vs $5/$25 per million tokens).

What is the cheapest frontier AI model?

Gemini 3 Pro at $2/$12 per million tokens offers the best price-to-performance among frontier models. For even lower costs, Gemini Flash and DeepSeek provide strong capabilities at 60-90% lower prices with some performance trade-offs.

Which model has the largest usable context window?

Claude Opus 4.6 offers 1 million tokens that actually work, scoring 76% on MRCR v2 long-context retrieval. Gemini 3 Pro advertises 2 million tokens but scores only 26.3% on the same test at 1M tokens. Usable context beats advertised context.

Should I use multiple AI models?

Yes. Modern best practice involves model routing that selects the optimal model per task. Use Claude for coding and enterprise work, GPT-5.2 for mathematical reasoning, and Gemini for multimodal tasks. This approach delivers better results at 70-80% lower cost than single-model deployment.

What is new in Claude Opus 4.6?

Released February 5, 2026, Opus 4.6 introduces: 1M token context window (beta), Agent Teams for multi-agent coding, adaptive thinking with four effort levels, Compaction API for infinite conversations, 128K max output tokens, and native Excel/PowerPoint integration. Pricing remains $5/$25 per million tokens.

How much does Claude Opus 4.6 cost?

Claude Opus 4.6 costs $5 per million input tokens and $25 per million output tokens, unchanged from Opus 4.5. For contexts exceeding 200K tokens, pricing increases to $7.50/$37.50. A project generating 10 million output tokens monthly costs approximately $250.

Is Gemini 3 Pro good for coding?

Gemini 3 Pro scored 76.2% on SWE-bench Verified and around 54% on Terminal-Bench 2.0, trailing both Claude (80.8%, 65.4%) and GPT-5.2 (80.0%, 64.7%). It generates concise code quickly but lacks the polish and depth of competitors. Best for routine tasks where speed matters more than sophistication.

What are Agent Teams in Claude Code?

Agent Teams allow multiple Claude instances to work in parallel on complex coding tasks through coordinated orchestration. In demonstrations, multi-agent Claude Code built a working C compiler (100,000 lines of code) that boots Linux on three CPU architectures. This enables autonomous software engineering at scale.

Which model has the best reasoning capabilities?

GPT-5.2 leads on pure mathematical reasoning (100% AIME 2025). Claude Opus 4.6 now leads on novel problem-solving (68.8% ARC-AGI-2, exceeding GPT-5.2's 52.9%) and enterprise reasoning (1,606 Elo on GDPval-AA). For most professional reasoning tasks, both models are highly capable with different strengths.