Claude Sonnet 4.6 Released! Evolution from Sonnet 4.5 and the Power of "Best Cost-Performance" from an Engineer's Perspective
Table of Contents
Hello. On February 17, 2026, Anthropic released the latest model in the Claude Sonnet line, "Claude Sonnet 4.6."
I just recently wrote an Opus 4.6 release article, and now we already have an update to the Sonnet line. To be honest, when I was writing the Opus 4.6 article, I had a feeling that "Sonnet would probably be coming soon too," and here it is.
In this article, based on the official announcement and the system card, I will organize the features of Sonnet 4.6 and the changes from Sonnet 4.5 from an engineer's perspective.
What kind of model is Claude Sonnet?
Claude has three model lineups: Haiku, Sonnet, and Opus. Sonnet is the "balanced model" positioned in the middle, characterized by its emphasis on balancing performance and cost.
While Sonnet 4.5 was a sufficiently practical model, it did pale in comparison to Opus-class models in some aspects. This new Sonnet 4.6 significantly closes that gap. In some benchmarks, it even surpasses Opus 4.5.
Key Evolution Points of Sonnet 4.6
1. Significant Improvement in Coding Ability
Let's start with what engineers care about most.
To give you the conclusion first: compared to the recently released Opus 4.6, it naturally doesn't reach the level of the top-tier model. You can see in the comparison table later that Opus 4.6 remains a step above in benchmarks. However, the point to note is that "it surpasses Opus 4.5, which was the flagship of the previous generation."
According to the official announcement, when early-access developers compared them using Claude Code, about 70% preferred Sonnet 4.6 over Sonnet 4.5. Furthermore, even compared to Opus 4.5—the strongest model as of November 2025—Sonnet 4.6 was preferred 59% of the time. In other words, more than half of developers felt that "Sonnet 4.6 is better than the top-tier Opus of the previous generation."
The fact that Sonnet delivers performance exceeding the previous generation's Opus at its price point ($3/$15) is quite impactful.
Summarizing user feedback on specific improvements:
- It now thoroughly reads the existing context before modifying code.
- It properly modularizes common logic instead of duplicating it.
- Over-engineering and "cutting corners" have decreased.
- Instruction following has improved.
- False reports of "success" and hallucinations have decreased.
- Consistency in multi-step tasks has improved.
Specifically, "reading the context properly before modifying" is a point I feel strongly about when using Claude Code or Cursor daily. For those who have experienced the AI writing new code while ignoring the intent of existing code, this is a welcome improvement.
2. "Performance Rivaling Opus" in Benchmarks
Looking at specific numbers, it's clear how close Sonnet 4.6 has come to the Opus class.
| Benchmark | Sonnet 4.6 | Opus 4.6 | Sonnet 4.5 | Overview |
|---|---|---|---|---|
| SWE-bench Verified | 79.6% | 80.8% | 77.2% | Real-world software bug fixing tasks |
| Terminal-Bench 2.0 | 59.1% | 65.4% | 51.0% | Coding tasks including terminal operations |
| OSWorld-Verified | 72.5% | 72.7% | 61.4% | PC operation tasks in real environments |
| τ²-bench (Retail) | 91.7% | 91.9% | 86.2% | Customer service handling |
| GDPval-AA | 1633 | 1606 | 1276 | Knowledge work tasks (Elo rating) |
| GPQA Diamond | 89.9% | 91.3% | 83.4% | Graduate-level science problems |
| ARC-AGI-2 | 58.3% | 68.8% | 13.6% | Reasoning ability for novel patterns |
*Note: For ARC-AGI-2, it is reported that Sonnet 4.6 reached 60.4% under high-effort conditions (the 58.3% in the table is the max-effort value).
What we can read from this table is that "depending on the area of expertise, there are clear cases where Sonnet is sufficient and cases where you should choose Opus." Let's look at specific examples.
Cases where Sonnet 4.6 is sufficient (or even better suited):
- Daily bug fixes and feature additions (SWE-bench: 79.6% vs 80.8%) — The difference is only 1.2 points. For tasks like "handing over an issue and having a PR created," it's rational to use the lower-cost Sonnet.
- Automation of browser operations and form inputs (OSWorld: 72.5% vs 72.7%) — Virtually the same score. Sonnet is sufficient for computer operation tasks like E2E test automation or internal system automation.
- Document creation and proposal drafting (GDPval-AA: 1633 vs 1606) — Sonnet actually has a higher score. Practical office work is a strength of Sonnet.
- Building Customer Support Bots (τ²-bench: 91.7% vs 91.9%) — Almost identical. For support agents handling large volumes of requests, Sonnet is the clear choice due to the cost difference.
Cases where you should choose Opus 4.6:
- Agents involving complex terminal operations (Terminal-Bench: 59.1% vs 65.4%) — A difference of over 6 points. In scenarios requiring accurate execution of long command chains, such as building CI/CD pipelines or infrastructure automation, the stability of Opus is desirable.
- Advanced reasoning for unknown patterns (ARC-AGI-2: 58.3% vs 68.8%) — A difference of over 10 points. Opus's reasoning power shines when tackling unprecedented architectural designs or difficult problems where existing solutions don't apply.
- Professional judgment in difficult fields (GPQA Diamond: 89.9% vs 91.3%) — While the difference is small, in areas like medicine, law, or science where "mistakes are irreversible," choosing the slightly more accurate Opus provides peace of mind.
Of course, these are just trends based on benchmark numbers. In actual use, impressions can change drastically depending on how prompts are written or the nature of the task. It's best to treat these as a guide for "which one to try first" and then test both for your specific use case.
3. 1-Million Token Context Window (Beta)
Sonnet 4.6 supports a 1-million token context window (beta). While the 1M context was available in beta for Sonnet 4.5, the practical difference in 4.6 is that reasoning quality under long context has been further improved.
It's not just about being able to include an entire codebase, long contracts, or dozens of papers in a single request; it's about being able to reason effectively across that entire context.
An interesting example introduced officially is the Vending-Bench Arena. This is a benchmark where AI models compete for profit by running a (simulated) business over a long period. Sonnet 4.6 reportedly pulled far ahead of other models by devising a unique strategy: spending heavily on capital investment for the first 10 months and then shifting sharply to profitability in the final stages.
Being able to plan and judge with such a long-term perspective is a direct benefit of the long context.
4. Significant Evolution in Computer Use
Another highlight of Sonnet 4.6 is the evolution of Computer Use.
As mentioned earlier, it recorded 72.5% on OSWorld-Verified. When Claude 3.5 Sonnet first appeared as the first computer-use model in October 2024, its score was in the 10% range. Reaching the 70% range in about 16 months is an incredible rate of improvement.
Developers who used it early have reported "human-level operational capability" in navigating complex spreadsheets and multi-step web form entries. It is also reportedly capable of tasks like aggregating information across multiple browser tabs.
At the same time, resistance to prompt injection attacks has been significantly improved from Sonnet 4.5. Since computer use involves security risks, this improvement is very important for real-world operations.
5. Adaptive Thinking and Effort Control
Adaptive Thinking (the feature where the model adjusts its own depth of thought), which was introduced in Opus 4.6, is also supported in Sonnet 4.6.
It also supports the conventional extended thinking mode, allowing you to choose based on your needs.
Official advice suggests that "Sonnet 4.6 performs strongly even with extended thinking turned OFF, so we want users to experiment with various settings to find the balance between speed and performance."
In other words, you can naturally use it in stages: use it at high speed without thinking -> increase effort for difficult tasks -> switch to Opus 4.6 if maximum reasoning is required.
6. Context Compaction (Beta)
Similar to Opus 4.6, Context Compaction—which automatically summarizes old content when the context approaches its limit—is available in beta.
The issue of context overflow during long-running agent tasks is a daily occurrence for developers. This feature should significantly reduce the frequency of needing to reset conversations mid-session.
Sonnet 4.5 vs Sonnet 4.6 Comparison Table
Here is a summary of the specs that matter to engineers.
| Item | Sonnet 4.5 | Sonnet 4.6 |
|---|---|---|
| Context Window | 200K (Standard) + 1M (Beta) | 200K (Standard) + 1M (Beta) |
| Adaptive Thinking | No | Yes |
| Extended Thinking | Yes | Yes |
| Computer Use | Supported (OSWorld 61.4%) | Significantly Improved (OSWorld 72.5%) |
| SWE-bench Verified | 77.2% | 79.6% |
| Pricing (Input/Output) | $3/$15 per 1M tokens | $3/$15 per 1M tokens (Unchanged) |
*Source (Official): Introducing Sonnet 4.6 / Claude Sonnet 4.6 System Card / Models overview
The fact that the price remains unchanged from Sonnet 4.5 at $3/$15 per 1M tokens is an astounding cost-performance ratio considering this performance boost. It is significantly cheaper than Opus 4.6's $5/$25.
The Positioning of Sonnet 4.6 — "More Choices"
Personally, I intend to continue using Opus 4.6 as my main model. Opus still has the edge in depth of reasoning and stability for agent operations, and I can feel the difference in "make-or-break" coding scenarios.
However, the arrival of Sonnet 4.6 is significant because it creates the choice: "Do I really need to use Opus for every task?" For example, in the following cases, it's perfectly viable to delegate to Sonnet 4.6 to save costs:
- Agents running in high volume — Nearly equivalent performance with 40% lower costs for both input and output. The difference adds up the more you run them.
- Frontend development — Its design sense has been particularly praised by partner companies.
- Document understanding — Equivalent score to Opus 4.6 in OfficeQA.
- Computer Use — Virtually the same score as Opus 4.6 in OSWorld.
On the other hand, for tasks requiring the deepest reasoning power, such as refactoring large codebases or coordinating multi-agent workflows, the official view is that Opus 4.6 still wins. In my own development style, these scenarios are more common, so my focus on Opus likely won't change, but I'm happy to be able to optimize total costs by mixing in Sonnet where appropriate.
Product Updates
Along with the release of Sonnet 4.6, several product updates were announced.
Expansion of Free Plans
Sonnet 4.6 is now the default model for Free and Pro plans on claude.ai and Claude Cowork. Furthermore, features like file creation, connectors, skills, and Compaction are now available even on the Free plan.
This is a deceptively significant change, meaning that even free users can now have a very practical development experience.
API Tools reach General Availability (GA)
The following API tools, which were previously in beta, are now GA:
- Code Execution
- Memory
- Programmatic Tool Calling
- Tool Search
- Tool Use Examples
Additionally, a mechanism has been added where Web Search and Fetch tools automatically execute code to filter search results, improving both response quality and token efficiency.
MCP Connector Support in Claude in Excel
The Excel add-in now supports MCP connectors, allowing users to directly reference data from external tools (S&P Global, LSEG, PitchBook, etc.) from within Excel. MCP connectors already configured in claude.ai can be used as-is.
About Safety
Sonnet 4.6 reportedly maintains safety levels equal to or higher than previous Claude models.
In the system card, safety researchers evaluated Sonnet 4.6 as having a "warm, honest, social, and sometimes humorous character, very strong safety behavior, and no signs of high-risk misalignment."
In some alignment metrics (measuring how well the model's behavior aligns with human intent, such as "not lying," "not sycophanting to the user with wrong answers," or "not acting independently beyond instructions"), it reportedly recorded the best scores among all Claude models to date.
Also noteworthy is that prompt injection resistance during computer use has significantly improved from Sonnet 4.5, reaching a level equivalent to Opus 4.6.
Evaluations from Partners
The official announcement included comments from many partner companies. Here are some that caught my eye as an engineer:
Claude Sonnet 4.6 delivers frontier-level results on complex app builds and bug-fixing. It's becoming our go-to for the kind of deep codebase work that used to require more expensive models.
The comment that Sonnet 4.6 is becoming the first choice for deep codebase work that previously required more expensive models is impressive.
Claude Sonnet 4.6 produced the best iOS code we've tested for Rakuten AI. Better spec compliance, better architecture, and it reached for modern tooling we didn't ask for, all in one shot. The results genuinely surprised us.
From Rakuten AI: "It produced the best iOS code we've tested," "Better spec compliance and architecture," and "it used modern tools we didn't even ask for." Being able to output high-quality code in one shot directly leads to efficiency in development workflows.
Claude Sonnet 4.6 has perfect design taste when building frontend pages and data reports, and it requires far less hand-holding to get there than anything we've tested before.
Evaluation such as "perfect design taste" for frontend pages and data reports. UI generation quality seems to be an improvement reported independently by many users.
Summary
Claude Sonnet 4.6 was an update that fundamentally overturned the perception that "Sonnet is a budget sub-model."
- Coding ability surpasses Opus 4.5 and approaches Opus 4.6 levels.
- Virtually equivalent to Opus 4.6 in OSWorld and document understanding.
- 1-million token context window (beta).
- Supports Adaptive Thinking and Effort control.
- Available as the default model even on the Free plan.
- Pricing remains unchanged despite all these evolutions.
The dynamic of "expensive and strong Opus" vs "cheap and decent Sonnet" might be a thing of the past. Sonnet 4.6 is a "cheap and very strong" model.
It is available via the API as claude-sonnet-4-6, and you can try it starting from the Free plan on claude.ai. It might be interesting to see the difference from Opus 4.6 for yourself.