All analyses
Field Note

Claude Opus 4.7, what actually changes and what it means for your workflows.

Anthropic shipped Opus 4.7 yesterday, April 16, 2026. It is the direct successor to Opus 4.6, released last September. Available immediately on the API, Claude Code, Cursor and GitHub Copilot, identifier claude-opus-4-7. Pricing unchanged: $5 per million input tokens, $25 output.

We spent the day testing it on the workflows we run for clients. Here is what stands out, and what it actually changes.

The numbers that matter

Four benchmarks, four real gains:

  • SWE-Bench Verified (autonomous resolution of real GitHub bugs): 53.4% → 64.3%, a +10.9 point lift. For reference, GPT 5.4 sits at 57.7% on the same test.
  • CursorBench (agentic coding evaluated inside Cursor): 58% → 70%, +12 points.
  • XBOW visual-acuity (reading screenshots and dense diagrams): 54.5% → 98.5%. +44 points. The most brutal jump in the release.
  • Rakuten SWE-Bench (tasks from real enterprise codebases): on autonomous resolution.

Average gain across coding benchmarks lands around +13%.

The scoop: document understanding

80.6% vs 51.1% for GPT 5.4 on OfficeQA Pro.

The single largest gap in the announcement. +29 points on a benchmark that measures how accurately a model can answer specific questions on dense professional documents: contracts, tax filings, annual reports, legal clauses.

For the firms we work with in accounting, legal, tax and consulting, this is the most directly actionable axis. Most enterprise RAG projects that plateau, plateau exactly here: precision of extraction from dense, badly scanned, multi-column PDFs. This gap explains why Anthropic keeps winning enterprise accounts against OpenAI on these verticals.

Effort × score: the real economic gain

What Anthropic calls "same quality, fewer tokens consumed" is the detail few people read, and probably the most important one for an API budget.

Concretely: Opus 4.7 on low reaches what Opus 4.6 did on medium. For many production pipelines, that means dropping one notch on effort without losing quality. The bill drops, latency too.

And the ceiling rises: on high, 4.7 surpasses the best score 4.6 ever reached.

Three frictions fixed

Beyond the scores, three painful 4.6 behaviors are now resolved:

  1. Infinite loops. 4.6 could get stuck re-reading the same file or repeating the same action, burning budget without progress. 4.7 escapes them in the vast majority of cases.
  2. Hallucinated data. When information is missing from context, the model now flags it instead of filling in a plausible answer. Less downstream correction, less manual eval.
  3. Systematic over-reasoning. 4.6 triggered Extended Thinking even on trivial requests. 4.7 modulates depth on its own.

Add to that a lower score on Anthropic's internal Misaligned Behavior benchmark: less lying, less sycophancy, less cheating. Measurable, in the spec sheet, and a signal we take.

Adaptive Thinking, the new reasoning mode

The most visible architectural change. Where Extended Thinking was a binary toggle with a visible chain of thought, Adaptive Thinking is built into the model and modulated automatically.

  • Simple question → fast answer.
  • Complex task → deep reasoning, internally.
  • No chain of thought to render or load front-side.
  • The arbitration is done by the model, not by the developer.

In practice: one fewer parameter on the API side. You let it run. The right move for agent workflows, where multiplying modes manually costs in complexity.

Mythos, the detail that says a lot

In the official announcement, Anthropic mentions Mythos: an internal model with performance above Opus 4.7, not opened to the public. Access reserved for selected partners.

To our knowledge, this is the first documented case of an AI lab commercializing a model while publicly acknowledging that a superior model already exists internally.

Strategic read: competitive pressure on OpenAI and Google can no longer be measured by public benchmarks alone. It is also measured by the gap between what a lab ships and what it keeps in reserve.

What we take away for clients

Five points actionable today:

  1. Direct migration without re-budgeting. Same price, +13% average gain. Trivial decision for any pipeline already in flight.
  2. Revisit effort levels. Workflows running on medium with 4.6 deserve a test on low with 4.7. Token savings to expect.
  3. Re-test document RAG pipelines. The +29 point gain on OfficeQA Pro changes the build vs specialized parser arbitration.
  4. Simplify agents. Adaptive Thinking absorbs a layer of logic you used to write by hand.
  5. Update evaluation harnesses. If you benchmark models since 4.6, your golden dataset deserves a refresh: 4.7 regressions are not where 4.6 regressions were.

We will publish a more detailed field report within two weeks, based on the pipelines we are migrating this week. If you currently run a Claude system in production and want us to look at what migration would imply, reach out.