When Claude Builds Claude: Inside Anthropic's Recursion

2026-06-07

This week, Anthropic did something unusual for an AI company: it published a detailed, data-rich account of how fast its own AI is taking over AI development - and then, 24 hours later, called for a global pause on exactly that trajectory.

The numbers are staggering. As of May 2026, more than 80% of the code merged into Anthropic's codebase is authored by Claude. Before Claude Code launched in February 2025, that number was in the low single digits. Anthropic engineers now ship 8× as much code per quarter as they did from 2021–2025. And the trajectory isn't leveling off - it's accelerating.

The acceleration curve is steepening

Anthropic's internal data shows that the length of tasks Claude can reliably complete autonomously has been doubling every four months, up from an earlier trend of doubling every seven months. In March 2024, Claude Opus 3 could handle tasks that take a human about four minutes. A year later, Claude Sonnet 3.7 managed hour-and-a-half tasks. By early 2026, Claude Opus 4.6 was completing 12-hour software tasks without human intervention.

If the current trend holds, tasks that take a skilled person days could come into range this year. In 2027, AI systems could be capable of weeks-long projects.

On public benchmarks, the story is the same. SWE-bench - the standard test of real-world software engineering - went from single-digit scores to saturation in two years. CORE-Bench, which tests whether a model can reproduce published research results, went from 20% success to saturation in fifteen months. METR, the organization that measures long-duration task capability, found that Claude Mythos Preview could work autonomously for "at least" 16 hours - hitting the ceiling of what their benchmark can measure.

The models writing the code are also the models topping the leaderboard

Anthropic models on lmrank.com:

Claude Opus 4.8 - 9.7 (#1 overall)
Claude Opus 4.7 - 9.6 (#2 overall)
Claude Opus 4.5 - 9.5 (#3 overall)
Claude Sonnet 4.6 - 9.0 (#11 overall)
Claude Sonnet 4 - 8.8 (#14 overall)

This isn't theoretical. Anthropic shared concrete examples: in April 2026, Claude shipped over 800 fixes that reduced a class of API errors by a factor of one thousand. The human engineer overseeing the work estimated it would have taken four years to complete manually. On the most open-ended tasks, Claude's success rate hit 76% in May 2026 - up 50 percentage points in six months.

The gap that still matters

For all the jaw-dropping capability gains, Anthropic was careful to identify where Claude still falls short: judgment. Claude can execute a well-specified experiment - it can match or outperform skilled humans at running the work. But it struggles with choosing which experiments to run, what goals to set, and what problems are worth solving. As Anthropic put it: "That's the gap between AI today and a future system that could autonomously design its own successor."

This is also why Anthropic's simultaneous call for a global AI development pause is so striking. The company is showing the world that the acceleration is real, that it's measured in months not years, and that the gap between "Claude helps us code" and "Claude designs the next Claude" is narrowing fast - while also being the ones saying we should slow this down.

What this means for the AI model landscape

Anthropic's disclosure changes how we should think about model rankings. When the company that occupies the top three positions on lmrank.com reveals that its models are increasingly responsible for building themselves, the feedback loop becomes visible. Better models → write more code → build better models. The question isn't whether Anthropic holds the lead today. It's whether this self-reinforcing cycle makes the lead unassailable.

At the same time, Anthropic's pause call signals that even the leaders are uncomfortable with the velocity. The company isn't just publishing a capability brag - it's publishing a warning, backed by internal data that few others have access to.

Other labs are surely watching. OpenAI's GPT-5.5 (9.4) and DeepSeek V4 Pro (9.0) occupy the #4 and #8 spots respectively - both frontier models that could benefit from similar self-improvement loops. If Anthropic's data generalizes, the entire frontier tier may be accelerating on a similar curve.

The recursion question

Here's the uncomfortable question Anthropic's post raises but doesn't fully answer: if Claude writes 80% of Anthropic's code, and Claude's own training was managed by infrastructure that Claude helped write, and the next Claude will be designed with even more Claude assistance - at what point does the recursion become genuinely self-sustaining?

Anthropic's honest answer: they don't know exactly, but it's closer than most institutions are prepared for. The task horizon trend suggests that within 12–18 months, AI systems could handle multi-week engineering projects autonomously. Multi-week projects are how you build model training infrastructure. Multi-week projects are how you design architectures. Multi-week projects are how you close the loop.

At lmrank.com, we track live scores across 50+ AI models. When the models start building themselves, the leaderboard gets very interesting. Follow along at lmrank.com/category/overall/.