Blog

Who Decides What, and How Does AI Execute? — Reading the Division of Labor from Anthropic's Claude Code Analysis

Futoshi Okazaki16 min read
Who Decides What, and How Does AI Execute? — Reading the Division of Labor from Anthropic's Claude Code Analysis

Introduction

In debates about AI coding, one question comes up constantly: “How much can we actually hand off to AI?”

But the question that really matters in practice is a little more concrete.

Who decides what, and how does AI execute?

Agentic coding and persistent returns to expertise, which Anthropic published in June 2026, is a fairly important primary source for thinking this through. It analyzes how Claude Code is actually used, drawing on roughly 400,000 Claude Code sessions, about 235,000 users, and usage data from October 2025 through April 2026.

Lately the same current gets discussed under the phrase “Loop Engineering.” But, at least as far as I can verify, Anthropic has not published an official report under that name. This article sets the buzzword aside and lays out the division of labor between humans and AI that emerges from Anthropic’s official data.


Humans decide “what to do”

The most telling part of Anthropic’s analysis is how decisions are split.

The report divides the decisions made within a Claude Code session into planning decisions and execution decisions. Planning is “what to do,” “which approach to take,” “what counts as done.” Execution is decisions like “which files to change,” “what code to write,” “which commands to run.”

The result: in a typical session, humans made about 70% of the planning decisions and about 20% of the execution decisions. Flip that around, and Claude was making roughly 80% of the execution decisions.

This is not a story of “AI replaced humans.”

It is, rather, a story of roles beginning to separate. Humans decide the goal, the constraints, the definition of done, the priorities, the risk tolerance. AI reads files, writes code, runs commands, and revises as it watches the results — all within that boundary.

What rises in value in AI-era engineering isn’t issuing detailed, step-by-step instructions; it’s designing the right problem and a verifiable definition of done.

AI is taking on a wide span of “how to execute”

The same report also classifies the kinds of work Claude Code does.

About 56% of sessions involved writing, modifying, testing, or orchestrating code. A further 17% were software operations, 14% were planning or exploration, and 13% were work like data analysis or writing — where code is not the deliverable, or only a secondary one.

What this shows is that AI coding has moved well beyond “code completion.”

Claude Code runs an average of about 10 actions per user prompt, and in some cases chains more than 100 actions together. It reads files, produces diffs, runs commands, looks at the results, and decides the next move. Rather than a human specifying each task in fine detail, the AI expands the unit of work internally.

The job left to the human here is not to say “now read this file next.”

It is to decide the range the AI may explore. To decide the criteria for backing out when something fails. To prepare the tests that must pass, the screens to compare, the design principles to uphold.

Anthropic’s Claude Code best practices likewise stress giving Claude the means to verify its own work — tests, builds, screenshot comparisons. Don’t let the AI stop at “it looks done”; make it read the pass/fail signal. That is the precondition for delegating execution to AI.

The more experienced the user, the more work they have AI do

What’s interesting is that experienced users don’t use AI less — they use it more.

In Anthropic’s analysis, the higher a user’s expertise, the more work Claude executes from a single instruction. In sessions classified as beginner, Claude ran about 5 actions per prompt and produced about 600 words. In high-expertise sessions, it was about 12 actions and about 3,200 words.

This seems counterintuitive, but in practice it’s natural.

The more expertise someone has, the more concrete the problem they hand to AI becomes. What to avoid, what to verify, how far to delegate — all of it gets clearer. So the AI can carry a longer chain of execution forward.

The outcomes differ too. Anthropic classifies whether a session succeeded from verifiable signals — tests passing, commits, pull requests, explicit confirmation from the user. By that strict bar, beginner sessions showed a verified success rate of 15%, while intermediate-and-above sessions showed 28–33%.

The important thing is that the conclusion is not “only people who can write code can use AI.”

In the report, users outside software roles show success rates close to those of software professionals on code-generating sessions. What makes the difference is understanding of the domain, more than the job title. Accounting, legal, operations, research — in any field, the person who understands the structure of the problem is more likely to guide AI well.

Not full delegation: supervision and verification remain

Anthropic’s internal study reinforces this point.

How AI is transforming work at Anthropic analyzes a survey of 132 Anthropic engineers and researchers conducted in August 2025, along with 53 interviews and internal Claude Code usage data.

In it, employees use Claude for about 60% of their work and self-report an average productivity gain of about 50%. At the same time, more than half of employees answered that the share of work they can “fully delegate” to Claude is only 0–20%.

This combination is what matters.

AI has worked its way into a fairly wide range of day-to-day work. But that doesn’t mean human supervision and verification have become unnecessary. If anything, the high productivity holds together with human judgment — choosing what to delegate, checking in partway, verifying the output.

Another official Anthropic report, Measuring AI agent autonomy in practice, also shows Claude Code’s autonomy growing. At the high end of long turns, the time Claude Code keeps working within a single turn rose from under 25 minutes to over 45 minutes across the three months from October 2025 to January 2026. And full auto-approve usage rose from about 20% among light users (under 50 sessions) to over 40% among users who had run roughly 750 sessions.

But this does not mean “leave everything unattended.” Experienced users lean on auto-approve more, yet they intervene where intervention is warranted. It’s less that they trust the AI, and closer to this: they arrange the work so that even when the AI fails, the failure can be detected.

This is a shift from prompt tricks to designing specs and verification

Here we return to the opening question.

Who decides what, and how does AI execute?

Reading from Anthropic’s data, the human’s job comes down to four things:

  1. Decide the goal: what to build, and why it’s needed
  2. Decide the constraints: the range that may be touched, the design to uphold, the risks to avoid
  3. Decide the definition of done: what has to pass for it to be finished
  4. Decide how to verify: tests, builds, reviews, screen checks, log checks

The AI’s job is to expand execution within that design.

Find files. Read code. Make changes. Run commands. If it fails, hunt down the cause. Fix it again. Update the plan if needed. This loop of execution is increasingly something that can move forward without a human operating it by hand the whole time.

That’s exactly why the center of gravity in working with AI shifts from “writing a good prompt” to “preparing a spec the AI can execute against and verification signals the AI can read.”

This is precisely what we have emphasized in AI Spec-Driven Development, and what we have published as our own methodology.

In AI Spec-Driven Development, we decide these four things at the level of a task — concretely, a GitHub Issue. An Issue is not merely a request ticket. It is the boundary of work the AI can carry out autonomously.

Within the Issue, decide the goal: write what to build and why. Decide the constraints: write the range that may be touched, the design to uphold, the risks to avoid. Decide the definition of done: write what has to pass to be finished. Decide how to verify: write whether pass/fail is judged by tests, builds, reviews, screen checks, or log checks.

At the stage of deciding all this, humans are deeply involved. It is, if anything, the place humans should think hardest. But once a task’s goal, constraints, definition of done, and verification method are set, the need for a human to direct each step during execution — “now read this file,” “now run this command” — shrinks. The AI reads the Issue, finds the related files, makes changes, runs the tests, hunts down the cause when it fails, and runs again.

Leave the Issue, the spec, the acceptance criteria, the review lenses, the test commands, and the operational rules in a form both humans and AI can read. The AI implements from them, and the human holds the goal and the verification.

A prompt is an instruction for the moment. A spec is an organizational asset the next AI can read too.

A view from Feel Flow

This change is not about making engineers unnecessary.

It is, rather, about the engineer’s role moving upstream.

Traditionally, humans designed, humans implemented, and humans verified. As AI coding enters real work, the human moves from “the person who implements everything by hand” toward “the person who designs a structure in which AI can execute safely.”

Concretely, abilities like these become more important:

  • The ability to break a problem down into units the AI can execute without getting lost
  • The ability to put specs, constraints, and the definition of done into words
  • The ability to prepare verification paths — tests, builds — where failure is visible
  • The ability to read the AI’s output and see through dangerous cleverness or apparent success
  • The ability to make a once-successful procedure reusable as an Issue, an AGENTS.md, a Skill, or a Playbook

The wider the range we hand to AI, the more human judgment moves not into obsolescence but further upstream.

What Anthropic’s data shows is that, in a world where AI has begun to take on execution, human expertise still matters strongly. Human value shifts away from sheer speed of hands at the keyboard, toward the ability to discern what should be built and to design how it should be verified.

At Feel Flow, we put this current into practice as our own methodology, AI Spec-Driven Development. Rather than one-off prompts, we build AI into a development process that connects spec, Issue, verification, review, and release. The division of labor between humans and AI that Anthropic’s data revealed overlaps with the way of thinking we have been advancing in the field. That, we believe, is where the difference between tomorrow’s development organizations will show.


References

This article is also available in Japanese.