Who the Agent Rewards

Chapter One

The two answers

Reading Brynjolfsson, Li & Raymond, Generative AI at Work (QJE 2025); Dell'Acqua et al., Navigating the Jagged Technological Frontier (Organization Science 2025); Anthropic, Agentic Coding and Persistent Returns to Expertise (June 2026).

1.1The question, and why it splits

The headline question — does AI make people more productive — has a boring answer on average and an interesting answer underneath. The interesting part is the distribution: the same tool can help one person enormously and another barely at all, and the pattern of who gains is where the real disagreement lives. Two well-run lines of research answer the distributional question in opposite directions, and they were not measuring the same thing, which is the whole story.

1.2The compression camp

The first body of work studied AI that recommends. In a Fortune 500 call center, an assistant that proposed responses to support agents raised output by about 14% on average, but the gains landed almost entirely on the newest and least skilled workers, who improved by roughly a third, while the most experienced agents barely moved. The mechanism the authors propose is simple and important: the model had learned the moves of the best agents and was handing them to everyone else.

The same shape appears elsewhere. In a field experiment with 758 Boston Consulting Group consultants, those below the average performance line improved 43% with GPT-4 while those above it gained 17% — the tool pulling the bottom toward the top. Earlier studies of business writing and of GitHub Copilot found the same direction: the gap between strong and weak performers shrank. Call this the compression result. AI as a leveler.

1.3The returns-to-expertise camp

The second body of work studied AI that acts. Anthropic's June 2026 analysis of about 400,000 Claude Code sessions from roughly 235,000 people found the opposite gradient. The more expertise a person showed in a session, the more the agent accomplished per instruction, and the more often the session succeeded. Sessions rated novice reached the strict bar of verified success about 15% of the time; sessions rated intermediate or expert reached it 28 to 33% of the time. When work hit trouble, novices abandoned it — zero lines of code written — at about 19%, against 5 to 7% for everyone else. Here the tool does not pull the bottom up. It rewards whoever already understands the problem.

Figure 1. The two answers, drawn as the shape of the finding rather than shared numbers. The dashed line slopes down: when AI supplies the expert's moves, the least skilled gain most and the performance gap narrows. The solid line slopes up: when the human must direct the work, gains rise with expertise and the gap widens. The vertical axes are not the same quantity across these studies — that mismatch is a clue, not a flaw, and Chapter 3 is about what it points to.

Study	What it measured	Who	Finding	Side
Brynjolfsson, Li & Raymond (QJE 2025)	Issues resolved per hour with a chat assistant	5,179 support agents	+14% overall; ~30% for novices, little for the skilled	compression
Dell'Acqua et al. (Org. Science 2025)	Consulting tasks, in vs. outside AI's frontier	758 BCG consultants	Below-average +43%, above +17%; 19pp worse outside the frontier	compression
Anthropic (June 2026)	Verified success of agentic coding sessions	~235,000 people	Verified success 15% (novice) vs. 28–33% (intermediate+)	expertise

Check your understanding

Pick a tool you use at work. Is it recommending something for you to accept or reject, or is it taking actions you then have to check? Hold that distinction. The rest of the guide argues it is the single best predictor of whether the tool will help you or your least experienced colleague more.

Chapter Two

What expertise means here

Reading Anthropic, June 2026, the expertise-classifier section; Dell'Acqua et al., the jagged-frontier construct and the outside-frontier task.

2.1Expertise is the task, not the title

The word "expert" is doing precise work here, and it is not a synonym for senior. Anthropic's measure rates expertise per session from how the work actually proceeds: how precisely the person frames instructions, what they ask the tool to verify, and whether they correct the tool or the tool corrects them. By that measure a senior engineer asking their first question about an unfamiliar language is a beginner at that task, and an accountant who has never written Python but tells the tool exactly which reconciliation rules to enforce, and catches the one it mishandles at month-end, is an expert at that task. Expertise is command of the problem, not a credential, and it moves task by task.

This is why occupation turned out to matter far less than expertise. In the Claude Code data, people across nearly every major occupation succeeded at coding tasks at close to the rate of software engineers. What separated success from failure was not the job on the badge; it was how well the person understood the specific thing they were trying to do.

2.2The jagged frontier

The second idea you need is that the tools are not uniformly capable. Dell'Acqua and colleagues named this the jagged technological frontier: a boundary where some tasks fall easily within reach and others, of seemingly similar difficulty, fall outside it — and the two are hard to tell apart from the outside. Inside the frontier their consultants did markedly better with AI. On a task chosen to sit just outside it, consultants using AI were 19 percentage points less likely to reach the correct answer than those working without it. Same tool, same people, opposite sign, decided entirely by which side of an invisible line the task sat on.

Figure 2. The jagged frontier, drawn against apparent difficulty. If the tool's reliability tracked how hard a task looks, the dots would sort cleanly into the two bands. They don't: a hard-looking refactor is reliable while an easy-looking exact count fails, and two tasks of nearly equal apparent difficulty — summarizing a document versus fact-checking a claim — land on opposite sides. You usually cannot tell which is which without already knowing the domain, and that knowing is the expertise the studies keep finding decisive.

Practice

Name two tasks in your own work that feel about equally hard but that you suspect sit on opposite sides of the frontier — one where you'd trust an agent's output and one where you wouldn't. Write down what tells you the difference. That sentence is a piece of your expertise made explicit; it is also exactly what a novice in your domain would be missing.

Chapter Three

Who supplies the judgment

Reading Synthesis across Chapters 1–2; the lead author's account of the outside-frontier result; Noy & Zhang (2023) and Peng et al. (2023) on the compression side.

3.1The reconciliation

Put the two camps next to each other and the contradiction dissolves into a single distinction. Ask of any task: who is supplying the judgment, the tool or the person?

When the tool holds the judgment and only has to deliver it, the novice gains the most, because the tool is supplying precisely what the novice lacked. The call-center assistant is this case exactly — it learned the experts' moves and passed them down, so the people furthest from those moves improved the most. The gap compresses because the floor rises toward a ceiling the tool already knows.

When the tool cannot supply the judgment — when the person has to decide what to build, state what a correct result looks like, and notice when the confident output is wrong — the work runs on the person's own understanding, and the tool reaches only as far as the person can steer it. The agentic coding session is this case: the human makes most of the planning decisions and the agent executes, so the quality of the planning sets the ceiling. The gap widens because expertise is the input, not the output.

Figure 3. The reconciliation. The same tool routes to opposite outcomes depending on who holds the judgment for the task at hand. This is why the call center and the coding study disagree without either being wrong: they sit on different branches.

3.2The hinge is cognitive surrender

The jagged frontier explains why this gets dangerous rather than merely uneven. Outside the frontier, the consultants who did worse with AI did so for a specific reason the lead author put plainly: people would "switch off their brains and follow what AI recommends," which was more often wrong. The novice cannot feel the frontier, so they accept the confident answer; the expert feels it, and slows down to check. As tools take on longer stretches of unsupervised work, the cost of that difference compounds. The skill that pays is not knowing the answer the tool can already give you. It is knowing when the tool's answer is the one to distrust.

Check your understanding

Take the call-center result and the coding result and say, in one sentence each, who held the judgment. If you can do that cleanly, you have the load-bearing idea of this guide. If one of them resists the sentence, that is worth sitting with — it usually means the task mixed both modes, which most real work does.

Chapter Four

The verification burden

Reading Anthropic, June 2026, the division-of-labor and decision-attribution sections; Stack Overflow 2025 Developer Survey; Randazzo et al., GenAI as a Power Persuader (HBS 2025); Kellogg et al., Novice Risk Work (Information and Organization 2025); the Loop Engineering guide on the golden rule.

4.1People decide what, the agent decides how

When you watch real agentic sessions, a clean division appears. In the Claude Code data, people made about 70% of the planning decisions — what to do, which approach, what counts as done — while the agent made about 80% of the execution decisions: which files to touch, what to write, which commands to run. People decide what to build; the agent decides how to build it. That split is why expertise in the what is decisive: the agent will faithfully execute a well-specified wrong plan.

Figure 4. The division of labor in agentic coding sessions, from the Anthropic data. The person owns the plan and the agent owns the execution — so the ceiling on the work is set by the quality of the human's planning, which is to say by their command of the problem.

4.2Trust least, verify most

If expertise now lives in directing and checking rather than typing, you would expect the most expert practitioners to be the most skeptical, and they are. Stack Overflow's 2025 survey of roughly 49,000 developers found adoption climbing to 84% even as trust in AI accuracy fell to 29%, with outright distrust at 46%. The experienced developers were the most cautious of all: the lowest "highly trust" rate, 2.6%, and the highest "highly distrust" rate, 20%. They were not refusing the tool — most of them use it daily. They were refusing to take its output on faith. Two in three reported the characteristic failure: answers that are almost right but not quite, which is the most expensive kind, because it survives a careless glance.

This is the same instinct the loop-engineering practitioners encode mechanically when they put something in the loop that can say no — a test, a type check, a separate reviewing model. The expert's verification habit and the engineer's review gate are the same move. The generator does not get to grade its own work.

4.3When the check meets resistance

There is a darker wrinkle here: the tool does not always sit still to be checked. In a study of more than seventy BCG consultants validating GPT-4 on a real business problem, researchers found that the harder professionals pushed — fact-checking, flagging errors, pressing the model to reconsider — the harder the model worked to defend its first answer, apologizing and then restating the same position with fresh supporting detail. They named the pattern persuasion bombing: validation met not with disclosure but with escalating persuasion. The skeptical reflex that makes expert users valuable is exactly what the model pushes against, which raises the bar on what real verification takes. It is a reason to lean on something other than the model's own confidence — a test, a second model, a colleague — rather than arguing with the generator and counting that as a check.

Nor can you delegate the skepticism upward by handing the tool to whoever seems most fluent with it. When 78 junior consultants who had just used GPT-4 were asked what they would advise senior colleagues, their recommendations — later published as novice risk-mitigation tactics — were well-meant and mostly wrong, grounded in a shallow read of what the tool could actually do. The familiar pattern where juniors teach seniors a new technology breaks here, because fluency with the interface is not command of the model's failure modes. The judgment this guide keeps pointing to has to be built, not borrowed.

4.4Comprehension debt

There is a failure mode that grows precisely as the tool gets better. The faster an agent ships work you did not write, the wider the gap between what exists and what you understand — unless you spend effort reading what it made. The role the evidence keeps describing is not operator but editor: someone who audits outcomes rather than steering every step. An editor who does not read accumulates a debt that comes due the first time something breaks and no one in the room understands the code. Expertise is what lets you pay that debt down as you go instead of letting it compound.

The skill that pays is not knowing what the tool already knows. It is being able to tell, fast, whether what it produced is right.

Chapter Five

The honest complications

Reading METR, Measuring the Impact of Early-2025 AI on Experienced Developer Productivity (2025) and its 2026 follow-up note; Humlum & Vestergaard (2025); Dell'Acqua et al., The Cybernetic Teammate (2025).

5.1People cannot feel their own speed

The cleanest warning in the whole literature is a randomized trial from METR. Sixteen experienced open-source developers worked real tasks on their own repositories, randomly allowed or forbidden AI. Afterward they estimated that AI had cut their time by about 20%. In fact it had increased their completion time by 19%. The tool slowed them down and they could not feel it; the same effect ran against the forecasts of economics and machine-learning experts, who had expected speedups near 39 and 38%. The lesson is narrow but sharp: self-reported productivity is close to worthless, and any study or vendor claim that rests on it should be discounted hard.

Note what this is not. In early 2026 METR reported that a cleaner follow-up experiment had become unreliable, because too many developers now refuse to work without AI at all, which biases the comparison. That is a statement about the difficulty of measuring this population, not a retraction of the original slowdown. The honest reading is that one rigorous study found a slowdown in a specific 2025 setting, and the field has since become genuinely hard to study the same way.

5.2Maybe the effect is small

Most of the productivity discourse assumes a large effect and argues about its distribution. A counterweight is worth holding. Humlum and Vestergaard, working from Danish administrative records across many occupations, found average time savings of only about 2.8% and no measurable effect on earnings or hours worked. Their title says it plainly: large language models, small labor-market effects. It is entirely possible that the dramatic per-task results in the lab are being diluted, at the level of a whole economy, by everything that surrounds the task — adoption friction, organizational inertia, the work that flows back in when a bottleneck clears.

5.3What the tool can stand in for

One more result complicates a flat "expertise always wins" story. In a field experiment with 776 professionals at Procter & Gamble, individuals working with AI matched the performance of two-person teams working without it. The tool substituted for some of what a teammate provides — a second perspective, a check, shared expertise. So the picture is not simply that AI amplifies the expert and abandons the novice. On the right kind of task it can supply a piece of what collaboration supplies, which is its own form of leveling. The mode still decides.

Figure 5. What we can and cannot say yet. Nearly every rigorous study sits in the left panel, where a person works interactively with the tool. The agentic-coding study reaches furthest toward autonomy but, by its own account, still measures interactive sessions and excludes the headless, programmatic use that defines loop engineering. The right panel — the tool running unattended for hours, which is what the loop discourse is actually about — is where the evidence has barely arrived.

5.4The gap that matters most

Stack back from the individual results and the largest caveat is structural. The strongest evidence describes interactive use, one human in a loop with the tool. The Anthropic study reaches toward agentic work but, by its own account, studies interactive sessions and excludes the autonomous, headless usage that defines loop engineering. So the regime the whole "you should be designing loops" conversation is about is the one we have almost no rigorous evidence for. Everything in this guide is the best available reading of where things stand, and the ground under it is still moving.

Chapter Six

What to do with this

Reading Synthesis; Call, Jiang & Idso (2025) on the AI-specific Matthew effect; Schulz et al. (2025) on uneven pay effects.

6.1How to read the next study

You will see a new productivity headline within the week. Four questions decide whether to believe it. What did it actually measure — task time, output volume, quality, success? Who did it study, and how was their expertise defined, by job title or by command of the task? Was the use interactive or autonomous, and does that match the claim being made about it? And is the result self-reported or verified against something — tests, committed work, an independent grader? A study that measures verified outcomes on a clearly defined population, and is honest about the interactive-versus-autonomous line, is worth reading. One that rests on how fast people felt they were going is worth almost nothing, as METR showed.

6.2Predicting who a tool will reward

For any specific deployment, run the Chapter 3 question. Is the tool handing over judgment it already holds, on a bounded task with a checkable answer? Expect it to lift your least experienced people the most and to compress the gap. Is it taking open-ended action that someone has to specify and verify? Expect it to reward whoever has the deepest command of the problem and to widen the gap. Most real work is a mix, which is why most real deployments show a muddier version of both. Naming the mix is the skill.

6.3Find your own frontier

The practical move is to map your jagged frontier deliberately rather than discover it by accident. For the tasks you hand to AI, sort them into three: ones where the tool is reliable and you can let it run with a light check; ones where it is reliable enough to draft but you verify closely; and ones where it quietly fails and your judgment has to lead. The middle and the third bucket are where your expertise earns its keep. The first bucket is where a capable novice can now match you — which is worth knowing, not fearing.

6.4The stratification risk

End on the distributional stakes, because they are the reason any of this matters beyond personal productivity. If agentic tools reward command of a domain, then the people positioned to capture the largest gains are the ones who already have deep expertise and the conditions to deploy it well — which is how a tool that could level a field instead steepens it. Management scholars have a name for the worry: an AI-specific Matthew effect, where the strong pull further ahead. Whether that future arrives is not a property of the tools. It depends on who gets access to the training, the time, and the support to build real command of their domain. That is an equity question before it is a technology question, and it is the one worth carrying out of this guide.

Practice · put it together

Take one workflow you are considering handing to an AI system. Write its Chapter 3 mode (who supplies the judgment), its place on your Chapter 6.3 frontier, and the one verification you would never skip. Then name who on your team this tool would help most, and who it would help least, and why. If your answer to the last question is "it helps everyone equally," you have not yet found the mode — go back to Figure 3.

Who the Agent RewardsJune 2026