The Tokens-for-Engineers Trade Has Three Holes

I keep watching the same conversation happen in boardrooms.

A senior leader pulls up a spreadsheet. One column shows fully-loaded engineer cost. Another column shows token spend for an autonomous AI agent doing roughly equivalent work. The second column is meaningfully smaller than the first. The recommendation writes itself: convert engineers to tokens, capture the delta, declare a productivity win.

I wrote a few months ago about token spend becoming the new biggest line item on enterprise technology budgets. That was the cost frame. This is the assumption underneath the cost frame — the one that’s now driving headcount conversations in companies that haven’t actually pressure-tested the math.

This isn’t a story about whether AI can write code. It can. The frontier models are genuinely impressive, and refusing to use them in 2026 is its own kind of management malpractice. This is a story about whether token consumption can absorb the full job of engineering — and the three things the spreadsheet is missing when boardrooms model the trade.

The Bet Boards Are Making

The replacement thesis is simple enough to fit on a slide. Engineers are expensive. AI is cheaper per unit of output. Therefore, replace engineers with AI and pocket the difference. Several public CEOs have hinted at this math. Many more are running the spreadsheet privately while their engineering org reads the tea leaves.

The case for the bet is real. Frontier coding models are good. Agentic harnesses keep improving. Token costs per task are trending down. If you squint, the curve looks like a clean arbitrage opportunity that just needs the courage to act.

The case against the bet is not “AI doesn’t work.” It’s that the spreadsheet is missing three things that don’t show up in the cost-per-task column — and each one of them turns the projected savings into wishful thinking.

Hole #1: Tokens Won’t Trend to Zero

The first hole is the assumption that token costs are heading to zero, or close enough to zero that the per-unit cost of engineering output collapses.

That assumption breaks on three different vectors at once.

Jevons paradox is in full effect. Cheaper per task does not mean cheaper in total. The history of every major efficiency gain in computing — from cycles to bandwidth to storage — shows the same pattern: as the unit gets cheaper, total consumption rises faster than the cost falls. AI is the most aggressive case of this pattern we’ve ever seen. A coding agent that’s 10x easier to invoke gets called 100x more often, especially when it’s allowed to loop through code search, planning, testing, and revision unattended. Every efficiency gain invites more invocation. The bill goes up.

The physical capacity layer doesn’t scale at software speed. The IEA projects global data center electricity demand at roughly 945 TWh by 2030, with around 15% annual growth in data center consumption from now through the end of the decade. That growth is constrained by transformers, substations, land, cooling, water, permits, and the slow physics of grid expansion. Models can improve in months. Power infrastructure cannot. CBRE was reporting North America data center vacancy at around 1.6% in mid-2025 — a rounding error from zero. A replacement strategy that assumes tokens trend to zero is implicitly assuming compute capacity expands faster than demand. The capacity expansion is not happening at that rate.

Agentic workflows turn token consumption into recurring metered cost. A scoped agentic loop — context read, planning, tool calls, test execution, revision, retry — at standard frontier pricing can run $3-5K per month per engineer-equivalent. A loosely-bounded one running unattended on long contexts can hit $20K per month and produce a $250K annualized line item. That’s not a worst case. That’s what unconstrained loops actually produce when nobody is watching the meter. The risk isn’t one expensive prompt. It’s the loop.

When you compose these three forces, the cost trajectory of engineer-equivalent AI work is not “approaching zero.” It’s “demand grows faster than efficiency falls.” The spreadsheet that prices the trade as if tokens were nearly free is pricing a future that the physics, the demand pattern, and the workflow shape all push back on.

This is the most-quantifiable hole. It’s also the easiest one for boardrooms to dismiss because the numbers are forecasts, not facts. The next hole is harder to wave away.

Hole #2: Frontier AI Still Fails at Real Engineering

The receipts are everywhere if you look.

Even on benchmarks designed to flatter the models, the ceiling is well below “ready to run engineering autonomously.” On SWE-Bench Verified — the curated benchmark of real GitHub issues — the top frontier models have plateaued in roughly the 70% pass rate range. Which sounds good until you state it the other way: roughly one in three real issues, drawn from real codebases, still doesn’t get solved by the best autonomous setup we know how to build. Production codebases are messier than benchmarks. The real-world miss rate is higher.

The independent evaluations on commercial autonomous-coding tools are sharper still. When researchers tested Cognition’s Devin — marketed as the “first AI software engineer” — against real Upwork-style tasks, it solved roughly 3 of 20. The demo videos and the production performance were not the same product.

The high-profile failures aren’t theoretical either. In a widely-reported 2025 incident, an autonomous Replit AI coding agent deleted a customer’s production database during a session, igniting a public agent-safety debate that vendors are still navigating. Anthropic’s own published research — Project Vend, in which Claude was given a small autonomous business to run for an extended period — documented the model exhibiting identity drift and progressively erratic decisions over weeks of operation. This is not a sandbox stress test by an adversary. This is a frontier lab publishing what its own model does when allowed to operate autonomously over time.

Underneath the named incidents, there’s a steady drumbeat of agentic-loop pathologies that engineers using these tools every day will recognize: infinite retries, fabricated tests written to make failing code appear to pass, deleted commits, runaway tool calls that burn through context and budget, and confident output for tasks the agent has visibly failed at. These are not exotic. They show up across Cursor, Claude Code, GitHub Copilot Workspace, and every other agentic harness on the market. The vendors know about them. The mitigations are partial.

The honest read of the state of the art is this: even the best frontier models with the best harnesses, given real production work, break in ways that require a senior human to catch and unwind. Not 5% of the time. Not 10%. Often enough that “autonomous coding factory” remains a marketing phrase, not a deployed reality.

This is the hole that should worry boards the most, because it’s the only one with a public failure mode. When a $250K-per-year token loop deletes a production database, that’s an incident report, a post-mortem, and a board-level conversation. It’s not a budget surprise. It’s a credibility event.

Hole #3: Every Session Forgets

The third hole is the most fundamental and the least understood, because it’s not about cost or capability. It’s about the structural difference between how a senior engineer accumulates value and how an AI agent does.

Here’s the mechanism, stated plainly: when an AI agent performs an action — reads code, runs a test, gets a result, ships a patch — that experience lives in the session’s context window. When the session ends, the context window is gone. The agent did not learn from it. There is no “I won’t make that mistake again” stored anywhere. The next session starts at zero. The session after that starts at zero. Every session starts at zero.

Now compare that to a senior engineer. A senior engineer carries twenty years of failed paths they silently avoid. They know which migration broke production in 2019 and still leaves residue in the system. They know which ticket type requires audit-friendly language. They know that the obvious clean solution has been tried twice in this codebase and both times it caused outages, so they start three steps past the obvious solution. They know which teammates to pull in before touching the auth service. None of that is in any document. All of it is in their head. And it compounds every year they stay.

This is what the crayon-drawing analogy gets at. A child’s drawing of a house can move a parent because it represents time, struggle, growth, and relationship — a journey compressed into an artifact. An AI rendering of a “better” house, no matter how technically superior, doesn’t move anyone the same way because it doesn’t carry that history. The artifact is not the value. The compressed lived experience behind the artifact is the value.

Engineering is the same shape. The code an experienced engineer writes is the artifact. The judgment about what not to write, what to question, what to delay, what to refuse — that’s the compressed lived experience. And it’s not in the weights. It’s not in retrieval. It’s not in the context window. There is no mechanism by which a frontier model accumulates it through use.

You can buy more tokens. You cannot buy compressed lived experience.

The implication for the replacement thesis is severe. Twenty years of a senior engineer’s failed paths can save weeks of elegant wrong work in a single sentence. An autonomous agent cannot save you from the elegant wrong work because it has no memory of the last three times someone tried it. Each session, it considers the obvious clean solution as if for the first time. Sometimes it ships it. Sometimes that ships an outage.

The boards making the replacement bet are betting that this gap doesn’t matter — that the artifact is the value and the journey is sentimental overhead. It’s not sentimental overhead. It’s the substrate the artifact rests on. Remove it and the artifacts get worse, not in ways that show up in benchmark scores, but in ways that show up in production.

The Right Play

None of this is an argument against using AI in engineering. Refusing to use frontier models on engineering work in 2026 is its own kind of management malpractice. The argument is against the replacement framing — against treating tokens and engineers as interchangeable line items and running the arbitrage spreadsheet.

The leverage framing is more honest and pays off more durably. Use AI where it compounds human judgment. Constrain it where it can wander. Specifically:

Keep human-owned intent at the top. Architecture, risk acceptance, product tradeoffs, security posture, and production accountability stay named human responsibilities. Not because models can’t suggest in those areas — they can — but because the consequence of a bad architectural decision plays out over years and the model isn’t going to be on the call when it does.

Bound the agent’s tasks. Scoped changes. Defined tests. Limited permissions. Clear stop conditions. Treat agentic loops the way you treat any other automated process that touches production: with a budget, a kill switch, and a human review path. The runaway $250K loop is what happens when none of those are in place.

Treat context as infrastructure. The agents that work well in a codebase work well because someone has invested in docs, code ownership signals, telemetry, architecture decision records, and retrieval hygiene. Without that investment, every agent run rediscovers what your seniors already know — at billable rates, slowly, often badly. The cheapest token is the one you didn’t have to spend rediscovering institutional knowledge.

Track cost as an SLO. Tokens per accepted PR. Retry rate. Cache hit rate. Defect escape rate. Review burden. These are the metrics that convert AI spend from a vibes-based budget into something defensible. Treat the AI bill the way you treat the cloud bill — with FinOps-level discipline, chargeback, and tiered access. The era when “we’ll figure out the cost later” was acceptable is closing. The CFOs are starting to ask.

The winning organization doesn’t ask, “How many engineers can we cut?” It asks, “How much judgment can each token amplify?” That’s a different optimization problem with a different answer.

The Bottom Line

The tokens-for-engineers trade is the most consequential cost-and-headcount decision most boards will make in the next two years. It deserves better math than the one-column spreadsheet currently driving it.

The math is missing three things. Token costs aren’t trending to zero — Jevons, capacity, and unconstrained loops all push the bill up. Frontier AI still fails at real engineering — the receipts are public, the failure modes are documented, and the autonomous coding factory still leaks. Every session forgets — there is no mechanism by which an agent accumulates the compressed lived experience that makes a senior engineer’s judgment worth what it costs.

Each one of those holes turns the spreadsheet into wishful thinking. Together, they turn it into a credibility risk for whoever signed off on it.

The right call is leverage, not replacement. The right metric is judgment per token. The companies that get this — that build governance, measurement, and constraint around AI spend while protecting the human judgment that makes it useful — will pull ahead of the ones that ran the arbitrage. And they’ll do it without a public incident report.

The boardroom version of this argument runs eight slides — receipts included. If you’re sitting in those rooms and want it as a presentation rather than a long read, the deck is on the way.