Tokenmaxxing: The AI-Adoption Metric That Punishes Your Best Engineers

A Familiar Mistake in New Clothing

Every few years, someone in leadership decides they've finally found the number that measures developer productivity. I've watched a lot of these attempts up close, and they all share a structure: a metric that's easy to collect, plausible-sounding in a slide deck, and—within about a sprint—thoroughly gamed.

The hard truth, which good engineering managers eventually internalize, is that there is no clean, easily measurable proxy for how productive a developer is. Not lines of code. Not tickets closed. Not bugs squashed. Not "finished everything in the sprint." The work is too varied, too context-dependent, and too entangled with everyone else's work to compress into a single scalar.

The latest entry in this genre is tokenmaxxing: grading how well a developer has "adopted AI" by measuring how many LLM tokens they consume.

It is one of the most self-defeating metrics I've encountered. Not only is it trivially easy to game - but it is even more actively harmful to a company's actual goals.

It reminds me of the most ill-advised cloud migration projects where companies prioritize speed of migration as well as app performance (equal or better than on-prem... after all, the cloud providers told us how fast everything is going to run!). Developers will gladly oblige and stand up the cloud infrastructure, sizing everything generously so there are no bottlenecks, and avoiding refactoring that might save a lot of cost but is time-consuming, with no visible benefit towards the stated goal. Keeping costs contained will end up near the bottom of list of priorities. Cue million-dollar cloud bills rolling in shortly after, when 90% of that cost could've been avoided.

Why Every Productivity Metric Eventually Backfires

Before we get to tokens, it's worth being precise about the underlying pattern, because tokenmaxxing is just the newest instance of a very old failure mode.

The economist Charles Goodhart gave us the canonical version: "When a measure becomes a target, it ceases to be a good measure." The moment you tie a reward to a number, people optimize the number—not the thing the number was supposed to represent. Donald Campbell put it even more bluntly: the more a quantitative metric is used for decision-making, the more it corrupts the process it was meant to monitor.

And here's the part managers consistently underestimate: the developers who care least about the quality of their work are the first to game the system, and the best at it. The conscientious engineer keeps doing careful work and quietly loses ground on the leaderboard. The cynic reads the metric as a set of instructions. You don't just get distorted data—you get an incentive gradient that promotes your least principled people.

The history is full of examples:

Lines of code. Reward volume and you get sprawling, copy-pasted, deliberately verbose code. The famous (possibly apocryphal but perfectly illustrative) line: measuring programming progress by lines of code is like measuring aircraft-building progress by weight.
Tickets closed. Reward throughput and watch engineers race to grab the trivial tickets, split one task into five "tickets," and leave the genuinely hard, ambiguous work—the work that actually matters—untouched because it tanks their count.
Bugs fixed. Pay a bounty per bug and you've just created a market for bugs. The cobra effect, named for the colonial Delhi bounty on dead cobras that ended with people breeding cobras for the payout, is not a metaphor here. It's the literal mechanism.
"Finished all assigned tasks within estimate." Reward hitting estimates and engineers sandbag. They pad every estimate ("that'll take three days"), refuse stretch goals, and accept as few tickets as possible during planning so the denominator stays comfortable. You have just trained your team to be less ambitious, on purpose.

Each of these metrics felt reasonable to whoever introduced it, but created a perverse incentive that rewarded gaming over craft. Tokenmaxxing belongs to this lineage—but it manages to be worse, because the behavior it rewards isn't merely unproductive. It's actively, measurably harmful on three separate axes: cost, code quality, and trust.

The Special Stupidity of Tokens-as-Target

Here is what makes tokenmaxxing uniquely backwards: the resource being maximized costs the company real money, and minimizing its use is genuine skilled labor.

Think about what it actually takes to use an LLM well on a codebase:

It takes effort to start a fresh thread at the right moment instead of letting one bloated conversation accumulate 80,000 tokens of stale context.
It takes judgment to give the model exactly the context it needs—the relevant file, the specific function, the precise error—rather than pasting the entire repository and hoping.
It takes discipline to curate your tools, disabling the forty MCP integrations you're not using so their schema definitions stop eating your context budget before you've typed a word.
It takes restraint to do the small, easy task yourself in ninety seconds rather than spinning up an agent to do it in five minutes and 12,000 tokens.

Every one of those behaviors is the mark of an engineer who understands the tool. And every one of them lowers token consumption. So a metric that rewards high token counts is, with almost surgical precision, a metric that rewards not understanding the tool and penalizes the people who do.

You are paying a per-token bill to a model provider. You have introduced a metric that encourages your engineers to inflate that bill. You have, in effect, put a bounty on your own cloud invoice. The cobra farm, but for inference.

Tokenmaxxing Wrecks Your Code

If tokenmaxxing only wasted money, it would merely be expensive. What makes it genuinely dangerous is that bloated context produces worse output—and most executives setting these targets have no idea.

The marketing says context windows are now 200K, a million, even more. The engineering reality, which we've written about before in The Context Window Survival Guide, is that model performance degrades well before those limits. Effective working context is a fraction of the advertised ceiling. Cram a model full of irrelevant files and three tangents' worth of stale conversation, and several things happen at once:

Attention dilutes. The signal the model needs is now buried in noise. The "lost in the middle" effect is well documented: information stranded in the center of a long context is the most likely to be ignored. Your token-stuffed prompt isn't giving the model more to work with—it's giving it more to get distracted by.
Hallucination risk rises. When the model is reasoning over a sprawling, contradictory context (the old version of the function and the new one, three different conventions, a half-abandoned refactor), it's far more likely to blend them into something that compiles but is subtly wrong.
Instructions get diluted. Your actual request—"fix this one race condition"—is now one sentence competing with 60,000 tokens of ambient material for the model's focus.

So the tokenmaxxer isn't just spending more. They are shipping lower-quality, higher-risk AI-assisted code, and they're doing it because the metric told them to. The disciplined engineer who starts a tight, well-scoped session gets a cleaner diff, fewer hallucinations, and a model that actually followed instructions—and gets dinged for "low AI adoption."

This is the cruelest part of the whole arrangement. Leadership introduces the metric to improve the output of AI-assisted development. The metric degrades it. The number goes up; the codebase gets worse. Keeping context lean isn't a cost-saving nicety—it's a first-class engineering practice that improves correctness. Tokenmaxxing inverts it into a vice. (This is the same dynamic that quietly ended MCP's honeymoon for a lot of teams—context bloat as silent quality killer—which we covered in Why MCP Stopped Making Sense for Most Apps.)

The Psychology: Why Smart People Go Along With It

Here's the question that bothered me most. The senior engineers who'd be setting up these systems know all of the above. They know token efficiency is skill. They know lean context produces better code. So why does anyone with technical literacy sign off on tokenmaxxing instead of telling leadership it's the worst idea they've heard all quarter?

A few psychological forces are worth naming, briefly, because they explain a lot.

The overjustification effect. When you attach an extrinsic reward to behavior people were already doing for intrinsic reasons—curiosity, craftsmanship, the satisfaction of using a tool well—you don't amplify the behavior. You often crowd it out. Good engineers used AI thoughtfully because it made their work better. Turn it into a quota and you convert a craft into a chore, and the thoughtfulness is the first thing to go.

Goodhart at the individual level. Once the dashboard exists, people stop asking "is this the right way to use the tool?" and start asking "what makes my number look good?" The metric quietly replaces judgment with compliance. That's not a character flaw; it's how measured humans behave. The Hawthorne lesson is that the act of measuring changes the behavior being measured—and rarely for the better.

Loss aversion and the headcount problem. This is the uncomfortable one. Effective AI adoption, done right, reduces how many people you need to ship the same work. For a senior developer or team lead, that's not an abstract efficiency win—it's a direct threat to something they value: the size of the team reporting to them. Even an engineer who feels personally secure knows that fewer reports is a worse résumé line, less organizational weight, a smaller domain. Team size is itself a status metric (another bad one), and people are far more motivated to avoid a loss than to chase an equivalent gain.

Put those together and you get a quietly rational, quietly corrosive outcome: the people who best understand how to make AI adoption actually work have a personal incentive to make it work slowly and visibly—and a token-consumption metric is the perfect cover. It looks like enthusiastic adoption. It generates a reassuring upward graph for the executive. And it happens to kneecap the one outcome (real efficiency, real headcount leverage) that would threaten the person implementing it. Nobody has to say any of this out loud. The incentive does the talking.

So the technically literate go along with the genius plan of measuring AI adoption by token burn—rather than telling the MBA in the room that it's the dumbest thing they've heard—and everyone gets a graph that points up while the actual goal recedes.

The Externality Almost Nobody Counts

One more cost, and it's the one I find hardest to shrug off: the environmental footprint.

Frivolous token consumption isn't free in some abstract cloud sense. Every wasted inference is real energy, real water for datacenter cooling, real carbon—spent to do nothing, or worse than nothing. It also consumes finite datacenter and accelerator capacity that could be serving legitimate work. When you build an incentive that explicitly rewards burning more tokens, you are subsidizing waste that has a physical bill attached, paid partly by people who never opted in. Of all the reasons tokenmaxxing is indefensible, "we manufactured an incentive to waste scarce compute for a vanity graph" is the one that should make a thoughtful executive wince.

What Good Looks Like Instead

If you genuinely want to know whether your team is getting value from AI-assisted development, don't measure consumption. Measure outcomes, and accept that the good ones are harder to put on a dashboard:

Cycle time and quality together. Is well-tested, reviewable work shipping faster without a rise in defects or revert rate? That's adoption working.
Cost per shipped unit of value, not raw spend. A team that ships more with fewer tokens is your success story, not your laggard. Token efficiency is a feature.
Qualitative signal from people you trust. Ask your strongest engineers where AI genuinely helps and where it gets in the way. The answer is more useful than any counter.
Remove the disincentives, don't add quotas. If the honest obstacle to AI adoption is that your seniors fear it'll shrink their teams and their standing, no metric fixes that. Address the career incentives directly—or the smartest people in the building will keep quietly steering around you.

The throughline of effective AI-assisted engineering is the same one we keep coming back to: context is architecture, and less context, used deliberately, beats more context used carelessly—on cost, on carbon, and, most importantly, on the quality of what you ship.

The Real Lesson

Tokenmaxxing is a small story about a bad metric, but it's also a preview of why AI adoption will take longer than the hype curve promises. Not because the technology isn't ready—it largely is—but because organizations are still strikingly bad at telling a good adoption strategy from a destructive one, and because the people best positioned to know the difference are too often the people with the least incentive to say so.

The fix isn't a better number. It's the humility to admit that the most valuable engineering behaviors—restraint, judgment, knowing when not to spend the token—were never going to show up on the leaderboard. They never have. That's exactly why the leaderboard keeps lying to you.