Be careful what you measure

Software and woodworking seem very different at first glance. Being a physical medium, wood is limited in how you can fix early errors. If you cut too much, you are left with a piece that is far too small. Gluing another piece to get it to the right size is possible, but it weakens the structure. There is an old adage in woodworking to measure twice, cut once. Take your time up front & you’ll have a better result.

In software, the always adjustable nature of its creation allows us to be more flexible. Early software engineering efforts borrowed from physical engineering with front-loaded planning (measure twice, cut once). 25 years ago, the Agile Manifesto was released. It threw out the old, heavy planning approach to software and taught us to take advantage of the medium. It has since become the standard way most software projects are managed today.

But one downside of agile was the lack of a long-term plan, which made it hard to say when something would be “done”. We couldn’t necessarily tell that our work was moving the effort forward. So, we started adding metrics to have the warm & fuzzy feeling of quantifying our output.

Lines of code were frequently used in the early days. The thinking was that the more code you wrote, the more productive you were. There are numerous downsides to this metric. More efficient or easier-to-read code might use fewer lines than bloated code. Duplicate code instead of moving it into a method would have more lines of code, but would be harder to maintain. Refactoring is often one of the most valuable efforts in software engineering, yet it often results in fewer lines of code or a negative lines of code commit. When we evaluate engineers based on the number of lines of code they commit, we create a misaligned metric for the outcome we want.

I’ve had a few vendors reach out to me in recent weeks, offering tools to “measure” engineers’ efficacy. In most cases, these still used lines of code, something I thought most leaders had moved on realizing the fallacy of its use as a metric a decade ago…. apparently not.

Of course, with AI coding tools, the absurdity of using lines of code as a measure of productivity is even more ridiculous than it was before. AI can generate a lot of code, often not good or optimized. If you are still using lines of code as a metric, you are encouraging the use of AI to create bloatware rather than an efficient, maintainable product.

The current state of LLM-based AI tools is that they are massively subsidized. The cost to provide the output and the prices charged are way out of sync. Major LLM vendors are losing billions of dollars each quarter, suggesting they are not charging enough.

But companies using AI tools think they have unlocked a huge, low-cost productivity gain. They are actively encouraging engineers to be “AI first” & use AI for everything they do, under the pretense that it makes them more productive. Some shops are measuring an engineer’s effectiveness by how many tokens they are using to indicate they are more “AI native”. This tokenmaxxing is just as problematic, if not worse than the lines of code metric.

Engineers are wasting tokens on throwaway prompts to game the system and improve their rankings in these token metrics. They are running multiple parallel, just to increase their token count. While some may applaude their maximizing use of AI… the question remains… to what benefit?

Just like lines of code, measuring token usage doesn’t measure the outcome we want, merely one that is easy to measure. We think we are improving productivity, but we are forgetting about real costs that are sometimes hard to measure. Bloated code with more lines of code than necessary creates tech debt. It slows future development, creates more bugs & often makes codebases run more slowly. With an AI subscription per engineer, this may not seem like a big deal to evaluate on token use. They are maximizing their use of the subscription, right?

Remember that LLM AI companies are losing billions per quarter today. The subscriptions don’t come close to reflecting the actual costs for executing those tokens. Firms like Anthropic have recently started restricting high use subscriptions and forcing them to use API pricing instead, which more closely reflects actual costs. If we’ve been encouraging engineers to maximize token use and now we pay per token, what happens to our budgets? The trillions in investment in AI infrastructure only make sense if investors get a return on that. The bills only have one way to go for AI usage, & its not down.

We end up with an internal process that is wasteful, costs a fortune in AI tokens, yet may not yield any meaningful results for what matters to our business.

Whatever you measure is what you optimize for. Don’t get in a trap. Make sure you are measuring something meaningful. AI can be a useful tool when used effectively, but indoctrinating wasteful approaches in your teams will always have a bill come due.

Share this:

Like this:

Discover more from Niels Meersschaert