Colton Pierson

The economics of writing code changed

The cost of producing a line of working code has dropped by something like an order of magnitude in two years, and it is not going back up. AI-assisted coding is here to stay, and the current generation of models is genuinely good. Not “good enough for a demo” good. Good enough that a senior engineer paired with one can outship a team of five from 2022, on real systems, in production.

That is the shift the rest of this post is reacting to. When the price of writing, translating, and rewriting code falls that far, the patterns that were optimal at the old price are not optimal at the new one. Some of the things we skipped because they were too expensive now pay for themselves. Some of the things we did because they saved keystrokes are now just making the code harder to read. The question worth asking on every architectural decision is: what did I previously avoid because we couldn't afford the time for humans to type it?

METR's task-length benchmark

If you want the shortest possible version of why this matters, it is METR's time-horizon benchmark. They measure the length of software task, in human-expert time, that a frontier model can complete autonomously with 50% success. The length has been doubling roughly every 7 months since 2019, and the doubling rate has actually been accelerating since 2024.

50% time horizon on METR's task suite, by model release date. Log scale. Data: METR, 2025.

Six years ago, frontier models could handle tasks that take a human two seconds. Today they handle tasks that take a human most of a workday, at the 50% mark, and the 80% mark is tracing the same curve a few months behind. If you draw a straight line on this chart, the argument about rewriting architecture around AI goes from speculative to obvious.

Skipping the old advice

Some of what gets called “new AI-era advice” is old advice that was already correct. Prefer duplication to the wrong abstraction. Avoid deep inheritance. Don't build a strategy hierarchy for a second implementation that may never exist. Sandi Metz was saying this in 2014. AI lowers the cost of getting it wrong, which makes the advice matter a little less, not more. I'll skip past it.

The interesting question is the other direction: which of the heavyweight patterns we used to skip because they cost too much in human time are suddenly worth reaching for?

Bounded contexts and inversion of control

Domain-driven design has been around for twenty years and most teams still do not practice it rigorously. Drawing bounded contexts, defining aggregates, and enforcing the boundaries between them is a lot of upfront work for a CRUD app. Most teams skipped it for the same reason they skipped event sourcing: the labor cost did not justify the payoff.

That changes when the codebase has a non-human author who can see exactly as much of the system as you show it, and no more. An AI working inside a well-drawn bounded context has a finite set of types, a clear aggregate root, and explicit interfaces at the edges. It cannot accidentally reach across into billing logic while editing the shipping module, because the import does not exist. You know the blast radius of any AI-authored change by looking at the context map, not by reading every line of the diff.

Inversion of control reinforces this. When a module declares its dependencies through constructor injection or a trait boundary rather than importing concrete implementations, you can hand the AI that module in isolation. It does not need to know what the real database client looks like to write business logic against the interface. You scope the AI's working set the same way you scope a unit test: invert the control and hand it only what it needs.

This is not free either. Drawing the wrong context boundaries is worse than drawing none, and refactoring a monolith into bounded contexts is real work even with AI help. The payoff is that once the boundaries exist, every future AI-authored change is cheaper to review and safer to ship.

Example. A monolith has an OrderService that imports InventoryClient, PaymentGateway, NotificationSender, and AuditLogger directly. An AI editing that file needs to understand all four dependencies, their side effects, and their error modes. Refactored: OrderService lives in the orders bounded context, takes four interfaces through its constructor, and the context boundary enforces that nothing outside is directly reachable. The AI modifies order logic, runs the tests with fakes, and produces a working change without ever touching payments or inventory code.

Event sourcing earns its cost

Event sourcing means storing the sequence of state changes instead of the current state. It is not free. Schema evolution is real work. GDPR deletion against an append-only log is awkward. Projection rebuilds take hours on a big dataset, and you are usually running a second storage system for the read side. None of that has gone away. What has changed is the part that used to dominate the budget: writing the event types, the handlers, the projections, and the tedious migration code. That part is now the cheapest piece of the project.

In return you get an auditable record of what an AI agent did to your system. A row that says a customer is on the Pro plan does not tell you which agent run upgraded them or why. A log entry that says PlanUpgraded(customer=123, from=Free, to=Pro, actor=billing-agent, reason="retry of failed webhook", trace_id=...) does.

Example. Instead of an orders table you mutate in place, emit OrderPlaced, OrderLineAdded, OrderDiscountApplied, OrderPaid, OrderShipped. “Current state” is a projection. When a customer emails support asking why their total changed on Tuesday, the answer is a query against the log instead of a half-day spent correlating logs and database snapshots. The price of event sourcing is still real. It now buys something you actually want.

Workflow engines belong in more places

Durable workflow engines like Temporal, Restate, DBOS, and Inngest used to be a niche choice for teams with a specific long-running-process problem. They should now be a default for any multi-step business logic that involves an AI call, because the shape of AI-involving work is exactly what these engines were built for.

LLM calls fail and time out and need retries with modified prompts. Tool calls against external systems are non-idempotent and partial. A human might need to approve a step in the middle of a run. That is not something you fit into a request handler with a 30-second timeout. It is something you build as a workflow with durable state, per-activity retry policies, and signals.

The refund-with-human-approval example is on Temporal's marketing page, so let me use a different one. A data ingestion pipeline where an AI proposes a schema mapping from a new vendor's CSV, a deterministic validator checks it against historical samples, rejections get re-prompted with the validator's actual complaint, and a human is paged only after five failed attempts. As a request path this is unbuildable. As a workflow, each retry is a replay with a new prompt, and six months later you can open any run and see exactly what the AI tried and which attempt the validator accepted.

Rust makes more sense, for specific reasons

The old argument against Rust was that it slowed humans down. Borrow checker, lifetimes, exhaustive Result handling, three times the code of the equivalent Python. That cost was real when a human typed every character. It is a lot smaller when an AI does.

The payoff went up, but only for certain kinds of AI mistakes. A strict compiler catches missing enum arms, shape mismatches, and all the “forgot to update the other three call sites” bugs that AI-generated patches love to introduce. It does not catch the most common hallucination, which is calling a real function that means something different from what the AI assumed. No type system saves you from that. You still need tests.

Example. In a Python codebase, ask the AI to add a new payment_method, and it will plausibly forget one of the four places the value is switched on. In Rust with an enum PaymentMethod and an exhaustive match, the missing arm is a compile error the AI fixes on the next turn without you having to find it. The verbosity is the feature.

Quick clarification so this section does not contradict itself: “strict” does not mean “no metaprogramming.” Rust has some of the heaviest metaprogramming in mainstream use, and Serde's derive macros are fine. The distinction is whether the metaprogramming produces typed output the compiler checks, or runtime reflection that makes behavior appear from nowhere. The first kind helps AI-written code. The second kind does not.

Trace everything, and don't confuse it with LLM observability

Logging won because printf was cheap and setting up OpenTelemetry was not. That economic argument is gone. You can get an AI to instrument a service with spans, attributes, and baggage propagation in an afternoon. There is almost no reason to ship a backend you cannot trace end-to-end.

But tracing answers “what happened,” not “why the model chose this.” These are two different questions with two different tools. OpenTelemetry shows you the request path, the tool calls, the retries, the event writes. For the model's actual decision-making you want prompt and response capture, token-level attribution, eval harnesses, and a tool like Langfuse, Braintrust, or Phoenix hooked into the same trace IDs. Teams that treat OTel as sufficient end up with beautiful traces and no idea why the bot did what it did.

Example. A customer says “your bot refunded the wrong order.” The OTel trace shows you the request, the tool calls, the workflow signals, and the RefundIssued event with its reason field. Langfuse shows you the exact prompt, the retrieved context, the model version, and the alternative completions the model scored lower. You need both. Either one on its own leaves you guessing in the post-mortem.

Heavy rewrites deserve a second look

A lot of modern engineering wisdom is really risk management around the cost of human rewrites. Don't rewrite it. Don't change languages. Don't migrate the persistence model. These were correct when the rewrite was six engineers for nine months. They are less universally correct now.

I do not want to oversell this. AI does not turn a quarter-long migration into a sprint. It turns the mechanical parts of one (translation, boilerplate, test scaffolding) into something that feels more like review than authoring. The data backfill, the cutover plan, and the production surprises still cost what they cost. What shifts is the ratio, not the total.

What that ratio shift unlocks is a different kind of project: one where the payoff is high but the mechanical labor was the thing killing it. Porting a hot-path service from Python to Rust. Adding event sourcing to a module that has been mysteriously losing state. Retrofitting a tangled request path into a workflow engine. Adding tracing across an entire fleet. These used to get planned, scoped, and then quietly deferred. Now they are worth re-costing from scratch, because the line between “too expensive” and “worth doing” has moved.

Closing thought

If there is a common thread here, it is that cheap ceremony is good ceremony as long as a machine is the one enforcing it. A compiler that rejects an invalid state. A workflow engine that refuses to lose a step. An event log that records every change. These used to be expensive to build. Now they are cheap, and they happen to catch exactly the ways AI-authored code tends to go wrong.

The patterns that should fade are the ones where a human was the enforcement mechanism: code review as the only quality gate, discipline as the only way to keep duplicates in sync, convention as the only way to keep layers separate. Those were always fragile. They get worse when the volume of change goes up and the reviewer does not.

None of this applies universally. Embedded code, games, and throwaway scripts have their own economics, and some of the advice here will be wrong for them. For the long-lived backends I actually work on, where correctness and auditability matter and an AI is increasingly one of the authors, the heavyweight patterns I used to skip are the ones I now reach for first.