The Agentic Cost Cliff

Running a coding agent for a single developer is relatively predictable in terms of cost. Scaling that to agent swarms, continuous PR review, and automations that run around the clock is a different problem entirely. The usage is no longer bounded by how fast a human works or the reasonable amount of agents they can manage. It runs continuously, in parallel, across every open PR and every queued task, and the inference costs compound in ways that individual developer usage never did.

The answer is not to scale back. It is to match the model to the workload. State-of-the-art models earn their cost on high-judgment tasks where the quality of reasoning changes the outcome. For the high-volume automations running in the background, capable open-source models do the job at a fraction of the price. Building that distinction into your stack is the difference between agentic AI that scales sustainably and one that gets expensive faster than it delivers value.

March 22, 2026 · Tomi Stipancik

Developer usage is a fraction of what automations generate

What a developer generates in AI usage through a working day is relatively bounded. They ask questions, work through problems, run an agent on a task. The volume is real but it scales with the number of people on the team. Automations scale differently. A PR review agent fires on every commit across every developer simultaneously. A health check runs on a continuous loop. A swarm working through a sprint backlog generates inference across every ticket in parallel without stopping. The usage is no longer proportional to headcount. It is proportional to activity, and engineering orgs generate a lot of activity.

As the number of automations grows, the gap between what the org spends on developer-driven AI and what it spends on agent-driven AI widens. The orgs managing this well are the ones that recognized early that these two categories warrant different cost strategies.

Most automations do not need the best model

Frontier models earn their cost on work where reasoning quality directly changes the outcome. A complex refactor spanning a large codebase, an agent interpreting ambiguous requirements where a wrong call creates significant rework, tasks where the model needs to hold a lot of context and make good judgments across it. For that category of work, the price difference between SoTA and the tier below it is worth paying.

High-volume automations are a different category. The quality gap between Opus 4.6 and a capable open-source model on PR review comments, issue triage, CI failure summaries, or changelog generation is small. The cost gap is not. Running these workloads on a frontier model by default is not a quality decision. It is a default that nobody revisited once the stack was in place.

  • PR review comments and inline feedback
  • Issue triage and label assignment
  • CI health monitoring and failure summarization
  • Changelog and release note generation
  • Documentation updates from code changes
  • Code formatting and style validation

Tiering by workload is the same logic that drives every other infrastructure decision: use the right resource for the job, not the best available resource for every job. The implementation in AI is newer than in compute, but the reasoning is identical.

The cheaper tier is genuinely capable now

Cursor shipped Composer 2 on Kimi 2.5 from Moonshot AI, with additional reinforcement learning applied on top. On several coding benchmarks it outperforms Claude Opus 4.6 at a fraction of the cost. That is not a marketing position. It is a product decision that reflects where the model actually benchmarks. Cursor built it because the cost advantage at scale is real and the quality holds up for the workload.

MiniMax M2.7 sits at near parity with Opus 4.6 on major benchmarks. Raw Kimi 2.5 matches Claude Sonnet 4.5, which was state of the art three months ago and remains an extremely capable coding model. The gap between the frontier tier and the one below it is narrower than it has ever been, and the set of models in that second tier is growing. For automations that do not require leading-edge reasoning, the options are real and the cost difference is significant.

These models are accessible today through US-based inference providers like Fireworks AI and Together AI. For teams with data residency or compliance requirements, AWS Bedrock provides managed inference for many of these models within your existing AWS account, keeping data off third-party APIs without requiring you to manage infrastructure. For teams that need unrestricted throughput and want to eliminate per-token costs at runtime, self-hosting on dedicated GPU compute is a separate path: more operational overhead, but full control over capacity and pricing.

Provider lock-in limits your options as the landscape shifts

Kimi 2.5 and MiniMax M2.7 did not exist a year ago. The competitive model landscape is moving fast enough that a commitment made six months ago may already be leaving cost or capability on the table. Teams that have wired their agentic workflows tightly to a single provider find themselves unable to act on better options when they emerge, not because the models are not available but because the switching cost of rebuilding the workflows around them is too high.

Model-agnostic infrastructure solves this at the architectural level. When your team's configuration, workflows, and institutional context are preserved independently of the model running underneath, switching becomes a routing decision rather than a migration. You get the benefit of a tiered strategy today and the flexibility to update it as the landscape continues to change.

The time to build this in is before you need it

As agentic workflows become a larger share of how engineering orgs operate, model selection becomes a real infrastructure decision with real cost consequences. The orgs that treat it deliberately, matching model to workload and maintaining the flexibility to adjust as options improve, will manage that cost curve better than orgs that let defaults accumulate into a fixed stack.

The practical starting point is an audit of what your automations are running on and whether each workload justifies it. Most high-volume automations do not require frontier reasoning, and routing them to capable open-source models is not a compromise on quality. It is the right tool for the job.