Back to blog
InfrastructureSystems Thinking

The Constraint Never Disappears — It Migrates

Fix the GPU bottleneck and you'll find the memory bottleneck. Fix that and you'll find the network bottleneck. Welcome to production AI.

There's a pattern that every experienced systems engineer recognizes but few AI roadmaps account for: the migrating constraint.

You profile your inference server. The GPU is the bottleneck — utilization at 98%, everything else waiting. You optimize. You apply FlashAttention, batch requests, fuse kernels. GPU utilization drops to 60%. Victory.

Except now your KV cache is eating all the memory. Requests queue because there's no room to start new sequences. You've moved the constraint from compute to memory.

You implement PagedAttention. Memory fragments efficiently. Throughput doubles. But now your network can't feed tokens to clients fast enough. The constraint migrated again — from memory to I/O.

A veteran systems engineer once described this as “the ghost in the wait”: fix one bottleneck and it moves one tier up the hierarchy. The constraint never disappears. It migrates.


Why AI makes this worse

Traditional software has relatively stable resource profiles. A database query uses predictable CPU, memory, and I/O. You can capacity-plan with confidence.

AI workloads are different. A reasoning model's memory consumption depends on how long it “thinks” — which varies per request, unpredictably. One request might use 2K tokens of chain-of-thought. The next might use 32K. Your serving infrastructure needs to handle both without the long request starving every short request behind it.

This creates convoy effects that don't exist in traditional web serving. One unexpectedly long inference can block GPU memory for seconds, cascading delays through the entire queue. Every SLO assumption from your web services playbook breaks.


What Day Two infrastructure actually looks like

Instrument before you optimize. GPU inference observability is at roughly 2005-era database maturity. Most teams are flying blind — they know the system is slow but can't pinpoint where. Before you throw hardware at the problem, build the visibility to understand which constraint you're actually hitting.

Design for constraint migration. Don't just solve today's bottleneck. Architect your system so that when the constraint moves (and it will), you can address the next one without rebuilding. Modular serving stacks with independent scaling of compute, memory, and I/O.

Plan for the tail, not the mean. Benchmark results report averages. Production lives in percentiles. The 99th percentile request is the one that breaks your SLO, pages your oncall, and creates the customer escalation. Design for the request you don't expect.

The gap between a benchmark and a production workload is the gap between a sparring match and a street fight. Day Two is when you stop sparring.

Hitting infrastructure walls?

Talk to Us