Why our inference router is mostly a queue
A year of trying to be clever, undone by the surprising sufficiency of priorities, fair shares, and patience.
The router started, as routers often do, with an ambition. We were going to model load as a continuous field. We were going to predict, with a little linear regression behind the scenes, how long each request would take on each replica. We were going to find a near-optimal assignment in under a millisecond, and we were going to feel very smart about it.
Eighteen months later the router is, in practice, a queue with three priorities and a fair-share token bucket per tenant. The clever parts are still there, technically, but they have been quietly demoted to a diagnostic mode that nobody runs. ¹ What replaced them is something I want to write down before I forget how it happened.
The first surprise: variance is the work
Inference latency is not a number; it is a distribution with a long tail and a stubborn refusal to be normal. The thing you want to route on — expected time-on-replica — is the least stable property of the system. We spent a quarter building a predictor that was, in aggregate, twelve percent better than a flat average. We deployed it. The p99 latency moved by zero.
The reason, in retrospect, is obvious: at the tail, the predictor and the average are wrong in the same way. The tail belongs to events the predictor never saw — a sudden cache miss, a slow tokenizer path, a noisy neighbor on the same host. Twelve percent better on the mean is not a tail story.
// What we replaced. The clever bit.
fn assign(req: &Request, fleet: &Fleet) -> ReplicaId {
fleet
.replicas
.iter()
.min_by_key(|r| predict_time(req, r))
.unwrap()
.id
}What we use now is shorter, and less clever, and stable in a way the old version never was. A replica pulls from the queue when it has capacity. Priority bands keep small interactive requests from being starved by large batch ones. A token bucket per tenant keeps any single tenant from doing what tenants will, given the chance, do.
// What we have now. The boring bit.
loop {
let req = queue.pop(priority).await;
if !tenant_bucket.try_take(&req) {
queue.push_back(req, priority);
continue;
}
replica.run(req).await;
}Performance, as measured by the only thing that matters (the p99 of end-to-end latency at the contractual load), improved. Operationally, the new version is a system any on-call engineer can reason about at three in the morning, which the old version was not.
The shape of the queue
A small picture is worth more here than a paragraph. The queue is the only shared piece of state in the path. Everything else — predictors, health probes, capacity planners — is decoration around this central object.
Fig. 1 — Three priority bands, three replicas. The replicas pull when they have capacity.
What the queue can't do
The queue is not a free lunch. There are three things it is bad at, and pretending otherwise is the failure mode I have seen most often in similar systems:
It is bad at multi-step coordination.If a request is actually a small graph of sub-requests with dependencies, the queue will happily schedule them all to the same replica at the same time, and you will discover, slowly and at the worst possible moment, that you've reinvented a thundering herd.
It is bad at heterogeneous fleets. If half of your replicas have twice the memory, the queue does not know this, and will not learn. We solved this with the unglamorous mechanism of queue-per-tier, which is uglier on a diagram but easier to reason about than any of the smarter ideas we tried.
It is bad at telling you what is happening. A queue, in production, is a shape you only see at the tail. You will need to build the observability separately, and you will need to commit to looking at it. ²
A note on cleverness
I'm not against clever systems. I have written a great many of them, and they have, on balance, been the most enjoyable parts of this career. But I now think the test for cleverness in infrastructure is not whether it is interesting on a whiteboard. It is whether the next engineer to be paged at four in the morning can hold the whole thing in their head.
The queue passes this test. The predictor never did.
- The diagnostic mode is, in retrospect, the most useful thing the clever code did. It shipped as an observability tool, which is what it always should have been.
- We use histograms over time-to-pull for each priority band as the primary signal. The exact dashboard is, I am told, the most boring graph anyone has ever loved.