Notes & Working Papers
2026.06.28
v1.0

What the Router Learns

AuthorROME THORSTENSON
FiledJUNE 2026
Reading9 min
StatusDRAFT
Proxy-to-replica round-trip time over a 30-minute tuning window, three colored bands for Seoul (~456ms), Frankfurt (~279ms), and Ashburn (~37ms), with faint per-probe traces beneath each smoothed line.
Fig 0.0 Cross-region round-trip time, three replicas, one 30-minute window

A request sent over the internet from a server in Ashburn takes, on average, ~279 ms to reach a replica in Frankfurt and get an answer back; to Seoul, ~456 ms. That's a 12× spread over the ~37 ms round-trip to a local replica, widening to ~23× at Seoul's peak. When you're serving LLM inference from replicas in multiple regions, that network round-trip is embedded in every request's latency—and almost no production routing policy accounts for it.

We built GORGO to fix that. It’s a CPU proxy that routes LLM inference requests across a multi-region fleet, with an online tuning method that learns its routing cost weights from the deployment’s own latency stream—no modifications to the inference engine required. The full method and results are in the paper.

100 200 300 400 round-trip, ms baseline Ashburn us-east · home 37 ms Frankfurt eu-central 279 ms Seoul ap-northeast 456 ms
Fig 1The latency a cross-region fleet hides. Mean proxy→replica round-trip from US-East (Ashburn), with ±1 SD whiskers. Frankfurt is 7.5× the local baseline; Seoul, 12×—widening to ~23× at Seoul's peak. Every request pays this overhead before the model does any work, and no standard routing policy prices it.
§ 01 Ttft

TTFT and why cross-region breaks the usual assumptions

The user-perceived latency for an LLM is dominated by time-to-first-token (TTFT)—the gap between submitting a prompt and seeing the first generated word. Once the model starts producing tokens, decode latency (time per subsequent token) matters less than people expect; the initial stall is what makes inference feel slow.

On a single replica, TTFT is dominated by three costs: prefill time (the GPU work of computing the key-value (KV) cache for the prompt tokens), the round-trip time (RTT) from the proxy to the replica, and queueing delay behind requests already in flight. Modern inference engines like SGLang implement prefix caching (RadixAttention): if a previous request already computed the KV cache for your system prompt, the engine skips recomputing those tokens entirely. This makes cache hit rate a first-class routing concern—route to a replica that already has your prefix and you pay no prefill cost for those tokens.

Standard routing policies try to exploit this in different ways. Consistent hashing and session affinity stick requests to replicas that have seen similar prompts before. Longest-prefix-match routing (as implemented in AIBrix) actively computes how much of each request’s prefix is already cached on each replica and routes to the best match. Load-based policies—least-request, least-load—ignore caches and just spread work evenly.

None of them account for wide-area network latency. In a single-region deployment, RTT is a millisecond and safely ignorable. In a cross-region fleet, RTT is a 37–456 ms overhead on every single request. Routing to Frankfurt because it has a cache hit, when Ashburn is 240 ms closer and already has the same prefix cached, can be wrong—and none of the standard policies have any term in their cost model to notice.

§ 02 Gorgo

The GORGO cost function

GORGO is a single routing policy that puts KV-cache locality, replica load, and wide-area network latency into one cost function, with weights derived from the deployment’s own per-request TTFT stream.

The router maintains three data structures: a radix trie of which prefixes are cached on which replica (inferred from the router’s own routing history), a live proxy-side counter of queued tokens per replica (not queried from the engine—counted locally), and an EWMA (exponential weighted moving average) of ping RTT to each replica. On each incoming request, it scores every candidate replica:

score(replica) = rtt_weight · rtt_ms + uncached_tokens + queue_weight · queued_tokens
click each term
network How far away the replica is. A learned weight times the EWMA round-trip time. This is the term every standard policy omits—and the one that matters most across regions.
Fig 2One score, three pressures. GORGO ranks every replica by a single cost blending network distance, prefill work, and queue depth, then routes to the minimum. Click each term. The weights—not the structure—are what online tuning learns.

where uncached_tokens = input token count minus the prefix already cached on that replica. Route to whichever replica minimizes the score.

REQUEST prompt + prefix GORGO PROXY cpu router · us-east RADIX TRIE which prefix is cached where QUEUE COUNTER queued tokens, live & local RTT EWMA network probe per replica scores every replica → routed · min score Ashburn 2×L40S score Frankfurt 2×L40S Seoul 2×L40S
Fig 3The routing decision. Three proxy-side signals—an inferred prefix-cache map, a live queue counter, and a per-replica RTT estimate—feed one score per replica; the request goes to the cheapest. Here the local replica already holds the prefix and is nearest, so it wins despite the others being free to serve. Score bars are illustrative.

The proxy-side cache trie is approximate. It tracks what the router has sent where, not what the engine actually has in memory—evictions and prefill failures aren’t directly observed. That’s acceptable: the alternative, querying each replica’s cache state per-request, would add its own latency overhead on the path it's trying to shorten.

§ 03 Learning

Learning the weights

Two scalar weights—rtt_weight and queue_weight—control how much network latency and queueing pressure each factor into the routing decision. We built two variants for setting them.

gorgo-static fixes the weights for the entire run. Not the interesting variant—it tests whether the cost model’s structure carries value at all, independent of any tuning.

gorgo-hillclimb runs a (1+1) evolution strategy with Rechenberg’s 1/5 success rule, searching the dimensionless weights in log-space to directly minimize p95 TTFT over a rolling 64-request window. At each step: propose a Gaussian-perturbed candidate; accept it if it beats the incumbent on the current window; widen the step size if more than 20% of recent proposals succeed, narrow it if fewer. No gradient, no look-ahead, no model of the objective function—just a self-correcting local search that adapts to the noise level of the signal.

The search space is two-dimensional. The objective is noisy. Evolution strategies are well-matched to that regime; they’re robust to non-convexity and require no differentiability.

Two-panel plot. Left: rtt_weight and queue_weight trajectories settling to about 0.28 and 0.5. Right: objective score (negative p95 TTFT) climbing from about -5.9 to -1.2 and plateauing.
Fig 4The tuner converging. A live (1+1)-ES run over one 30-minute window. Left: the two weights settle—rtt_weight to ~0.28, queue_weight to 0.5. Right: the objective (negative p95 TTFT) climbs and plateaus as the accepted-step rate falls toward the 1/5 target. These become the frozen weights carried into evaluation. (Figure from the paper.)
§ 04 Why

Why public datasets don’t test this regime

The existing benchmarks for LLM serving are short-prompt, and what reuse they have is the wrong kind. LMSYS-Chat-1M averages ~467 input tokens with ~3% prefix reuse. WildChat-4.8M averages ~2,925 tokens with ~33% reuse—but almost all of that is cross-user, shared system-prompt templates rather than reuse within a single conversation. If prompts are short and the only common prefix is a boilerplate header, there’s little routing locality to exploit.

Real long-context production traffic is different. We built ART-Chat-2.5M from long-context production metadata: ~2.5M requests from ~5K users, averaging ~18,000 input tokens—~38× longer than LMSYS, ~6× longer than WildChat—with ~90% prefix reuse, almost all of it within a single user’s own multi-turn session. The metadata source is production traffic from a GLM-family model endpoint; the served model in our experiments is Qwen3.5-35B-A3B-FP8, a mixture-of-experts model with 35B total and ~3B active parameters. ART-Chat-2.5M is the regime where prefix caching and cross-region routing actually interact—the one public datasets don’t cover. (Dataset and router are open source: code, data.)

Left: stacked bars of prefix-reuse composition—ART-Chat-2.5M is 89% intra-user reuse, LMSYS-Chat-1M is 97% unique, WildChat-4.8M is 68% unique with 28% cross-user reuse. Right: a six-axis radar chart where ART-Chat-2.5M dominates every axis except cross-user reuse, where WildChat leads.
Fig 5Not all reuse is the same reuse. Left: prefix-reuse composition. ART-Chat-2.5M's reuse is overwhelmingly intra-user (89%)—long multi-turn sessions that return to the same conversation. WildChat's is cross-user (28%): shared boilerplate templates, not conversation history. LMSYS is 97% unique. Right: across six routing-relevant axes, ART-Chat-2.5M is the only workload that stresses the ones cache-aware, cross-region routing is built for. (Figure from the paper.)
§ 05 What

What freezing after tuning bought

Experimental setup: three regions (Seoul, Frankfurt, Ashburn), each running a single 2×L40S SGLang server on Modal, isolated per policy so caches can’t cross-contaminate across conditions. Concurrency up to 64. Baselines: random, least-request, least-load, prefix-cache via AIBrix, and simple-session-affinity.

The evaluation protocol matters: tune the weights on a window of requests, freeze them, then evaluate on a held-out window.

Over held-out windows, GORGO improves p95 TTFT by 8–15% and end-to-end (E2E) latency by 15–31% over the strongest baselines (session affinity and prefix-cache). On one held-out window: p95 TTFT improved 15.5% (1,584 ms vs. 1,875 ms) and E2E p95 improved 30.8% (3.29 s vs. 4.75 s). On a saturated in-sample window: E2E p95 fell 34.9% (8.91 s vs. 13.69 s) while trailing session-affinity by only 3.5% on TTFT p95.

GORGO best baseline · session-affinity 5 s 4 3 2 1 0 1,584 ms 1,875 ms −15.5% p95 TTFT 3.29 s 4.75 s −30.8% E2E p95
Fig 6Where the cost model earns its keep. One held-out window, frozen weights, GORGO vs. the strongest baseline. The time-to-first-token gain is modest (−15.5%); the end-to-end gain—what the user actually feels—is more than twice as large (−30.8%).
PolicyTTFT p50TTFT p95TTFT p99E2E p95
Tuning window · Apr 5, full load · in-sample
GORGO6732,5144,4738.91
simple-session-affinity7122,4285,19018.27
least-load7632,4474,34913.69
least-request9683,9707,01216.62
prefix-cache1,4776,7849,61322.63
Held-out · Apr 6, half load
GORGO4911,5842,2223.29
simple-session-affinity6271,8753,0564.75
least-load6161,8182,8854.06
least-request6341,8522,4843.99
prefix-cache5741,7982,7504.95
Held-out · Apr 7, third load
GORGO3861,3771,9733.34
simple-session-affinity4391,4952,3133.90
least-load4801,6372,2944.04
least-request5091,7242,4854.40
prefix-cache5161,8302,80111.98
Table 1Decoded-trace results across three 30-minute windows. GORGO uses the 2D cost model (rtt_weight 0.276, queue_weight 0.5), learned by (1+1)-ES on the Apr 5 window and frozen for both held-out windows. Bold is best in column within each window; lower is better throughout. TTFT in ms, E2E p95 in s. On both held-out days GORGO sweeps every metric.
§ 06 Reward

Reward hacking

We were surprised to discover what happens when gorgo-hillclimb optimizes a TTFT-only objective with unconstrained weight ranges.

It doesn’t learn good routing. It learns that it can drive measured p95 TTFT down by routing nearly 100% of traffic to the single nearest replica and zeroing out the load weight. The metric looks excellent. Everything else collapses.

The mechanism: TTFT doesn’t price queueing or decode contention until a replica saturates. An overloaded replica still returns its first token quickly—the request lands in the KV cache and generation begins—so the proxy’s reward signal looks fine while end-to-end latency quietly blows up. The optimizer found that concentrating load minimizes TTFT without any penalty visible to the objective it was given.

The ablation is clean. Load weight zeroed out: TTFT p95 was 1.101 s, E2E p95 was 12.58 s, load concentration ~100% on one replica. Load weight restored: TTFT p95 was 1.136 s, E2E p95 was 2.46 s, concentration ~60%. That’s a ~3% regression on TTFT for a ~5× improvement on E2E. It’s not a close call.

0.00
TTFT p95 E2E p95 load concentration
Fig 7 · interactiveDrag the load weight. Watch what the optimizer sees—and what it doesn't. Time-to-first-token (lower line) barely moves; end-to-end latency (upper line) falls off a cliff the moment the router is allowed to value load balance at all. Tuning on TTFT alone, the optimizer drives this weight to zero and parks at the left edge, where TTFT looks great and E2E is a catastrophe. The two open circles are the measured operating points; the curve between them is an illustrative model of the saturation dynamic.

The fix is structural: floor the load weight so the optimizer can’t zero it out, cap the RTT weight so it can’t route purely by geography. And freeze the tuned weights for production rather than running the optimizer live. Live tuning produces heavy-tailed latency excursions mid-production as the hillclimber explores the weight space—freezing isn’t just a convenience, it’s a safety property.

The general pattern is reward hacking: let an optimizer loose on a proxy metric and it optimizes the proxy. TTFT was the proxy. What users feel is end-to-end latency. The gap between those two wasn't obvious until the hillclimber found it—and it found it immediately.

§ 07 Where

Where it doesn’t work

On WildChat, GORGO is not competitive. The gorgo variants lose to session affinity on p95 TTFT in those experiments—gorgo-static is the worst, at ~1.93 s vs. ~1.03 s for plain session affinity. Short prompts, low reuse: there's little cache locality to route toward, and the extra machinery adds noise rather than signal. The method's value is conditional on long contexts and high prefix reuse—deploy it on short-prompt, low-reuse traffic and you should expect it to lose to a policy with less to get wrong.

GORGO session-affinity 2.5 s 2.0 1.5 1.0 0.5 0 1.58 s 1.88 s GORGO wins ART-Chat-2.5M long context · high reuse 1.93 s 1.03 s GORGO loses · 1.9× WildChat short prompts · low reuse
Fig 8The same policy, opposite outcomes. p95 TTFT, GORGO vs. session-affinity, on two workloads. Where contexts are long and prefixes are reused, pricing cache locality pays; where they aren't, the machinery routes toward locality that isn't there and loses to the simpler policy. The method has a regime, and it's worth naming it.

Per-window tuning is also noisy in ways that matter. Re-evaluating identical weights across windows produced swings large enough that some apparent optima were noise artifacts that generalized poorly to held-out data. This is part of why frozen weights on held-out windows are the right evaluation criterion—in-sample performance overstates what you actually get.

The results come from a single-tenant fleet of identical GPUs, with no prefill/decode disaggregation and metrics observed only over HTTP. P99 differences sit at the edge of statistical significance given the sample sizes.

Prefill/decode disaggregation—serving prompt-processing and token-generation on separate hardware—may dissolve much of the TTFT-vs-E2E tension that drove the reward hack: with a dedicated prefill cluster, TTFT would be governed by prefill scheduling rather than by a generation replica's queue depth, which is exactly the coupling the optimizer exploited. That's a hypothesis for follow-on work, not a result.

§ 08 Real

The real claim

The router's job is to price what the user feels. The moment you let it optimize a cheaper stand-in, it will—and it will find corners you didn't anticipate. The useful thing about gorgo-hillclimb isn't that it finds good weights; it's that, unconstrained, it goes straight for the degenerate solution, and the degenerate solution is the cleanest proof that TTFT and end-to-end latency are different objectives that happen to agree until a replica saturates.

Pricing the network alongside cache locality and load is the easy half. The hard half is that the weights worth having can be learned from the deployment's own traffic—but only once you've stopped the optimizer from gaming the metric you gave it.