Cache-aware router for OpenAI-compatible LLM servers, in Go.
Routes each request to the worker that already holds its prefix in KV cache, so prefill runs once per shared prefix instead of once per worker. Validated locally with llama.cpp + Qwen2.5-1.5B, then run on 4× A100 80GB SXM with vLLM 0.6.4 + Qwen2.5-{7B,14B}. Cloud reproduction cost: ~$25.
The hero chart above is the cloud sweep. Numbers below are mean ± stddev across three seeds at each concurrency point. Trace shape: 4 to 24 sessions, 8 turns each, 6 KB shared system prompt, max_tokens=64.
Note
Three seeds is a small sample and a few intermediate points have visibly noisy error bars in the chart. The headline is the slope, not any single point.
| Strategy | 7B slope | 14B slope |
|---|---|---|
| random | +49 ms (98 to 146 ms) | +98 ms (167 to 265 ms) |
| round-robin | +31 ms (84 to 115 ms) | +36 ms (157 to 194 ms) |
| least-loaded | +35 ms (67 to 102 ms) | +50 ms (122 to 172 ms) |
| prefix-aware | +16 ms (83 to 99 ms) | +25 ms (141 to 166 ms) |
Prefix-aware has the gentlest slope on both model sizes. Slope ratios relative to PA: random 3.0× (7B) / 3.9× (14B); least-loaded 2.2× / 2.0×; round-robin 1.9× / 1.4×.
| Strategy | KV cached | TTFT p50 | TTFT p95 | TTFT p99 | RPS |
|---|---|---|---|---|---|
| roundrobin | 91.10% | 193 ms | 2.40 s | 2.72 s | 25.2 |
| random | 90.69% | 265 ms | 2.36 s | 2.74 s | 23.4 |
| leastloaded | 94.22% | 171 ms | 2.12 s | 2.29 s | 27.6 |
| prefixaware | 94.88% | 166 ms | 2.13 s | 2.25 s | 26.8 |
Prefix-aware wins p50 by 3% over the best baseline (leastloaded), 14%
over round-robin, and 37% over random; p95 is essentially tied with
leastloaded (2.13 s vs 2.12 s); p99 is the tightest of the four
(2.25 s vs 2.29 s LL, 2.72-2.74 s RR/random). Upstream KV cache hit rate
stays at 94-95% across every concurrency point on both model sizes;
baselines drift between 80% and 95% with load.
Per-point p50 comparison between PA and the best baseline (always least-loaded on this trace):
| sessions | 7B PA / LL p50 | winner | 14B PA / LL p50 | winner |
|---|---|---|---|---|
| 4 | 83 / 67 ms | LL +16 ms | 141 / 122 ms | LL +19 ms |
| 8 | 84 / 75 ms | LL +9 ms | 140 / 132 ms | LL +8 ms |
| 12 | 94 / 87 ms | LL +7 ms | 141 / 140 ms | tied |
| 16 | 87 / 88 ms | PA +1 ms | 148 / 152 ms | PA +4 ms |
| 24 | 99 / 102 ms | PA +3 ms | 166 / 172 ms | PA +5 ms |
PA's per-point p50 wins start at sessions=16 (4× the worker count), not
at sessions=N_workers + 1. Below that, leastloaded accidentally
distributes one session per worker and the second turn naturally lands
on the worker that cached turn one, so PA's stricter pinning adds
router overhead without marginal benefit. PA's value across the whole
range is the flat slope (see the headline table); the per-point p50
gap only opens up at high concurrency.
Important
The benchmark is a single trace pattern (multi-turn conversations with a fixed-length shared system prompt). Real production traffic mixes single-turn, RAG, and branching tool loops. The crossover concurrency, the absolute TTFT numbers, and the cache hit rates will all shift with a different traffic shape; the relative ordering of strategies should hold.
Full per-concurrency tables for both cloud models, with hit rates, RPS,
and caveats: docs/results-cloud.md.
The same Go code was validated on an M1 Pro against three llama-server
workers serving Qwen2.5-1.5B (Q4_K_M), three seeds, 18 requests each:
| Strategy | KV cached | TTFT p50 | TTFT p95 | RPS |
|---|---|---|---|---|
| roundrobin | 57.56% | 2.01 s | 8.21 s | 1.67 |
| random | 62.34% | 1.70 s | 8.06 s | 1.84 |
| leastloaded | 61.94% | 0.82 s | 8.29 s | 1.85 |
| prefixaware | 75.00% | 2.10 s | 3.70 s | 2.36 |
vs leastloaded (best baseline): +13.1 pp cache hit, 55% lower p95
TTFT, 28% higher throughput. p50 favors leastloaded (0.82 s vs PA's
2.10 s) because PA pins subsequent requests onto the same worker
serially while leastloaded parallelises cold prefills across three.
The win is at the tail and on throughput, same shape as the cloud
result. Three-seed CIs, safety-valve ablation, smaller-model reference:
docs/results.md.
Each backend has its own compressed radix tree of recently-dispatched
prompt prefixes, hashed into 32-byte chunks. On each request the router
asks every tree for the longest leading-chunk match, randomizes ties,
sorts by (match desc, inflight asc), and applies a saturation safety
valve before dispatch. Every routing decision is exposed to the client
via X-Router-Backend and X-Router-Reason response headers.
| Strategy | Decision rule |
|---|---|
roundrobin |
rotation counter |
random |
uniform random |
leastloaded |
minimum in-flight count |
prefixaware |
longest prefix match across per-worker trees, ties broken by random shuffle, with a saturation valve that spills off any worker above inflight ≥ saturation_inflight |
SGLang's RadixAttention and vLLM's --enable-prefix-caching solve the
same problem one layer down: inside a single inference engine, sharing
cached prefixes across requests that the engine already received.
llm-cache-router solves it one layer up: deciding which of N
independent engine instances a request reaches in the first place, so
the request lands on the worker that already holds its prefix. The two
layers compose; run vLLM (or SGLang) behind this router and you get
both intra-engine prefix reuse and inter-worker cache locality, with
neither layer needing to know the other exists.
Package layout, request lifecycle diagram, and the full list of emitted
Prometheus metrics: docs/architecture.md.
Eight ADRs covering language choice, backend abstraction, prefix tree
design, LRU eviction, the safety valve, tokenization, the circuit
breaker, and tie-break randomization: docs/decisions/.
Two paths: a 15-minute local run on a Mac with no GPU, and a 3.5-hour cloud run that produces the chart above.
Local (Apple Silicon, llama.cpp, Qwen2.5-1.5B):
brew install llama.cpp
mkdir -p models
curl -L -o models/qwen2.5-1.5b.gguf \
https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/qwen2.5-1.5b-instruct-q4_k_m.gguf
make build
MODEL=models/qwen2.5-1.5b.gguf WORKER_PORTS="8001 8002 8003" \
RUNS=3 SESSIONS=6 TURNS=3 SYS_LEN=2048 SEED=17 \
bash scripts/bench/real-llm.sh
cat bench/results/real.mdKnobs and ablations: docs/benchmarks.md. Full
results: docs/results.md.
Cloud (RunPod 4× A100 80GB SXM, vLLM, Qwen2.5):
git clone https://github.com/zxuhan/llm-router.git
cd llm-router
bash scripts/install-cloud.sh
tmux new -s bench
bash scripts/bench/full-bench.sh
# detach Ctrl+b d, reattach: tmux attach -t benchPod sizing, scp recipe, preempt recovery, terminate-vs-stop billing
trap: docs/cloud-bench.md.
.
├── cmd/ CLI binaries: router, bench, gen-traces, replay
├── internal/ Go packages: config, backend, prefixtree, router, proxy, ...
├── config/ example yaml
├── scripts/
│ ├── install-cloud.sh one-shot pod setup (Go + vLLM + pinned deps)
│ ├── bench/ orchestrators: real-llm.sh, cloud-vllm.sh, full-bench.sh, ...
│ └── plot/ matplotlib renderers: hero-cloud.py, sweep-plot.py, ...
├── docs/
│ ├── architecture.md package graph, request lifecycle, metrics list
│ ├── benchmarks.md local reproduction recipes
│ ├── cloud-bench.md cloud reproduction runbook
│ ├── results.md local M1 numbers
│ ├── results-cloud.md cloud A100 numbers, per-concurrency tables
│ ├── decisions/ 8 ADRs
│ └── images/ chart and gif assets
├── .github/workflows/ci.yml vet, lint, race tests, coverage gate, fuzz smoke
├── Makefile
├── go.mod
└── README.md
- Bootstrap CIs over the per-request distribution would replace the current ±1 sample-stddev error bars (computed from three seeds). With only three seeds, an outlier seed visibly distorts a few intermediate points.
- Tokenizer-backed chunker would replace the current 32-byte hash chunks (ADR 0006). Hash chunks are model-agnostic and cheap, but chunk boundaries can split tokens; a real tokenizer would tighten the longest-match calculation.
- Auto-tuned
saturation_inflightbased on observed per-worker p95 latency would remove the only routing knob that needs manual setting per workload. The shipped default is eight; the cloud bench overrides it to four, calibrated for that trace shape. - SGLang as the upstream engine. The Architecture section explains
the composition; the cloud bench has not yet measured the layered
combination of
llm-cache-router+ SGLang behind it. Doing so would quantify how much extra latency comes off when intra-engine and inter-worker cache locality are stacked. - Multi-trace bench. The current trace is one shape; real production mixes single-turn, RAG, and branching tool loops.
- Half-open one-probe circuit breaker. Today's breaker is closed and open with a fixed cooldown (ADR 0007); a half-open probe would shorten recovery from a flaky backend.
- SGLang: Efficient Execution of Structured Language Model Programs (Zheng et al., 2023). RadixAttention, the engine-level prefix caching this project demonstrates at the routing layer.
- Efficient Memory Management for Large Language Model Serving with PagedAttention (Kwon et al., 2023). The vLLM paper; the upstream whose prefix caching and
prompt_tokens_details.cached_tokenswe read from. - Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving (Qin et al., 2024). Production-scale precedent for cache-aware request scheduling across an inference fleet.