null-pointer.blog

Optimizing retrieval-augmented LLM inference isn’t just about bigger GPUs, it’s about smarter orchestration.

🚨 The Problem

A few months ago, our retrieval-augmented generation (RAG) system was hitting its limits.

We were serving clinical queries from healthcare providers, but:

p95 latency: 850 ms

GPU utilization: ~35 %

Query throughput: barely 300 QPS under load

The pipeline looked like this:
User Query
  ↓
1️⃣ Embed query → FAISS search
2️⃣ Lexical retrieval → Elasticsearch (BM25)
3️⃣ Merge + re-rank
4️⃣ Construct context → LLM inference
5️⃣ De-identify + postprocess
  ↓
Response
Despite using a strong model, our serving stack wasn’t scaling.

We needed to bring latency below 500 ms while keeping costs in check.

🧠 Profiling the Bottleneck

We profiled the full inference path using:

Nsight Systems and PyTorch Profiler for GPU traces

Prometheus + Grafana for latency and token-level metrics

Key findings:

40 % GPU idle time between kernels (batching inefficiency)

Retrieval (FAISS + Elastic) serialized — ~120 ms wasted

KV cache fragmentation consuming unnecessary VRAM

Static batching delaying low-traffic requests

⚙️ The Redesign

We re-architected inference around three major improvements:

(1) Tensor-parallel serving via vLLM, (2) Continuous batching, and (3) Semantic caching.

🚀 1. Multi-GPU Tensor Parallelism

We deployed our model using vLLM on AWS p5 instances (8× H100 GPUs):
python -m vllm.entrypoints.api_server \\
  --model meta-llama/Llama-2-70b-chat-hf \\
  --tensor-parallel-size 4 \\
  --gpu-memory-utilization 0.9 \\
  --port 8000
Model weights sharded across 4 GPUs

Each GPU stores KV-cache only for its own attention heads

Communication handled by NCCL over NVLink

✅ Result:

Linear scaling of throughput

GPU utilization ↑ to 85 %

Per-token latency ↓ 35 %

⚡ 2. Continuous Batching

Static batching was replaced with continuous batching, where new requests join an ongoing batch mid-generation.

This kept GPUs fully occupied even with variable traffic.

Before: waiting to fill batches → GPU idle gaps

After: dynamic merge/split at token granularity

✅ Result:

Throughput ↑ 2.8×

Latency variance (p95 – p99) ↓ 50 %

🧠 3. Semantic Caching

We introduced a Redis-backed semantic cache keyed by query embeddings.
cache_key = f"embed:{hash(embedding)}"
if redis.exists(cache_key):
    results = redis.get(cache_key)
else:
    results = run_retrieval(embedding)
    redis.setex(cache_key, 3600, results)
When a new query arrives, we check for high cosine similarity (> 0.9) with cached embeddings.

If found, we reuse the retrieval + context immediately — skipping FAISS + Elastic calls.

✅ Result:

Cache hit rate 71 %

Vector DB load ↓ 60 %

End-to-end latency ↓ ~90 ms for repeated/semantically similar queries

🔍 Results

Metric Before After Δ
GPU Utilization 35 % 85 % +50 pts
p95 Latency 850 ms 490 ms ↓ 42 %
Cost per Query — ↓ 37 % —
NDCG@10 (Retrieval Quality) 0.73 0.89 +22 %
Cache Hit Rate 0 % 71 % —

🧩 Lessons Learned

Batching is a hidden bottleneck — static batching kills latency under variable load.

Tensor parallelism isn’t free — one slow GPU throttles the group; NVLink topology matters.

Semantic caching > lexical caching — paraphrased queries benefit enormously.

Monitoring is critical — token throughput and cache hit dashboards caught regressions early.

🧠 Takeaway

Optimizing LLM inference isn’t about throwing bigger GPUs at the problem — it’s about making each token, cache lookup, and GPU cycle count.

By re-architecting our serving stack with vLLM, tensor parallelism, and semantic caching,

we now sustain 1 K QPS at < 500 ms p95 — with better retrieval grounding and 35 % lower cost.

✍️ Final Thought

We plan to open-source part of our semantic caching implementation soon.

If you’re scaling RAG inference or debugging GPU underutilization, I’d love to compare notes — reach out anytime.

null-pointer.blog

recent posts

about

How We Cut LLM Inference Latency by 40% Using vLLM, Tensor Parallelism, and Semantic Caching

🚨 The Problem

🧠 Profiling the Bottleneck

⚙️ The Redesign

🚀 1. Multi-GPU Tensor Parallelism

⚡ 2. Continuous Batching

🧠 3. Semantic Caching

🔍 Results

🧩 Lessons Learned

🧠 Takeaway

✍️ Final Thought

Metric	Before	After	Δ
GPU Utilization	35 %	85 %	+50 pts
p95 Latency	850 ms	490 ms	↓ 42 %
Cost per Query	—	↓ 37 %	—
NDCG@10 (Retrieval Quality)	0.73	0.89	+22 %
Cache Hit Rate	0 %	71 %	—