Cutting ML Inference Costs 68% — Quantization, Batching, Right-Sizing

A client came to us with an inference bill growing 22% month over month and a board asking why AI gross margin looked like a hardware business. Twelve weeks later the bill was down 68% with no measurable accuracy loss. None of the techniques were exotic. What made the difference was sequencing, measurement, and the willingness to question defaults. Here's the ledger.

First: measure before touching anything

Week one produced no optimizations — only instrumentation. Per-model, per-endpoint cost attribution; GPU utilization profiles; latency distributions; request-size histograms. The findings that shaped everything after:

Average GPU utilization across the serving fleet: 14%.
One model — a BERT-class classifier — consumed 41% of GPU spend while its p95 latency budget (800ms) was loose enough for CPU.
31% of embedding requests were exact duplicates within a 24-hour window. Nobody had checked.

You cannot optimize a system you haven't profiled, and most teams discover their intuitions about where money goes are wrong by an order of magnitude.

Move one: evict models that don't need GPUs (−29%)

The classifier moved to INT8 ONNX Runtime on CPU autoscaling groups. Quantization cost 0.3 points of F1 — within the noise band of the eval set — and p95 landed at 410ms, comfortably inside budget. GPU spend attributable to that model went to zero. The general rule: any model under ~1B parameters with a latency budget above ~300ms is a CPU candidate, and quantization usually pays its accuracy tax in pennies.

Move two: dynamic batching on what remains (−18%)

The remaining GPU models served requests one at a time. Enabling dynamic batching (Triton, max queue delay 25ms) lifted throughput per GPU 3.4× on the embedding workload. p95 rose 19ms — invisible to users, transformative to the bill. Batching is the cheapest optimization in the entire menu and the most commonly skipped, because the default serving examples everyone copies don't include it.

Move three: cache what you already computed (−12%)

A Redis layer keyed on normalized input hashes eliminated the duplicate embedding traffic, and a semantic cache (similarity > 0.97) caught near-duplicates on the generation endpoint. Combined hit rate after tuning: 38%. Every cache hit is inference at the price of a hash lookup.

Move four: right-size the fleet (−9%)

With utilization now legible, the fleet consolidated from A100s to a mix of L40S and A10G matched to each model's actual memory and throughput profile, with spot instances absorbing batch workloads behind a queue. The premium-GPU-by-default habit is a tax on incuriosity.

Totals compound against the original baseline, not sequentially — hence four moves summing to 68%, measured on the monthly invoice, which is the only eval that matters to a CFO.

What we deliberately didn't do

No distillation (the win didn't justify the project risk yet), no custom kernels, no migration to exotic accelerators. The boring 80% — placement, batching, caching, quantization — was worth 68%. Exhaust the boring options before funding the interesting ones.

The durable artifact

The lasting deliverable wasn't the savings — it was the cost observability that made waste visible within days instead of quarters, plus eval gates ensuring every future optimization proves accuracy-neutrality before rollout. Cost engineering isn't a project. It's a property of a system that can see itself.

How We Cut ML Inference Costs by 68% Without Losing Accuracy