The Economics of Inference: Napkin Math for Scaling LLMs
Executive Summary
This report provides a rigorous framework for estimating large-scale LLM inference costs using “back-of-the-envelope” calculations. By analyzing hardware specs like H100 bandwidth, it reveals that memory throughput, rather than raw compute (TFLOPS), is the primary bottleneck for inference scalability and margins.
- ▶ Bandwidth is the Bottleneck: During the decoding phase, the speed at which model weights and KV Cache are moved into the GPU determines latency. Most inference workloads are strictly memory-bound, not compute-bound.
- ▶ The KV Cache Tax: As context windows expand, the memory footprint of the KV Cache grows linearly, severely limiting batch sizes and driving up the cost-per-token for long-form applications.
- ▶ Optimization as a Business Strategy: Techniques like Grouped Query Attention (GQA) and quantization (FP8/INT4) are no longer optional optimizations; they are essential levers for improving Unit Economics by increasing throughput on fixed hardware.
Bagua Insight
At 「Bagua Intelligence」, we observe a disconnect between the hype surrounding model capabilities and the physical realities of deployment. The “napkin math” presented here highlights a critical truth: even with H100 clusters, Model FLOPs Utilization (MFU) remains embarrassingly low if the memory wall isn’t addressed. The industry is shifting from a “parameter arms race” to an “inference efficiency war.” The real winners won’t just have the smartest models; they will have the most efficient inference stacks (utilizing PagedAttention, Speculative Decoding, etc.) that can bypass the memory bottleneck to deliver sustainable margins.
Actionable Advice
- Model Selection: Prioritize models that implement GQA (e.g., Llama 3, Mistral) for high-concurrency production environments to minimize KV Cache overhead.
- TCO Recalculation: Move beyond simple API pricing. Engineering leads should use bandwidth-based math to calculate the Total Cost of Ownership (TCO) for self-hosted clusters, factoring in expected concurrency and context length.
- Infrastructure Focus: Invest heavily in inference engines like vLLM or TensorRT-LLM. Optimizing KV Cache management is currently the highest-ROI engineering task for reducing the cost of long-context GenAI features.