Memory Bandwidth

Executive SummaryThis report provides a rigorous framework for estimating large-scale LLM inference costs using "back-of-the-envelope" calculations. By analyzing hardware specs like H100 bandwidth, it reveals that memory throughput, rather than raw compute (TFLOPS), is the primary bottleneck for inference scalability and margins.▶ Bandwidth is the Bottleneck: During the decoding phase, the speed at which model weights and KV Cache are moved into the GPU determines latency. Most inference workloads are strictly memory-bound, not compute-bound.▶ The KV Cache Tax: As context windows expand, the memory footprint of the KV Cache grows linearly, severely limiting batch sizes and driving up the cost-per-token for long-form applications.▶ Optimization as a Business Strategy: Techniques like Grouped Query Attention (GQA) and quantization (FP8/INT4) are no longer optional optimizations; they are essential levers for improving Unit Economics by increasing throughput on fixed hardware.Bagua InsightAt 「Bagua Intelligence」, we observe a disconnect between the hype surrounding model capabilities and the physical realities of deployment. The "napkin math" presented here highlights a critical truth: even with H100 clusters, Model FLOPs Utilization (MFU) remains embarrassingly low if the memory wall isn't addressed. The industry is shifting from a "parameter arms race" to an "inference efficiency war." The real winners won't just have the smartest models; they will have the most efficient inference stacks (utilizing PagedAttention, Speculative Decoding, etc.) that can bypass the memory bottleneck to deliver sustainable margins.Actionable AdviceModel Selection: Prioritize models that implement GQA (e.g., Llama 3, Mistral) for high-concurrency production environments to minimize KV Cache overhead.TCO Recalculation: Move beyond simple API pricing. Engineering leads should use bandwidth-based math to calculate the Total Cost of Ownership (TCO) for self-hosted clusters, factoring in expected concurrency and context length.Infrastructure Focus: Invest heavily in inference engines like vLLM or TensorRT-LLM. Optimizing KV Cache management is currently the highest-ROI engineering task for reducing the cost of long-context GenAI features.

The Economics of Inference: Napkin Math for Scaling LLMs

Mixed-Gen Powerhouse: RTX 5080 + 3090 Setup Hits 80+ Tok/s on Qwen 3.6 27B Q8

The Hybrid Inference Frontier: Quantized Prefilling Meets Precise Decoding

BAGUA AI