Compute Scheduling

Event Core A developer has achieved a staggering 110 tokens per second on a Qwen 3.6 35B model using an RTX 4070 Super (12GB VRAM) by switching from standard llama.cpp to the ik_llama.cpp branch, highlighting the critical impact of optimized CPU offloading in resource-constrained environments. Bagua Insight ▶ Asymmetric Performance Gains: While standard MTP (Speculative Decoding) often struggles with overhead on mid-range hardware, the ik_llama.cpp branch leverages superior CPU offloading scheduling to bypass the physical limitations of limited GPU VRAM. ▶ Democratizing Large Models: This benchmark proves that software-level operator optimization can effectively bridge the performance gap for consumer-grade GPUs, allowing 30B+ parameter models to run at production-level speeds without requiring enterprise-grade hardware. Actionable Advice ▶ Optimize Your Stack: When facing VRAM bottlenecks, pivot to specialized forks like ik_llama.cpp that prioritize heterogeneous compute efficiency rather than relying solely on the upstream llama.cpp main branch. ▶ Re-evaluate Hybrid Inference: For edge computing and local workstations, prioritize tuning the balance between CPU and GPU offloading. Strategic layer distribution often yields a higher ROI than simply upgrading to higher-VRAM GPUs.

Compute Scheduling

Breaking the VRAM Ceiling: How ik_llama.cpp Enables 110 tok/s on Qwen 35B with 12GB VRAM

BAGUA AI