Breaking the VRAM Ceiling: How ik_llama.cpp Enables 110 tok/s on Qwen 35B with 12GB VRAM

● PUBLISHED: 2026 5 21 · SOURCE: Reddit LocalLLaMA →

[ DATA_STREAM_START ]

Event Core

A developer has achieved a staggering 110 tokens per second on a Qwen 3.6 35B model using an RTX 4070 Super (12GB VRAM) by switching from standard llama.cpp to the ik_llama.cpp branch, highlighting the critical impact of optimized CPU offloading in resource-constrained environments.

Bagua Insight

▶ Asymmetric Performance Gains: While standard MTP (Speculative Decoding) often struggles with overhead on mid-range hardware, the ik_llama.cpp branch leverages superior CPU offloading scheduling to bypass the physical limitations of limited GPU VRAM.
▶ Democratizing Large Models: This benchmark proves that software-level operator optimization can effectively bridge the performance gap for consumer-grade GPUs, allowing 30B+ parameter models to run at production-level speeds without requiring enterprise-grade hardware.

Actionable Advice

▶ Optimize Your Stack: When facing VRAM bottlenecks, pivot to specialized forks like ik_llama.cpp that prioritize heterogeneous compute efficiency rather than relying solely on the upstream llama.cpp main branch.
▶ Re-evaluate Hybrid Inference: For edge computing and local workstations, prioritize tuning the balance between CPU and GPU offloading. Strategic layer distribution often yields a higher ROI than simply upgrading to higher-VRAM GPUs.

[ DATA_STREAM_END ]

[ ORIGINAL_SOURCE ]

READ_ORIGINAL →

[ 02 ] RELATED_INTEL

2026 6 23

MiniMax M3 EAGLE Hits GGUF: Speculative Decoding Doubles Local Inference Throughput

Event Core Leveraging a new PR in the llama.cpp ecosystem, Inferact has successfully ported the MiniMax M3 EAGLE draft model…

2026 6 26

OpenAI GPT-5.6 Sol Preview: A Paradigm Shift from General Chat to Expert-Level Agency

Event Core OpenAI has officially unveiled a preview of its next-generation model, GPT-5.6 Sol. As a pivotal iteration within the…

2026 5 15

The End of Open Access: Economic and Security Moats are Gating Frontier AI