Ornith-1.0-35B Breakthrough: Native MTP Grafting Achieves 1.35x Speedup in Local Inference
The Ornith-1.0-35B update introduces a sophisticated native Multi-Token Prediction (MTP) draft head graft onto its IQ4_XS quantized body, delivering a substantial performance leap for local inference within the llama.cpp ecosystem.
- ▶ Native MTP Grafting: Successfully integrated a native draft head (quantized at Q6) directly onto the model body, enabling self-speculative decoding on a single GPU without the overhead of a separate draft model.
- ▶ Performance & Fidelity Gains: Single-stream decoding throughput jumped from 172.6 to 233.8 tokens/sec—a 1.35x acceleration—while maintaining byte-identical next-token distribution (KLD 0.0) compared to the target-only model.
- ▶ Deterministic Long-Context Stability: Achieved a 93.4% token match rate in long-context generation, with BF16 KLD metrics outperforming standard Q4_K_M quantization schemes.
Bagua Insight
The Ornith-1.0 update signals a shift in the Local LLM optimization paradigm toward “intra-architectural surgery.” Traditionally, speculative decoding requires a secondary, smaller draft model, which complicates VRAM management and inference scheduling. Ornith’s MTP grafting proves that within the GGUF/IQ quantization framework, leveraging native architectural components for self-acceleration is not only viable but highly efficient. This “space-for-time” trade-off—adding minimal weight for the draft head—offers a massive ROI for 35B-class models. In single-GPU deployments, this approach directly addresses the throughput bottleneck while bypassing the typical accuracy degradation associated with model distillation.
Actionable Advice
Developers optimizing local inference services should prioritize MTP-compatible architectures within the llama.cpp stack. The Ornith case study demonstrates that for 30B-70B models, combining IQ quantization with MTP speculative decoding is currently the “gold standard” for balancing VRAM footprint and generation speed. Furthermore, when benchmarking, teams should look beyond TTFT (Time to First Token) and scrutinize the decoding consistency enabled by MTP, which is critical for logic-heavy applications like RAG and automated coding.