Ornith-1.0-35B Breakthrough: Native MTP Grafting Achieves 1.35x Speedup in Local Inference

● PUBLISHED: 2026 6 29 · SOURCE: Reddit LocalLLaMA →

[ DATA_STREAM_START ]

The Ornith-1.0-35B update introduces a sophisticated native Multi-Token Prediction (MTP) draft head graft onto its IQ4_XS quantized body, delivering a substantial performance leap for local inference within the llama.cpp ecosystem.

▶ Native MTP Grafting: Successfully integrated a native draft head (quantized at Q6) directly onto the model body, enabling self-speculative decoding on a single GPU without the overhead of a separate draft model.
▶ Performance & Fidelity Gains: Single-stream decoding throughput jumped from 172.6 to 233.8 tokens/sec—a 1.35x acceleration—while maintaining byte-identical next-token distribution (KLD 0.0) compared to the target-only model.
▶ Deterministic Long-Context Stability: Achieved a 93.4% token match rate in long-context generation, with BF16 KLD metrics outperforming standard Q4_K_M quantization schemes.

Bagua Insight

The Ornith-1.0 update signals a shift in the Local LLM optimization paradigm toward “intra-architectural surgery.” Traditionally, speculative decoding requires a secondary, smaller draft model, which complicates VRAM management and inference scheduling. Ornith’s MTP grafting proves that within the GGUF/IQ quantization framework, leveraging native architectural components for self-acceleration is not only viable but highly efficient. This “space-for-time” trade-off—adding minimal weight for the draft head—offers a massive ROI for 35B-class models. In single-GPU deployments, this approach directly addresses the throughput bottleneck while bypassing the typical accuracy degradation associated with model distillation.

Actionable Advice

Developers optimizing local inference services should prioritize MTP-compatible architectures within the llama.cpp stack. The Ornith case study demonstrates that for 30B-70B models, combining IQ quantization with MTP speculative decoding is currently the “gold standard” for balancing VRAM footprint and generation speed. Furthermore, when benchmarking, teams should look beyond TTFT (Time to First Token) and scrutinize the decoding consistency enabled by MTP, which is critical for logic-heavy applications like RAG and automated coding.

[ DATA_STREAM_END ]

[ ORIGINAL_SOURCE ]

READ_ORIGINAL →

[ 02 ] RELATED_INTEL

2026 6 10

Inside Siri’s Architecture: WaveRNN and FastSpeech2 Powering On-Device Voice Synthesis

Core Summary Recent teardowns of iOS system files reveal that Siri’s Text-to-Speech (TTS) pipeline has transitioned to a WaveRNN and…

2026 5 14

The Great Data Enclosure: Google and Cloudflare Choke the Open Web for AI

Google has signaled the end of the open-web era for AI by restricting its free Search API to a mere…

2026 6 5

BeeLlama v0.3.1 Released: Redefining Local Inference with 5x Throughput Gains on RTX 3090