Crushing the 100 t/s Barrier: RTX 5090 + 3090 Ti Synergy via Tensor Parallelism for Qwen3.6-27B
By pivoting from traditional layer-based splitting to tensor-split mode, a developer has achieved a massive performance jump to 100+ tokens per second (t/s) on Qwen3.6-27B (Q8_0) using a heterogeneous RTX 5090 and 3090 Ti setup, marking a ~43% efficiency gain over previous configurations.
- ▶ Breaking the Heterogeneous Bottleneck: Tensor splitting eliminates the sequential “waiting game” inherent in layer-wise distribution, allowing the RTX 5090 to flex its compute muscles without being throttled by the 3090 Ti’s inter-layer communication latency.
- ▶ 27B Models Hit Instant-Response Territory: Achieving 100+ t/s at Q8 precision on consumer-grade hardware signals that local LLMs are now competitive with—and often faster than—premium cloud APIs for high-throughput reasoning tasks.
Bagua Insight
This breakthrough highlights a critical shift in the local LLM community: the transition from “VRAM capacity anxiety” to “TFLOPS saturation optimization.” In multi-GPU rigs, especially mismatched ones, naive layer splitting creates significant pipeline stalls where the flagship card (5090) sits idle while the legacy card (3090 Ti) finishes its workload. Tensor Parallelism (TP) solves this by distributing the compute load of individual layers across both GPUs simultaneously. It proves that as we enter the Blackwell era, software-level orchestration is the “secret sauce” that determines whether your hardware investment translates into actual inference speed.
Actionable Advice
For users running multi-GPU setups, especially those mixing different generations of NVIDIA hardware, it is time to move beyond default layer-splitting. Prioritize backends like llama.cpp that support --split-mode tensor to minimize synchronization overhead. When configuring heterogeneous clusters, focus on balancing compute density rather than just VRAM allocation. For models in the 20B-30B range, the combination of Q8 quantization and tensor splitting represents the current “sweet spot” for achieving enterprise-grade performance on a prosumer budget.