Command A+ (218B MoE) Hits Apple Silicon: A New Frontier for Local Ultra-Large Scale Inference
Event Core
Cohere’s Command A+ model, featuring a massive 218B total parameter count with 25B active parameters, is officially being ported to Apple Silicon via the MLX framework. The architecture utilizes a 128-expert MoE (Mixture of Experts) setup with top-8 routing. A pull request (PR) has been opened for mlx-lm, introducing specific support for Cohere’s unique implementation of shared experts and Sigmoid-based routing.
- ▶ Architectural Innovation: Unlike standard MoE models, Command A+ employs a single shared expert (intermediate size 16,384) and uses normalized Sigmoid routing instead of Softmax to stabilize expert selection.
- ▶ Hardware Milestone: This port enables high-end Mac Studio and Mac Pro users to run one of the most sophisticated open-weights models locally, leveraging Apple’s Unified Memory.
- ▶ Strategic Licensing: Under the Apache 2.0 license, Cohere is positioning Command A+ as the go-to alternative for enterprise-grade, privacy-centric RAG applications.
Bagua Insight
The arrival of Command A+ on MLX is a watershed moment for the local LLM community. From a technical standpoint, the shift to Sigmoid routing and the inclusion of a “Shared Expert” layer addresses the inherent “knowledge fragmentation” issues found in traditional MoE architectures like Mixtral. By merging routed outputs with a shared backbone, Cohere achieves a balance between specialized depth and generalist stability. From a market perspective, this is a direct challenge to Meta’s dominance. By optimizing for MLX, Cohere is courting the “Prosumer” and “Enterprise Dev” demographic who require massive context windows (128k) and high parameter counts without the latency or privacy risks of cloud APIs. Apple Silicon is no longer just for creative work; it is becoming the primary workstation for local AI orchestration.
Actionable Advice
- Infrastructure Planning: For organizations running local RAG, evaluate the 218B model as a replacement for smaller 70B models. The increased expert count significantly improves retrieval-augmented performance.
- Quantization Strategy: Monitor the MLX PR for 4-bit and 6-bit quantization updates. A 4-bit Q4_K_M variant will likely be the “sweet spot” for 128GB RAM machines.
- Architecture Benchmarking: Developers should analyze the Sigmoid routing mechanism; it offers a blueprint for more stable fine-tuning compared to traditional Softmax-based MoE models.