Rank Collapse

New research utilizing Lyapunov spectrum analysis has identified a critical geometric law in decoder-only Transformers: the ratio of spectral norms between MLP and Attention layers serves as a definitive predictor of "Rank-1 collapse." The study demonstrates that maintaining this spectral ratio within the 0.5–2 range is essential for preserving geometric stability through the model's final layers. ▶ Predicting Rank-1 Collapse: The research identifies that before a model loses representational diversity in deep layers (where tokens converge into a single vector), the spectral ratio between MLP and Attention components exhibits significant imbalance. ▶ The 0.5–2 "Safe Zone": Empirical evidence suggests that when the ratio drifts outside this window, the model's energy biases heavily toward one component, causing rapid geometric degradation during the forward pass. ▶ Advanced Diagnostic Capability: Spectral ratio analysis offers a more granular diagnostic tool than traditional loss curves or gradient norms, enabling the detection of "silent failures" in representational learning. Bagua Insight As the industry continues to scale LLMs to unprecedented depths, this discovery addresses a critical yet overlooked bottleneck: the geometric health of the architecture. For years, the ratio between MLP and Attention has been dictated by empirical heuristics (e.g., the standard 4:1 hidden dimension expansion), but these static rules fail to account for "energy drift" during dynamic training. By applying Lyapunov spectrum analysis, this study bridges dynamical systems theory and Transformer stability. It suggests that future architecture design will shift from simple parameter scaling to precise geometric alignment, ensuring feature spaces do not collapse in high-dimensional transitions. For labs pushing the boundaries of ultra-deep models or long-context stability, this ratio provides a vital new telemetry metric. Actionable Advice 1. Implement Spectral Telemetry: Integrate MLP-to-Attention spectral ratio tracking into your pre-training observability stack as an early-warning system for model health.2. Dynamic Initialization Tuning: If the ratio consistently drifts outside the 0.5–2 range during early iterations, consider adjusting initialization gains or implementing layer-wise scaling factors to restore geometric equilibrium.3. Refine Residual Architectures: When iterating on Transformer variants, evaluate how residual branch designs impact the spectral ratio to ensure balanced energy distribution between token mixing and feature refinement.

The Golden Ratio of Transformer Stability: Balancing MLP and Attention Spectral Norms

BAGUA AI