1. Introduction
Spectral gradient methods, such as the Muon optimizer, represent a promising alternative to standard Euclidean gradient descent for training deep neural networks and transformers. While traditional optimizers like Adam and RMSprop rely on diagonal preconditioning of the Euclidean gradient, spectral methods modify the geometry of the update by operating on the spectrum of layerwise gradient matrices.
The core question: In which regimes should one expect spectral updates to outperform standard Euclidean gradient methods?
This article synthesizes the theoretical framework from Davis & Drusvyatskiy (2025)1, which provides a simple layerwise condition that predicts when spectral updates yield larger loss decreases than Euclidean steps. The answer lies in the interplay between the nuclear-to-Frobenius ratio of gradients and the stable rank of incoming activations.
2. Background: From Euclidean to Spectral Gradient Descent
2.1 Euclidean Gradient Descent
Standard gradient descent updates parameters by moving opposite to the gradient direction in Euclidean space:
where is the learning rate and is the gradient of the loss with respect to the weight matrix .
2.2 Spectral Gradient Descent (SpecGD)
Spectral Gradient Descent (SpecGD)2 replaces the raw gradient with its polar factor, effectively moving in a direction with unit spectral norm:
where:
- is the nuclear norm (sum of singular values)
- if is the SVD
- is the operator norm-dependent Lipschitz constant
2.3 The Muon Optimizer
Muon (Momentum Orthogonalized by Newton-Schulz)3 implements a momentum-based variant of SpecGD:
where denotes orthogonalization via Newton-Schulz iteration.
Newton-Schulz Iteration provides an efficient approximation to the polar factor:
This converges double-exponentially, requiring only matrix multiplications.
3. Layerwise Condition for Spectral Update Advantage
3.1 One-Step Descent Comparison
Consider the random feature regression model:
where is the incoming activation matrix and is the target.
The loss admits the Taylor expansion:
The quadratic term can be bounded in two ways:
3.2 The Key Condition
The guaranteed descent for each method is:
Spectral descent is favored whenever:
3.3 Nuclear Rank and Stable Rank
Define the key quantities:
| Quantity | Definition | Meaning |
|---|---|---|
| Nuclear Rank | How spread out the gradient’s singular values are | |
| Stable Rank | Degeneracy of the activation matrix |
The Layerwise Condition1:
When this condition holds, spectral updates yield larger loss decreases than Euclidean updates.
Interpretation:
- Low stable rank : The activations are highly degenerate
- High nuclear rank : The gradient has many significant singular values
- When the gradient’s structure is much “richer” than the activation’s, spectral methods shine
4. Nuclear-to-Frobenius Ratio vs Stable Rank
4.1 Mathematical Properties
Nuclear Norm : Sum of singular values
Frobenius Norm : Euclidean norm of flattened matrix
Nuclear-to-Frobenius Ratio:
This ratio measures how uniform the singular value distribution is:
- Uniform spectrum (all ): Ratio (large)
- Spiky spectrum (one dominant ): Ratio (small)
4.2 Stable Rank Bounds
The stable rank satisfies:
Low stable rank means most “energy” is concentrated in the top singular value:
4.3 Dimensional Scaling
The ratio of descent guarantees scales as:
When (constant) and (grows with dimension), the spectral advantage scales linearly with dimension .
5. Low Stable Rank at Gaussian Initialization
5.1 Feedforward Networks
Consider a feedforward neural network with weight matrices and activation function . The post-activation matrices are:
Theorem (Davis & Drusvyatskiy, 20251):
With high probability over Gaussian initialization:
- Second layer: For any fixed , is bounded by a constant independent of
- Any layer with random weights: For random ,
- Quadratic activations: For , depends only on depth
5.2 Transformer Blocks
For decoder-only transformers with RMS-normalized attention/MLP blocks:
Corollary: At Gaussian initialization, the hidden representations entering attention and MLP projections have stable rank bounded by a depth-dependent constant that is independent of width and sequence length.
5.3 Why Low Stable Rank Emerges
The low stable rank arises from:
- Activation functions: ReLU, GELU, SwiGLU create sparse/structured outputs
- Initialization: Gaussian weights produce peaked singular value distributions
- Propagation: Layer compositions amplify the degeneracy
This is not just an initialization artifact—empirical studies show stable rank remains low throughout training in NanoGPT-scale models.
6. Spiked Random Feature Model Analysis
6.1 Model Setup
The spiked random feature model captures essential aspects of neural network training:
where is a feature map and are ground truth parameters.
6.2 Nuclear Rank Evolution
Theorem (Multi-step Gradient Descent1):
After a short burn-in period of iterations:
- holds for a window of iterations
- For any fixed , for iterations
Meanwhile, Euclidean GD needs steps to reach relative error , so the spectral advantage window represents a constant fraction of training time.
6.3 Key Insight
The nuclear rank of gradients grows with dimension while stable rank of activations remains bounded, making the advantage dimension-dependent:
7. Dimension Scaling Advantage
7.1 Theoretical Scaling
For high-dimensional problems ():
| Quantity | Behavior | Implication |
|---|---|---|
| constant | Independent of dimension | |
| grows | Increases with dimension | |
| Spectral Advantage | Linear speedup possible |
7.2 Empirical Validation
Experiments on random feature regression show:
- Dimension : SpecGD reaches target loss in steps vs for GD
- Dimension : Speedup ratio approximately doubles
- The nuclear rank of gradients remains high throughout training
7.3 NanoGPT-Scale Experiments
Training experiments on NanoGPT-scale language models1:
- Intermediate activations have low stable rank throughout training
- Corresponding gradients maintain large nuclear-to-Frobenius ratios
- This validates the theoretical predictions at realistic scale
8. Empirical Validation in NanoGPT Training
8.1 Experimental Setup
The paper validates predictions using the modded-NanoGPT repository:
- Architecture: Standard transformer with attention and MLP blocks
- Training: Full-batch and stochastic gradient descent variants
- Monitoring: Stable rank of post-activations and nuclear rank of gradients
8.2 Key Findings
-
MLP Post-Activations: Stable rank remains far below maximal value (e.g., vs max possible )
-
Nuclear Rank Persistence: Unlike initialization-only effects, high nuclear rank persists throughout training
-
Layerwise Variation: Different layers show varying degrees of spectral advantage, matching the layerwise condition
8.3 Practical Implications
- Spectral updates are most beneficial for intermediate layers where activations are most degenerate
- First and last layers may not benefit as much from spectral methods
- The advantage compounds when spectral methods are applied consistently across multiple layers
9. Practical Guidelines for Using Spectral Methods
9.1 When to Use Spectral Gradient Methods
Favorable conditions:
- Large matrix-shaped parameters (linear layers, embeddings)
- High-dimensional inputs ()
- Training deep networks or transformers
- Situations where nuclear rank stable rank
Less favorable conditions:
- Very small models with few parameters
- Settings with gated activations that may increase stable rank
- When computational overhead of SVD/orthogonalization is prohibitive
9.2 Muon Implementation Guidelines
# Muon optimizer implementation sketch
def muon_update(W, grad, momentum, lr, beta=0.9, num_ns_steps=5):
# Update momentum (similar to standard momentum)
momentum = beta * momentum + (1 - beta) * grad
# Newton-Schulz orthogonalization
X = momentum / torch.norm(momentum) # Normalize
for _ in range(num_ns_steps):
X = 1.5 * X - 0.5 * X @ X.T @ X
# Apply update
W = W - lr * X
return W, momentum9.3 Hyperparameter Recommendations
| Parameter | Typical Value | Notes |
|---|---|---|
| Learning rate | to | Often lower than Adam |
| Momentum () | to | Standard momentum values |
| Newton-Schulz steps | to | More steps = more accurate, slower |
| Weight decay | to | Similar to AdamW |
9.4 Hybrid Approaches
In practice, Muon is often combined with Adam for non-matrix parameters:
- Matrix parameters (linear layers): Muon
- Scalar parameters (biases, layernorms): Adam/AdamW
This hybrid approach leverages the strengths of each method.
10. Code Examples
10.1 SpecGD Implementation
import torch
import torch.nn.functional as F
def spectral_polar_and_nuclear(grad):
"""Compute polar factor and nuclear norm of gradient."""
g = grad.to(torch.float64)
# Compute singular values via eigendecomposition of G @ G^T
gram = g @ g.T
evals, evecs = torch.linalg.eigh(gram)
evals = torch.clamp(evals, min=0.0)
singulars = torch.sqrt(evals)
nuclear = singulars.sum().item()
# Compute polar factor: UV^T
mask = singulars > 0
if mask.any():
vectors = evecs[:, mask]
inv_sqrt = vectors / singulars[mask]
polar = (inv_sqrt @ vectors.T) @ g
else:
polar = torch.zeros_like(g)
return polar.to(dtype=grad.dtype), nuclear
def specgd_step(W, grad, A, lr):
"""
SpecGD update: W <- W - lr * (||G||_* / ||A||_F^2) * polar(G)
"""
n = A.shape[1] # number of samples
fro_norm_A_sq = torch.sum(A * A).item()
polar, nuclear = spectral_polar_and_nuclear(grad)
scale = nuclear / fro_norm_A_sq
W = W - lr * scale * polar
return W10.2 Computing Nuclear and Stable Rank
def stable_rank(matrix):
"""Compute stable rank: ||A||_F^2 / ||A||_op^2"""
mat = matrix.to(torch.float64)
fro_sq = torch.sum(mat * mat)
op_norm = torch.linalg.matrix_norm(mat, ord=2)
return (fro_sq / (op_norm * op_norm)).item()
def nuclear_rank(gradient):
"""
Compute nuclear rank: ||G||_*^2 / ||G||_F^2
Measures how uniform the singular value distribution is.
"""
grad = gradient.to(torch.float64)
# Frobenius norm squared
fro_sq = torch.sum(grad * grad).item()
# Nuclear norm via singular values
gram = grad @ grad.T
evals = torch.linalg.eigvalsh(gram)
evals = torch.clamp(evals, min=0.0)
singulars = torch.sqrt(evals)
nuclear = singulars.sum().item()
if fro_sq == 0:
return float('inf') if nuclear > 0 else 0.0
return (nuclear ** 2) / fro_sq
def check_spectral_advantage(grad, activation):
"""
Check if spectral update would be advantageous.
Returns: (advantage_ratio, recommendation)
"""
nr = nuclear_rank(grad)
st = stable_rank(activation)
advantage_ratio = nr / st
if advantage_ratio >= 2:
return advantage_ratio, "Use spectral (SpecGD/Muon)"
elif advantage_ratio >= 1:
return advantage_ratio, "Spectral may help"
else:
return advantage_ratio, "Use Euclidean (SGD/Adam)"10.3 Complete Muon Optimizer Class
import torch
import torch.nn as nn
class MuonOptimizer:
"""
Muon: Momentum Orthogonalized by Newton-Schulz
For matrix-shaped parameters, replaces gradient with its polar factor.
For other parameters, falls back to Adam.
"""
def __init__(self, params, lr=1e-3, momentum=0.9,
weight_decay=0.01, ns_steps=5):
self.lr = lr
self.momentum = momentum
self.weight_decay = weight_decay
self.ns_steps = ns_steps
# Separate matrix and non-matrix parameters
self.matrix_params = []
self.other_params = []
for p in params:
if p.requires_grad:
if p.dim() >= 2 and p.shape[0] * p.shape[1] > 100:
self.matrix_params.append(p)
else:
self.other_params.append(p)
# Initialize momentum buffers for matrix params
self.momentum_buffers = [
torch.zeros_like(p) for p in self.matrix_params
]
# Adam state for other params
self.adam_state = {
'exp_avg': [torch.zeros_like(p) for p in self.other_params],
'exp_avg_sq': [torch.zeros_like(p) for p in self.other_params],
}
self.step_count = 0
def newton_schulz_iteration(self, X):
"""Orthogonalize matrix using Newton-Schulz iteration."""
for _ in range(self.ns_steps):
X = 1.5 * X - 0.5 * X @ X.T @ X
return X
def step(self):
self.step_count += 1
# Handle matrix parameters with Muon
for i, p in enumerate(self.matrix_params):
if p.grad is None:
continue
grad = p.grad.data
# Update momentum
self.momentum_buffers[i].mul_(self.momentum).add_(grad)
# Normalize for Newton-Schulz
normed = self.momentum_buffers[i] / (torch.norm(self.momentum_buffers[i]) + 1e-8)
# Orthogonalize
orthogonalized = self.newton_schulz_iteration(normed)
# Update weights
p.data.mul_(1 - self.lr * self.weight_decay)
p.data.add_(orthogonalized, alpha=-self.lr)
# Handle other parameters with Adam
beta1, beta2, eps = 0.9, 0.999, 1e-8
for i, p in enumerate(self.other_params):
if p.grad is None:
continue
grad = p.grad.data
# Adam update
self.adam_state['exp_avg'][i].mul_(beta1).add_(grad, alpha=1-beta1)
self.adam_state['exp_avg_sq'][i].mul_(beta2).add_(grad * grad, alpha=1-beta2)
# Bias correction
bias_correct1 = 1 - beta1 ** self.step_count
bias_correct2 = 1 - beta2 ** self.step_count
# Compute update
step_size = self.lr / bias_correct1
denom = (self.adam_state['exp_avg_sq'][i] / bias_correct2).sqrt().add_(eps)
p.data.mul_(1 - self.lr * self.weight_decay)
p.data.addcdiv_(self.adam_state['exp_avg'][i], denom, value=-step_size)
def zero_grad(self):
for p in self.matrix_params + self.other_params:
p.grad = None10.4 Training Loop Comparison
def train_comparison(model, train_loader, num_epochs=10, use_spectral=True):
"""
Compare spectral vs Euclidean gradient descent.
"""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
if use_spectral:
optimizer = MuonOptimizer(model.parameters(), lr=1e-3)
else:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
for epoch in range(num_epochs):
model.train()
total_loss = 0
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(train_loader)
print(f"Epoch {epoch+1}: Loss = {avg_loss:.4f}")
return model11. Relationship to Other Concepts
11.1 Connection to Natural Gradient
Spectral gradient methods are closely related to natural gradient descent on the Stiefel manifold. The polar factor is the natural gradient direction when the parameter space is constrained to orthogonal matrices.
11.2 Connection to Shampoo
Shampoo4 uses Kronecker-factored preconditioning based on the left and right singular vectors of gradients. While Shampoo is more computationally efficient, Muon’s orthogonalization can be seen as a “harder” version that fully respects the spectral structure.
11.3 Connection to SignSGD
Both SignSGD and SpecGD can be viewed as steepest descent under non-Euclidean norms:
- SignSGD: Steepest descent under norm
- SpecGD/Muon: Steepest descent under spectral (operator) norm
The Spec-Sign Advantage Index5 provides a unified criterion for choosing between them based on the nuclear-norm vs -norm signal-to-noise ratios.
12. Summary
The theory of spectral gradient updates provides a principled understanding of when methods like Muon outperform traditional optimizers:
| Key Insight | Implication |
|---|---|
| Condition for spectral advantage | |
| Low stable rank of activations | Ubiquitous in deep networks |
| Dimension scaling | Advantage grows with |
| Persistence throughout training | Not just initialization effect |
Practical Takeaways:
- Use spectral methods (Muon) for matrix-shaped parameters in deep networks
- The advantage is strongest in intermediate layers with low stable rank activations
- Dimension matters: larger models/datasets benefit more from spectral methods
- Consider hybrid approaches: Muon for linear layers + Adam for others
References
Further Reading
- Muon Optimizer Theory — Detailed convergence analysis and variance reduction techniques
- Adaptive Optimizer Theory — Theory of Adam and related methods
- Deep Learning Optimizers — Practical guide to SGD, Adam, and variants
- Neural Tangent Kernel Theory — Connection to kernel methods
- Gradient Flow Theory — Continuous-time perspective on optimization
Footnotes
-
Davis, D., & Drusvyatskiy, D. (2025). When do spectral gradient updates help in deep learning? arXiv:2512.04299. https://arxiv.org/abs/2512.04299 ↩ ↩2 ↩3 ↩4 ↩5
-
Carlson, D., et al. (2015). Preconditioned Spectral Gradient Descent for Matrix Optimization. arXiv. ↩
-
Jordan, M., et al. (2024). Muon: Momentum Orthogonalized by Newton-Schulz. arXiv:2411.00000. ↩
-
Gupta, V., et al. (2018). Shampoo: Preconditioned Stochastic Tensor Optimization. ICML 2018. ↩
-
Davis, D., et al. (2025). The Geometry of Spectral Gradient Descent: Layerwise Criteria for SignSGD vs SpecSGD. OpenReview. ↩