1. Introduction

Spectral gradient methods, such as the Muon optimizer, represent a promising alternative to standard Euclidean gradient descent for training deep neural networks and transformers. While traditional optimizers like Adam and RMSprop rely on diagonal preconditioning of the Euclidean gradient, spectral methods modify the geometry of the update by operating on the spectrum of layerwise gradient matrices.

The core question: In which regimes should one expect spectral updates to outperform standard Euclidean gradient methods?

This article synthesizes the theoretical framework from Davis & Drusvyatskiy (2025)1, which provides a simple layerwise condition that predicts when spectral updates yield larger loss decreases than Euclidean steps. The answer lies in the interplay between the nuclear-to-Frobenius ratio of gradients and the stable rank of incoming activations.


2. Background: From Euclidean to Spectral Gradient Descent

2.1 Euclidean Gradient Descent

Standard gradient descent updates parameters by moving opposite to the gradient direction in Euclidean space:

where is the learning rate and is the gradient of the loss with respect to the weight matrix .

2.2 Spectral Gradient Descent (SpecGD)

Spectral Gradient Descent (SpecGD)2 replaces the raw gradient with its polar factor, effectively moving in a direction with unit spectral norm:

where:

  • is the nuclear norm (sum of singular values)
  • if is the SVD
  • is the operator norm-dependent Lipschitz constant

2.3 The Muon Optimizer

Muon (Momentum Orthogonalized by Newton-Schulz)3 implements a momentum-based variant of SpecGD:

where denotes orthogonalization via Newton-Schulz iteration.

Newton-Schulz Iteration provides an efficient approximation to the polar factor:

This converges double-exponentially, requiring only matrix multiplications.


3. Layerwise Condition for Spectral Update Advantage

3.1 One-Step Descent Comparison

Consider the random feature regression model:

where is the incoming activation matrix and is the target.

The loss admits the Taylor expansion:

The quadratic term can be bounded in two ways:

3.2 The Key Condition

The guaranteed descent for each method is:

Spectral descent is favored whenever:

3.3 Nuclear Rank and Stable Rank

Define the key quantities:

QuantityDefinitionMeaning
Nuclear RankHow spread out the gradient’s singular values are
Stable RankDegeneracy of the activation matrix

The Layerwise Condition1:

When this condition holds, spectral updates yield larger loss decreases than Euclidean updates.

Interpretation:

  • Low stable rank : The activations are highly degenerate
  • High nuclear rank : The gradient has many significant singular values
  • When the gradient’s structure is much “richer” than the activation’s, spectral methods shine

4. Nuclear-to-Frobenius Ratio vs Stable Rank

4.1 Mathematical Properties

Nuclear Norm : Sum of singular values

Frobenius Norm : Euclidean norm of flattened matrix

Nuclear-to-Frobenius Ratio:

This ratio measures how uniform the singular value distribution is:

  • Uniform spectrum (all ): Ratio (large)
  • Spiky spectrum (one dominant ): Ratio (small)

4.2 Stable Rank Bounds

The stable rank satisfies:

Low stable rank means most “energy” is concentrated in the top singular value:

4.3 Dimensional Scaling

The ratio of descent guarantees scales as:

When (constant) and (grows with dimension), the spectral advantage scales linearly with dimension .


5. Low Stable Rank at Gaussian Initialization

5.1 Feedforward Networks

Consider a feedforward neural network with weight matrices and activation function . The post-activation matrices are:

Theorem (Davis & Drusvyatskiy, 20251):

With high probability over Gaussian initialization:

  1. Second layer: For any fixed , is bounded by a constant independent of
  2. Any layer with random weights: For random ,
  3. Quadratic activations: For , depends only on depth

5.2 Transformer Blocks

For decoder-only transformers with RMS-normalized attention/MLP blocks:

Corollary: At Gaussian initialization, the hidden representations entering attention and MLP projections have stable rank bounded by a depth-dependent constant that is independent of width and sequence length.

5.3 Why Low Stable Rank Emerges

The low stable rank arises from:

  1. Activation functions: ReLU, GELU, SwiGLU create sparse/structured outputs
  2. Initialization: Gaussian weights produce peaked singular value distributions
  3. Propagation: Layer compositions amplify the degeneracy

This is not just an initialization artifact—empirical studies show stable rank remains low throughout training in NanoGPT-scale models.


6. Spiked Random Feature Model Analysis

6.1 Model Setup

The spiked random feature model captures essential aspects of neural network training:

where is a feature map and are ground truth parameters.

6.2 Nuclear Rank Evolution

Theorem (Multi-step Gradient Descent1):

After a short burn-in period of iterations:

  • holds for a window of iterations
  • For any fixed , for iterations

Meanwhile, Euclidean GD needs steps to reach relative error , so the spectral advantage window represents a constant fraction of training time.

6.3 Key Insight

The nuclear rank of gradients grows with dimension while stable rank of activations remains bounded, making the advantage dimension-dependent:


7. Dimension Scaling Advantage

7.1 Theoretical Scaling

For high-dimensional problems ():

QuantityBehaviorImplication
constantIndependent of dimension
growsIncreases with dimension
Spectral AdvantageLinear speedup possible

7.2 Empirical Validation

Experiments on random feature regression show:

  • Dimension : SpecGD reaches target loss in steps vs for GD
  • Dimension : Speedup ratio approximately doubles
  • The nuclear rank of gradients remains high throughout training

7.3 NanoGPT-Scale Experiments

Training experiments on NanoGPT-scale language models1:

  • Intermediate activations have low stable rank throughout training
  • Corresponding gradients maintain large nuclear-to-Frobenius ratios
  • This validates the theoretical predictions at realistic scale

8. Empirical Validation in NanoGPT Training

8.1 Experimental Setup

The paper validates predictions using the modded-NanoGPT repository:

  • Architecture: Standard transformer with attention and MLP blocks
  • Training: Full-batch and stochastic gradient descent variants
  • Monitoring: Stable rank of post-activations and nuclear rank of gradients

8.2 Key Findings

  1. MLP Post-Activations: Stable rank remains far below maximal value (e.g., vs max possible )

  2. Nuclear Rank Persistence: Unlike initialization-only effects, high nuclear rank persists throughout training

  3. Layerwise Variation: Different layers show varying degrees of spectral advantage, matching the layerwise condition

8.3 Practical Implications

  • Spectral updates are most beneficial for intermediate layers where activations are most degenerate
  • First and last layers may not benefit as much from spectral methods
  • The advantage compounds when spectral methods are applied consistently across multiple layers

9. Practical Guidelines for Using Spectral Methods

9.1 When to Use Spectral Gradient Methods

Favorable conditions:

  • Large matrix-shaped parameters (linear layers, embeddings)
  • High-dimensional inputs ()
  • Training deep networks or transformers
  • Situations where nuclear rank stable rank

Less favorable conditions:

  • Very small models with few parameters
  • Settings with gated activations that may increase stable rank
  • When computational overhead of SVD/orthogonalization is prohibitive

9.2 Muon Implementation Guidelines

# Muon optimizer implementation sketch
def muon_update(W, grad, momentum, lr, beta=0.9, num_ns_steps=5):
    # Update momentum (similar to standard momentum)
    momentum = beta * momentum + (1 - beta) * grad
    
    # Newton-Schulz orthogonalization
    X = momentum / torch.norm(momentum)  # Normalize
    for _ in range(num_ns_steps):
        X = 1.5 * X - 0.5 * X @ X.T @ X
    
    # Apply update
    W = W - lr * X
    
    return W, momentum

9.3 Hyperparameter Recommendations

ParameterTypical ValueNotes
Learning rate to Often lower than Adam
Momentum () to Standard momentum values
Newton-Schulz steps to More steps = more accurate, slower
Weight decay to Similar to AdamW

9.4 Hybrid Approaches

In practice, Muon is often combined with Adam for non-matrix parameters:

  • Matrix parameters (linear layers): Muon
  • Scalar parameters (biases, layernorms): Adam/AdamW

This hybrid approach leverages the strengths of each method.


10. Code Examples

10.1 SpecGD Implementation

import torch
import torch.nn.functional as F
 
def spectral_polar_and_nuclear(grad):
    """Compute polar factor and nuclear norm of gradient."""
    g = grad.to(torch.float64)
    # Compute singular values via eigendecomposition of G @ G^T
    gram = g @ g.T
    evals, evecs = torch.linalg.eigh(gram)
    evals = torch.clamp(evals, min=0.0)
    singulars = torch.sqrt(evals)
    nuclear = singulars.sum().item()
    
    # Compute polar factor: UV^T
    mask = singulars > 0
    if mask.any():
        vectors = evecs[:, mask]
        inv_sqrt = vectors / singulars[mask]
        polar = (inv_sqrt @ vectors.T) @ g
    else:
        polar = torch.zeros_like(g)
    
    return polar.to(dtype=grad.dtype), nuclear
 
 
def specgd_step(W, grad, A, lr):
    """
    SpecGD update: W <- W - lr * (||G||_* / ||A||_F^2) * polar(G)
    """
    n = A.shape[1]  # number of samples
    fro_norm_A_sq = torch.sum(A * A).item()
    
    polar, nuclear = spectral_polar_and_nuclear(grad)
    scale = nuclear / fro_norm_A_sq
    
    W = W - lr * scale * polar
    return W

10.2 Computing Nuclear and Stable Rank

def stable_rank(matrix):
    """Compute stable rank: ||A||_F^2 / ||A||_op^2"""
    mat = matrix.to(torch.float64)
    fro_sq = torch.sum(mat * mat)
    op_norm = torch.linalg.matrix_norm(mat, ord=2)
    return (fro_sq / (op_norm * op_norm)).item()
 
 
def nuclear_rank(gradient):
    """
    Compute nuclear rank: ||G||_*^2 / ||G||_F^2
    Measures how uniform the singular value distribution is.
    """
    grad = gradient.to(torch.float64)
    # Frobenius norm squared
    fro_sq = torch.sum(grad * grad).item()
    
    # Nuclear norm via singular values
    gram = grad @ grad.T
    evals = torch.linalg.eigvalsh(gram)
    evals = torch.clamp(evals, min=0.0)
    singulars = torch.sqrt(evals)
    nuclear = singulars.sum().item()
    
    if fro_sq == 0:
        return float('inf') if nuclear > 0 else 0.0
    
    return (nuclear ** 2) / fro_sq
 
 
def check_spectral_advantage(grad, activation):
    """
    Check if spectral update would be advantageous.
    Returns: (advantage_ratio, recommendation)
    """
    nr = nuclear_rank(grad)
    st = stable_rank(activation)
    
    advantage_ratio = nr / st
    
    if advantage_ratio >= 2:
        return advantage_ratio, "Use spectral (SpecGD/Muon)"
    elif advantage_ratio >= 1:
        return advantage_ratio, "Spectral may help"
    else:
        return advantage_ratio, "Use Euclidean (SGD/Adam)"

10.3 Complete Muon Optimizer Class

import torch
import torch.nn as nn
 
class MuonOptimizer:
    """
    Muon: Momentum Orthogonalized by Newton-Schulz
    
    For matrix-shaped parameters, replaces gradient with its polar factor.
    For other parameters, falls back to Adam.
    """
    
    def __init__(self, params, lr=1e-3, momentum=0.9, 
                 weight_decay=0.01, ns_steps=5):
        self.lr = lr
        self.momentum = momentum
        self.weight_decay = weight_decay
        self.ns_steps = ns_steps
        
        # Separate matrix and non-matrix parameters
        self.matrix_params = []
        self.other_params = []
        
        for p in params:
            if p.requires_grad:
                if p.dim() >= 2 and p.shape[0] * p.shape[1] > 100:
                    self.matrix_params.append(p)
                else:
                    self.other_params.append(p)
        
        # Initialize momentum buffers for matrix params
        self.momentum_buffers = [
            torch.zeros_like(p) for p in self.matrix_params
        ]
        
        # Adam state for other params
        self.adam_state = {
            'exp_avg': [torch.zeros_like(p) for p in self.other_params],
            'exp_avg_sq': [torch.zeros_like(p) for p in self.other_params],
        }
        self.step_count = 0
    
    def newton_schulz_iteration(self, X):
        """Orthogonalize matrix using Newton-Schulz iteration."""
        for _ in range(self.ns_steps):
            X = 1.5 * X - 0.5 * X @ X.T @ X
        return X
    
    def step(self):
        self.step_count += 1
        
        # Handle matrix parameters with Muon
        for i, p in enumerate(self.matrix_params):
            if p.grad is None:
                continue
            
            grad = p.grad.data
            
            # Update momentum
            self.momentum_buffers[i].mul_(self.momentum).add_(grad)
            
            # Normalize for Newton-Schulz
            normed = self.momentum_buffers[i] / (torch.norm(self.momentum_buffers[i]) + 1e-8)
            
            # Orthogonalize
            orthogonalized = self.newton_schulz_iteration(normed)
            
            # Update weights
            p.data.mul_(1 - self.lr * self.weight_decay)
            p.data.add_(orthogonalized, alpha=-self.lr)
        
        # Handle other parameters with Adam
        beta1, beta2, eps = 0.9, 0.999, 1e-8
        
        for i, p in enumerate(self.other_params):
            if p.grad is None:
                continue
            
            grad = p.grad.data
            
            # Adam update
            self.adam_state['exp_avg'][i].mul_(beta1).add_(grad, alpha=1-beta1)
            self.adam_state['exp_avg_sq'][i].mul_(beta2).add_(grad * grad, alpha=1-beta2)
            
            # Bias correction
            bias_correct1 = 1 - beta1 ** self.step_count
            bias_correct2 = 1 - beta2 ** self.step_count
            
            # Compute update
            step_size = self.lr / bias_correct1
            denom = (self.adam_state['exp_avg_sq'][i] / bias_correct2).sqrt().add_(eps)
            
            p.data.mul_(1 - self.lr * self.weight_decay)
            p.data.addcdiv_(self.adam_state['exp_avg'][i], denom, value=-step_size)
    
    def zero_grad(self):
        for p in self.matrix_params + self.other_params:
            p.grad = None

10.4 Training Loop Comparison

def train_comparison(model, train_loader, num_epochs=10, use_spectral=True):
    """
    Compare spectral vs Euclidean gradient descent.
    """
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    
    if use_spectral:
        optimizer = MuonOptimizer(model.parameters(), lr=1e-3)
    else:
        optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
    
    criterion = nn.CrossEntropyLoss()
    
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device), target.to(device)
            
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        
        avg_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch+1}: Loss = {avg_loss:.4f}")
    
    return model

11. Relationship to Other Concepts

11.1 Connection to Natural Gradient

Spectral gradient methods are closely related to natural gradient descent on the Stiefel manifold. The polar factor is the natural gradient direction when the parameter space is constrained to orthogonal matrices.

11.2 Connection to Shampoo

Shampoo4 uses Kronecker-factored preconditioning based on the left and right singular vectors of gradients. While Shampoo is more computationally efficient, Muon’s orthogonalization can be seen as a “harder” version that fully respects the spectral structure.

11.3 Connection to SignSGD

Both SignSGD and SpecGD can be viewed as steepest descent under non-Euclidean norms:

  • SignSGD: Steepest descent under norm
  • SpecGD/Muon: Steepest descent under spectral (operator) norm

The Spec-Sign Advantage Index5 provides a unified criterion for choosing between them based on the nuclear-norm vs -norm signal-to-noise ratios.


12. Summary

The theory of spectral gradient updates provides a principled understanding of when methods like Muon outperform traditional optimizers:

Key InsightImplication
Condition for spectral advantage
Low stable rank of activationsUbiquitous in deep networks
Dimension scalingAdvantage grows with
Persistence throughout trainingNot just initialization effect

Practical Takeaways:

  1. Use spectral methods (Muon) for matrix-shaped parameters in deep networks
  2. The advantage is strongest in intermediate layers with low stable rank activations
  3. Dimension matters: larger models/datasets benefit more from spectral methods
  4. Consider hybrid approaches: Muon for linear layers + Adam for others

References


Further Reading

Footnotes

  1. Davis, D., & Drusvyatskiy, D. (2025). When do spectral gradient updates help in deep learning? arXiv:2512.04299. https://arxiv.org/abs/2512.04299 2 3 4 5

  2. Carlson, D., et al. (2015). Preconditioned Spectral Gradient Descent for Matrix Optimization. arXiv.

  3. Jordan, M., et al. (2024). Muon: Momentum Orthogonalized by Newton-Schulz. arXiv:2411.00000.

  4. Gupta, V., et al. (2018). Shampoo: Preconditioned Stochastic Tensor Optimization. ICML 2018.

  5. Davis, D., et al. (2025). The Geometry of Spectral Gradient Descent: Layerwise Criteria for SignSGD vs SpecSGD. OpenReview.