Spectral Gradient Updates Theory

1. Introduction

Spectral gradient methods, such as the Muon optimizer, represent a promising alternative to standard Euclidean gradient descent for training deep neural networks and transformers. While traditional optimizers like Adam and RMSprop rely on diagonal preconditioning of the Euclidean gradient, spectral methods modify the geometry of the update by operating on the spectrum of layerwise gradient matrices.

The core question: In which regimes should one expect spectral updates to outperform standard Euclidean gradient methods?

This article synthesizes the theoretical framework from Davis & Drusvyatskiy (2025)¹, which provides a simple layerwise condition that predicts when spectral updates yield larger loss decreases than Euclidean steps. The answer lies in the interplay between the nuclear-to-Frobenius ratio of gradients and the stable rank of incoming activations.

2. Background: From Euclidean to Spectral Gradient Descent

2.1 Euclidean Gradient Descent

Standard gradient descent updates parameters by moving opposite to the gradient direction in Euclidean space:

W_{t + 1} = W_{t} - η \cdot \nabla_{W} L (W_{t})

where $η$ is the learning rate and $\nabla_{W} L$ is the gradient of the loss with respect to the weight matrix $W$ .

2.2 Spectral Gradient Descent (SpecGD)

Spectral Gradient Descent (SpecGD)² replaces the raw gradient with its polar factor, effectively moving in a direction with unit spectral norm:

W_{t + 1} = W_{t} - η \cdot \frac{∥ \nabla _{W} L ∥ _{*}}{L _{op}} \cdot polar (\nabla_{W} L)

where:

$∥ \cdot ∥_{*}$ is the nuclear norm (sum of singular values)
$polar (G) = U V^{T}$ if $G = U Σ V^{T}$ is the SVD
$L_{op} = \frac{1}{n} ∥ A ∥_{F}^{2}$ is the operator norm-dependent Lipschitz constant

2.3 The Muon Optimizer

Muon (Momentum Orthogonalized by Newton-Schulz)³ implements a momentum-based variant of SpecGD:

M_{t} W_{t + 1} = β M_{t - 1} + (1 - β) \nabla_{W} L (W_{t}) = W_{t} - η \cdot Orth (M_{t})

where $Orth (\cdot)$ denotes orthogonalization via Newton-Schulz iteration.

Newton-Schulz Iteration provides an efficient approximation to the polar factor:

X_{k + 1} = \frac{3}{2} X_{k} - \frac{1}{2} X_{k} X_{k}^{T} X_{k}

This converges double-exponentially, requiring only matrix multiplications.

3. Layerwise Condition for Spectral Update Advantage

3.1 One-Step Descent Comparison

Consider the random feature regression model:

W \in R^{m \times k} min L (W) := \frac{1}{2 n} ∥ W A - Y ∥_{F}^{2}

where $A$ is the incoming activation matrix and $Y$ is the target.

The loss admits the Taylor expansion:

L (W + U) = L (W) + ⟨ \nabla L (W), U ⟩ + \frac{1}{2 n} ∥ U A ∥_{F}^{2}

The quadratic term can be bounded in two ways:

\frac{1}{n} ∥ U A ∥_{F}^{2} \leq L_{F} \frac{1}{n} ∥ A ∥_{op}^{2} ∥ U ∥_{F}^{2} (Euclidean GD)

\frac{1}{n} ∥ U A ∥_{F}^{2} \leq L_{op} \frac{1}{n} ∥ A ∥_{F}^{2} ∥ U ∥_{op}^{2} (Spectral GD)

3.2 The Key Condition

The guaranteed descent for each method is:

Δ_{GD} ≍ \frac{∥\nabla L ∥ _{F}^{2}}{L _{F}} = \frac{∥\nabla L ∥ _{F}^{2}}{∥ A ∥ _{op}^{2}}

Δ_{Spec} ≍ \frac{∥\nabla L ∥ _{*}^{2}}{L _{op}} = \frac{∥\nabla L ∥ _{*}^{2}}{∥ A ∥ _{F}^{2}}

Spectral descent is favored whenever:

\frac{∥\nabla L ∥ _{*}^{2}}{∥\nabla L ∥ _{F}^{2}} \geq \frac{∥ A ∥ _{F}^{2}}{∥ A ∥ _{op}^{2}}

3.3 Nuclear Rank and Stable Rank

Define the key quantities:

Quantity	Definition	Meaning
Nuclear Rank	$nr (G) := \frac{∥ G ∥ _{*}^{2}}{∥ G ∥ _{F}^{2}}$	How spread out the gradient’s singular values are
Stable Rank	$st (A) := \frac{∥ A ∥ _{F}^{2}}{∥ A ∥ _{op}^{2}}$	Degeneracy of the activation matrix

The Layerwise Condition¹:

nr (\nabla_{W} L) \geq st (A)

When this condition holds, spectral updates yield larger loss decreases than Euclidean updates.

Interpretation:

Low stable rank $st (A) = O (1)$ : The activations are highly degenerate
High nuclear rank $nr (\nabla L) = Ω (d)$ : The gradient has many significant singular values
When the gradient’s structure is much “richer” than the activation’s, spectral methods shine

4. Nuclear-to-Frobenius Ratio vs Stable Rank

4.1 Mathematical Properties

Nuclear Norm $∥ G ∥_{*} = \sum_{i} σ_{i} (G)$ : Sum of singular values

Frobenius Norm $∥ G ∥_{F} = \sum_{i} σ_{i}^{2} (G)$ : Euclidean norm of flattened matrix

Nuclear-to-Frobenius Ratio:

\frac{∥ G ∥ _{*}^{2}}{∥ G ∥ _{F}^{2}} = \frac{( \sum _{i} σ _{i} ) ^{2}}{\sum _{i} σ _{i}^{2}}

This ratio measures how uniform the singular value distribution is:

Uniform spectrum (all $σ_{i} \approx 1$ ): Ratio $\approx r$ (large)
Spiky spectrum (one dominant $σ_{1} ≫ σ_{2, 3, ...}$ ): Ratio $\approx 1$ (small)

4.2 Stable Rank Bounds

The stable rank satisfies:

1 \leq st (A) \leq min {m, n}

Low stable rank means most “energy” is concentrated in the top singular value:

st (A) ≪ min {m, n} ⟹ ∥ A ∥_{F} ≪ min {m, n} \cdot ∥ A ∥_{op}

4.3 Dimensional Scaling

The ratio of descent guarantees scales as:

\frac{Δ _{Spec}}{Δ _{GD}} ≍ \frac{nr ( \nabla L )}{st ( A )}

When $st (A) = O (1)$ (constant) and $nr (\nabla L) = Ω (d)$ (grows with dimension), the spectral advantage scales linearly with dimension $d$ .

5. Low Stable Rank at Gaussian Initialization

5.1 Feedforward Networks

Consider a feedforward neural network with weight matrices $W_{1}, \dots, W_{L}$ and activation function $σ$ . The post-activation matrices are:

A_{0} = X, A_{ℓ} = σ (W_{ℓ} A_{ℓ - 1}) \forall ℓ = 1, \dots, L

Theorem (Davis & Drusvyatskiy, 2025¹):

With high probability over Gaussian initialization:

Second layer: For any fixed $W_{1}$ , $st (A_{1})$ is bounded by a constant independent of $W_{1}$
Any layer with random weights: For random $W_{ℓ} \sim N (0, 1)$ , $st (A_{ℓ}) = O (1)$
Quadratic activations: For $σ (t) = t^{2}$ , $st (A_{ℓ})$ depends only on depth $ℓ$

5.2 Transformer Blocks

For decoder-only transformers with RMS-normalized attention/MLP blocks:

Corollary: At Gaussian initialization, the hidden representations entering attention and MLP projections have stable rank bounded by a depth-dependent constant that is independent of width and sequence length.

5.3 Why Low Stable Rank Emerges

The low stable rank arises from:

Activation functions: ReLU, GELU, SwiGLU create sparse/structured outputs
Initialization: Gaussian weights produce peaked singular value distributions
Propagation: Layer compositions amplify the degeneracy

This is not just an initialization artifact—empirical studies show stable rank remains low throughout training in NanoGPT-scale models.

6. Spiked Random Feature Model Analysis

6.1 Model Setup

The spiked random feature model captures essential aspects of neural network training:

y = f^{⋆} (x) + ϵ, f^{⋆} (x) = \frac{1}{k} i = 1 \sum k u_{i} \cdot ϕ (v_{i}^{T} x)

where $ϕ$ is a feature map and ${u_{i}, v_{i}}$ are ground truth parameters.

6.2 Nuclear Rank Evolution

Theorem (Multi-step Gradient Descent¹):

After a short burn-in period of $Θ (d)$ iterations:

$nr (\nabla L (W_{t})) = Ω (d)$ holds for a window of $Θ (d)$ iterations
For any fixed $ε > 0$ , $nr (\nabla L (W_{t})) \geq d^{1 - ε}$ for $Θ (d lo g d)$ iterations

Meanwhile, Euclidean GD needs $Θ (d lo g (d / δ))$ steps to reach relative error $δ$ , so the spectral advantage window represents a constant fraction of training time.

6.3 Key Insight

The nuclear rank of gradients grows with dimension while stable rank of activations remains bounded, making the advantage dimension-dependent:

\frac{Δ _{Spec}}{Δ _{GD}} ≍ \frac{Ω ( d )}{O ( 1 )} = Ω (d)

7. Dimension Scaling Advantage

7.1 Theoretical Scaling

For high-dimensional problems ( $d ≫ 1$ ):

Quantity	Behavior	Implication
$st (A)$	$O (1)$ constant	Independent of dimension
$nr (\nabla L)$	$Ω (d)$ grows	Increases with dimension
Spectral Advantage	$Ω (d)$	Linear speedup possible

7.2 Empirical Validation

Experiments on random feature regression show:

Dimension $d = 100$ : SpecGD reaches target loss in $\sim 200$ steps vs $\sim 600$ for GD
Dimension $d = 200$ : Speedup ratio approximately doubles
The nuclear rank of gradients remains high throughout training

7.3 NanoGPT-Scale Experiments

Training experiments on NanoGPT-scale language models¹:

Intermediate activations have low stable rank throughout training
Corresponding gradients maintain large nuclear-to-Frobenius ratios
This validates the theoretical predictions at realistic scale

8. Empirical Validation in NanoGPT Training

8.1 Experimental Setup

The paper validates predictions using the modded-NanoGPT repository:

Architecture: Standard transformer with attention and MLP blocks
Training: Full-batch and stochastic gradient descent variants
Monitoring: Stable rank of post-activations and nuclear rank of gradients

8.2 Key Findings

MLP Post-Activations: Stable rank remains far below maximal value (e.g., $\sim 10 - 50$ vs max possible $\sim 3000$ )
Nuclear Rank Persistence: Unlike initialization-only effects, high nuclear rank persists throughout training
Layerwise Variation: Different layers show varying degrees of spectral advantage, matching the layerwise condition

8.3 Practical Implications

Spectral updates are most beneficial for intermediate layers where activations are most degenerate
First and last layers may not benefit as much from spectral methods
The advantage compounds when spectral methods are applied consistently across multiple layers

9. Practical Guidelines for Using Spectral Methods

9.1 When to Use Spectral Gradient Methods

Favorable conditions:

Large matrix-shaped parameters (linear layers, embeddings)
High-dimensional inputs ( $d ≫ 1$ )
Training deep networks or transformers
Situations where nuclear rank $≫$ stable rank

Less favorable conditions:

Very small models with few parameters
Settings with gated activations that may increase stable rank
When computational overhead of SVD/orthogonalization is prohibitive

9.2 Muon Implementation Guidelines

# Muon optimizer implementation sketch
def muon_update(W, grad, momentum, lr, beta=0.9, num_ns_steps=5):
    # Update momentum (similar to standard momentum)
    momentum = beta * momentum + (1 - beta) * grad
    
    # Newton-Schulz orthogonalization
    X = momentum / torch.norm(momentum)  # Normalize
    for _ in range(num_ns_steps):
        X = 1.5 * X - 0.5 * X @ X.T @ X
    
    # Apply update
    W = W - lr * X
    
    return W, momentum

9.3 Hyperparameter Recommendations

Parameter	Typical Value	Notes
Learning rate	$1 0^{- 4}$ to $1 0^{- 3}$	Often lower than Adam
Momentum ( $β$ )	$0.9$ to $0.95$	Standard momentum values
Newton-Schulz steps	$3$ to $5$	More steps = more accurate, slower
Weight decay	$0.01$ to $0.1$	Similar to AdamW

9.4 Hybrid Approaches

In practice, Muon is often combined with Adam for non-matrix parameters:

Matrix parameters (linear layers): Muon
Scalar parameters (biases, layernorms): Adam/AdamW

This hybrid approach leverages the strengths of each method.

10. Code Examples

10.1 SpecGD Implementation

import torch
import torch.nn.functional as F
 
def spectral_polar_and_nuclear(grad):
    """Compute polar factor and nuclear norm of gradient."""
    g = grad.to(torch.float64)
    # Compute singular values via eigendecomposition of G @ G^T
    gram = g @ g.T
    evals, evecs = torch.linalg.eigh(gram)
    evals = torch.clamp(evals, min=0.0)
    singulars = torch.sqrt(evals)
    nuclear = singulars.sum().item()
    
    # Compute polar factor: UV^T
    mask = singulars > 0
    if mask.any():
        vectors = evecs[:, mask]
        inv_sqrt = vectors / singulars[mask]
        polar = (inv_sqrt @ vectors.T) @ g
    else:
        polar = torch.zeros_like(g)
    
    return polar.to(dtype=grad.dtype), nuclear
 
 
def specgd_step(W, grad, A, lr):
    """
    SpecGD update: W <- W - lr * (||G||_* / ||A||_F^2) * polar(G)
    """
    n = A.shape[1]  # number of samples
    fro_norm_A_sq = torch.sum(A * A).item()
    
    polar, nuclear = spectral_polar_and_nuclear(grad)
    scale = nuclear / fro_norm_A_sq
    
    W = W - lr * scale * polar
    return W

10.2 Computing Nuclear and Stable Rank

def stable_rank(matrix):
    """Compute stable rank: ||A||_F^2 / ||A||_op^2"""
    mat = matrix.to(torch.float64)
    fro_sq = torch.sum(mat * mat)
    op_norm = torch.linalg.matrix_norm(mat, ord=2)
    return (fro_sq / (op_norm * op_norm)).item()
 
 
def nuclear_rank(gradient):
    """
    Compute nuclear rank: ||G||_*^2 / ||G||_F^2
    Measures how uniform the singular value distribution is.
    """
    grad = gradient.to(torch.float64)
    # Frobenius norm squared
    fro_sq = torch.sum(grad * grad).item()
    
    # Nuclear norm via singular values
    gram = grad @ grad.T
    evals = torch.linalg.eigvalsh(gram)
    evals = torch.clamp(evals, min=0.0)
    singulars = torch.sqrt(evals)
    nuclear = singulars.sum().item()
    
    if fro_sq == 0:
        return float('inf') if nuclear > 0 else 0.0
    
    return (nuclear ** 2) / fro_sq
 
 
def check_spectral_advantage(grad, activation):
    """
    Check if spectral update would be advantageous.
    Returns: (advantage_ratio, recommendation)
    """
    nr = nuclear_rank(grad)
    st = stable_rank(activation)
    
    advantage_ratio = nr / st
    
    if advantage_ratio >= 2:
        return advantage_ratio, "Use spectral (SpecGD/Muon)"
    elif advantage_ratio >= 1:
        return advantage_ratio, "Spectral may help"
    else:
        return advantage_ratio, "Use Euclidean (SGD/Adam)"

10.3 Complete Muon Optimizer Class

import torch
import torch.nn as nn
 
class MuonOptimizer:
    """
    Muon: Momentum Orthogonalized by Newton-Schulz
    
    For matrix-shaped parameters, replaces gradient with its polar factor.
    For other parameters, falls back to Adam.
    """
    
    def __init__(self, params, lr=1e-3, momentum=0.9, 
                 weight_decay=0.01, ns_steps=5):
        self.lr = lr
        self.momentum = momentum
        self.weight_decay = weight_decay
        self.ns_steps = ns_steps
        
        # Separate matrix and non-matrix parameters
        self.matrix_params = []
        self.other_params = []
        
        for p in params:
            if p.requires_grad:
                if p.dim() >= 2 and p.shape[0] * p.shape[1] > 100:
                    self.matrix_params.append(p)
                else:
                    self.other_params.append(p)
        
        # Initialize momentum buffers for matrix params
        self.momentum_buffers = [
            torch.zeros_like(p) for p in self.matrix_params
        ]
        
        # Adam state for other params
        self.adam_state = {
            'exp_avg': [torch.zeros_like(p) for p in self.other_params],
            'exp_avg_sq': [torch.zeros_like(p) for p in self.other_params],
        }
        self.step_count = 0
    
    def newton_schulz_iteration(self, X):
        """Orthogonalize matrix using Newton-Schulz iteration."""
        for _ in range(self.ns_steps):
            X = 1.5 * X - 0.5 * X @ X.T @ X
        return X
    
    def step(self):
        self.step_count += 1
        
        # Handle matrix parameters with Muon
        for i, p in enumerate(self.matrix_params):
            if p.grad is None:
                continue
            
            grad = p.grad.data
            
            # Update momentum
            self.momentum_buffers[i].mul_(self.momentum).add_(grad)
            
            # Normalize for Newton-Schulz
            normed = self.momentum_buffers[i] / (torch.norm(self.momentum_buffers[i]) + 1e-8)
            
            # Orthogonalize
            orthogonalized = self.newton_schulz_iteration(normed)
            
            # Update weights
            p.data.mul_(1 - self.lr * self.weight_decay)
            p.data.add_(orthogonalized, alpha=-self.lr)
        
        # Handle other parameters with Adam
        beta1, beta2, eps = 0.9, 0.999, 1e-8
        
        for i, p in enumerate(self.other_params):
            if p.grad is None:
                continue
            
            grad = p.grad.data
            
            # Adam update
            self.adam_state['exp_avg'][i].mul_(beta1).add_(grad, alpha=1-beta1)
            self.adam_state['exp_avg_sq'][i].mul_(beta2).add_(grad * grad, alpha=1-beta2)
            
            # Bias correction
            bias_correct1 = 1 - beta1 ** self.step_count
            bias_correct2 = 1 - beta2 ** self.step_count
            
            # Compute update
            step_size = self.lr / bias_correct1
            denom = (self.adam_state['exp_avg_sq'][i] / bias_correct2).sqrt().add_(eps)
            
            p.data.mul_(1 - self.lr * self.weight_decay)
            p.data.addcdiv_(self.adam_state['exp_avg'][i], denom, value=-step_size)
    
    def zero_grad(self):
        for p in self.matrix_params + self.other_params:
            p.grad = None

10.4 Training Loop Comparison

def train_comparison(model, train_loader, num_epochs=10, use_spectral=True):
    """
    Compare spectral vs Euclidean gradient descent.
    """
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    
    if use_spectral:
        optimizer = MuonOptimizer(model.parameters(), lr=1e-3)
    else:
        optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
    
    criterion = nn.CrossEntropyLoss()
    
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device), target.to(device)
            
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        
        avg_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch+1}: Loss = {avg_loss:.4f}")
    
    return model

11. Relationship to Other Concepts

11.1 Connection to Natural Gradient

Spectral gradient methods are closely related to natural gradient descent on the Stiefel manifold. The polar factor $U V^{T}$ is the natural gradient direction when the parameter space is constrained to orthogonal matrices.

11.2 Connection to Shampoo

Shampoo⁴ uses Kronecker-factored preconditioning based on the left and right singular vectors of gradients. While Shampoo is more computationally efficient, Muon’s orthogonalization can be seen as a “harder” version that fully respects the spectral structure.

11.3 Connection to SignSGD

Both SignSGD and SpecGD can be viewed as steepest descent under non-Euclidean norms:

SignSGD: Steepest descent under $ℓ_{\infty}$ norm
SpecGD/Muon: Steepest descent under spectral (operator) norm

The Spec-Sign Advantage Index⁵ provides a unified criterion for choosing between them based on the nuclear-norm vs $ℓ_{1}$ -norm signal-to-noise ratios.

12. Summary

The theory of spectral gradient updates provides a principled understanding of when methods like Muon outperform traditional optimizers:

Key Insight	Implication
$nr (\nabla L) \geq st (A)$	Condition for spectral advantage
Low stable rank of activations	Ubiquitous in deep networks
Dimension scaling	Advantage grows with $d$
Persistence throughout training	Not just initialization effect

Practical Takeaways:

Use spectral methods (Muon) for matrix-shaped parameters in deep networks
The advantage is strongest in intermediate layers with low stable rank activations
Dimension matters: larger models/datasets benefit more from spectral methods
Consider hybrid approaches: Muon for linear layers + Adam for others

Metaphor

探索