Transformer Global Convergence with Mean-Field Theory

1. Introduction: The Optimization Mystery of Large Transformers

Despite the widespread success of Transformer models across various domains—including natural language processing and computer vision—their optimization guarantees in large-scale settings remain poorly understood.¹

The Core Mystery: Why do gradient-based methods consistently succeed in training Transformers, despite the highly non-convex landscape of the training objective?

This phenomenon becomes particularly intriguing as model size increases: training algorithms typically converge globally, even when the loss landscape contains numerous local minima and saddle points. Understanding this theoretical puzzle is essential for:

Algorithmic improvements: Designing better optimizers and learning rate schedules
Architecture design: Understanding which components contribute to trainability
Generalization theory: Connecting optimization dynamics to out-of-sample performance

1.1 Prior Work and Limitations

Previous theoretical analyses of deep network optimization have established global convergence guarantees for:

Model	Key Technique	Limitation
Two-layer NNs	Mean-field analysis	Limited to shallow architectures
Deep ResNets	Neural ODE / Skip connections	Requires homogeneity assumptions
NTK regime	Infinite width scaling	Excludes feature learning

Critical Gap: Existing tools for deep networks (particularly Lu et al., 2020) demand:

Full homogeneity of the network function
Global Lipschitz smoothness of gradients

These conditions are not satisfied by Transformer architectures, particularly due to:

Softmax attention mechanism (not homogeneous)
Layer normalization
Complex interactions between attention and FFN blocks

1.2 Our Approach: Mean-Field Theory for Transformers

This paper bridges the gap between Transformer theory and practice by demonstrating global convergence of Transformer training via gradient flow in a large-scale model regime.¹

Key Innovation: Shift optimization analysis from parameter space to distributional dynamics in the Wasserstein metric, enabling:

Construction of the mean-field limit for Transformers
Rigorous approximation guarantees between discrete and continuous dynamics
Proof of global minimum convergence under mild assumptions

2. Mean-Field Limit Construction for Transformers

2.1 Transformer Model Architecture

Following common Transformer configurations, each block consists of two distinct layers:¹

Self-Attention Layer (with residual connection):

Attn_{θ_{1}, \dots, θ_{M}} (Z, η) = Z + \frac{η}{M} j = 1 \sum M f (Z, θ^{j})

Feed-Forward Layer (with residual connection):

MLP_{w_{1}, \dots, w_{M}} (Z, η) = Z + \frac{η}{M} j = 1 \sum M h (Z, w^{j})

where:

$Z \in R^{D \times (N + 1)}$ is the input sequence (matrix form)
$M$ is the model width (number of attention heads/FFN units)
$η$ is the residual step size
$f (\cdot, θ)$ and $h (\cdot, w)$ are the attention and FFN encoders

2.2 Deep Transformer Structure

For a Transformer with $L$ blocks (depth), the structure evolves as:

\overset{ˉ}{T}_{Θ} (H, t + Δ t /2) \overset{ˉ}{T}_{Θ} (H, t + Δ t) = Attn_{θ^{t, 1}, \dots, θ^{t, M}} (\overset{ˉ}{T}_{Θ} (H, t), Δ t /2) = MLP_{w^{t, 1}, \dots, w^{t, M}} (\overset{ˉ}{T}_{Θ} (H, t + Δ t /2), Δ t /2)

where $Δ t = 1/ L$ and $t = 0, Δ t, \dots, (L - 1) Δ t$ .

2.3 From Discrete to Continuous: Mean-Field Limit

Key Insight: As both width $M \to \infty$ and depth $L \to \infty$ , we can interpret the Transformer as a continuous dynamical system where parameters follow a probability distribution.

The continuous Transformer $T_{ρ} (H, t)$ satisfies the ODE:

\dot{T}_{ρ} (H, t) = \int_{θ, w} \frac{f ( T _{ρ} ( H , t ) , θ ) + h ( T _{ρ} ( H , t ) , w )}{2} ρ (θ, w, t) d θ d w

where $ρ (θ, w, t)$ is the probability distribution of parameters at “time” $t$ .

Interpretation: Each encoder $f$ or $h$ is conceptualized as a particle, and $ρ$ describes the distribution of these particles in parameter space.

2.4 Gradient Flow on Parameter Distribution

The training objective with $ℓ_{2}$ regularization is:

Q (ρ) = R (ρ) + \frac{λ}{2} \int_{0}^{1} \int_{θ, w} (∥ θ ∥_{2}^{2} + ∥ w ∥_{2}^{2}) ρ (θ, w, t) d θ d w d t

where the population risk is:

R (ρ) = E_{μ} [\frac{1}{2} (Read [T_{ρ} (H, 1)] - y (H))^{2}]

The Wasserstein gradient flow of this functional is given by the McKean-Vlasov PDE:

\frac{\partial ρ ( τ )}{\partial τ} = div (ρ \cdot \nabla \frac{δ Q}{δ ρ})

This PDE governs how the parameter distribution evolves during training.

3. Wasserstein Gradient Flow Representation

3.1 Mathematical Framework

The Wasserstein space $P_{2} (R^{d})$ is the space of probability measures with finite second moment, equipped with the Wasserstein-2 distance:

W_{2} (μ, ν) = (γ \in Γ (μ, ν) in f \int ∥ x - y ∥^{2} d γ (x, y))^{1/2}

where $Γ (μ, ν)$ is the set of couplings between $μ$ and $ν$ .

Why Wasserstein geometry? Unlike $L^{2}$ geometry on function spaces, Wasserstein geometry:

Respects the nonlinear structure of probability distributions
Enables analysis of propagation of chaos (convergence of particle systems)
Provides a natural metric for gradient flows on probability measures

3.2 Functional Gradient Derivation

The functional derivative of $Q$ with respect to $ρ$ is:

\frac{δ Q}{δ ρ} (θ, w, t) = E_{μ} [Tr (\frac{f ( T _{ρ} ( H , t ) , θ ) + h ( T _{ρ} ( H , t ) , w )}{2}^{⊤} p_{ρ} (H, t))] + \frac{λ}{2} (∥ θ ∥_{2}^{2} + ∥ w ∥_{2}^{2})

where $p_{ρ} (H, t) = \frac{\partial R}{\partial T _{ρ} ( H , t )}^{⊤}$ is the adjoint variable.

3.3 The Gradient Flow PDE

Explicitly, the Wasserstein gradient flow satisfies:

\frac{d ρ ^{(τ)} ( θ , w , t )}{d τ} = div_{θ} (ρ^{(τ)} G_{f} (θ, ρ^{(τ)}, t)) + div_{w} (ρ^{(τ)} G_{h} (w, ρ^{(τ)}, t))

with gradient functions:

G_{f} (θ, ρ, t) = \frac{1}{2} E_{μ} [\nabla_{θ} Tr (f (T_{ρ} (H, t), θ)^{⊤} p_{ρ} (H, t))] + λ θ

G_{h} (w, ρ, t) = \frac{1}{2} E_{μ} [\nabla_{w} Tr (h (T_{ρ} (H, t), w)^{⊤} p_{ρ} (H, t))] + λ w

Interpretation: Particles $θ$ and $w$ flow “downhill” according to the gradient of the training objective, with the distribution $ρ$ transporting mass through this vector field.

3.4 Well-Posedness of the Gradient Flow

Proposition (Existence and Uniqueness): Under mild assumptions, there exists a unique solution ${ρ^{(τ)}}_{τ \geq 0}$ to the gradient flow equation, satisfying:

Bounded support: $ρ^{(τ)}$ concentrates on ${(θ, w) : ∥ θ ∥^{2} + ∥ w ∥^{2} \leq R_{τ}^{2}} \times [0, 1]$
Normalized mass: $\int_{θ, w} ρ^{(τ)} (θ, w, t) d θ d w = 1$ for all $t$
Regularization effect: The $λ$ parameter controls parameter growth

Key insight: The $λ > 0$ regularization is essential for well-posedness—it stabilizes the optimization by controlling both maximum and average parameter norms.

4. Partial Homogeneity and Local Lipschitz Smoothness

4.1 Beyond Full Homogeneity

Previous mean-field analyses of deep networks required full homogeneity:

g (T, c θ) = c \cdot g (T, θ) for all c > 0

This property enabled certain technical arguments but excludes:

Softmax attention (exponential in parameters)
Sigmoid activations
Complex interactions in multi-head attention

4.2 Partial Homogeneity Assumption

This paper introduces partial homogeneity that applies to only a subset of parameters:¹

Assumption 4 (Partial 1-Homogeneity): There exists a partition $α = (α_{1}, α_{2})$ such that:

g (T, c α_{1}, α_{2}) = c \cdot g (T, α_{1}, α_{2}) for all c \in R

Interpretation: Only a subset of parameters ( $α_{1}$ ) scales the output homogeneously, while the remaining parameters ( $α_{2}$ ) can have arbitrary dependence.

Example: In self-attention, the key-query projection matrices might exhibit partial homogeneity while the value projection does not.

4.3 Local Lipschitz Smoothness

Instead of global Lipschitz continuity of gradients (too restrictive for Transformers), we assume local Lipschitz smoothness in expectation:

Assumption 3 (Locally Lipschitz Continuous Gradient in Expectation): For any $L_{T} > 0$ and $L_{T}$ -Lipschitz continuous functions $T_{1} (H)$ and $T_{2} (H)$ :

E_{μ} ∥ \nabla_{θ} f (T_{1}, θ)_{:, i} - \nabla_{θ} f (T_{2}, θ)_{:, i} ∥_{2} \leq ϕ_{PT} (∥ θ ∥, K_{T}, L_{T}) H sup ∥ T_{1} - T_{2} ∥_{2 -col}

where $ϕ_{PT}$ is a continuous, monotonically increasing function.

Key relaxation: This assumption:

Accommodates ReLU activations (where derivatives exist almost everywhere)
Is satisfied by softmax attention with standard initialization
Holds “in expectation” rather than pointwise

4.4 Comparison with Prior Work

Property	Prior Work (ResNets)	This Work (Transformers)
Homogeneity	Full homogeneity required	Partial homogeneity sufficient
Lipschitz constant	Global, uniform	Local, expectation-based
Activation functions	Smooth required	ReLU, softmax, sigmoid allowed
Analysis scope	Single encoder per block	Two distinct encoders (Attn + FFN)

5. Global Minimum Convergence Proof

5.1 Main Theorem: Gradient Flow Approximation

Theorem 3.1 (Gradient flow approximation of discretization): Define the empirical distribution:

\overset{ρ}{^} (τ) := \frac{1}{M L} t \sum j = 1 \sum M δ_{(θ_{t, j}^{(τ)}, w_{t, j}^{(τ)}, t)}

Under Assumptions 1-3, the empirical distribution weakly converges to the Wasserstein gradient flow solution $ρ (τ)$ almost surely as $L \to \infty$ , $M / lo g L \to \infty$ .

Moreover, for any fixed $τ > 0$ and $δ > 0$ , with probability at least $1 - 3 exp (- δ)$ :

s \in [0, τ] sup Read [\overset{ˉ}{T}_{Θ} (s) (H, t)] - Read [T_{ρ} (s) (H, t)] \leq C (L^{- 1} + \frac{δ + lo g ( L + 1 )}{M})

Interpretation: Large-scale discrete Transformers can be arbitrarily well approximated by their mean-field limit, with approximation error decaying as $O (1/ M + 1/ L)$ .

5.2 Global Convergence Theorem

Theorem 4.1 (Global convergence up to $λ$ ): Suppose the Wasserstein gradient flow $ρ (τ)$ weakly converges to $ρ_{\infty}$ , and the following conditions hold:

Bounded support: $ρ (τ)$ concentrates on a bounded region for large $τ$
Separation property: The support of $ρ_{\infty}$ at some depth $t^{*}$ spans a set that separates inner and outer parameter regions

Then, for any $ϵ > 0$ , there exists $τ_{0} > 0$ such that:

τ \geq τ_{0} sup R (Θ (τ)) \leq ϵ + C_{1} (L^{- 1} + \frac{δ + lo g ( L + 1 )}{M}) + C_{2} λ

Key implications:

As $M \to \infty$ and $L \to \infty$ , the risk $R (Θ (τ))$ approaches zero
The additional term $C_{2} λ$ arises from regularization
By choosing sufficiently small $λ > 0$ , we achieve arbitrarily small training risk

5.3 Proof Sketch

The proof proceeds in three main steps:

Step 1: Establishing Continuity of Functional Gradient

Show that the functional gradient $\frac{δ Q}{δ ρ}_{ρ_{\infty}}$ remains constant if the derivative is constant over a region. This requires careful analysis of the Transformer dynamics.

Step 2: Bounding the Energy at Fixed Points

Derive the key bound for $Q (ρ_{\infty}) \propto λ$ by analyzing the landscape of the functional energy through its derivatives.

Step 3: Finite-Time Risk Approximation

Show that the finite-time risk can approach this bound. Using Theorem 3.1, demonstrate that $Q (Θ (τ_{0}))$ becomes sufficiently small, and since $Q (ρ (τ))$ is non-increasing, it remains small for all $τ \geq τ_{0}$ .

5.4 Practical Corollary

Corollary 4.1: For any fixed $δ > 0$ and $ϵ > 0$ , there exist constants $τ_{0}, L_{0}, K_{0} > 0$ such that:

τ \geq τ_{0}, L > L_{0}, M / l o g L > K_{0} sup R (Θ (τ)) \leq (1/2 + C_{2} C_{λ}) ϵ

whenever $λ \leq C_{λ} ϵ$ .

Practical meaning: With sufficiently large width and depth, gradient flow training of Transformers guarantees convergence to near-zero training loss.

6. Novel Mean-Field Techniques for Transformers

6.1 Technical Contributions

This paper develops several novel techniques that extend mean-field theory to Transformer architectures:¹

6.1.1 Uniform Error Control

Previous works analyzed error at specific time points. This work achieves uniform error control over any finite time interval $[0, τ]$ :

s \in [0, τ] sup ∥ \overset{ˉ}{T}_{Θ} (s) - T_{ρ} (s) ∥_{F} \leq C (M, L)

This enables continuous monitoring of maximum error across the training trajectory.

6.1.2 Two-Encoder Analysis

Unlike ResNet models with a single encoder per block, Transformers use two distinct encoders ( $f$ for attention, $h$ for FFN) that alternate. The analysis:

Treats each encoder separately with appropriate regularity conditions
Unifies them through the average $(f + h) /2$ in the continuous limit
Provides rigorous validation of the “ensemble of paths” concept

6.1.3 Propagation of Chaos Framework

Extended the classical propagation of chaos theory to the Transformer setting:

Particle systems with non-i.i.d. initializations at different depths
Uniform bounds on particle differences using Grönwall’s inequality
Concentration estimates for empirical distributions

6.2 Novel Assumptions and Their Justification

Assumption	Novel Element	Justification
2	Column-wise norms for sequential data	Natural for token-based inputs
3	Local Lipschitz in expectation	Accommodates ReLU/softmax
4	Partial homogeneity	Enables softmax/sigmoid activations

6.3 Verification for Concrete Architectures

The paper provides explicit verification of assumptions for specific Transformer configurations, demonstrating that:

Feed-forward layers with ReLU satisfy the universal kernel property
Self-attention layers can serve as universal approximators under certain conditions
The partial homogeneity condition holds for standard architectural choices

7. Connections to Prior Work

7.1 Neural Network Mean-Field Theory

The foundation builds on the seminal work of Mei, Montanari, and Nguyen (2018) on two-layer networks:¹

NTK limit mean-field Wasserstein gradient flow

Key developments:

Chizat & Bach (2018): Established global convergence for overparameterized models using optimal transport
Lu et al. (2020): Extended to deep ResNets with skip connections
Ding et al. (2021, 2022): Proved overparameterization guarantees for deep ResNets

7.2 ResNet vs Transformer Analysis

Aspect	ResNet Analysis	Transformer Analysis
Structure	Single encoder per block	Two distinct encoders
Homogeneity	Full homogeneity required	Partial homogeneity sufficient
Analysis method	ODE discretization	PDE/ODE hybrid
Skip connections	Identity shortcuts	Learned attention patterns

7.3 Neural ODE Connection

The Transformer ODE perspective connects to Neural ODE theory:

ResNets: $\overset{x}{˙} = f (x)$ with identity-like skip connections
Transformers: $\dot{T} = E_{ρ} [f (T, θ) + h (T, w)]$ with stochastic averaging

This provides a unifying framework for understanding deep network training dynamics.

7.4 In-Context Learning Connections

Recent work on in-context learning (ICL) provides complementary perspectives:

Ahn et al., 2023: Showed Transformers can perform ICL via linear regression approximation
Kim & Suzuki, 2024: Analyzed mean-field dynamics for in-context feature learning
This work: Provides optimization-theoretic foundation for these phenomena

8. Future Directions and Open Problems

8.1 Theoretical Extensions

Direct gradient descent analysis: Current results use continuous gradient flow; discrete-time analysis with step sizes is needed
Generalization bounds: Connect optimization convergence to finite-sample generalization
Self-attention as universal kernel: Rigorous conditions for attention’s approximation capacity

8.2 Practical Implications

Initialization schemes: Theory suggests optimal scaling for width/depth tradeoffs
Regularization tuning:指导选择合适的 $λ$ 以平衡收敛速度和泛化
Architecture design: Principles for choosing attention/FFN ratios

8.3 Open Questions

Can we remove the partial homogeneity assumption entirely?
What are the necessary and sufficient conditions for global convergence?
How does the theory extend to mixture of experts and sparse Transformers?

9. Summary

This paper establishes the first rigorous global convergence theory for large-scale Transformer training using mean-field methods.¹

Key Contributions:

Mean-field limit construction: Showed that as width and depth go to infinity, Transformers converge to a Wasserstein gradient flow described by a PDE
Novel technical assumptions: Introduced partial homogeneity and local Lipschitz smoothness—weaker conditions that accommodate real Transformer architectures
Two main theorems:
- Theorem 3.1: Close approximation between discrete gradient flow and continuous limit
- Theorem 4.1: Global convergence to near-zero training loss
Practical implications: Demonstrated that basic gradient flow can successfully navigate the complex non-convex landscape to find optimal solutions

Significance: These results provide a theoretical foundation for understanding why Transformers train so successfully in practice, and open new avenues for optimization theory of modern deep learning architectures.

References

Transformer as Differential Equation — Continuous modeling of Transformers
Neural Tangent Kernel Theory — Infinite-width neural network analysis
Unified Gradient Flow Convergence — General convergence theory for neural networks
Backpropagation as Gradient Flow — Optimization dynamics foundations
ResNet as Dynamical System — Neural ODE perspective on skip connections

Gao, C., Cao, Y., Li, Z., He, Y., Wang, M., Liu, H., Klusowski, J. M., & Fan, J. (2024). Global Convergence in Training Large-Scale Transformers. NeurIPS 2024. arXiv:2410.23610 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷

Metaphor

探索

Transformer Global Convergence with Mean-Field Theory

Transformer Global Convergence with Mean-Field Theory

1. Introduction: The Optimization Mystery of Large Transformers

1.1 Prior Work and Limitations

1.2 Our Approach: Mean-Field Theory for Transformers

2. Mean-Field Limit Construction for Transformers

2.1 Transformer Model Architecture

2.2 Deep Transformer Structure

2.3 From Discrete to Continuous: Mean-Field Limit

2.4 Gradient Flow on Parameter Distribution

3. Wasserstein Gradient Flow Representation

3.1 Mathematical Framework

3.2 Functional Gradient Derivation

3.3 The Gradient Flow PDE

3.4 Well-Posedness of the Gradient Flow

4. Partial Homogeneity and Local Lipschitz Smoothness

4.1 Beyond Full Homogeneity

4.2 Partial Homogeneity Assumption

4.3 Local Lipschitz Smoothness

4.4 Comparison with Prior Work

5. Global Minimum Convergence Proof

5.1 Main Theorem: Gradient Flow Approximation

5.2 Global Convergence Theorem

5.3 Proof Sketch

5.4 Practical Corollary

6. Novel Mean-Field Techniques for Transformers

6.1 Technical Contributions

6.1.1 Uniform Error Control

6.1.2 Two-Encoder Analysis

6.1.3 Propagation of Chaos Framework

6.2 Novel Assumptions and Their Justification

6.3 Verification for Concrete Architectures

7. Connections to Prior Work

7.1 Neural Network Mean-Field Theory

7.2 ResNet vs Transformer Analysis

7.3 Neural ODE Connection

7.4 In-Context Learning Connections

8. Future Directions and Open Problems

8.1 Theoretical Extensions

8.2 Practical Implications

8.3 Open Questions

9. Summary

References

Related Topics

Footnotes

关系图谱

目录