CompleteP Transformer Parameterization

Introduction: The Problem of HP Transfer Across Model Sizes

A fundamental challenge in training Large Language Models (LLMs) is determining how to scale hyperparameters (HPs) when changing model architecture. Modern deep learning’s core paradigm—that larger models yield better performance¹—creates enormous pressure to scale model width and depth efficiently.

The Scaling Challenge: When practitioners scale models from small experiments to production deployments, they face two critical problems:

HP Retuning Overhead: Traditional parameterizations require re-tuning learning rates and other HPs for each model size, which is computationally expensive
Sub-optimal Training: When re-tuning is prohibitive, models are trained with sub-optimal HPs, wasting valuable compute resources

The maximal update parameterization (µP)² was developed to address HP transfer when scaling width. However, extending this to simultaneous width and depth scaling remained an open problem.

This article introduces CompleteP³, the parameterization that achieves both:

Depth-wise HP transfer: Optimal base HPs remain approximately constant when scaling model depth
Non-lazy learning in all layers: Every layer learns genuinely non-linear features, not just linearizations

CompleteP enables 12-34% compute efficiency improvements over prior state-of-the-art parameterizations.

The Lazy Learning Regime in Deep Networks

What is Lazy Learning?

In deep networks, lazy learning refers to a regime where layers learn features that remain very close to their linearization around initialization⁴. This phenomenon occurs when the network’s non-linear components contribute minimally to the learned representations.

Mathematical Definition: A layer is in the lazy learning regime if:

∥ f_{θ} (x) - f_{θ_{0}} (x) - \nabla_{θ} f_{θ_{0}} (x) \cdot (θ - θ_{0}) ∥ ≪ ∥ θ - θ_{0} ∥

where $f_{θ}$ is the network output with parameters $θ$ , and $θ_{0}$ is the initialization.

Why Lazy Learning is Problematic

When networks operate in the lazy regime:

Aspect	Impact
Feature Diversity	Layers learn similar (linear) features, wasting depth
Non-linearity Unused	Activation functions contribute little to learning
Depth Inefficiency	Adding more layers provides diminishing returns
Compute Waste	Training compute is spent without proportional benefit

Previous work⁵ argued that HP transfer was not truly possible at any depth scaling factor, and that $α = 0.5$ was optimal in practice. This suggested that lazy learning was an unavoidable trade-off.

CompleteP Parameterization Design

Core Design Principle

CompleteP sets the depth scaling exponent $α = 1$ , which gives the parameterization its name. This is the unique value of $α$ that ensures complete feature learning (non-lazy learning in all layers) while maintaining HP transfer.

The Residual Connection Formula

The key difference between parameterizations lies in how the transformer’s residual block outputs are scaled:

h^{ℓ + 1} = h^{ℓ} + L^{- α} \cdot F_{ℓ} (h^{ℓ}), ℓ \in {1, \dots, L}

where:

$h^{ℓ}$ is the residual stream at layer $ℓ$
$F_{ℓ}$ is the $ℓ$ -th residual block (MLP or attention)
$L$ is the total number of layers
$α \in [0.5, 1]$ controls depth-dependent re-scaling

Parameterization Comparison

Parameterization	$α$ Value	Key Properties
SP (Standard)	N/A	No depth scaling; unstable at depth
µP	N/A	Width scaling only; $α$ not applicable
Mean-Field $α = 0.5$	$0.5$	Partial depth scaling; still lazy
CompleteP	$1.0$	Full depth scaling; non-lazy learning

CompleteP Scaling Rules

CompleteP extends µP with principled depth scaling. The complete scaling rules for a Pre-LN transformer³:

Width Multipliers

Component	SP	µP	CompleteP ( $α = 1$ )
Embedding Init Var	$σ_{ba se}^{2}$	$σ_{ba se}^{2}$	$σ_{ba se}^{2}$
Embedding LR	$η_{ba se}$	$η_{ba se}$	$η_{ba se}$
Hidden Init Var	$σ_{ba se}^{2}$	$σ_{ba se}^{2} / m_{N}$	$σ_{ba se}^{2} / m_{N}$
Hidden LR (Adam)	$η_{ba se}$	$η_{ba se} / m_{N}$	$η_{ba se} \cdot m_{L} / m_{N}$

where:

$m_{N} = N / N_{ba se}$ (width multiplier)
$m_{L} = L / L_{ba se}$ (depth multiplier)

Depth-Specific Corrections (CompleteP)

CompleteP requires additional corrections for stable training:

Residual Scaling Factor:

residual_scaling = \frac{1}{m _{L}^{α}} = \frac{1}{m _{L}}

LayerNorm Learning Rate: $η_{L N} = η_{ba se} \cdot m_{L}$
Weight Decay: $λ_{hi dd e n} = λ_{ba se} \cdot m_{N} / m_{L}$
AdamW $ϵ$ :
- Embedding layers: $ϵ_{e mb} = ϵ_{ba se} / m_{N}$
- Hidden layers: $ϵ_{hi dd e n} = ϵ_{ba se} / (m_{N} \cdot m_{L})$

Depth-wise HP Transfer Guarantees

Empirical Evidence

The paper demonstrates that only CompleteP ( $α = 1$ ) achieves reliable depth-wise HP transfer³:

Finding 1: With SP, µP, and $α = 0.5$ , optimal learning rate $η$ and initialization standard deviation $σ$ do not remain stable as depth $L$ varies.

Finding 2: Only CompleteP ( $α = 1$ ) achieves reliable depth-wise HP transfer.

Coordinate Check Results

The paper validates HP transfer using coordinate checks—training with different depth multipliers and verifying that:

Loss curves align when using transferred HPs
Optimal HPs remain constant across depths

CompleteP (α=1): Concentric curves with stable minima
α=0.5: Diverging optimal HPs with depth
µP/SP: Unstable training at depth

Compute-Optimal Setting

Even under the compute-optimal setup (20 tokens per parameter), CompleteP shows superior HP transfer:

Reduced sensitivity to learning rate choice
Consistent loss improvements across depths
No additional HP tuning required when scaling depth

Non-Lazy Learning in All Layers: Proof Sketch

The Lazy vs Non-Lazy Spectrum

Yang et al. (2023)⁵ argued that $α = 0.5$ was optimal based on feature diversity considerations. CompleteP’s key insight is that this argument only holds in the lazy learning regime.

Complete Feature Learning Requirement

CompleteP introduces a refined desideratum for HP transfer that includes complete feature learning³:

Complete Feature Learning: The learned representation in every layer of the model must remain non-lazy (i.e., non-linear) with respect to the parameters in both that layer and all earlier layers.

Mathematical Framework

Consider the network’s representation at layer $ℓ$ :

h^{ℓ} = h^{0} + i = 1 \sum ℓ L^{- α} F_{i} (h^{i - 1})

Critical Insight: For $α = 1$ , the contribution from each residual block is scaled by $L^{- 1}$ , ensuring:

Order 1 terms (linear) contribute $O (L \cdot L^{- 1}) = O (1)$
Order 2+ terms (non-linear) remain significant at $O (1)$ scale
Lazy regime avoided: Non-linear features cannot be absorbed into linear approximations

Theorem Sketch (Informal)

Theorem: Under CompleteP parameterization ( $α = 1$ ), as $N, L \to \infty$ with finite ratio $N / L$ :

Every layer’s learned features are non-lazy with respect to all earlier layers
The training dynamics remain in the feature learning regime rather than the NTK/lazy regime
HP transfer holds: optimal base HPs are constant across model sizes

This theorem distinguishes CompleteP from all other $α \in [0.5, 1)$ parameterizations.

Hardware-Aware Shape Optimization

The N:L Ratio Question

Kaplan et al. (2020)⁶ established that $N : L \approx 100$ is compute-optimal using SP. However, SP doesn’t fairly admit stable width and depth scaling, undermining this conclusion.

CompleteP Enables Wider Design Space

CompleteP’s stable depth scaling enables revisiting the compute-optimal width-to-depth ratio:

N:L Ratio	SP/µP Viability	CompleteP Viability
100:1	Stable	Stable
50:1	Moderate	Stable
20:1	Challenging	Stable
10:1	Poor	Stable
1:1	Unstable	Viable

Hardware Considerations

Different hardware configurations favor different N:L ratios:

Hardware Type	Preferred N:L	Rationale
GPU (memory-bound)	Higher N:L	Better memory utilization
Wafer-scale (Cerebras)	Variable	Massive memory bandwidth enables deeper models
Edge devices	Lower N:L	Memory constraints favor depth efficiency

CompleteP’s key advantage: Practitioners can now choose shapes based on hardware characteristics rather than training stability constraints.

Compute Efficiency Improvements

FLOP Savings

The paper reports substantial compute efficiency improvements with CompleteP³:

Model Configuration	FLOP Savings vs µP
Optimally-shaped 1.9B model	11.8%
179-layer deep model	34.4%

Training Curves

Models trained with CompleteP achieve:

Faster convergence (lower loss at same token count)
Better final performance (lower asymptotic loss)
Improved scaling with depth

Efficiency Breakdown

The 12-34% improvement comes from three sources:

HP Transfer Efficiency: Eliminating re-tuning costs
Training Efficiency: Non-lazy learning extracts more value from each layer
Shape Efficiency: Deeper models become viable, enabling hardware-optimized shapes

Practical Implementation Guidelines

Minimal Implementation

The following shows the key modifications needed to implement CompleteP³:

# In the Block forward pass:
def forward(self, x):
    residual_scaling = 1 / (self.depth_multiplier ** self.depth_alpha_exp)
    x = x + residual_scaling * self.attn(self.ln_1(x))
    x = x + residual_scaling * self.mlp(self.ln_2(x))
    return x

Optimizer Configuration

CompleteP requires separated optimizer groups with depth-dependent learning rate scaling:

# Depth learning rate scaling
depth_lr_scaling = depth_multiplier ** (depth_alpha_exp - 1)  # = 1/depth_multiplier for α=1
width_lr_scaling = 1 / width_multiplier
 
# AdamW epsilon scaling
emb_unemb_adam_eps = adam_eps / width_multiplier
hidden_adam_eps = adam_eps / (width_multiplier * depth_multiplier)
 
optim_groups = [
    {'params': emb_params, 'weight_decay': weight_decay, 'lr_scale': 1.0, 'eps': emb_unemb_adam_eps},
    {'params': hidden_ln_params, 'weight_decay': 0.0, 'lr_scale': depth_lr_scaling, 'eps': hidden_adam_eps},
    {'params': hidden_weight_params, 'weight_decay': weight_decay / width_lr_scaling, 
     'lr_scale': width_lr_scaling * depth_lr_scaling, 'eps': hidden_adam_eps},
    {'params': hidden_bias_params, 'weight_decay': 0.0, 'lr_scale': depth_lr_scaling, 'eps': hidden_adam_eps},
]

Configuration Checklist

When implementing CompleteP:

Set depth_alpha_exp = 1.0 (the key CompleteP parameter)
Implement residual scaling: residual_scaling = 1/depth_multiplier
Separate optimizer groups for LN, hidden weights, biases
Scale AdamW $ϵ$ by depth and width multipliers
Scale weight decay inversely with depth

Reference Implementation

A minimal implementation is available at:

github.com/EleutherAI/nanoGPT-mup/tree/completep

Summary and Key Takeaways

What CompleteP Achieves

Depth-wise HP Transfer: Optimal learning rates and initialization scales transfer across model depths without re-tuning
Non-Lazy Learning: Every layer learns genuinely non-linear features, maximizing the value of model depth
Hardware Flexibility: Enables a wider range of width-to-depth ratios to be compute-efficient
Empirical Gains: 12-34% compute efficiency improvements over prior state-of-the-art

When to Use CompleteP

Scenario	Recommendation
Scaling depth significantly	Essential
Multi-model family training	Highly recommended
Hardware-specific shape optimization	Recommended
Single small model only	µP sufficient

Key Parameters

Parameter	Value for CompleteP	Purpose
$α$ (depth_alpha_exp)	`1.0`	Core CompleteP parameter
Residual scaling	`1/m_L`	Depth-dependent residual scaling
Hidden LR scaling	$m_{L} / m_{N}$	Learning rate adjustment
AdamW $ϵ$ scaling	$1/ (m_{N} \cdot m_{L})$	Numerical stability

References

Depthwise hyperparameter transfer in residual networks — Dalmazi et al., 2023
Infinite limits of multi-head transformer dynamics — Bordelon et al., 2024
The practitioner’s guide to the maximal update parameterization — Dey et al., 2024

Metaphor

探索