Introduction: The Problem of HP Transfer Across Model Sizes
A fundamental challenge in training Large Language Models (LLMs) is determining how to scale hyperparameters (HPs) when changing model architecture. Modern deep learning’s core paradigm—that larger models yield better performance1—creates enormous pressure to scale model width and depth efficiently.
The Scaling Challenge: When practitioners scale models from small experiments to production deployments, they face two critical problems:
- HP Retuning Overhead: Traditional parameterizations require re-tuning learning rates and other HPs for each model size, which is computationally expensive
- Sub-optimal Training: When re-tuning is prohibitive, models are trained with sub-optimal HPs, wasting valuable compute resources
The maximal update parameterization (µP)2 was developed to address HP transfer when scaling width. However, extending this to simultaneous width and depth scaling remained an open problem.
This article introduces CompleteP3, the parameterization that achieves both:
- Depth-wise HP transfer: Optimal base HPs remain approximately constant when scaling model depth
- Non-lazy learning in all layers: Every layer learns genuinely non-linear features, not just linearizations
CompleteP enables 12-34% compute efficiency improvements over prior state-of-the-art parameterizations.
The Lazy Learning Regime in Deep Networks
What is Lazy Learning?
In deep networks, lazy learning refers to a regime where layers learn features that remain very close to their linearization around initialization4. This phenomenon occurs when the network’s non-linear components contribute minimally to the learned representations.
Mathematical Definition: A layer is in the lazy learning regime if:
where is the network output with parameters , and is the initialization.
Why Lazy Learning is Problematic
When networks operate in the lazy regime:
| Aspect | Impact |
|---|---|
| Feature Diversity | Layers learn similar (linear) features, wasting depth |
| Non-linearity Unused | Activation functions contribute little to learning |
| Depth Inefficiency | Adding more layers provides diminishing returns |
| Compute Waste | Training compute is spent without proportional benefit |
Previous work5 argued that HP transfer was not truly possible at any depth scaling factor, and that was optimal in practice. This suggested that lazy learning was an unavoidable trade-off.
CompleteP Parameterization Design
Core Design Principle
CompleteP sets the depth scaling exponent , which gives the parameterization its name. This is the unique value of that ensures complete feature learning (non-lazy learning in all layers) while maintaining HP transfer.
The Residual Connection Formula
The key difference between parameterizations lies in how the transformer’s residual block outputs are scaled:
where:
- is the residual stream at layer
- is the -th residual block (MLP or attention)
- is the total number of layers
- controls depth-dependent re-scaling
Parameterization Comparison
| Parameterization | Value | Key Properties |
|---|---|---|
| SP (Standard) | N/A | No depth scaling; unstable at depth |
| µP | N/A | Width scaling only; not applicable |
| Mean-Field | Partial depth scaling; still lazy | |
| CompleteP | Full depth scaling; non-lazy learning |
CompleteP Scaling Rules
CompleteP extends µP with principled depth scaling. The complete scaling rules for a Pre-LN transformer3:
Width Multipliers
| Component | SP | µP | CompleteP () |
|---|---|---|---|
| Embedding Init Var | |||
| Embedding LR | |||
| Hidden Init Var | |||
| Hidden LR (Adam) |
where:
- (width multiplier)
- (depth multiplier)
Depth-Specific Corrections (CompleteP)
CompleteP requires additional corrections for stable training:
- Residual Scaling Factor:
-
LayerNorm Learning Rate:
-
Weight Decay:
-
AdamW :
- Embedding layers:
- Hidden layers:
Depth-wise HP Transfer Guarantees
Empirical Evidence
The paper demonstrates that only CompleteP () achieves reliable depth-wise HP transfer3:
Finding 1: With SP, µP, and , optimal learning rate and initialization standard deviation do not remain stable as depth varies.
Finding 2: Only CompleteP () achieves reliable depth-wise HP transfer.
Coordinate Check Results
The paper validates HP transfer using coordinate checks—training with different depth multipliers and verifying that:
- Loss curves align when using transferred HPs
- Optimal HPs remain constant across depths
CompleteP (α=1): Concentric curves with stable minima
α=0.5: Diverging optimal HPs with depth
µP/SP: Unstable training at depth
Compute-Optimal Setting
Even under the compute-optimal setup (20 tokens per parameter), CompleteP shows superior HP transfer:
- Reduced sensitivity to learning rate choice
- Consistent loss improvements across depths
- No additional HP tuning required when scaling depth
Non-Lazy Learning in All Layers: Proof Sketch
The Lazy vs Non-Lazy Spectrum
Yang et al. (2023)5 argued that was optimal based on feature diversity considerations. CompleteP’s key insight is that this argument only holds in the lazy learning regime.
Complete Feature Learning Requirement
CompleteP introduces a refined desideratum for HP transfer that includes complete feature learning3:
Complete Feature Learning: The learned representation in every layer of the model must remain non-lazy (i.e., non-linear) with respect to the parameters in both that layer and all earlier layers.
Mathematical Framework
Consider the network’s representation at layer :
Critical Insight: For , the contribution from each residual block is scaled by , ensuring:
- Order 1 terms (linear) contribute
- Order 2+ terms (non-linear) remain significant at scale
- Lazy regime avoided: Non-linear features cannot be absorbed into linear approximations
Theorem Sketch (Informal)
Theorem: Under CompleteP parameterization (), as with finite ratio :
- Every layer’s learned features are non-lazy with respect to all earlier layers
- The training dynamics remain in the feature learning regime rather than the NTK/lazy regime
- HP transfer holds: optimal base HPs are constant across model sizes
This theorem distinguishes CompleteP from all other parameterizations.
Hardware-Aware Shape Optimization
The N:L Ratio Question
Kaplan et al. (2020)6 established that is compute-optimal using SP. However, SP doesn’t fairly admit stable width and depth scaling, undermining this conclusion.
CompleteP Enables Wider Design Space
CompleteP’s stable depth scaling enables revisiting the compute-optimal width-to-depth ratio:
| N:L Ratio | SP/µP Viability | CompleteP Viability |
|---|---|---|
| 100:1 | Stable | Stable |
| 50:1 | Moderate | Stable |
| 20:1 | Challenging | Stable |
| 10:1 | Poor | Stable |
| 1:1 | Unstable | Viable |
Hardware Considerations
Different hardware configurations favor different N:L ratios:
| Hardware Type | Preferred N:L | Rationale |
|---|---|---|
| GPU (memory-bound) | Higher N:L | Better memory utilization |
| Wafer-scale (Cerebras) | Variable | Massive memory bandwidth enables deeper models |
| Edge devices | Lower N:L | Memory constraints favor depth efficiency |
CompleteP’s key advantage: Practitioners can now choose shapes based on hardware characteristics rather than training stability constraints.
Compute Efficiency Improvements
FLOP Savings
The paper reports substantial compute efficiency improvements with CompleteP3:
| Model Configuration | FLOP Savings vs µP |
|---|---|
| Optimally-shaped 1.9B model | 11.8% |
| 179-layer deep model | 34.4% |
Training Curves
Models trained with CompleteP achieve:
- Faster convergence (lower loss at same token count)
- Better final performance (lower asymptotic loss)
- Improved scaling with depth
Efficiency Breakdown
The 12-34% improvement comes from three sources:
- HP Transfer Efficiency: Eliminating re-tuning costs
- Training Efficiency: Non-lazy learning extracts more value from each layer
- Shape Efficiency: Deeper models become viable, enabling hardware-optimized shapes
Practical Implementation Guidelines
Minimal Implementation
The following shows the key modifications needed to implement CompleteP3:
# In the Block forward pass:
def forward(self, x):
residual_scaling = 1 / (self.depth_multiplier ** self.depth_alpha_exp)
x = x + residual_scaling * self.attn(self.ln_1(x))
x = x + residual_scaling * self.mlp(self.ln_2(x))
return xOptimizer Configuration
CompleteP requires separated optimizer groups with depth-dependent learning rate scaling:
# Depth learning rate scaling
depth_lr_scaling = depth_multiplier ** (depth_alpha_exp - 1) # = 1/depth_multiplier for α=1
width_lr_scaling = 1 / width_multiplier
# AdamW epsilon scaling
emb_unemb_adam_eps = adam_eps / width_multiplier
hidden_adam_eps = adam_eps / (width_multiplier * depth_multiplier)
optim_groups = [
{'params': emb_params, 'weight_decay': weight_decay, 'lr_scale': 1.0, 'eps': emb_unemb_adam_eps},
{'params': hidden_ln_params, 'weight_decay': 0.0, 'lr_scale': depth_lr_scaling, 'eps': hidden_adam_eps},
{'params': hidden_weight_params, 'weight_decay': weight_decay / width_lr_scaling,
'lr_scale': width_lr_scaling * depth_lr_scaling, 'eps': hidden_adam_eps},
{'params': hidden_bias_params, 'weight_decay': 0.0, 'lr_scale': depth_lr_scaling, 'eps': hidden_adam_eps},
]Configuration Checklist
When implementing CompleteP:
- Set
depth_alpha_exp = 1.0(the key CompleteP parameter) - Implement residual scaling:
residual_scaling = 1/depth_multiplier - Separate optimizer groups for LN, hidden weights, biases
- Scale AdamW by depth and width multipliers
- Scale weight decay inversely with depth
Reference Implementation
A minimal implementation is available at:
Summary and Key Takeaways
What CompleteP Achieves
-
Depth-wise HP Transfer: Optimal learning rates and initialization scales transfer across model depths without re-tuning
-
Non-Lazy Learning: Every layer learns genuinely non-linear features, maximizing the value of model depth
-
Hardware Flexibility: Enables a wider range of width-to-depth ratios to be compute-efficient
-
Empirical Gains: 12-34% compute efficiency improvements over prior state-of-the-art
When to Use CompleteP
| Scenario | Recommendation |
|---|---|
| Scaling depth significantly | Essential |
| Multi-model family training | Highly recommended |
| Hardware-specific shape optimization | Recommended |
| Single small model only | µP sufficient |
Key Parameters
| Parameter | Value for CompleteP | Purpose |
|---|---|---|
| (depth_alpha_exp) | 1.0 | Core CompleteP parameter |
| Residual scaling | 1/m_L | Depth-dependent residual scaling |
| Hidden LR scaling | Learning rate adjustment | |
| AdamW scaling | Numerical stability |
References
Related Papers
- Depthwise hyperparameter transfer in residual networks — Dalmazi et al., 2023
- Infinite limits of multi-head transformer dynamics — Bordelon et al., 2024
- The practitioner’s guide to the maximal update parameterization — Dey et al., 2024