Introduction: The Problem of HP Transfer Across Model Sizes

A fundamental challenge in training Large Language Models (LLMs) is determining how to scale hyperparameters (HPs) when changing model architecture. Modern deep learning’s core paradigm—that larger models yield better performance1—creates enormous pressure to scale model width and depth efficiently.

The Scaling Challenge: When practitioners scale models from small experiments to production deployments, they face two critical problems:

  1. HP Retuning Overhead: Traditional parameterizations require re-tuning learning rates and other HPs for each model size, which is computationally expensive
  2. Sub-optimal Training: When re-tuning is prohibitive, models are trained with sub-optimal HPs, wasting valuable compute resources

The maximal update parameterization (µP)2 was developed to address HP transfer when scaling width. However, extending this to simultaneous width and depth scaling remained an open problem.

This article introduces CompleteP3, the parameterization that achieves both:

  • Depth-wise HP transfer: Optimal base HPs remain approximately constant when scaling model depth
  • Non-lazy learning in all layers: Every layer learns genuinely non-linear features, not just linearizations

CompleteP enables 12-34% compute efficiency improvements over prior state-of-the-art parameterizations.


The Lazy Learning Regime in Deep Networks

What is Lazy Learning?

In deep networks, lazy learning refers to a regime where layers learn features that remain very close to their linearization around initialization4. This phenomenon occurs when the network’s non-linear components contribute minimally to the learned representations.

Mathematical Definition: A layer is in the lazy learning regime if:

where is the network output with parameters , and is the initialization.

Why Lazy Learning is Problematic

When networks operate in the lazy regime:

AspectImpact
Feature DiversityLayers learn similar (linear) features, wasting depth
Non-linearity UnusedActivation functions contribute little to learning
Depth InefficiencyAdding more layers provides diminishing returns
Compute WasteTraining compute is spent without proportional benefit

Previous work5 argued that HP transfer was not truly possible at any depth scaling factor, and that was optimal in practice. This suggested that lazy learning was an unavoidable trade-off.


CompleteP Parameterization Design

Core Design Principle

CompleteP sets the depth scaling exponent , which gives the parameterization its name. This is the unique value of that ensures complete feature learning (non-lazy learning in all layers) while maintaining HP transfer.

The Residual Connection Formula

The key difference between parameterizations lies in how the transformer’s residual block outputs are scaled:

where:

  • is the residual stream at layer
  • is the -th residual block (MLP or attention)
  • is the total number of layers
  • controls depth-dependent re-scaling

Parameterization Comparison

Parameterization ValueKey Properties
SP (Standard)N/ANo depth scaling; unstable at depth
µPN/AWidth scaling only; not applicable
Mean-Field Partial depth scaling; still lazy
CompletePFull depth scaling; non-lazy learning

CompleteP Scaling Rules

CompleteP extends µP with principled depth scaling. The complete scaling rules for a Pre-LN transformer3:

Width Multipliers

ComponentSPµPCompleteP ()
Embedding Init Var
Embedding LR
Hidden Init Var
Hidden LR (Adam)

where:

  • (width multiplier)
  • (depth multiplier)

Depth-Specific Corrections (CompleteP)

CompleteP requires additional corrections for stable training:

  1. Residual Scaling Factor:
  1. LayerNorm Learning Rate:

  2. Weight Decay:

  3. AdamW :

    • Embedding layers:
    • Hidden layers:

Depth-wise HP Transfer Guarantees

Empirical Evidence

The paper demonstrates that only CompleteP () achieves reliable depth-wise HP transfer3:

Finding 1: With SP, µP, and , optimal learning rate and initialization standard deviation do not remain stable as depth varies.

Finding 2: Only CompleteP () achieves reliable depth-wise HP transfer.

Coordinate Check Results

The paper validates HP transfer using coordinate checks—training with different depth multipliers and verifying that:

  • Loss curves align when using transferred HPs
  • Optimal HPs remain constant across depths
CompleteP (α=1): Concentric curves with stable minima
α=0.5: Diverging optimal HPs with depth
µP/SP: Unstable training at depth

Compute-Optimal Setting

Even under the compute-optimal setup (20 tokens per parameter), CompleteP shows superior HP transfer:

  • Reduced sensitivity to learning rate choice
  • Consistent loss improvements across depths
  • No additional HP tuning required when scaling depth

Non-Lazy Learning in All Layers: Proof Sketch

The Lazy vs Non-Lazy Spectrum

Yang et al. (2023)5 argued that was optimal based on feature diversity considerations. CompleteP’s key insight is that this argument only holds in the lazy learning regime.

Complete Feature Learning Requirement

CompleteP introduces a refined desideratum for HP transfer that includes complete feature learning3:

Complete Feature Learning: The learned representation in every layer of the model must remain non-lazy (i.e., non-linear) with respect to the parameters in both that layer and all earlier layers.

Mathematical Framework

Consider the network’s representation at layer :

Critical Insight: For , the contribution from each residual block is scaled by , ensuring:

  1. Order 1 terms (linear) contribute
  2. Order 2+ terms (non-linear) remain significant at scale
  3. Lazy regime avoided: Non-linear features cannot be absorbed into linear approximations

Theorem Sketch (Informal)

Theorem: Under CompleteP parameterization (), as with finite ratio :

  • Every layer’s learned features are non-lazy with respect to all earlier layers
  • The training dynamics remain in the feature learning regime rather than the NTK/lazy regime
  • HP transfer holds: optimal base HPs are constant across model sizes

This theorem distinguishes CompleteP from all other parameterizations.


Hardware-Aware Shape Optimization

The N:L Ratio Question

Kaplan et al. (2020)6 established that is compute-optimal using SP. However, SP doesn’t fairly admit stable width and depth scaling, undermining this conclusion.

CompleteP Enables Wider Design Space

CompleteP’s stable depth scaling enables revisiting the compute-optimal width-to-depth ratio:

N:L RatioSP/µP ViabilityCompleteP Viability
100:1StableStable
50:1ModerateStable
20:1ChallengingStable
10:1PoorStable
1:1UnstableViable

Hardware Considerations

Different hardware configurations favor different N:L ratios:

Hardware TypePreferred N:LRationale
GPU (memory-bound)Higher N:LBetter memory utilization
Wafer-scale (Cerebras)VariableMassive memory bandwidth enables deeper models
Edge devicesLower N:LMemory constraints favor depth efficiency

CompleteP’s key advantage: Practitioners can now choose shapes based on hardware characteristics rather than training stability constraints.


Compute Efficiency Improvements

FLOP Savings

The paper reports substantial compute efficiency improvements with CompleteP3:

Model ConfigurationFLOP Savings vs µP
Optimally-shaped 1.9B model11.8%
179-layer deep model34.4%

Training Curves

Models trained with CompleteP achieve:

  • Faster convergence (lower loss at same token count)
  • Better final performance (lower asymptotic loss)
  • Improved scaling with depth

Efficiency Breakdown

The 12-34% improvement comes from three sources:

  1. HP Transfer Efficiency: Eliminating re-tuning costs
  2. Training Efficiency: Non-lazy learning extracts more value from each layer
  3. Shape Efficiency: Deeper models become viable, enabling hardware-optimized shapes

Practical Implementation Guidelines

Minimal Implementation

The following shows the key modifications needed to implement CompleteP3:

# In the Block forward pass:
def forward(self, x):
    residual_scaling = 1 / (self.depth_multiplier ** self.depth_alpha_exp)
    x = x + residual_scaling * self.attn(self.ln_1(x))
    x = x + residual_scaling * self.mlp(self.ln_2(x))
    return x

Optimizer Configuration

CompleteP requires separated optimizer groups with depth-dependent learning rate scaling:

# Depth learning rate scaling
depth_lr_scaling = depth_multiplier ** (depth_alpha_exp - 1)  # = 1/depth_multiplier for α=1
width_lr_scaling = 1 / width_multiplier
 
# AdamW epsilon scaling
emb_unemb_adam_eps = adam_eps / width_multiplier
hidden_adam_eps = adam_eps / (width_multiplier * depth_multiplier)
 
optim_groups = [
    {'params': emb_params, 'weight_decay': weight_decay, 'lr_scale': 1.0, 'eps': emb_unemb_adam_eps},
    {'params': hidden_ln_params, 'weight_decay': 0.0, 'lr_scale': depth_lr_scaling, 'eps': hidden_adam_eps},
    {'params': hidden_weight_params, 'weight_decay': weight_decay / width_lr_scaling, 
     'lr_scale': width_lr_scaling * depth_lr_scaling, 'eps': hidden_adam_eps},
    {'params': hidden_bias_params, 'weight_decay': 0.0, 'lr_scale': depth_lr_scaling, 'eps': hidden_adam_eps},
]

Configuration Checklist

When implementing CompleteP:

  • Set depth_alpha_exp = 1.0 (the key CompleteP parameter)
  • Implement residual scaling: residual_scaling = 1/depth_multiplier
  • Separate optimizer groups for LN, hidden weights, biases
  • Scale AdamW by depth and width multipliers
  • Scale weight decay inversely with depth

Reference Implementation

A minimal implementation is available at:

github.com/EleutherAI/nanoGPT-mup/tree/completep


Summary and Key Takeaways

What CompleteP Achieves

  1. Depth-wise HP Transfer: Optimal learning rates and initialization scales transfer across model depths without re-tuning

  2. Non-Lazy Learning: Every layer learns genuinely non-linear features, maximizing the value of model depth

  3. Hardware Flexibility: Enables a wider range of width-to-depth ratios to be compute-efficient

  4. Empirical Gains: 12-34% compute efficiency improvements over prior state-of-the-art

When to Use CompleteP

ScenarioRecommendation
Scaling depth significantlyEssential
Multi-model family trainingHighly recommended
Hardware-specific shape optimizationRecommended
Single small model onlyµP sufficient

Key Parameters

ParameterValue for CompletePPurpose
(depth_alpha_exp)1.0Core CompleteP parameter
Residual scaling1/m_LDepth-dependent residual scaling
Hidden LR scalingLearning rate adjustment
AdamW scalingNumerical stability

References

Footnotes

  1. Deep learning scaling is predictable, empirically

  2. Tuning large neural networks via zero-shot hyperparameter transfer

  3. Don’t be lazy: CompleteP enables compute-efficient deep transformers 2 3 4 5 6

  4. On lazy training in differentiable programming

  5. Feature learning in infinite-depth neural networks 2

  6. Scaling Laws for Neural Language Models