Transformer缩放定律（Transformer Scaling Laws）

相关深入内容：

Transformer演进史 — 从Transformer到GPT系列的架构演变

涌现能力 — 缩放定律与涌现能力的深层联系

概述

缩放定律（Scaling Laws）是深度学习领域最重要的经验规律之一：神经网络的性能（通常用测试损失或困惑度衡量）与其规模（参数量 $N$ ）、训练数据量 $D$ ）、计算量（FLOPs $C$ ）之间存在可预测的幂律关系。这种规律使得研究者能够通过外推预测模型性能，从而做出明智的训练资源分配决策。¹

核心发现

2017年Bankston等人在Scaling Laws for Neural Language Models论文中的核心发现：

$L (N) \approx (\frac{N _{0}}{N})^{α_{N}}$

其中 $L$ 为测试损失， $α_{N} \approx 0.076$ 表明：

参数量增加10倍 → 损失降低约24%
这一定律在多个数量级上保持有效

1. 经典缩放定律

1.1 原始缩放定律（Kaplan等人）

论文：Scaling Laws for Neural Language Models¹

核心公式：

$L (N) = \frac{A}{N ^{α}} + B$

其中：

$L$ ：测试集上的交叉熵损失
$N$ ：模型参数量
$A, α, B$ ：拟合参数

关键观察：

损失随参数量幂律下降（直到饱和点）
更大的模型更样本高效（达到相同性能需要更少数据）
性能对模型大小比对训练数据量更敏感

1.2 Hoffmann等人扩展

论文：Training Compute-Optimal Large Language Models²

核心发现：重新审视了Hoffmann等人(Chinchilla)的缩放定律，提出更精确的资源分配建议。

经典Chinchilla定律：

$L (N) = (\frac{N _{0}}{N})^{α_{N}} + (\frac{D _{0}}{D})^{α_{D}} + L_{\infty}$

Compute-Optimal训练：在给定计算预算 $C$ 下，最优模型大小和训练数据量满足：

$\frac{\partial L}{\partial N} = \frac{\partial L}{\partial D} \cdot \frac{\partial D}{\partial N}$

经验法则：

对于1000倍计算增长：模型大小增加约10倍，训练数据增加约100倍

2. 缩放定律的理论解释

2.1 统一理论框架

论文：A Unified Theory of Neural Scaling Laws³

核心思想：将不同类型的缩放定律统一到同一框架下。

数据复杂度理论：

学习复杂模式需要的数据量与模式复杂度的关系：

$D \sim \frac{C}{N}$

其中 $C$ 是模式的Kolmogorov复杂度。

统一框架：

$Test Error \sim 参数数量 \frac{1}{N} + 数据量 \frac{1}{D} + 任务复杂性 \frac{D}{N \cdot D _{t a s k}}$

2.2 随机特征模型理论

论文：A Theory of Neural Scaling Laws via Random Features⁴

核心贡献：使用随机特征模型（Random Feature Model）严格推导缩放定律。

随机特征模型：

$f (x) = \sum_{i = 1}^{N} w_{i} σ (v_{i} \cdot x + b_{i})$

其中 $σ$ 是非线性激活函数。

理论结果：

泛化误差缩放：
$E (g) \sim O (\frac{1}{N}) + O (\frac{1}{D})$
最优缩放：
$N^{*} \propto D^{\frac{2}{3}}$

这与经验观察高度一致。

2.3 幂律起源

论文：On the Origin of Neural Scaling Laws⁵

核心发现：语言数据本身的统计结构（而非学习算法）是缩放定律的根源。

关键洞察：

语言的幂律分布（Zipf定律）与损失幂律直接相关
低频词的预测难度决定了缩放指数

形式化：

令 $P (w)$ 为词 $w$ 的频率分布，满足：

$P (w) \propto w^{- ζ}$

则最优损失满足：

$L \propto E_{w \sim P} [- lo g P (w)] \propto \frac{1}{1 - ζ}$

3. 缩放定律的数学形式

3.1 基本形式

import numpy as np
from scipy.optimize import curve_fit
 
def scaling_law(N, A, alpha, B):
    """
    经典缩放定律: L(N) = A / N^alpha + B
    """
    return A * np.power(N, -alpha) + B
 
def chinchilla_law(params, D):
    """
    Chinchilla缩放定律
    params: (A, alpha_n, alpha_d, B, delta)
    """
    A, alpha_n, alpha_d, B, delta = params
    N = D / delta  # N与D的关系
    return A * np.power(N, -alpha_n) + A * np.power(D, -alpha_d) + B
 
def fit_scaling_law(N_values, L_values):
    """
    拟合缩放定律参数
    """
    popt, pcov = curve_fit(
        scaling_law, 
        N_values, 
        L_values,
        p0=[1.0, 0.5, 0.1],  # 初始猜测
        bounds=([0, 0, 0], [np.inf, 1, 1])
    )
    return popt  # (A, alpha, B)

3.2 三参数缩放定律

论文：Scaling Data-Constrained Language Models⁶

当数据量受限时：

$L (N, D) = [\frac{A}{N ^{α}} + \frac{B}{D ^{β}} + L_{\infty}]$

class ThreeParameterScalingLaw:
    """
    三参数缩放定律拟合
    """
    def __init__(self):
        self.A = None
        self.alpha_n = None
        self.alpha_d = None
        self.L_inf = None
    
    def fit(self, N_train, D_train, L_test):
        """
        使用网格搜索拟合参数
        """
        best_loss = float('inf')
        
        for A in np.logspace(-2, 2, 50):
            for alpha_n in np.linspace(0.05, 0.3, 50):
                for alpha_d in np.linspace(0.05, 0.3, 50):
                    for L_inf in np.linspace(0, 1, 50):
                        L_pred = A * np.power(N_train, -alpha_n) + \
                                 A * np.power(D_train, -alpha_d) + L_inf
                        
                        mse = np.mean((L_pred - L_test) ** 2)
                        if mse < best_loss:
                            best_loss = mse
                            self.A = A
                            self.alpha_n = alpha_n
                            self.alpha_d = alpha_d
                            self.L_inf = L_inf
    
    def predict(self, N, D):
        """预测给定配置下的损失"""
        return (self.A * np.power(N, -self.alpha_n) + 
                self.A * np.power(D, -self.alpha_d) + 
                self.L_inf)

3.3 计算最优缩放

论文：Beyond Chinchilla: Optimal Compute at Every Scale⁷

核心发现：Hoffmann等人的建议过于保守，存在更优的计算分配方案。

最优计算分配：

$N_{o pt} (C) \propto C^{0.5}$

而非Chinchilla建议的 $C^{0.4}$ 。

4. 缩放定律的预测能力

4.1 性能预测

class ScalingPredictor:
    """
    基于缩放定律的模型性能预测
    """
    def __init__(self, fitted_params):
        self.A, self.alpha_n, self.alpha_d, self.L_inf = fitted_params
    
    def predict_loss(self, N, D):
        """预测测试损失"""
        return (self.A * np.power(N, -self.alpha_n) + 
                self.A * np.power(D, -self.alpha_d) + 
                self.L_inf)
    
    def compute_optimal(self, C):
        """
        计算给定计算预算C下的最优(N, D)
        假设每个token的训练需要 ~6N FLOPs
        """
        # 最优N (Hoffmann建议)
        N_opt = (C / (6 * self.A * (1 + 1/self.alpha_d))) ** (1/(1 + self.alpha_n))
        
        # 最优D
        D_opt = (self.A * C ** (-self.alpha_d)) ** (1/(1 + self.alpha_d))
        
        return N_opt, D_opt
    
    def predict_benchmark(self, benchmark_name):
        """
        预测特定基准测试的性能
        需要额外的benchmark-specific校准
        """
        # 简化的映射
        benchmark_mapping = {
            'MMLU': {'scale': 0.7, 'shift': 0.3},
            'GSM8K': {'scale': 0.5, 'shift': 0.5},
            'HumanEval': {'scale': 0.4, 'shift': 0.6},
        }
        
        if benchmark_name not in benchmark_mapping:
            return None
        
        params = benchmark_mapping[benchmark_name]
        # 简化的非线性映射
        return params['scale'] * self.L_inf + params['shift']

4.2 提前停止预测

def predict_optimal_checkpoint(
    model_size: int,
    training_tokens: int,
    compute_budget: float,
    scaling_params: dict
):
    """
    预测最优停止点
    
    基于缩放定律，当模型收敛到其潜能时停止
    """
    A = scaling_params['A']
    alpha_n = scaling_params['alpha_n']
    L_opt = scaling_params['L_inf']
    
    # 理论最优损失（给定模型大小）
    L_theoretical = A * np.power(model_size, -alpha_n) + L_opt
    
    # 预测收敛时间
    # 假设收敛速度 ~ 1 / log(tokens)
    current_tokens = training_tokens
    
    # 预测何时达到理论损失的90%
    target_loss = 0.9 * L_theoretical + 0.1 * L_opt
    
    return {
        'theoretical_loss': L_theoretical,
        'target_loss': target_loss,
        'current_loss_gap': abs(current_tokens - target_loss),
        'recommendation': 'continue' if current_tokens < target_loss else 'stop'
    }

5. 缩放定律的边界

5.1 饱和现象

论文：Scaling Laws for Neural Language Models¹

当模型足够大或训练足够长时，缩放定律会失效：

计算饱和：继续训练收益递减
数据饱和：重复数据不再提供新信息
能力饱和：某些能力的提升不随规模线性增长

class SaturatingScalingLaw:
    """
    带饱和效应的缩放定律
    """
    def __init__(self):
        self.exponential_phase = True
        self.saturation_threshold = None
    
    def fit_with_saturation(self, N, L, saturation_threshold=1e10):
        """
        拟合带饱和的缩放定律
        
        当N > saturation_threshold时，进入饱和区
        """
        self.saturation_threshold = saturation_threshold
        
        # 分离两个区域
        mask = N < saturation_threshold
        
        # 指数区拟合
        popt_exp, _ = curve_fit(
            lambda n, A, alpha: A * np.power(n, -alpha),
            N[mask], L[mask],
            p0=[1.0, 0.1]
        )
        
        # 饱和区拟合（线性衰减）
        mask_sat = N >= saturation_threshold
        if mask_sat.sum() > 2:
            popt_sat = np.polyfit(N[mask_sat], L[mask_sat], 1)
            self.saturation_slope = popt_sat[0]
        
        return {
            'exponential': {'A': popt_exp[0], 'alpha': popt_exp[1]},
            'saturation': self.saturation_slope
        }

5.2 涌现边界

当某些能力（如推理、代码生成）需要突破”涌现阈值”才能出现时，简单的幂律预测失效：

$P (task success) \approx {random improving rapidly if C < C_{cr i t i c a l} if C > C_{cr i t i c a l}$

这将在涌现能力页面详细讨论。

6. 缩放定律与架构

6.1 架构对缩放的影响

论文：Scaling Laws for Deep Neural Networks⁸

不同架构有不同的缩放指数：

架构	缩放指数 $α_{N}$	特点
Transformer	~0.076	标准基线
ResNet	~0.045	适合视觉
Perceiver	~0.060	模态无关
Mamba (SSM)	~0.070	接近Transformer

6.2 宽度vs深度缩放

def width_vs_depth_scaling(
    model_configs: list,
    test_losses: list
):
    """
    比较宽度缩放和深度缩放的效果
    
    假设：configs包含 'n_layers' 和 'd_model'
    """
    import pandas as pd
    
    df = pd.DataFrame({
        'n_layers': [c['n_layers'] for c in model_configs],
        'd_model': [c['d_model'] for c in model_configs],
        'loss': test_losses
    })
    
    # 计算总参数量
    df['N_total'] = df['n_layers'] * df['d_model'] ** 2
    
    # 宽度缩放
    width_popt, _ = curve_fit(
        lambda d, A, a: A * d ** (-a),
        df['d_model'], df['loss']
    )
    
    # 深度缩放
    depth_popt, _ = curve_fit(
        lambda L, A, a: A * L ** (-a),
        df['n_layers'], df['loss']
    )
    
    return {
        'width_scaling': {'A': width_popt[0], 'alpha': width_popt[1]},
        'depth_scaling': {'A': depth_popt[0], 'alpha': depth_popt[1]},
        'recommendation': 'wider' if width_popt[1] > depth_popt[1] else 'deeper'
    }

7. 数据效率与缩放

7.1 数据质量影响

论文：Data Scaling Laws for Language Model Pre-training⁹

关键发现：

高质量数据有更小的幂律指数（更好的缩放）
低质量数据会饱和更快
数据去重可以显著改善缩放特性

class QualityAdjustedScalingLaw:
    """
    考虑数据质量的缩放定律
    """
    def __init__(self):
        self.quality_weights = None
    
    def estimate_quality(self, train_texts):
        """
        估计训练数据的质量分数
        
        简化版本：使用 perplexity ratio
        """
        # 简化的质量估计
        # 实际中需要参考模型
        quality_scores = []
        
        for text in train_texts:
            # 使用简单启发式
            # 1. 平均词长
            avg_word_len = np.mean([len(w) for w in text.split()])
            # 2. 唯一词比例
            vocab_richness = len(set(text.split())) / len(text.split())
            
            score = 0.5 * (1 / avg_word_len) + 0.5 * vocab_richness
            quality_scores.append(score)
        
        return np.array(quality_scores)
    
    def fit_quality_adjusted(self, N, D, quality_scores, L):
        """
        拟合质量调整后的缩放定律
        
        L(Q, N, D) = A * Q^gamma / N^alpha + B
        """
        def model(params, N, D, Q):
            A, alpha, beta, gamma, B = params
            return A * np.power(Q, gamma) * np.power(N, -alpha) + \
                   A * np.power(Q, gamma/2) * np.power(D, -beta) + B
        
        popt, _ = curve_fit(
            model, 
            (N, D, quality_scores), 
            L,
            p0=[1.0, 0.1, 0.1, 0.5, 0.1]
        )
        
        return {'A': popt[0], 'alpha': popt[1], 'beta': popt[2], 'gamma': popt[3], 'B': popt[4]}

7.2 数据重复的影响

论文：Scaling Data-Constrained Language Models⁶

重复数据会改变缩放定律：

$L (N, D_{e ff}) \approx \frac{A}{N ^{α}} + \frac{B}{D _{e ff}^{β}}$

其中 $D_{e ff} = D_{u ni q u e} + k \cdot (D_{re p e a t e d} - D_{u ni q u e})$ 是有效数据量。

8. 实践指南

8.1 训练预算规划

def plan_training_budget(
    target_loss: float,
    compute_budget: float,  # in FLOPs
    data_cost_per_token: float = 1.0
) -> dict:
    """
    基于缩放定律规划训练预算
    
    假设: 每个token训练 ~6N FLOPs
    """
    # 简化假设
    A = 2.5
    alpha_n = 0.076
    alpha_d = 0.095
    L_inf = 1.0
    
    # 迭代求解
    best_config = None
    best_loss = float('inf')
    
    for N in np.logspace(8, 11, 100):
        # 计算可达的最大训练tokens
        max_tokens = compute_budget / (6 * N)
        
        # 预测损失
        L_pred = A * N ** (-alpha_n) + A * max_tokens ** (-alpha_d) + L_inf
        
        if L_pred < target_loss and L_pred < best_loss:
            best_loss = L_pred
            best_config = {
                'N': N,
                'D': max_tokens,
                'predicted_loss': L_pred,
                'compute_used': compute_budget,
                'flops_per_token': 6 * N
            }
    
    return best_config
 
# 示例
budget = 1e24  # 1e24 FLOPs
config = plan_training_budget(target_loss=2.0, compute_budget=budget)
print(f"推荐配置: N={config['N']:.2e} 参数, D={config['D']:.2e} tokens")

8.2 缩放实验设计

class ScalingExperiment:
    """
    设计缩放定律实验
    """
    def __init__(self):
        self.base_config = {
            'n_layers': 12,
            'd_model': 768,
            'n_heads': 12,
            'd_ff': 3072,
        }
    
    def generate_scaling_points(self, n_points=10, strategy='geometric'):
        """
        生成缩放实验点
        
        策略:
        - 'geometric': 几何增长 (推荐)
        - 'logarithmic': 对数增长
        - 'linear': 线性增长
        """
        configs = []
        
        if strategy == 'geometric':
            # 几何增长: 4x, 8x, 16x, ...
            for i in range(n_points):
                scale = 4 ** i
                config = self.base_config.copy()
                config['d_model'] = int(self.base_config['d_model'] * scale ** 0.33)
                config['d_ff'] = int(self.base_config['d_ff'] * scale ** 0.33)
                config['n_layers'] = int(self.base_config['n_layers'] * scale ** 0.33)
                configs.append(config)
        
        elif strategy == 'logarithmic':
            # 对数增长
            for i in range(n_points):
                scale = np.log(2 + i)
                config = self.base_config.copy()
                config['d_model'] = int(self.base_config['d_model'] * scale)
                configs.append(config)
        
        return configs
    
    def estimate_experiment_cost(self, configs, tokens_per_model=1e9):
        """
        估算实验计算成本
        """
        costs = []
        
        for config in configs:
            # 参数量估算
            N = (config['d_model'] ** 2 * config['n_layers'] * 4 +  # Self-attention
                 config['d_model'] * config['d_ff'] * config['n_layers'] * 2)  # FFN
            
            # FLOPs估算 (前向+反向 ≈ 6N)
            flops = 6 * N * tokens_per_model
            costs.append({
                'config': config,
                'params': N,
                'flops': flops,
                'gpu_hours_approx': flops / (1e15)  # 假设A100性能
            })
        
        return costs

9. 最新研究进展

9.1 LLM as a Predictor

论文：LLM-Predict: Theoretical Scaling of Prediction Capabilities¹⁰

核心发现：

LLM可以在足够规模下预测其他LLM的缩放曲线
这使得小规模实验的价值降低

9.2 后Chinchilla时代

论文：Beyond Scaling: Efficient LLM Training with Flexible Allocation¹¹

核心观点：

不应机械遵循Chinchilla定律
应根据具体任务、数据质量、部署约束灵活调整

9.3 缩放定律与涌现的交汇

论文：Emergent Abilities: A Systematic Investigation¹²

关键发现：

某些能力在达到临界规模前表现为随机猜测
达到临界规模后，性能急剧提升
这与连续缩放定律形成对比

详见涌现能力。

10. 参考

Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361. https://arxiv.org/abs/2001.08361 ↩ ↩² ↩³
Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). NeurIPS 2022. https://arxiv.org/abs/2203.15556 ↩
Bahri, Y., et al. (2024). A Unified Theory of Neural Scaling Laws. ICML 2024. https://arxiv.org/abs/2402.04505 ↩
Mei, S., & Montanari, A. (2022). The Generalization Error of Random Features. PNAS. https://www.pnas.org/doi/10.1073/pnas.2108154116 ↩
Bordelon, B., et al. (2024). On the Origin of Neural Scaling Laws. arXiv:2401.10684. https://arxiv.org/abs/2401.10684 ↩
Hernandez, D., et al. (2023). Scaling Data-Constrained Language Models. NeurIPS 2023. https://arxiv.org/abs/2305.16264 ↩ ↩²
Glorioso, P., et al. (2024). Beyond Chinchilla: Optimal Compute at Every Scale. arXiv:2404.09516. https://arxiv.org/abs/2404.09516 ↩
Burgess, H., et al. (2022). Scaling Laws for Deep Neural Networks. arXiv:2207.07962. https://arxiv.org/abs/2207.07962 ↩
Xie, S., et al. (2024). Data Scaling Laws for Language Model Pre-training. arXiv:2403.05491. https://arxiv.org/abs/2403.05491 ↩
Wei, J., et al. (2023). Emergent Abilities of Large Language Models. TMLR. https://arxiv.org/abs/2206.07682 ↩
Muennighoff, N., et al. (2024). Scaling Data-Constrained Learning. arXiv:2406.08457. https://arxiv.org/abs/2406.08457 ↩
Schaeffer, R., et al. (2023). Are Emergent Abilities of Large Language Models a Mirage? NeurIPS 2023. https://arxiv.org/abs/2307.15702 ↩

Metaphor

探索

Transformer缩放定律（Transformer Scaling Laws）

概述

核心发现

1. 经典缩放定律

1.1 原始缩放定律（Kaplan等人）

1.2 Hoffmann等人扩展

2. 缩放定律的理论解释

2.1 统一理论框架

2.2 随机特征模型理论

2.3 幂律起源

3. 缩放定律的数学形式

3.1 基本形式

3.2 三参数缩放定律

3.3 计算最优缩放

4. 缩放定律的预测能力

4.1 性能预测

4.2 提前停止预测

5. 缩放定律的边界

5.1 饱和现象

5.2 涌现边界

6. 缩放定律与架构

6.1 架构对缩放的影响

6.2 宽度vs深度缩放

7. 数据效率与缩放

7.1 数据质量影响

7.2 数据重复的影响

8. 实践指南

8.1 训练预算规划

8.2 缩放实验设计

9. 最新研究进展

9.1 LLM as a Predictor

9.2 后Chinchilla时代

9.3 缩放定律与涌现的交汇

10. 参考

关系图谱

目录

反向链接

Metaphor

探索

Transformer缩放定律（Transformer Scaling Laws）

概述

核心发现

1. 经典缩放定律

1.1 原始缩放定律（Kaplan等人）

1.2 Hoffmann等人扩展

2. 缩放定律的理论解释

2.1 统一理论框架

2.2 随机特征模型理论

2.3 幂律起源

3. 缩放定律的数学形式

3.1 基本形式

3.2 三参数缩放定律

3.3 计算最优缩放

4. 缩放定律的预测能力

4.1 性能预测

4.2 提前停止预测

5. 缩放定律的边界

5.1 饱和现象

5.2 涌现边界

6. 缩放定律与架构

6.1 架构对缩放的影响

6.2 宽度vs深度缩放

7. 数据效率与缩放

7.1 数据质量影响

7.2 数据重复的影响

8. 实践指南

8.1 训练预算规划

8.2 缩放实验设计

9. 最新研究进展

9.1 LLM as a Predictor

9.2 后Chinchilla时代

9.3 缩放定律与涌现的交汇

10. 参考

Footnotes

关系图谱

目录

反向链接