相关深入内容:

概述

缩放定律(Scaling Laws)是深度学习领域最重要的经验规律之一:神经网络的性能(通常用测试损失或困惑度衡量)与其规模(参数量 )、训练数据量 )、计算量(FLOPs )之间存在可预测的幂律关系。这种规律使得研究者能够通过外推预测模型性能,从而做出明智的训练资源分配决策。1

核心发现

2017年Bankston等人在Scaling Laws for Neural Language Models论文中的核心发现:

其中 为测试损失, 表明:

  • 参数量增加10倍 → 损失降低约24%
  • 这一定律在多个数量级上保持有效

1. 经典缩放定律

1.1 原始缩放定律(Kaplan等人)

论文:Scaling Laws for Neural Language Models1

核心公式

其中:

  • :测试集上的交叉熵损失
  • :模型参数量
  • :拟合参数

关键观察

  1. 损失随参数量幂律下降(直到饱和点)
  2. 更大的模型更样本高效(达到相同性能需要更少数据)
  3. 性能对模型大小比对训练数据量更敏感

1.2 Hoffmann等人扩展

论文:Training Compute-Optimal Large Language Models2

核心发现:重新审视了Hoffmann等人(Chinchilla)的缩放定律,提出更精确的资源分配建议。

经典Chinchilla定律

Compute-Optimal训练:在给定计算预算 下,最优模型大小和训练数据量满足:

经验法则

  • 对于1000倍计算增长:模型大小增加约10倍,训练数据增加约100倍

2. 缩放定律的理论解释

2.1 统一理论框架

论文:A Unified Theory of Neural Scaling Laws3

核心思想:将不同类型的缩放定律统一到同一框架下。

数据复杂度理论

学习复杂模式需要的数据量与模式复杂度的关系:

其中 是模式的Kolmogorov复杂度。

统一框架

2.2 随机特征模型理论

论文:A Theory of Neural Scaling Laws via Random Features4

核心贡献:使用随机特征模型(Random Feature Model)严格推导缩放定律。

随机特征模型

其中 是非线性激活函数。

理论结果

  1. 泛化误差缩放

  2. 最优缩放

这与经验观察高度一致。

2.3 幂律起源

论文:On the Origin of Neural Scaling Laws5

核心发现:语言数据本身的统计结构(而非学习算法)是缩放定律的根源。

关键洞察

  • 语言的幂律分布(Zipf定律)与损失幂律直接相关
  • 低频词的预测难度决定了缩放指数

形式化

为词 的频率分布,满足:

则最优损失满足:


3. 缩放定律的数学形式

3.1 基本形式

import numpy as np
from scipy.optimize import curve_fit
 
def scaling_law(N, A, alpha, B):
    """
    经典缩放定律: L(N) = A / N^alpha + B
    """
    return A * np.power(N, -alpha) + B
 
def chinchilla_law(params, D):
    """
    Chinchilla缩放定律
    params: (A, alpha_n, alpha_d, B, delta)
    """
    A, alpha_n, alpha_d, B, delta = params
    N = D / delta  # N与D的关系
    return A * np.power(N, -alpha_n) + A * np.power(D, -alpha_d) + B
 
def fit_scaling_law(N_values, L_values):
    """
    拟合缩放定律参数
    """
    popt, pcov = curve_fit(
        scaling_law, 
        N_values, 
        L_values,
        p0=[1.0, 0.5, 0.1],  # 初始猜测
        bounds=([0, 0, 0], [np.inf, 1, 1])
    )
    return popt  # (A, alpha, B)

3.2 三参数缩放定律

论文:Scaling Data-Constrained Language Models6

当数据量受限时:

class ThreeParameterScalingLaw:
    """
    三参数缩放定律拟合
    """
    def __init__(self):
        self.A = None
        self.alpha_n = None
        self.alpha_d = None
        self.L_inf = None
    
    def fit(self, N_train, D_train, L_test):
        """
        使用网格搜索拟合参数
        """
        best_loss = float('inf')
        
        for A in np.logspace(-2, 2, 50):
            for alpha_n in np.linspace(0.05, 0.3, 50):
                for alpha_d in np.linspace(0.05, 0.3, 50):
                    for L_inf in np.linspace(0, 1, 50):
                        L_pred = A * np.power(N_train, -alpha_n) + \
                                 A * np.power(D_train, -alpha_d) + L_inf
                        
                        mse = np.mean((L_pred - L_test) ** 2)
                        if mse < best_loss:
                            best_loss = mse
                            self.A = A
                            self.alpha_n = alpha_n
                            self.alpha_d = alpha_d
                            self.L_inf = L_inf
    
    def predict(self, N, D):
        """预测给定配置下的损失"""
        return (self.A * np.power(N, -self.alpha_n) + 
                self.A * np.power(D, -self.alpha_d) + 
                self.L_inf)

3.3 计算最优缩放

论文:Beyond Chinchilla: Optimal Compute at Every Scale7

核心发现:Hoffmann等人的建议过于保守,存在更优的计算分配方案。

最优计算分配

而非Chinchilla建议的


4. 缩放定律的预测能力

4.1 性能预测

class ScalingPredictor:
    """
    基于缩放定律的模型性能预测
    """
    def __init__(self, fitted_params):
        self.A, self.alpha_n, self.alpha_d, self.L_inf = fitted_params
    
    def predict_loss(self, N, D):
        """预测测试损失"""
        return (self.A * np.power(N, -self.alpha_n) + 
                self.A * np.power(D, -self.alpha_d) + 
                self.L_inf)
    
    def compute_optimal(self, C):
        """
        计算给定计算预算C下的最优(N, D)
        假设每个token的训练需要 ~6N FLOPs
        """
        # 最优N (Hoffmann建议)
        N_opt = (C / (6 * self.A * (1 + 1/self.alpha_d))) ** (1/(1 + self.alpha_n))
        
        # 最优D
        D_opt = (self.A * C ** (-self.alpha_d)) ** (1/(1 + self.alpha_d))
        
        return N_opt, D_opt
    
    def predict_benchmark(self, benchmark_name):
        """
        预测特定基准测试的性能
        需要额外的benchmark-specific校准
        """
        # 简化的映射
        benchmark_mapping = {
            'MMLU': {'scale': 0.7, 'shift': 0.3},
            'GSM8K': {'scale': 0.5, 'shift': 0.5},
            'HumanEval': {'scale': 0.4, 'shift': 0.6},
        }
        
        if benchmark_name not in benchmark_mapping:
            return None
        
        params = benchmark_mapping[benchmark_name]
        # 简化的非线性映射
        return params['scale'] * self.L_inf + params['shift']

4.2 提前停止预测

def predict_optimal_checkpoint(
    model_size: int,
    training_tokens: int,
    compute_budget: float,
    scaling_params: dict
):
    """
    预测最优停止点
    
    基于缩放定律,当模型收敛到其潜能时停止
    """
    A = scaling_params['A']
    alpha_n = scaling_params['alpha_n']
    L_opt = scaling_params['L_inf']
    
    # 理论最优损失(给定模型大小)
    L_theoretical = A * np.power(model_size, -alpha_n) + L_opt
    
    # 预测收敛时间
    # 假设收敛速度 ~ 1 / log(tokens)
    current_tokens = training_tokens
    
    # 预测何时达到理论损失的90%
    target_loss = 0.9 * L_theoretical + 0.1 * L_opt
    
    return {
        'theoretical_loss': L_theoretical,
        'target_loss': target_loss,
        'current_loss_gap': abs(current_tokens - target_loss),
        'recommendation': 'continue' if current_tokens < target_loss else 'stop'
    }

5. 缩放定律的边界

5.1 饱和现象

论文:Scaling Laws for Neural Language Models1

当模型足够大或训练足够长时,缩放定律会失效:

  1. 计算饱和:继续训练收益递减
  2. 数据饱和:重复数据不再提供新信息
  3. 能力饱和:某些能力的提升不随规模线性增长
class SaturatingScalingLaw:
    """
    带饱和效应的缩放定律
    """
    def __init__(self):
        self.exponential_phase = True
        self.saturation_threshold = None
    
    def fit_with_saturation(self, N, L, saturation_threshold=1e10):
        """
        拟合带饱和的缩放定律
        
        当N > saturation_threshold时,进入饱和区
        """
        self.saturation_threshold = saturation_threshold
        
        # 分离两个区域
        mask = N < saturation_threshold
        
        # 指数区拟合
        popt_exp, _ = curve_fit(
            lambda n, A, alpha: A * np.power(n, -alpha),
            N[mask], L[mask],
            p0=[1.0, 0.1]
        )
        
        # 饱和区拟合(线性衰减)
        mask_sat = N >= saturation_threshold
        if mask_sat.sum() > 2:
            popt_sat = np.polyfit(N[mask_sat], L[mask_sat], 1)
            self.saturation_slope = popt_sat[0]
        
        return {
            'exponential': {'A': popt_exp[0], 'alpha': popt_exp[1]},
            'saturation': self.saturation_slope
        }

5.2 涌现边界

当某些能力(如推理、代码生成)需要突破”涌现阈值”才能出现时,简单的幂律预测失效:

这将在涌现能力页面详细讨论。


6. 缩放定律与架构

6.1 架构对缩放的影响

论文:Scaling Laws for Deep Neural Networks8

不同架构有不同的缩放指数:

架构缩放指数 特点
Transformer~0.076标准基线
ResNet~0.045适合视觉
Perceiver~0.060模态无关
Mamba (SSM)~0.070接近Transformer

6.2 宽度vs深度缩放

def width_vs_depth_scaling(
    model_configs: list,
    test_losses: list
):
    """
    比较宽度缩放和深度缩放的效果
    
    假设:configs包含 'n_layers' 和 'd_model'
    """
    import pandas as pd
    
    df = pd.DataFrame({
        'n_layers': [c['n_layers'] for c in model_configs],
        'd_model': [c['d_model'] for c in model_configs],
        'loss': test_losses
    })
    
    # 计算总参数量
    df['N_total'] = df['n_layers'] * df['d_model'] ** 2
    
    # 宽度缩放
    width_popt, _ = curve_fit(
        lambda d, A, a: A * d ** (-a),
        df['d_model'], df['loss']
    )
    
    # 深度缩放
    depth_popt, _ = curve_fit(
        lambda L, A, a: A * L ** (-a),
        df['n_layers'], df['loss']
    )
    
    return {
        'width_scaling': {'A': width_popt[0], 'alpha': width_popt[1]},
        'depth_scaling': {'A': depth_popt[0], 'alpha': depth_popt[1]},
        'recommendation': 'wider' if width_popt[1] > depth_popt[1] else 'deeper'
    }

7. 数据效率与缩放

7.1 数据质量影响

论文:Data Scaling Laws for Language Model Pre-training9

关键发现

  1. 高质量数据有更小的幂律指数(更好的缩放)
  2. 低质量数据会饱和更快
  3. 数据去重可以显著改善缩放特性
class QualityAdjustedScalingLaw:
    """
    考虑数据质量的缩放定律
    """
    def __init__(self):
        self.quality_weights = None
    
    def estimate_quality(self, train_texts):
        """
        估计训练数据的质量分数
        
        简化版本:使用 perplexity ratio
        """
        # 简化的质量估计
        # 实际中需要参考模型
        quality_scores = []
        
        for text in train_texts:
            # 使用简单启发式
            # 1. 平均词长
            avg_word_len = np.mean([len(w) for w in text.split()])
            # 2. 唯一词比例
            vocab_richness = len(set(text.split())) / len(text.split())
            
            score = 0.5 * (1 / avg_word_len) + 0.5 * vocab_richness
            quality_scores.append(score)
        
        return np.array(quality_scores)
    
    def fit_quality_adjusted(self, N, D, quality_scores, L):
        """
        拟合质量调整后的缩放定律
        
        L(Q, N, D) = A * Q^gamma / N^alpha + B
        """
        def model(params, N, D, Q):
            A, alpha, beta, gamma, B = params
            return A * np.power(Q, gamma) * np.power(N, -alpha) + \
                   A * np.power(Q, gamma/2) * np.power(D, -beta) + B
        
        popt, _ = curve_fit(
            model, 
            (N, D, quality_scores), 
            L,
            p0=[1.0, 0.1, 0.1, 0.5, 0.1]
        )
        
        return {'A': popt[0], 'alpha': popt[1], 'beta': popt[2], 'gamma': popt[3], 'B': popt[4]}

7.2 数据重复的影响

论文:Scaling Data-Constrained Language Models6

重复数据会改变缩放定律:

其中 是有效数据量。


8. 实践指南

8.1 训练预算规划

def plan_training_budget(
    target_loss: float,
    compute_budget: float,  # in FLOPs
    data_cost_per_token: float = 1.0
) -> dict:
    """
    基于缩放定律规划训练预算
    
    假设: 每个token训练 ~6N FLOPs
    """
    # 简化假设
    A = 2.5
    alpha_n = 0.076
    alpha_d = 0.095
    L_inf = 1.0
    
    # 迭代求解
    best_config = None
    best_loss = float('inf')
    
    for N in np.logspace(8, 11, 100):
        # 计算可达的最大训练tokens
        max_tokens = compute_budget / (6 * N)
        
        # 预测损失
        L_pred = A * N ** (-alpha_n) + A * max_tokens ** (-alpha_d) + L_inf
        
        if L_pred < target_loss and L_pred < best_loss:
            best_loss = L_pred
            best_config = {
                'N': N,
                'D': max_tokens,
                'predicted_loss': L_pred,
                'compute_used': compute_budget,
                'flops_per_token': 6 * N
            }
    
    return best_config
 
# 示例
budget = 1e24  # 1e24 FLOPs
config = plan_training_budget(target_loss=2.0, compute_budget=budget)
print(f"推荐配置: N={config['N']:.2e} 参数, D={config['D']:.2e} tokens")

8.2 缩放实验设计

class ScalingExperiment:
    """
    设计缩放定律实验
    """
    def __init__(self):
        self.base_config = {
            'n_layers': 12,
            'd_model': 768,
            'n_heads': 12,
            'd_ff': 3072,
        }
    
    def generate_scaling_points(self, n_points=10, strategy='geometric'):
        """
        生成缩放实验点
        
        策略:
        - 'geometric': 几何增长 (推荐)
        - 'logarithmic': 对数增长
        - 'linear': 线性增长
        """
        configs = []
        
        if strategy == 'geometric':
            # 几何增长: 4x, 8x, 16x, ...
            for i in range(n_points):
                scale = 4 ** i
                config = self.base_config.copy()
                config['d_model'] = int(self.base_config['d_model'] * scale ** 0.33)
                config['d_ff'] = int(self.base_config['d_ff'] * scale ** 0.33)
                config['n_layers'] = int(self.base_config['n_layers'] * scale ** 0.33)
                configs.append(config)
        
        elif strategy == 'logarithmic':
            # 对数增长
            for i in range(n_points):
                scale = np.log(2 + i)
                config = self.base_config.copy()
                config['d_model'] = int(self.base_config['d_model'] * scale)
                configs.append(config)
        
        return configs
    
    def estimate_experiment_cost(self, configs, tokens_per_model=1e9):
        """
        估算实验计算成本
        """
        costs = []
        
        for config in configs:
            # 参数量估算
            N = (config['d_model'] ** 2 * config['n_layers'] * 4 +  # Self-attention
                 config['d_model'] * config['d_ff'] * config['n_layers'] * 2)  # FFN
            
            # FLOPs估算 (前向+反向 ≈ 6N)
            flops = 6 * N * tokens_per_model
            costs.append({
                'config': config,
                'params': N,
                'flops': flops,
                'gpu_hours_approx': flops / (1e15)  # 假设A100性能
            })
        
        return costs

9. 最新研究进展

9.1 LLM as a Predictor

论文:LLM-Predict: Theoretical Scaling of Prediction Capabilities10

核心发现

  • LLM可以在足够规模下预测其他LLM的缩放曲线
  • 这使得小规模实验的价值降低

9.2 后Chinchilla时代

论文:Beyond Scaling: Efficient LLM Training with Flexible Allocation11

核心观点

  • 不应机械遵循Chinchilla定律
  • 应根据具体任务、数据质量、部署约束灵活调整

9.3 缩放定律与涌现的交汇

论文:Emergent Abilities: A Systematic Investigation12

关键发现

  • 某些能力在达到临界规模前表现为随机猜测
  • 达到临界规模后,性能急剧提升
  • 这与连续缩放定律形成对比

详见涌现能力


10. 参考

Footnotes

  1. Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361. https://arxiv.org/abs/2001.08361 2 3

  2. Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). NeurIPS 2022. https://arxiv.org/abs/2203.15556

  3. Bahri, Y., et al. (2024). A Unified Theory of Neural Scaling Laws. ICML 2024. https://arxiv.org/abs/2402.04505

  4. Mei, S., & Montanari, A. (2022). The Generalization Error of Random Features. PNAS. https://www.pnas.org/doi/10.1073/pnas.2108154116

  5. Bordelon, B., et al. (2024). On the Origin of Neural Scaling Laws. arXiv:2401.10684. https://arxiv.org/abs/2401.10684

  6. Hernandez, D., et al. (2023). Scaling Data-Constrained Language Models. NeurIPS 2023. https://arxiv.org/abs/2305.16264 2

  7. Glorioso, P., et al. (2024). Beyond Chinchilla: Optimal Compute at Every Scale. arXiv:2404.09516. https://arxiv.org/abs/2404.09516

  8. Burgess, H., et al. (2022). Scaling Laws for Deep Neural Networks. arXiv:2207.07962. https://arxiv.org/abs/2207.07962

  9. Xie, S., et al. (2024). Data Scaling Laws for Language Model Pre-training. arXiv:2403.05491. https://arxiv.org/abs/2403.05491

  10. Wei, J., et al. (2023). Emergent Abilities of Large Language Models. TMLR. https://arxiv.org/abs/2206.07682

  11. Muennighoff, N., et al. (2024). Scaling Data-Constrained Learning. arXiv:2406.08457. https://arxiv.org/abs/2406.08457

  12. Schaeffer, R., et al. (2023). Are Emergent Abilities of Large Language Models a Mirage? NeurIPS 2023. https://arxiv.org/abs/2307.15702