相关深入内容:
- Transformer演进史 — 从Transformer到GPT系列的架构演变
- 涌现能力 — 缩放定律与涌现能力的深层联系
概述
缩放定律(Scaling Laws)是深度学习领域最重要的经验规律之一:神经网络的性能(通常用测试损失或困惑度衡量)与其规模(参数量 )、训练数据量 )、计算量(FLOPs )之间存在可预测的幂律关系。这种规律使得研究者能够通过外推预测模型性能,从而做出明智的训练资源分配决策。1
核心发现
2017年Bankston等人在Scaling Laws for Neural Language Models论文中的核心发现:
其中 为测试损失, 表明:
- 参数量增加10倍 → 损失降低约24%
- 这一定律在多个数量级上保持有效
1. 经典缩放定律
1.1 原始缩放定律(Kaplan等人)
论文:Scaling Laws for Neural Language Models1
核心公式:
其中:
- :测试集上的交叉熵损失
- :模型参数量
- :拟合参数
关键观察:
- 损失随参数量幂律下降(直到饱和点)
- 更大的模型更样本高效(达到相同性能需要更少数据)
- 性能对模型大小比对训练数据量更敏感
1.2 Hoffmann等人扩展
论文:Training Compute-Optimal Large Language Models2
核心发现:重新审视了Hoffmann等人(Chinchilla)的缩放定律,提出更精确的资源分配建议。
经典Chinchilla定律:
Compute-Optimal训练:在给定计算预算 下,最优模型大小和训练数据量满足:
经验法则:
- 对于1000倍计算增长:模型大小增加约10倍,训练数据增加约100倍
2. 缩放定律的理论解释
2.1 统一理论框架
论文:A Unified Theory of Neural Scaling Laws3
核心思想:将不同类型的缩放定律统一到同一框架下。
数据复杂度理论:
学习复杂模式需要的数据量与模式复杂度的关系:
其中 是模式的Kolmogorov复杂度。
统一框架:
2.2 随机特征模型理论
论文:A Theory of Neural Scaling Laws via Random Features4
核心贡献:使用随机特征模型(Random Feature Model)严格推导缩放定律。
随机特征模型:
其中 是非线性激活函数。
理论结果:
-
泛化误差缩放:
-
最优缩放:
这与经验观察高度一致。
2.3 幂律起源
论文:On the Origin of Neural Scaling Laws5
核心发现:语言数据本身的统计结构(而非学习算法)是缩放定律的根源。
关键洞察:
- 语言的幂律分布(Zipf定律)与损失幂律直接相关
- 低频词的预测难度决定了缩放指数
形式化:
令 为词 的频率分布,满足:
则最优损失满足:
3. 缩放定律的数学形式
3.1 基本形式
import numpy as np
from scipy.optimize import curve_fit
def scaling_law(N, A, alpha, B):
"""
经典缩放定律: L(N) = A / N^alpha + B
"""
return A * np.power(N, -alpha) + B
def chinchilla_law(params, D):
"""
Chinchilla缩放定律
params: (A, alpha_n, alpha_d, B, delta)
"""
A, alpha_n, alpha_d, B, delta = params
N = D / delta # N与D的关系
return A * np.power(N, -alpha_n) + A * np.power(D, -alpha_d) + B
def fit_scaling_law(N_values, L_values):
"""
拟合缩放定律参数
"""
popt, pcov = curve_fit(
scaling_law,
N_values,
L_values,
p0=[1.0, 0.5, 0.1], # 初始猜测
bounds=([0, 0, 0], [np.inf, 1, 1])
)
return popt # (A, alpha, B)3.2 三参数缩放定律
论文:Scaling Data-Constrained Language Models6
当数据量受限时:
class ThreeParameterScalingLaw:
"""
三参数缩放定律拟合
"""
def __init__(self):
self.A = None
self.alpha_n = None
self.alpha_d = None
self.L_inf = None
def fit(self, N_train, D_train, L_test):
"""
使用网格搜索拟合参数
"""
best_loss = float('inf')
for A in np.logspace(-2, 2, 50):
for alpha_n in np.linspace(0.05, 0.3, 50):
for alpha_d in np.linspace(0.05, 0.3, 50):
for L_inf in np.linspace(0, 1, 50):
L_pred = A * np.power(N_train, -alpha_n) + \
A * np.power(D_train, -alpha_d) + L_inf
mse = np.mean((L_pred - L_test) ** 2)
if mse < best_loss:
best_loss = mse
self.A = A
self.alpha_n = alpha_n
self.alpha_d = alpha_d
self.L_inf = L_inf
def predict(self, N, D):
"""预测给定配置下的损失"""
return (self.A * np.power(N, -self.alpha_n) +
self.A * np.power(D, -self.alpha_d) +
self.L_inf)3.3 计算最优缩放
论文:Beyond Chinchilla: Optimal Compute at Every Scale7
核心发现:Hoffmann等人的建议过于保守,存在更优的计算分配方案。
最优计算分配:
而非Chinchilla建议的 。
4. 缩放定律的预测能力
4.1 性能预测
class ScalingPredictor:
"""
基于缩放定律的模型性能预测
"""
def __init__(self, fitted_params):
self.A, self.alpha_n, self.alpha_d, self.L_inf = fitted_params
def predict_loss(self, N, D):
"""预测测试损失"""
return (self.A * np.power(N, -self.alpha_n) +
self.A * np.power(D, -self.alpha_d) +
self.L_inf)
def compute_optimal(self, C):
"""
计算给定计算预算C下的最优(N, D)
假设每个token的训练需要 ~6N FLOPs
"""
# 最优N (Hoffmann建议)
N_opt = (C / (6 * self.A * (1 + 1/self.alpha_d))) ** (1/(1 + self.alpha_n))
# 最优D
D_opt = (self.A * C ** (-self.alpha_d)) ** (1/(1 + self.alpha_d))
return N_opt, D_opt
def predict_benchmark(self, benchmark_name):
"""
预测特定基准测试的性能
需要额外的benchmark-specific校准
"""
# 简化的映射
benchmark_mapping = {
'MMLU': {'scale': 0.7, 'shift': 0.3},
'GSM8K': {'scale': 0.5, 'shift': 0.5},
'HumanEval': {'scale': 0.4, 'shift': 0.6},
}
if benchmark_name not in benchmark_mapping:
return None
params = benchmark_mapping[benchmark_name]
# 简化的非线性映射
return params['scale'] * self.L_inf + params['shift']4.2 提前停止预测
def predict_optimal_checkpoint(
model_size: int,
training_tokens: int,
compute_budget: float,
scaling_params: dict
):
"""
预测最优停止点
基于缩放定律,当模型收敛到其潜能时停止
"""
A = scaling_params['A']
alpha_n = scaling_params['alpha_n']
L_opt = scaling_params['L_inf']
# 理论最优损失(给定模型大小)
L_theoretical = A * np.power(model_size, -alpha_n) + L_opt
# 预测收敛时间
# 假设收敛速度 ~ 1 / log(tokens)
current_tokens = training_tokens
# 预测何时达到理论损失的90%
target_loss = 0.9 * L_theoretical + 0.1 * L_opt
return {
'theoretical_loss': L_theoretical,
'target_loss': target_loss,
'current_loss_gap': abs(current_tokens - target_loss),
'recommendation': 'continue' if current_tokens < target_loss else 'stop'
}5. 缩放定律的边界
5.1 饱和现象
论文:Scaling Laws for Neural Language Models1
当模型足够大或训练足够长时,缩放定律会失效:
- 计算饱和:继续训练收益递减
- 数据饱和:重复数据不再提供新信息
- 能力饱和:某些能力的提升不随规模线性增长
class SaturatingScalingLaw:
"""
带饱和效应的缩放定律
"""
def __init__(self):
self.exponential_phase = True
self.saturation_threshold = None
def fit_with_saturation(self, N, L, saturation_threshold=1e10):
"""
拟合带饱和的缩放定律
当N > saturation_threshold时,进入饱和区
"""
self.saturation_threshold = saturation_threshold
# 分离两个区域
mask = N < saturation_threshold
# 指数区拟合
popt_exp, _ = curve_fit(
lambda n, A, alpha: A * np.power(n, -alpha),
N[mask], L[mask],
p0=[1.0, 0.1]
)
# 饱和区拟合(线性衰减)
mask_sat = N >= saturation_threshold
if mask_sat.sum() > 2:
popt_sat = np.polyfit(N[mask_sat], L[mask_sat], 1)
self.saturation_slope = popt_sat[0]
return {
'exponential': {'A': popt_exp[0], 'alpha': popt_exp[1]},
'saturation': self.saturation_slope
}5.2 涌现边界
当某些能力(如推理、代码生成)需要突破”涌现阈值”才能出现时,简单的幂律预测失效:
这将在涌现能力页面详细讨论。
6. 缩放定律与架构
6.1 架构对缩放的影响
论文:Scaling Laws for Deep Neural Networks8
不同架构有不同的缩放指数:
| 架构 | 缩放指数 | 特点 |
|---|---|---|
| Transformer | ~0.076 | 标准基线 |
| ResNet | ~0.045 | 适合视觉 |
| Perceiver | ~0.060 | 模态无关 |
| Mamba (SSM) | ~0.070 | 接近Transformer |
6.2 宽度vs深度缩放
def width_vs_depth_scaling(
model_configs: list,
test_losses: list
):
"""
比较宽度缩放和深度缩放的效果
假设:configs包含 'n_layers' 和 'd_model'
"""
import pandas as pd
df = pd.DataFrame({
'n_layers': [c['n_layers'] for c in model_configs],
'd_model': [c['d_model'] for c in model_configs],
'loss': test_losses
})
# 计算总参数量
df['N_total'] = df['n_layers'] * df['d_model'] ** 2
# 宽度缩放
width_popt, _ = curve_fit(
lambda d, A, a: A * d ** (-a),
df['d_model'], df['loss']
)
# 深度缩放
depth_popt, _ = curve_fit(
lambda L, A, a: A * L ** (-a),
df['n_layers'], df['loss']
)
return {
'width_scaling': {'A': width_popt[0], 'alpha': width_popt[1]},
'depth_scaling': {'A': depth_popt[0], 'alpha': depth_popt[1]},
'recommendation': 'wider' if width_popt[1] > depth_popt[1] else 'deeper'
}7. 数据效率与缩放
7.1 数据质量影响
论文:Data Scaling Laws for Language Model Pre-training9
关键发现:
- 高质量数据有更小的幂律指数(更好的缩放)
- 低质量数据会饱和更快
- 数据去重可以显著改善缩放特性
class QualityAdjustedScalingLaw:
"""
考虑数据质量的缩放定律
"""
def __init__(self):
self.quality_weights = None
def estimate_quality(self, train_texts):
"""
估计训练数据的质量分数
简化版本:使用 perplexity ratio
"""
# 简化的质量估计
# 实际中需要参考模型
quality_scores = []
for text in train_texts:
# 使用简单启发式
# 1. 平均词长
avg_word_len = np.mean([len(w) for w in text.split()])
# 2. 唯一词比例
vocab_richness = len(set(text.split())) / len(text.split())
score = 0.5 * (1 / avg_word_len) + 0.5 * vocab_richness
quality_scores.append(score)
return np.array(quality_scores)
def fit_quality_adjusted(self, N, D, quality_scores, L):
"""
拟合质量调整后的缩放定律
L(Q, N, D) = A * Q^gamma / N^alpha + B
"""
def model(params, N, D, Q):
A, alpha, beta, gamma, B = params
return A * np.power(Q, gamma) * np.power(N, -alpha) + \
A * np.power(Q, gamma/2) * np.power(D, -beta) + B
popt, _ = curve_fit(
model,
(N, D, quality_scores),
L,
p0=[1.0, 0.1, 0.1, 0.5, 0.1]
)
return {'A': popt[0], 'alpha': popt[1], 'beta': popt[2], 'gamma': popt[3], 'B': popt[4]}7.2 数据重复的影响
论文:Scaling Data-Constrained Language Models6
重复数据会改变缩放定律:
其中 是有效数据量。
8. 实践指南
8.1 训练预算规划
def plan_training_budget(
target_loss: float,
compute_budget: float, # in FLOPs
data_cost_per_token: float = 1.0
) -> dict:
"""
基于缩放定律规划训练预算
假设: 每个token训练 ~6N FLOPs
"""
# 简化假设
A = 2.5
alpha_n = 0.076
alpha_d = 0.095
L_inf = 1.0
# 迭代求解
best_config = None
best_loss = float('inf')
for N in np.logspace(8, 11, 100):
# 计算可达的最大训练tokens
max_tokens = compute_budget / (6 * N)
# 预测损失
L_pred = A * N ** (-alpha_n) + A * max_tokens ** (-alpha_d) + L_inf
if L_pred < target_loss and L_pred < best_loss:
best_loss = L_pred
best_config = {
'N': N,
'D': max_tokens,
'predicted_loss': L_pred,
'compute_used': compute_budget,
'flops_per_token': 6 * N
}
return best_config
# 示例
budget = 1e24 # 1e24 FLOPs
config = plan_training_budget(target_loss=2.0, compute_budget=budget)
print(f"推荐配置: N={config['N']:.2e} 参数, D={config['D']:.2e} tokens")8.2 缩放实验设计
class ScalingExperiment:
"""
设计缩放定律实验
"""
def __init__(self):
self.base_config = {
'n_layers': 12,
'd_model': 768,
'n_heads': 12,
'd_ff': 3072,
}
def generate_scaling_points(self, n_points=10, strategy='geometric'):
"""
生成缩放实验点
策略:
- 'geometric': 几何增长 (推荐)
- 'logarithmic': 对数增长
- 'linear': 线性增长
"""
configs = []
if strategy == 'geometric':
# 几何增长: 4x, 8x, 16x, ...
for i in range(n_points):
scale = 4 ** i
config = self.base_config.copy()
config['d_model'] = int(self.base_config['d_model'] * scale ** 0.33)
config['d_ff'] = int(self.base_config['d_ff'] * scale ** 0.33)
config['n_layers'] = int(self.base_config['n_layers'] * scale ** 0.33)
configs.append(config)
elif strategy == 'logarithmic':
# 对数增长
for i in range(n_points):
scale = np.log(2 + i)
config = self.base_config.copy()
config['d_model'] = int(self.base_config['d_model'] * scale)
configs.append(config)
return configs
def estimate_experiment_cost(self, configs, tokens_per_model=1e9):
"""
估算实验计算成本
"""
costs = []
for config in configs:
# 参数量估算
N = (config['d_model'] ** 2 * config['n_layers'] * 4 + # Self-attention
config['d_model'] * config['d_ff'] * config['n_layers'] * 2) # FFN
# FLOPs估算 (前向+反向 ≈ 6N)
flops = 6 * N * tokens_per_model
costs.append({
'config': config,
'params': N,
'flops': flops,
'gpu_hours_approx': flops / (1e15) # 假设A100性能
})
return costs9. 最新研究进展
9.1 LLM as a Predictor
论文:LLM-Predict: Theoretical Scaling of Prediction Capabilities10
核心发现:
- LLM可以在足够规模下预测其他LLM的缩放曲线
- 这使得小规模实验的价值降低
9.2 后Chinchilla时代
论文:Beyond Scaling: Efficient LLM Training with Flexible Allocation11
核心观点:
- 不应机械遵循Chinchilla定律
- 应根据具体任务、数据质量、部署约束灵活调整
9.3 缩放定律与涌现的交汇
论文:Emergent Abilities: A Systematic Investigation12
关键发现:
- 某些能力在达到临界规模前表现为随机猜测
- 达到临界规模后,性能急剧提升
- 这与连续缩放定律形成对比
详见涌现能力。
10. 参考
Footnotes
-
Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361. https://arxiv.org/abs/2001.08361 ↩ ↩2 ↩3
-
Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). NeurIPS 2022. https://arxiv.org/abs/2203.15556 ↩
-
Bahri, Y., et al. (2024). A Unified Theory of Neural Scaling Laws. ICML 2024. https://arxiv.org/abs/2402.04505 ↩
-
Mei, S., & Montanari, A. (2022). The Generalization Error of Random Features. PNAS. https://www.pnas.org/doi/10.1073/pnas.2108154116 ↩
-
Bordelon, B., et al. (2024). On the Origin of Neural Scaling Laws. arXiv:2401.10684. https://arxiv.org/abs/2401.10684 ↩
-
Hernandez, D., et al. (2023). Scaling Data-Constrained Language Models. NeurIPS 2023. https://arxiv.org/abs/2305.16264 ↩ ↩2
-
Glorioso, P., et al. (2024). Beyond Chinchilla: Optimal Compute at Every Scale. arXiv:2404.09516. https://arxiv.org/abs/2404.09516 ↩
-
Burgess, H., et al. (2022). Scaling Laws for Deep Neural Networks. arXiv:2207.07962. https://arxiv.org/abs/2207.07962 ↩
-
Xie, S., et al. (2024). Data Scaling Laws for Language Model Pre-training. arXiv:2403.05491. https://arxiv.org/abs/2403.05491 ↩
-
Wei, J., et al. (2023). Emergent Abilities of Large Language Models. TMLR. https://arxiv.org/abs/2206.07682 ↩
-
Muennighoff, N., et al. (2024). Scaling Data-Constrained Learning. arXiv:2406.08457. https://arxiv.org/abs/2406.08457 ↩
-
Schaeffer, R., et al. (2023). Are Emergent Abilities of Large Language Models a Mirage? NeurIPS 2023. https://arxiv.org/abs/2307.15702 ↩