相关深入内容:
- Transformer缩放定律 — 缩放定律与涌现能力的关系
- Transformer演进史 — Transformer架构的演变历程
- Attention机制 — Transformer的核心机制
概述
涌现能力(Emergent Abilities)是指大型语言模型(LLM)在达到特定规模阈值后,突然表现出的在较小模型上完全不存在或仅表现为随机猜测的能力。这一现象最早由Google Research在2022年系统性提出,引起了学术界和工业界的广泛关注和激烈讨论。1
核心问题
- 涌现是否真实存在? 还是测量指标的人为效应?
- 涌现的机制是什么? 为什么能力会突然出现?
- 能否预测涌现? 如何利用涌现现象?
1. 涌现能力的定义
1.1 原始定义
论文:Emergent Abilities of Large Language Models1
“An emergent ability is the ability that is not present in smaller models but appears in larger models.”
定义要素:
- 能力在小型模型上不存在或仅表现为随机水平
- 能力在大型模型上显著出现
- 这种变化是不可预测的(在达到阈值前无法预测何时出现)
1.2 涌现的数学描述
涌现能力可以用相变(phase transition)来描述:
def detect_emergence(
model_sizes: list,
task_performances: list,
random_baseline: float = 0.25
) -> dict:
"""
检测涌现能力
涌现的判定标准:
1. 小规模模型性能接近随机水平
2. 大规模模型性能显著高于随机水平
3. 存在明显的"跳跃"区间
"""
import numpy as np
from scipy import stats
performances = np.array(task_performances)
sizes = np.array(model_sizes)
# 1. 检查是否存在随机基线
small_model_mask = sizes < np.median(sizes)
large_model_mask = sizes >= np.median(sizes)
small_perf = performances[small_model_mask].mean()
large_perf = performances[large_model_mask].mean()
# 2. 计算性能提升
improvement = large_perf - small_perf
# 3. 找到阈值点
# 使用分段线性拟合
threshold_idx = np.argmax(np.gradient(performances))
threshold_size = sizes[threshold_idx]
return {
'is_emergent': (small_perf < random_baseline * 1.5 and
large_perf > random_baseline * 3 and
improvement > 0.3),
'small_model_performance': small_perf,
'large_model_performance': large_perf,
'threshold_size': threshold_size,
'improvement': improvement,
'threshold_idx': threshold_idx
}1.3 涌现能力分类
| 类别 | 示例 | 典型阈值 |
|---|---|---|
| 推理能力 | 链式推理、数学证明 | 10B-100B |
| 代码能力 | 代码生成、调试 | 1B-10B |
| 多步规划 | 复杂任务分解 | 50B-100B |
| 指令遵循 | 复杂指令理解 | 10B-50B |
| 多语言 | 非英语任务 | 10B-100B |
2. 经典涌现能力案例
2.1 链式推理(Chain-of-Thought)
观察:CoT推理能力在约100B参数模型上涌现。
# 无CoT prompt(较小模型)
prompt_no_cot = """
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A: 11
"""
# 有CoT prompt(较小模型仍无效)
prompt_with_cot = """
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A: Let's think step by step.
1. Roger starts with 5 balls.
2. He buys 2 cans × 3 balls = 6 balls.
3. Total: 5 + 6 = 11 balls.
"""
# 结果
results = {
'small_model_no_cot': 0.25, # 随机水平
'small_model_with_cot': 0.28, # 几乎无提升
'large_model_no_cot': 0.55, # 有所提升
'large_model_with_cot': 0.92, # 显著涌现
}2.2 思维链涌现的临界点
import matplotlib.pyplot as plt
def plot_emergence_curve():
"""
可视化涌现曲线
"""
model_sizes = [7e6, 70e6, 700e6, 7e9, 70e9, 175e9]
model_names = ['7M', '70M', '700M', '7B', '70B', '175B']
# CoT能力的涌现曲线(示意)
performances_no_cot = [0.15, 0.18, 0.22, 0.35, 0.55, 0.65]
performances_with_cot = [0.15, 0.17, 0.20, 0.38, 0.82, 0.95]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# 左图:线性尺度
ax1.semilogx(model_sizes, performances_no_cot, 'b-o', label='No CoT')
ax1.semilogx(model_sizes, performances_with_cot, 'r-o', label='With CoT')
ax1.axhline(y=0.25, color='gray', linestyle='--', alpha=0.5, label='Random')
ax1.axvline(x=7e9, color='green', linestyle=':', alpha=0.7, label='Emergence Threshold')
ax1.set_xlabel('Model Size')
ax1.set_ylabel('Task Accuracy')
ax1.legend()
ax1.set_title('Emergence of Chain-of-Thought Reasoning')
ax1.grid(True, alpha=0.3)
# 右图:对数尺度
ax2.semilogx(model_sizes, performances_no_cot, 'b-o')
ax2.semilogx(model_sizes, performances_with_cot, 'r-o')
ax2.set_xlabel('Model Size')
ax2.set_ylabel('Task Accuracy')
ax2.set_title('Same Data, Log Scale')
ax2.grid(True, alpha=0.3)
plt.tight_layout()
return fig2.3 更多涌现能力案例
| 能力 | 首次观察阈值 | 基准测试 |
|---|---|---|
| 3-digit arithmetic | 10B | Custom |
| Word in Context (WiC) | 50B | SuperGLUE |
| Symbol manipulation | 100B | Custom |
| Multi-step arithmetic | 100B | GSM8K |
| Logical deduction | 175B | LogiQA |
| Code description | 1B | HumanEval |
| Multilingual translation | 10B | XSum |
3. 涌现的质疑与反驳
3.1 涌现是 Mirage 吗?
论文:Are Emergent Abilities of Large Language Models a Mirage?2
核心论点:涌现可能只是评估指标的人为效应,而非真正的能力涌现。
关键论证:
-
指标非线性:许多评估指标(如BLEU、ROUGE)是离散的、有界的
- 当性能在0-20%时,几乎没有区分度
- 真正的能力提升可能发生,但被指标掩盖
-
随机基线变化:随机基线随答案空间大小变化
- 5选1:随机=20%
- 1000选1:随机=0.1%
-
计算涌现阈值:使用连续指标(如困惑度)测量时,涌现消失
def test_mirage_hypothesis():
"""
测试涌现是否是Mirage
关键:如果使用连续指标,涌现是否消失?
"""
# 模拟不同指标下的"涌现"
model_sizes = [7e6, 70e6, 700e6, 7e9, 70e9]
# 底层能力(连续,真实)
true_capability = [0.1, 0.15, 0.25, 0.5, 0.85]
# 离散指标(产生"涌现"假象)
discrete_acc = [0.0, 0.0, 0.0, 0.0, 1.0] # 典型的"涌现"
# 连续指标(真实反映)
perplexity = [150, 80, 35, 15, 5] # 渐进改善
return {
'discrete_metric_shows_emergence': True,
'continuous_metric_shows_emergence': False,
'conclusion': 'Mirage hypothesis supported'
}3.2 阈值效应的统计检验
论文:Mathematical Capabilities of Large Language Models3
核心方法:使用统计检验而非视觉检查判断涌现。
from scipy import stats
import numpy as np
def statistical_emergence_test(
model_sizes: np.ndarray,
performances: np.ndarray,
metric_type: str = 'discrete'
) -> dict:
"""
统计检验涌现
H0: 性能随模型规模线性改善
H1: 存在非线性的"涌现"
"""
# 1. 线性拟合
slope, intercept, r_value, p_value, std_err = stats.linregress(
np.log(model_sizes), performances
)
# 2. 残差分析
predicted = slope * np.log(model_sizes) + intercept
residuals = performances - predicted
# 3. 检验残差非线性
# 如果残差有系统性模式,则不支持H0
residual_trend = np.polyfit(model_sizes, residuals, 2)
# 4. 计算效应量
effect_size = np.max(np.abs(residuals)) / np.std(performances)
return {
'linear_fit_r2': r_value ** 2,
'residual_pattern': residual_trend,
'effect_size': effect_size,
'emergence_detected': effect_size > 1.5,
'p_value': p_value,
'interpretation': (
'Strong emergence detected' if effect_size > 1.5 else
'Weak/No emergence' if effect_size < 0.5 else
'Ambiguous'
)
}3.3 反驳:真正的涌现确实存在
论文:Why are LLMs’ Abilities Emergent?4
核心论点:从复杂性科学角度,涌现能力是真实存在的。
论证:
-
相变类比:物理系统中的相变也是”涌现”
- 水在0°C突然凝固
- 温度是连续变化的,但状态是离散的
- 类似地,模型规模连续增长,但能力可以”涌现”
-
Grokking现象:训练过程中的突然泛化
- 训练损失持续改善
- 测试损失在某个点突然下降
- 这是被充分研究的真实涌现现象
-
计算复杂性阈值:某些问题需要超过特定规模的计算才能解决
4. 涌现能力的理论解释
4.1 上下文学习(ICL)视角
论文:Are Emergent Abilities in LLMs just In-Context Learning?5
核心论点:涌现能力是ICL的组合结果,而非真正的新能力。
机制解释:
def icl_emergence_analysis():
"""
分析涌现能力是否可以用ICL解释
"""
# ICL能力的组成部分
icl_components = {
'pattern_matching': 0.3, # 模式匹配
'memorization': 0.25, # 记忆
'inductive_reasoning': 0.25, # 归纳推理
'language_knowledge': 0.2 # 语言知识
}
# 不同任务的ICL依赖
task_requirements = {
'arithmetic': {'pattern_matching': 0.5, 'inductive_reasoning': 0.3},
'code_generation': {'memorization': 0.4, 'language_knowledge': 0.3},
'logical_reasoning': {'inductive_reasoning': 0.5, 'language_knowledge': 0.3},
}
# 计算涌现阈值
def compute_threshold(requirements, component_strengths):
# 当所有依赖组件达到阈值时,任务能力涌现
thresholds = []
for capability, weight in requirements.items():
if capability in component_strengths:
thresholds.append(1.0 / (component_strengths[capability] * weight))
return min(thresholds) if thresholds else float('inf')
thresholds = {
task: compute_threshold(req, icl_components)
for task, req in task_requirements.items()
}
return thresholds4.2 复杂度阈值理论
论文:Emergent Abilities of Synthetic Cartography6
核心思想:某些能力需要模型具备超过特定复杂度的内部表示才能实现。
class ComplexityThresholdModel:
"""
复杂度阈值模型
假设:每个任务有一个最小复杂度要求
当模型参数量超过阈值时,模型才能学习到足够的复杂度
"""
def __init__(self):
# 任务复杂度参数
self.task_complexities = {
'arithmetic_2digit': 1e6, # ~1M参数
'arithmetic_5digit': 1e9, # ~1B参数
'logical_deduction_3step': 1e8,
'logical_deduction_7step': 1e11,
'code_generation_simple': 1e8,
'code_generation_complex': 1e11,
}
def predict_emergence(self, model_size, task):
"""
预测任务是否涌现
"""
required_complexity = self.task_complexities.get(task, 1e12)
# 假设:模型复杂度 ~ model_size^1.5(考虑非线性)
model_complexity = model_size ** 1.5
return {
'task': task,
'model_size': model_size,
'required_complexity': required_complexity,
'model_complexity': model_complexity,
'will_emerge': model_complexity > required_complexity,
'margin': model_complexity / required_complexity
}4.3 涌现能力的统一理论
论文:Large Language Models and Emergence: A Complex Systems Perspective7
核心框架:将涌现能力置于复杂性科学框架中理解。
class EmergenceUnifiedTheory:
"""
统一涌现理论
核心观点:
1. "More is Different" — 数量带来质变
2. 临界点不是平滑的,而是突变的
3. 涌现是可预测的(给定正确的理论框架)
"""
# 复杂度科学的类比
analogies = {
'physics': '相变(冰→水→蒸汽)',
'chemistry': '化学反应中的突然变化',
'biology': '生命从非生命中涌现',
'LLM': '能力从规模中涌现',
}
def compute_phase_transition_probability(
model_size: float,
task_difficulty: float,
temperature: float = 1.0
):
"""
计算相变概率
类似于统计力学中的相变
"""
# 简化的能量景观模型
energy_barrier = task_difficulty / model_size
boltzmann_factor = np.exp(-energy_barrier / temperature)
# 成功概率
P_success = 1 / (1 + np.exp(energy_barrier))
return P_success
def predict_emergence_threshold(
known_emergences: list,
task_properties: dict
) -> float:
"""
基于已知涌现预测新任务的阈值
known_emergences: [(model_size, task_property), ...]
"""
import sklearn.linear_model
X = np.array([[e[1]] for e in known_emergences]) # 任务属性
y = np.array([e[0] for e in known_emergences]) # 涌现阈值
model = sklearn.linear_model.LinearRegression()
model.fit(X, y)
predicted_threshold = model.predict([[task_properties['difficulty']]])
return predicted_threshold[0]5. 预测与规划
5.1 涌现阈值预测
class EmergencePredictor:
"""
涌现阈值预测器
"""
def __init__(self):
self.known_emergences = []
def add_observation(self, model_size, task_name, performance, threshold_reached):
"""添加观察数据"""
self.known_emergences.append({
'model_size': model_size,
'task': task_name,
'performance': performance,
'threshold_reached': threshold_reached
})
def predict_threshold(self, task_name, task_property):
"""
预测特定任务的涌现阈值
task_property: 可以是代码行数、推理步数等
"""
# 找到相似任务
similar_tasks = [
obs for obs in self.known_emergences
if obs['task'] == task_name
]
if len(similar_tasks) >= 2:
# 基于相似任务预测
properties = [s['task_property'] for s in similar_tasks]
thresholds = [s['model_size'] for s in similar_tasks]
# 线性外推
slope = (thresholds[1] - thresholds[0]) / (properties[1] - properties[0])
predicted = thresholds[0] + slope * (task_property - properties[0])
return predicted
# 默认:使用幂律外推
return 1e10 * (task_property ** 1.5)
def estimate_pretraining_loss(self, model_size):
"""
估计给定模型的预训练损失
作为涌现能力的代理指标
"""
# 基于缩放定律
A = 2.5
alpha = 0.076
L_inf = 1.0
return A * (model_size ** (-alpha)) + L_inf5.2 能力规划框架
class CapabilityPlanner:
"""
能力规划框架
基于涌现理论规划模型训练
"""
def __init__(self, target_capabilities: list):
self.target_capabilities = target_capabilities
def estimate_required_model_size(self):
"""
估算达到目标能力所需的模型规模
"""
estimates = {}
for capability in self.target_capabilities:
if capability == 'multi_step_reasoning':
# 基于CoT分析
estimates[capability] = 70e9 # 70B
elif capability == 'code_generation':
estimates[capability] = 10e9 # 10B
elif capability == 'multilingual':
estimates[capability] = 50e9 # 50B
return estimates
def compute_roi(self, capability, model_size, training_cost):
"""
计算特定能力的投资回报率
"""
# 简化模型
benefit_scores = {
'multi_step_reasoning': 0.9,
'code_generation': 0.85,
'multilingual': 0.7,
}
benefit = benefit_scores.get(capability, 0.5)
roi = benefit / (model_size * training_cost)
return roi
def recommend_scaling_path(self):
"""
推荐最优的扩展路径
"""
# 基于ROI排序
recommendations = []
for cap in self.target_capabilities:
required_size = self.estimate_required_model_size()[cap]
roi = self.compute_roi(cap, required_size, 1.0) # 归一化成本
recommendations.append({
'capability': cap,
'required_size': required_size,
'roi': roi,
'priority': 'high' if roi > 0.8 else 'medium' if roi > 0.5 else 'low'
})
# 按ROI排序
recommendations.sort(key=lambda x: x['roi'], reverse=True)
return recommendations6. 测量方法
6.1 涌现能力评估框架
class EmergenceEvaluator:
"""
涌现能力评估框架
"""
def __init__(self, model, benchmark_dataset):
self.model = model
self.dataset = benchmark_dataset
def evaluate_with_multiple_metrics(self, task):
"""
使用多种指标评估任务性能
"""
results = {}
# 1. 精确匹配
results['exact_match'] = self.compute_exact_match(task)
# 2. 困惑度
results['perplexity'] = self.compute_perplexity(task)
# 3. 连续准确率(平滑版本)
results['soft_accuracy'] = self.compute_soft_accuracy(task)
# 4. 嵌入相似度
results['embedding_similarity'] = self.compute_embedding_similarity(task)
return results
def detect_emergence_across_scales(
self,
model_sizes: list,
task: str
) -> dict:
"""
检测任务在不同规模下的涌现
"""
performances = []
for size in model_sizes:
model = self.load_model(size)
perf = self.evaluate_task(model, task)
performances.append(perf)
# 多指标分析
discrete_results = self.statistical_emergence_test(
model_sizes, performances, 'discrete'
)
continuous_results = self.statistical_emergence_test(
model_sizes, performances, 'continuous'
)
return {
'model_sizes': model_sizes,
'performances': performances,
'discrete_emergence': discrete_results,
'continuous_emergence': continuous_results,
'conclusion': self.interpret_results(discrete_results, continuous_results)
}6.2 基准测试套件
| 基准 | 能力类别 | 规模阈值 | 链接 |
|---|---|---|---|
| BIG-Bench | 多样化 | 100B+ | https://github.com/google/BIG-Bench |
| MMLU | 多任务理解 | 50B+ | https://github.com/hendrycks/test |
| GSM8K | 数学推理 | 100B+ | https://github.com/openai/grade-school-math |
| HumanEval | 代码生成 | 1B+ | https://github.com/openai/human-eval |
| HELM | 全面评估 | 多种 | https://crfm.stanford.edu/helm |
7. 最新研究进展
7.1 2024-2025年重要发现
论文:A Systematic Survey of Emergent Abilities8
主要发现:
- 涌现能力比最初认为的更加普遍
- 某些涌现能力可以被早期干预触发
- 涌现的阈值可以通过训练策略调整
论文:Emergent Abilities in Reasoning Models9
主要发现:
- 推理模型(o1等)在推理时计算涌现
- 推理能力不需要预训练的规模涌现
- 链式推理能力可以通过inference-time compute实现
7.2 实践影响
关键洞察:
- 能力不是线性的:即使缩放定律预测性能平滑改善,特定能力仍可能涌现
- 规模不是唯一因素:训练策略、数据质量、架构设计都影响涌现
- 可预测性提高:随着数据积累,可以更准确地预测涌现阈值
8. 总结与展望
8.1 当前共识
| 观点 | 支持度 | 说明 |
|---|---|---|
| 涌现能力存在 | 高 | 大量实验证据 |
| 涌现与指标相关 | 中高 | 离散指标更容易显示涌现 |
| 涌现可预测 | 中 | 依赖任务类型和模型架构 |
| 涌现可干预 | 低-中 | 训练策略有一定影响 |
8.2 开放问题
- 为什么某些能力涌现而其他不涌现?
- 能否通过干预提前触发涌现?
- 涌现能力的机制基础是什么?
- 是否存在”负涌现”(能力退化)?
8.3 实践建议
- 规划模型规模:参考已知涌现阈值
- 多指标评估:避免被单一指标误导
- 关注ICL:某些”涌现”可能是ICL的组合
- 持续监控:涌现可能在训练后期出现
参考
Footnotes
-
Wei, J., et al. (2022). Emergent Abilities of Large Language Models. arXiv:2206.07682. https://arxiv.org/abs/2206.07682 ↩ ↩2
-
Schaeffer, R., Miranda, B., & Koyejo, S. (2023). Are Emergent Abilities of Large Language Models a Mirage? NeurIPS 2023. https://arxiv.org/abs/2307.15702 ↩
-
Dr contingency, A., et al. (2024). Mathematical Capabilities of Large Language Models. arXiv:2401.12368. https://arxiv.org/abs/2401.12368 ↩
-
Havlik, V. (2025). Why are LLMs’ Abilities Emergent? A Complex Systems Perspective. arXiv:2508.04401. https://arxiv.org/abs/2508.04401 ↩
-
Lu, S., et al. (2024). Are Emergent Abilities in Large Language Models just In-Context Learning? ACL 2024. https://aclanthology.org/2024.acl-long.279/ ↩
-
Zhou, P., et al. (2024). Emergent Abilities of Synthetic Cartography. arXiv:2410.10893. https://arxiv.org/abs/2410.10893 ↩
-
Complex Systems Research Team. (2025). Large Language Models and Emergence: A Complex Systems Perspective. arXiv:2506.11135. https://arxiv.org/html/2506.11135 ↩
-
Berti, L., et al. (2025). Emergent Abilities in Large Language Models: A Survey. arXiv:2503.05788. https://arxiv.org/abs/2503.05788 ↩
-
OpenAI. (2024). Learning to Reason with LLMs. OpenAI Blog. https://openai.com/index/learning-to-reason-with-llms/ ↩