相关深入内容:

概述

涌现能力(Emergent Abilities)是指大型语言模型(LLM)在达到特定规模阈值后,突然表现出的在较小模型上完全不存在或仅表现为随机猜测的能力。这一现象最早由Google Research在2022年系统性提出,引起了学术界和工业界的广泛关注和激烈讨论。1

核心问题

  1. 涌现是否真实存在? 还是测量指标的人为效应?
  2. 涌现的机制是什么? 为什么能力会突然出现?
  3. 能否预测涌现? 如何利用涌现现象?

1. 涌现能力的定义

1.1 原始定义

论文:Emergent Abilities of Large Language Models1

“An emergent ability is the ability that is not present in smaller models but appears in larger models.”

定义要素

  1. 能力在小型模型上不存在或仅表现为随机水平
  2. 能力在大型模型上显著出现
  3. 这种变化是不可预测的(在达到阈值前无法预测何时出现)

1.2 涌现的数学描述

涌现能力可以用相变(phase transition)来描述:

def detect_emergence(
    model_sizes: list,
    task_performances: list,
    random_baseline: float = 0.25
) -> dict:
    """
    检测涌现能力
    
    涌现的判定标准:
    1. 小规模模型性能接近随机水平
    2. 大规模模型性能显著高于随机水平
    3. 存在明显的"跳跃"区间
    """
    import numpy as np
    from scipy import stats
    
    performances = np.array(task_performances)
    sizes = np.array(model_sizes)
    
    # 1. 检查是否存在随机基线
    small_model_mask = sizes < np.median(sizes)
    large_model_mask = sizes >= np.median(sizes)
    
    small_perf = performances[small_model_mask].mean()
    large_perf = performances[large_model_mask].mean()
    
    # 2. 计算性能提升
    improvement = large_perf - small_perf
    
    # 3. 找到阈值点
    # 使用分段线性拟合
    threshold_idx = np.argmax(np.gradient(performances))
    threshold_size = sizes[threshold_idx]
    
    return {
        'is_emergent': (small_perf < random_baseline * 1.5 and 
                       large_perf > random_baseline * 3 and
                       improvement > 0.3),
        'small_model_performance': small_perf,
        'large_model_performance': large_perf,
        'threshold_size': threshold_size,
        'improvement': improvement,
        'threshold_idx': threshold_idx
    }

1.3 涌现能力分类

类别示例典型阈值
推理能力链式推理、数学证明10B-100B
代码能力代码生成、调试1B-10B
多步规划复杂任务分解50B-100B
指令遵循复杂指令理解10B-50B
多语言非英语任务10B-100B

2. 经典涌现能力案例

2.1 链式推理(Chain-of-Thought)

观察:CoT推理能力在约100B参数模型上涌现。

# 无CoT prompt(较小模型)
prompt_no_cot = """
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. 
Each can has 3 tennis balls. How many tennis balls does he have now?
A: 11
"""
 
# 有CoT prompt(较小模型仍无效)
prompt_with_cot = """
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. 
Each can has 3 tennis balls. How many tennis balls does he have now?
A: Let's think step by step.
1. Roger starts with 5 balls.
2. He buys 2 cans × 3 balls = 6 balls.
3. Total: 5 + 6 = 11 balls.
"""
 
# 结果
results = {
    'small_model_no_cot': 0.25,    # 随机水平
    'small_model_with_cot': 0.28,   # 几乎无提升
    'large_model_no_cot': 0.55,     # 有所提升
    'large_model_with_cot': 0.92,   # 显著涌现
}

2.2 思维链涌现的临界点

import matplotlib.pyplot as plt
 
def plot_emergence_curve():
    """
    可视化涌现曲线
    """
    model_sizes = [7e6, 70e6, 700e6, 7e9, 70e9, 175e9]
    model_names = ['7M', '70M', '700M', '7B', '70B', '175B']
    
    # CoT能力的涌现曲线(示意)
    performances_no_cot = [0.15, 0.18, 0.22, 0.35, 0.55, 0.65]
    performances_with_cot = [0.15, 0.17, 0.20, 0.38, 0.82, 0.95]
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # 左图:线性尺度
    ax1.semilogx(model_sizes, performances_no_cot, 'b-o', label='No CoT')
    ax1.semilogx(model_sizes, performances_with_cot, 'r-o', label='With CoT')
    ax1.axhline(y=0.25, color='gray', linestyle='--', alpha=0.5, label='Random')
    ax1.axvline(x=7e9, color='green', linestyle=':', alpha=0.7, label='Emergence Threshold')
    ax1.set_xlabel('Model Size')
    ax1.set_ylabel('Task Accuracy')
    ax1.legend()
    ax1.set_title('Emergence of Chain-of-Thought Reasoning')
    ax1.grid(True, alpha=0.3)
    
    # 右图:对数尺度
    ax2.semilogx(model_sizes, performances_no_cot, 'b-o')
    ax2.semilogx(model_sizes, performances_with_cot, 'r-o')
    ax2.set_xlabel('Model Size')
    ax2.set_ylabel('Task Accuracy')
    ax2.set_title('Same Data, Log Scale')
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    return fig

2.3 更多涌现能力案例

能力首次观察阈值基准测试
3-digit arithmetic10BCustom
Word in Context (WiC)50BSuperGLUE
Symbol manipulation100BCustom
Multi-step arithmetic100BGSM8K
Logical deduction175BLogiQA
Code description1BHumanEval
Multilingual translation10BXSum

3. 涌现的质疑与反驳

3.1 涌现是 Mirage 吗?

论文:Are Emergent Abilities of Large Language Models a Mirage?2

核心论点:涌现可能只是评估指标的人为效应,而非真正的能力涌现。

关键论证

  1. 指标非线性:许多评估指标(如BLEU、ROUGE)是离散的、有界的

    • 当性能在0-20%时,几乎没有区分度
    • 真正的能力提升可能发生,但被指标掩盖
  2. 随机基线变化:随机基线随答案空间大小变化

    • 5选1:随机=20%
    • 1000选1:随机=0.1%
  3. 计算涌现阈值:使用连续指标(如困惑度)测量时,涌现消失

def test_mirage_hypothesis():
    """
    测试涌现是否是Mirage
    
    关键:如果使用连续指标,涌现是否消失?
    """
    # 模拟不同指标下的"涌现"
    model_sizes = [7e6, 70e6, 700e6, 7e9, 70e9]
    
    # 底层能力(连续,真实)
    true_capability = [0.1, 0.15, 0.25, 0.5, 0.85]
    
    # 离散指标(产生"涌现"假象)
    discrete_acc = [0.0, 0.0, 0.0, 0.0, 1.0]  # 典型的"涌现"
    
    # 连续指标(真实反映)
    perplexity = [150, 80, 35, 15, 5]  # 渐进改善
    
    return {
        'discrete_metric_shows_emergence': True,
        'continuous_metric_shows_emergence': False,
        'conclusion': 'Mirage hypothesis supported'
    }

3.2 阈值效应的统计检验

论文:Mathematical Capabilities of Large Language Models3

核心方法:使用统计检验而非视觉检查判断涌现。

from scipy import stats
import numpy as np
 
def statistical_emergence_test(
    model_sizes: np.ndarray,
    performances: np.ndarray,
    metric_type: str = 'discrete'
) -> dict:
    """
    统计检验涌现
    
    H0: 性能随模型规模线性改善
    H1: 存在非线性的"涌现"
    """
    # 1. 线性拟合
    slope, intercept, r_value, p_value, std_err = stats.linregress(
        np.log(model_sizes), performances
    )
    
    # 2. 残差分析
    predicted = slope * np.log(model_sizes) + intercept
    residuals = performances - predicted
    
    # 3. 检验残差非线性
    # 如果残差有系统性模式,则不支持H0
    residual_trend = np.polyfit(model_sizes, residuals, 2)
    
    # 4. 计算效应量
    effect_size = np.max(np.abs(residuals)) / np.std(performances)
    
    return {
        'linear_fit_r2': r_value ** 2,
        'residual_pattern': residual_trend,
        'effect_size': effect_size,
        'emergence_detected': effect_size > 1.5,
        'p_value': p_value,
        'interpretation': (
            'Strong emergence detected' if effect_size > 1.5 else
            'Weak/No emergence' if effect_size < 0.5 else
            'Ambiguous'
        )
    }

3.3 反驳:真正的涌现确实存在

论文:Why are LLMs’ Abilities Emergent?4

核心论点:从复杂性科学角度,涌现能力是真实存在的。

论证

  1. 相变类比:物理系统中的相变也是”涌现”

    • 水在0°C突然凝固
    • 温度是连续变化的,但状态是离散的
    • 类似地,模型规模连续增长,但能力可以”涌现”
  2. Grokking现象:训练过程中的突然泛化

    • 训练损失持续改善
    • 测试损失在某个点突然下降
    • 这是被充分研究的真实涌现现象
  3. 计算复杂性阈值:某些问题需要超过特定规模的计算才能解决


4. 涌现能力的理论解释

4.1 上下文学习(ICL)视角

论文:Are Emergent Abilities in LLMs just In-Context Learning?5

核心论点:涌现能力是ICL的组合结果,而非真正的新能力。

机制解释

def icl_emergence_analysis():
    """
    分析涌现能力是否可以用ICL解释
    """
    # ICL能力的组成部分
    icl_components = {
        'pattern_matching': 0.3,      # 模式匹配
        'memorization': 0.25,          # 记忆
        'inductive_reasoning': 0.25,   # 归纳推理
        'language_knowledge': 0.2       # 语言知识
    }
    
    # 不同任务的ICL依赖
    task_requirements = {
        'arithmetic': {'pattern_matching': 0.5, 'inductive_reasoning': 0.3},
        'code_generation': {'memorization': 0.4, 'language_knowledge': 0.3},
        'logical_reasoning': {'inductive_reasoning': 0.5, 'language_knowledge': 0.3},
    }
    
    # 计算涌现阈值
    def compute_threshold(requirements, component_strengths):
        # 当所有依赖组件达到阈值时,任务能力涌现
        thresholds = []
        for capability, weight in requirements.items():
            if capability in component_strengths:
                thresholds.append(1.0 / (component_strengths[capability] * weight))
        return min(thresholds) if thresholds else float('inf')
    
    thresholds = {
        task: compute_threshold(req, icl_components)
        for task, req in task_requirements.items()
    }
    
    return thresholds

4.2 复杂度阈值理论

论文:Emergent Abilities of Synthetic Cartography6

核心思想:某些能力需要模型具备超过特定复杂度的内部表示才能实现。

class ComplexityThresholdModel:
    """
    复杂度阈值模型
    
    假设:每个任务有一个最小复杂度要求
    当模型参数量超过阈值时,模型才能学习到足够的复杂度
    """
    def __init__(self):
        # 任务复杂度参数
        self.task_complexities = {
            'arithmetic_2digit': 1e6,     # ~1M参数
            'arithmetic_5digit': 1e9,     # ~1B参数
            'logical_deduction_3step': 1e8,
            'logical_deduction_7step': 1e11,
            'code_generation_simple': 1e8,
            'code_generation_complex': 1e11,
        }
    
    def predict_emergence(self, model_size, task):
        """
        预测任务是否涌现
        """
        required_complexity = self.task_complexities.get(task, 1e12)
        
        # 假设:模型复杂度 ~ model_size^1.5(考虑非线性)
        model_complexity = model_size ** 1.5
        
        return {
            'task': task,
            'model_size': model_size,
            'required_complexity': required_complexity,
            'model_complexity': model_complexity,
            'will_emerge': model_complexity > required_complexity,
            'margin': model_complexity / required_complexity
        }

4.3 涌现能力的统一理论

论文:Large Language Models and Emergence: A Complex Systems Perspective7

核心框架:将涌现能力置于复杂性科学框架中理解。

class EmergenceUnifiedTheory:
    """
    统一涌现理论
    
    核心观点:
    1. "More is Different" — 数量带来质变
    2. 临界点不是平滑的,而是突变的
    3. 涌现是可预测的(给定正确的理论框架)
    """
    
    # 复杂度科学的类比
    analogies = {
        'physics': '相变(冰→水→蒸汽)',
        'chemistry': '化学反应中的突然变化',
        'biology': '生命从非生命中涌现',
        'LLM': '能力从规模中涌现',
    }
    
    def compute_phase_transition_probability(
        model_size: float,
        task_difficulty: float,
        temperature: float = 1.0
    ):
        """
        计算相变概率
        
        类似于统计力学中的相变
        """
        # 简化的能量景观模型
        energy_barrier = task_difficulty / model_size
        boltzmann_factor = np.exp(-energy_barrier / temperature)
        
        # 成功概率
        P_success = 1 / (1 + np.exp(energy_barrier))
        
        return P_success
    
    def predict_emergence_threshold(
        known_emergences: list,
        task_properties: dict
    ) -> float:
        """
        基于已知涌现预测新任务的阈值
        
        known_emergences: [(model_size, task_property), ...]
        """
        import sklearn.linear_model
        
        X = np.array([[e[1]] for e in known_emergences])  # 任务属性
        y = np.array([e[0] for e in known_emergences])     # 涌现阈值
        
        model = sklearn.linear_model.LinearRegression()
        model.fit(X, y)
        
        predicted_threshold = model.predict([[task_properties['difficulty']]])
        
        return predicted_threshold[0]

5. 预测与规划

5.1 涌现阈值预测

class EmergencePredictor:
    """
    涌现阈值预测器
    """
    def __init__(self):
        self.known_emergences = []
    
    def add_observation(self, model_size, task_name, performance, threshold_reached):
        """添加观察数据"""
        self.known_emergences.append({
            'model_size': model_size,
            'task': task_name,
            'performance': performance,
            'threshold_reached': threshold_reached
        })
    
    def predict_threshold(self, task_name, task_property):
        """
        预测特定任务的涌现阈值
        
        task_property: 可以是代码行数、推理步数等
        """
        # 找到相似任务
        similar_tasks = [
            obs for obs in self.known_emergences 
            if obs['task'] == task_name
        ]
        
        if len(similar_tasks) >= 2:
            # 基于相似任务预测
            properties = [s['task_property'] for s in similar_tasks]
            thresholds = [s['model_size'] for s in similar_tasks]
            
            # 线性外推
            slope = (thresholds[1] - thresholds[0]) / (properties[1] - properties[0])
            predicted = thresholds[0] + slope * (task_property - properties[0])
            
            return predicted
        
        # 默认:使用幂律外推
        return 1e10 * (task_property ** 1.5)
    
    def estimate_pretraining_loss(self, model_size):
        """
        估计给定模型的预训练损失
        
        作为涌现能力的代理指标
        """
        # 基于缩放定律
        A = 2.5
        alpha = 0.076
        L_inf = 1.0
        
        return A * (model_size ** (-alpha)) + L_inf

5.2 能力规划框架

class CapabilityPlanner:
    """
    能力规划框架
    
    基于涌现理论规划模型训练
    """
    
    def __init__(self, target_capabilities: list):
        self.target_capabilities = target_capabilities
    
    def estimate_required_model_size(self):
        """
        估算达到目标能力所需的模型规模
        """
        estimates = {}
        
        for capability in self.target_capabilities:
            if capability == 'multi_step_reasoning':
                # 基于CoT分析
                estimates[capability] = 70e9  # 70B
            elif capability == 'code_generation':
                estimates[capability] = 10e9  # 10B
            elif capability == 'multilingual':
                estimates[capability] = 50e9  # 50B
        
        return estimates
    
    def compute_roi(self, capability, model_size, training_cost):
        """
        计算特定能力的投资回报率
        """
        # 简化模型
        benefit_scores = {
            'multi_step_reasoning': 0.9,
            'code_generation': 0.85,
            'multilingual': 0.7,
        }
        
        benefit = benefit_scores.get(capability, 0.5)
        roi = benefit / (model_size * training_cost)
        
        return roi
    
    def recommend_scaling_path(self):
        """
        推荐最优的扩展路径
        """
        # 基于ROI排序
        recommendations = []
        
        for cap in self.target_capabilities:
            required_size = self.estimate_required_model_size()[cap]
            roi = self.compute_roi(cap, required_size, 1.0)  # 归一化成本
            
            recommendations.append({
                'capability': cap,
                'required_size': required_size,
                'roi': roi,
                'priority': 'high' if roi > 0.8 else 'medium' if roi > 0.5 else 'low'
            })
        
        # 按ROI排序
        recommendations.sort(key=lambda x: x['roi'], reverse=True)
        
        return recommendations

6. 测量方法

6.1 涌现能力评估框架

class EmergenceEvaluator:
    """
    涌现能力评估框架
    """
    
    def __init__(self, model, benchmark_dataset):
        self.model = model
        self.dataset = benchmark_dataset
    
    def evaluate_with_multiple_metrics(self, task):
        """
        使用多种指标评估任务性能
        """
        results = {}
        
        # 1. 精确匹配
        results['exact_match'] = self.compute_exact_match(task)
        
        # 2. 困惑度
        results['perplexity'] = self.compute_perplexity(task)
        
        # 3. 连续准确率(平滑版本)
        results['soft_accuracy'] = self.compute_soft_accuracy(task)
        
        # 4. 嵌入相似度
        results['embedding_similarity'] = self.compute_embedding_similarity(task)
        
        return results
    
    def detect_emergence_across_scales(
        self,
        model_sizes: list,
        task: str
    ) -> dict:
        """
        检测任务在不同规模下的涌现
        """
        performances = []
        
        for size in model_sizes:
            model = self.load_model(size)
            perf = self.evaluate_task(model, task)
            performances.append(perf)
        
        # 多指标分析
        discrete_results = self.statistical_emergence_test(
            model_sizes, performances, 'discrete'
        )
        continuous_results = self.statistical_emergence_test(
            model_sizes, performances, 'continuous'
        )
        
        return {
            'model_sizes': model_sizes,
            'performances': performances,
            'discrete_emergence': discrete_results,
            'continuous_emergence': continuous_results,
            'conclusion': self.interpret_results(discrete_results, continuous_results)
        }

6.2 基准测试套件

基准能力类别规模阈值链接
BIG-Bench多样化100B+https://github.com/google/BIG-Bench
MMLU多任务理解50B+https://github.com/hendrycks/test
GSM8K数学推理100B+https://github.com/openai/grade-school-math
HumanEval代码生成1B+https://github.com/openai/human-eval
HELM全面评估多种https://crfm.stanford.edu/helm

7. 最新研究进展

7.1 2024-2025年重要发现

论文:A Systematic Survey of Emergent Abilities8

主要发现

  1. 涌现能力比最初认为的更加普遍
  2. 某些涌现能力可以被早期干预触发
  3. 涌现的阈值可以通过训练策略调整

论文:Emergent Abilities in Reasoning Models9

主要发现

  1. 推理模型(o1等)在推理时计算涌现
  2. 推理能力不需要预训练的规模涌现
  3. 链式推理能力可以通过inference-time compute实现

7.2 实践影响

关键洞察

  1. 能力不是线性的:即使缩放定律预测性能平滑改善,特定能力仍可能涌现
  2. 规模不是唯一因素:训练策略、数据质量、架构设计都影响涌现
  3. 可预测性提高:随着数据积累,可以更准确地预测涌现阈值

8. 总结与展望

8.1 当前共识

观点支持度说明
涌现能力存在大量实验证据
涌现与指标相关中高离散指标更容易显示涌现
涌现可预测依赖任务类型和模型架构
涌现可干预低-中训练策略有一定影响

8.2 开放问题

  1. 为什么某些能力涌现而其他不涌现?
  2. 能否通过干预提前触发涌现?
  3. 涌现能力的机制基础是什么?
  4. 是否存在”负涌现”(能力退化)?

8.3 实践建议

  1. 规划模型规模:参考已知涌现阈值
  2. 多指标评估:避免被单一指标误导
  3. 关注ICL:某些”涌现”可能是ICL的组合
  4. 持续监控:涌现可能在训练后期出现

参考

Footnotes

  1. Wei, J., et al. (2022). Emergent Abilities of Large Language Models. arXiv:2206.07682. https://arxiv.org/abs/2206.07682 2

  2. Schaeffer, R., Miranda, B., & Koyejo, S. (2023). Are Emergent Abilities of Large Language Models a Mirage? NeurIPS 2023. https://arxiv.org/abs/2307.15702

  3. Dr contingency, A., et al. (2024). Mathematical Capabilities of Large Language Models. arXiv:2401.12368. https://arxiv.org/abs/2401.12368

  4. Havlik, V. (2025). Why are LLMs’ Abilities Emergent? A Complex Systems Perspective. arXiv:2508.04401. https://arxiv.org/abs/2508.04401

  5. Lu, S., et al. (2024). Are Emergent Abilities in Large Language Models just In-Context Learning? ACL 2024. https://aclanthology.org/2024.acl-long.279/

  6. Zhou, P., et al. (2024). Emergent Abilities of Synthetic Cartography. arXiv:2410.10893. https://arxiv.org/abs/2410.10893

  7. Complex Systems Research Team. (2025). Large Language Models and Emergence: A Complex Systems Perspective. arXiv:2506.11135. https://arxiv.org/html/2506.11135

  8. Berti, L., et al. (2025). Emergent Abilities in Large Language Models: A Survey. arXiv:2503.05788. https://arxiv.org/abs/2503.05788

  9. OpenAI. (2024). Learning to Reason with LLMs. OpenAI Blog. https://openai.com/index/learning-to-reason-with-llms/