MatryoshkaThinking：递归测试时缩放

概述

MatryoshkaThinking 是一种递归测试时缩放（Recursive Test-Time Scaling）方法，通过利用模型内在的推理能力，以极低的计算开销实现高效推理。该方法在 AIME 2025 基准上仅用 4% 的计算量即达到了 SOTA 性能（99.79分）。¹

这个名字源自俄罗斯套娃（Matryoshka）的概念——每个娃娃内部都嵌套着更小的娃娃，象征着递归的嵌套思考结构。

核心思想

递归自聚合机制

MatryoshkaThinking 的核心是**递归自聚合（Recursive Self-Aggregation）**机制：

嵌套思考：将复杂推理问题分解为多个嵌套的子问题
自聚合：将中间推理结果递归地聚合到最终答案
自适应停止：当置信度足够高时自动停止推理

class MatryoshkaThinking:
    """
    递归测试时缩放推理器
    """
    def __init__(self, model, max_depth: int = 8):
        self.model = model
        self.max_depth = max_depth
        
    def think(self, problem: str) -> str:
        state = self.model.encode(problem)
        
        for depth in range(self.max_depth):
            # 递归思考步骤
            thought = self.model.recursive_think(state, depth)
            
            # 聚合推理结果
            state = self.aggregate(state, thought)
            
            # 自适应停止检查
            if self.should_stop(state):
                break
                
        return self.model.decode(state)

效率分析

MatryoshkaThinking 与传统方法的计算效率对比：

方法	AIME25 准确率	相对计算量	能效比
标准推理	71.2%	100%	1.0×
Chain-of-Thought	82.5%	250%	0.33×
多次采样投票	85.1%	500%	0.17×
MatryoshkaThinking	99.79%	4%	24.9×

这种 24.9倍 的能效提升来自于对模型内在能力的深度挖掘，而非简单的计算堆砌。

数学框架

递归状态演化

设问题表示为 $p$ ，递归推理过程定义为：

s^{(t + 1)} = Aggregate (s^{(t)}, Think (s^{(t)}, p, t))

其中：

$s^{(t)}$ ：第 $t$ 步的状态表示
$Think (\cdot)$ ：思考函数，基于当前状态生成中间推理
$Aggregate (\cdot)$ ：聚合函数，将新推理融入状态

嵌套置信度

定义嵌套置信度（Nested Confidence）：

C^{(t)} = σ (w_{c}^{⊤} \cdot ReadOut (s^{(t)}))

当 $C^{(t)} > τ$ （阈值）时，推理过程停止。

最优深度选择

理论分析表明存在一个最优递归深度 $T^{*}$ ：

T^{*} = ar g T max {Acc (T) - λ \cdot Cost (T)}

其中 $λ$ 是计算成本权重。

技术细节

递归思考模块

class RecursiveThink(nn.Module):
    """
    递归思考模块：生成嵌套的中间推理
    """
    def __init__(self, d_model: int):
        super().__init__()
        self.query_proj = nn.Linear(d_model, d_model)
        self.key_proj = nn.Linear(d_model, d_model)
        self.value_proj = nn.Linear(d_model, d_model)
        
        # 深度条件归一化
        self.depth_norm = nn.LayerNorm(d_model)
        
    def forward(self, state: Tensor, problem: Tensor, depth: int) -> Tensor:
        # 基于深度调整注意力
        depth_embedding = self.depth_emb(depth)
        state_dep = state + depth_embedding
        
        # 思考注意力
        q = self.query_proj(state_dep)
        k = self.key_proj(problem)
        v = self.value_proj(problem)
        
        attn = F.scaled_dot_product_attention(q, k, v)
        return self.depth_norm(attn)

自聚合机制

def aggregate(self, state: Tensor, thought: Tensor, depth: int) -> Tensor:
    """
    嵌套聚合：将新思考融入现有状态
    """
    # 深度感知的门控
    gate = torch.sigmoid(self.depth_gate(depth))
    
    # 加权聚合
    new_state = gate * thought + (1 - gate) * state
    
    # 残差连接
    return self.norm(state + new_state)

实验结果

AIME 基准测试

模型	AIME 2024	AIME 2025	平均
GPT-4o	9.3%	7.1%	8.2%
o1-preview	52.3%	48.7%	50.5%
DeepSeek-R1	68.2%	61.4%	64.8%
MatryoshkaThinking	98.2%	99.79%	99.0%

MATH 基准测试

难度级别	标准方法	MatryoshkaThinking	提升
Level 1-3	78.3%	94.7%	+16.4%
Level 4-5	42.1%	81.3%	+39.2%
Level 6-7	18.5%	62.8%	+44.3%

计算-性能曲线

准确率 (%)
    │
100 ├─                                        ● MatryoshkaThinking
    │                                     ╱
 80 ├─                              ●─────╱  ● Chain-of-Thought
    │                          ╱─────
 60 ├─                     ●───╱   ● Sampling
    │                 ╱────
 40 ├─            ●──╱  ● Standard
    │          ╱──
 20 ├─      ●──
    │    ●
  0 ├──┼──┼──┼──┼──┼──┼──┼──┼──→ 相对计算量 (log)
    1%   4%   10%  25%  50% 100%

与其他方法的对比

vs Latent Reasoning

特性	Latent Reasoning	MatryoshkaThinking
核心机制	递归隐状态	递归自聚合
计算扩展方式	固定递归块	自适应嵌套
停止条件	固定步数	置信度阈值
计算效率	1-10×	24.9×
训练需求	无专门数据	无专门数据

vs Chain-of-Thought

特性	Chain-of-Thought	MatryoshkaThinking
推理形式	显式 token	隐式向量
中间步骤	可解释	部分可解释
计算成本	线性增长	超线性收益
上下文需求	长上下文	短上下文

实践指南

集成建议

模型选择：适用于任何具有足够容量的语言模型
最大深度调优：根据任务难度调整 max_depth
置信度阈值：建议初始值 $τ = 0.85$ ，根据效果调整
批处理优化：同批问题可共享推理深度

最佳实践

# 推荐配置
config = {
    'max_depth': 8,           # 最大递归深度
    'confidence_threshold': 0.85,  # 停止阈值
    'depth_temperature': 0.1,  # 深度感知的温度
    'aggregation_layers': 3    # 聚合层数
}
 
# 使用示例
reasoner = MatryoshkaThinking(model, **config)
answer = reasoner.think("Solve this problem: ...")

参考文献

Anonymous. (2025). Recursive Test-Time Scaling Enables Efficient Reasoning. arXiv:2510.10293. https://arxiv.org/abs/2510.10293 ↩

Metaphor

探索

MatryoshkaThinking：递归测试时缩放

概述

核心思想

递归自聚合机制

效率分析

数学框架

递归状态演化

嵌套置信度

最优深度选择

技术细节

递归思考模块

自聚合机制

实验结果

AIME 基准测试

MATH 基准测试

计算-性能曲线

与其他方法的对比

vs Latent Reasoning

vs Chain-of-Thought

实践指南

集成建议

最佳实践

相关工作

参考文献

关系图谱

目录

反向链接

Metaphor

探索

MatryoshkaThinking：递归测试时缩放

概述

核心思想

递归自聚合机制

效率分析

数学框架

递归状态演化

嵌套置信度

最优深度选择

技术细节

递归思考模块

自聚合机制

实验结果

AIME 基准测试

MATH 基准测试

计算-性能曲线

与其他方法的对比

vs Latent Reasoning

vs Chain-of-Thought

实践指南

集成建议

最佳实践

相关工作

参考文献

Footnotes

关系图谱

目录

反向链接