概述

LLM推理策略(Inference Strategies)是指在推理阶段(Inference Time)通过各种技术手段提升大语言模型推理能力的方法。与传统的仅依赖模型训练来提升能力不同,推理策略在模型部署后,通过增加推理时的计算资源或优化推理算法来实现能力的质的飞跃。1

核心问题:如何在不重新训练模型的情况下,通过推理策略让模型在复杂推理任务上表现更好?

推理策略的分类

推理策略
├── 采样策略
│   ├── Best-of-N 采样
│   ├── 多数投票(Majority Voting)
│   └── 温度调节采样
│
├── 搜索策略
│   ├── 树搜索(Tree Search)
│   ├── 束搜索(Beam Search)
│   └── 蒙特卡洛树搜索(MCTS)
│
├── 扩展策略
│   ├── 顺序扩展(Sequential Scaling)
│   ├── 并行扩展(Parallel Scaling)
│   └── 混合扩展(Hybrid Scaling)
│
├── 加速策略
│   ├── 推测解码(Speculative Decoding)
│   └── 批处理优化
│
└── 链式推理演进
    ├── 标准CoT
    ├── 断裂CoT(Fractured CoT)
    ├── 弹性推理(Elastic Reasoning)
    └── 深度快速思考

Best-of-N 与多数投票

基本概念

Best-of-N多数投票(Majority Voting)是两种最基础的推理策略,通过多次采样并选择最佳或最一致的结果来提升推理质量。

class BestOfNSampler:
    """
    Best-of-N 采样器
    生成N个候选答案,使用验证器选择最佳
    """
    def __init__(self, model, verifier, n: int = 16):
        self.model = model
        self.verifier = verifier
        self.n = n
    
    def forward(self, question: str) -> str:
        candidates = []
        scores = []
        
        for _ in range(self.n):
            # 多次采样生成不同答案
            response = self.model.generate(question, temperature=0.8)
            candidates.append(response)
            
            # 验证器打分
            score = self.verifier.score(question, response)
            scores.append(score)
        
        # 返回得分最高的答案
        best_idx = max(range(len(scores)), key=lambda i: scores[i])
        return candidates[best_idx]
 
 
class MajorityVoter:
    """
    多数投票器
    生成多个答案,选取出现次数最多的
    """
    def __init__(self, model, n: int = 40):
        self.model = model
        self.n = n
    
    def forward(self, question: str) -> str:
        answers = []
        
        for _ in range(self.n):
            response = self.model.generate(question, temperature=0.7)
            answer = extract_final_answer(response)
            answers.append(answer)
        
        # 多数投票
        return Counter(answers).most_common(1)[0][0]

推理感知微调(Inference-Aware Fine-Tuning)

传统的Best-of-N方法存在一个问题:模型在训练时没有考虑推理时的探索-利用权衡(Exploration-Exploitation Trade-off)。推理感知微调通过在训练阶段模拟推理过程来解决这个问题。2

class InferenceAwareTraining:
    """
    推理感知微调
    训练模型时考虑推理阶段的采样策略
    """
    def __init__(self, model, n_samples: int = 16, beta: float = 0.04):
        self.model = model
        self.n_samples = n_samples
        self.beta = beta  # KL散度权重
    
    def train_step(self, batch):
        """
        推理感知训练步骤
        """
        # 1. 生成N个样本
        candidates = []
        for _ in range(self.n_samples):
            response = self.model.generate(batch.prompt, temperature=0.8)
            candidates.append(response)
        
        # 2. 计算每个样本的奖励
        rewards = []
        for candidate in candidates:
            reward = self.compute_reward(batch.prompt, candidate, batch.answer)
            rewards.append(reward)
        
        # 3. 选择最佳样本的logits作为监督目标
        best_idx = max(range(len(rewards)), key=lambda i: rewards[i])
        best_candidate = candidates[best_idx]
        best_logits = self.model.get_logits(best_candidate)
        
        # 4. 计算损失:同时考虑答案正确性和推理多样性
        policy_loss = self.compute_policy_loss(batch, candidates, rewards)
        kl_loss = self.compute_kl_divergence(batch)
        
        total_loss = policy_loss + self.beta * kl_loss
        total_loss.backward()
        
        return total_loss.item()

核心思想:模型在训练时隐式学习到在推理时应该进行多少”探索”(采样多个答案)以及”利用”(选择最佳答案)。

实验结果

在MATH数据集上的对比实验表明,推理感知微调可以显著提升Best-of-N的效果:

方法MATH准确率
标准采样26.8%
Best-of-N(N=16)28.2%
推理感知 + Best-of-N30.8%

这表明通过适当的训练策略,可以让模型更好地适应推理时的采样策略。

树搜索与过程奖励模型

束搜索在每一步维护个最有可能的候选路径,而不是像贪婪解码那样只选择一个:

class BeamSearchDecoder:
    """
    束搜索解码器
    维护K条最优路径进行推理
    """
    def __init__(self, model, beam_width: int = 4, max_depth: int = 20):
        self.model = model
        self.beam_width = beam_width
        self.max_depth = max_depth
    
    def decode(self, question: str) -> str:
        # 初始化:每个beam是一个(n, log_prob)元组
        beams = [(question, 0.0)]
        
        for step in range(self.max_depth):
            # 存储下一步的候选
            candidates = []
            
            for content, score in beams:
                # 生成下一步的候选
                next_tokens = self.model.generate_next_tokens(content)
                
                for token, token_prob in next_tokens:
                    new_content = content + token
                    new_score = score + math.log(token_prob)
                    candidates.append((new_content, new_score))
            
            # 选择top-K
            beams = heapq.nlargest(self.beam_width, candidates, key=lambda x: x[1])
            
            # 检查是否完成
            for content, score in beams:
                if self.is_complete(content):
                    return self.extract_answer(content)
        
        # 返回最佳beam的答案
        return self.extract_answer(max(beams, key=lambda x: x[1])[0])

蒙特卡洛树搜索(MCTS)

MCTS是一种更强大的搜索算法,通过模拟多次” rollout”来评估每个动作的价值:

class MCTSNode:
    """MCTS树节点"""
    def __init__(self, state: str, parent=None, action=None):
        self.state = state
        self.parent = parent
        self.action = action
        self.children = {}
        self.visits = 0
        self.value = 0.0
 
class MCTSDecoder:
    """
    蒙特卡洛树搜索推理
    """
    def __init__(self, model, prm: ProcessRewardModel, 
                 n_simulations: int = 64, c_puct: float = 1.4):
        self.model = model
        self.prm = prm
        self.n_simulations = n_simulations
        self.c_puct = c_puct
    
    def search(self, question: str) -> str:
        root = MCTSNode(state=question)
        
        for _ in range(self.n_simulations):
            # 1. 选择(Selection)
            node = self._select(root)
            
            # 2. 扩展(Expansion)
            if not self.is_terminal(node):
                node = self._expand(node, question)
            
            # 3. 模拟(Simulation)
            value = self._simulate(node, question)
            
            # 4. 回溯(Backpropagation)
            self._backpropagate(node, value)
        
        # 返回访问次数最多的子节点的答案
        best_child = max(root.children.values(), key=lambda n: n.visits)
        return self.extract_answer(best_child.state)
    
    def _select(self, node: MCTSNode) -> MCTSNode:
        """UCB1选择"""
        while node.children:
            node = max(node.children.values(),
                      key=lambda n: self._ucb_score(n, node))
        return node
    
    def _ucb_score(self, child: MCTSNode, parent: MCTSNode) -> float:
        """UCB1公式"""
        exploitation = child.value / max(child.visits, 1)
        exploration = self.c_puct * math.sqrt(
            math.log(parent.visits) / max(child.visits, 1)
        )
        return exploitation + exploration
    
    def _expand(self, node: MCTSNode, question: str) -> MCTSNode:
        """扩展子节点"""
        next_tokens = self.model.generate_next_tokens(node.state)
        
        for token in next_tokens[:self.n_expansions]:
            child_state = node.state + token
            child = MCTSNode(state=child_state, parent=node, action=token)
            node.children[token] = child
        
        return list(node.children.values())[0]
    
    def _simulate(self, node: MCTSNode, question: str) -> float:
        """Rollout模拟"""
        state = node.state
        
        for _ in range(self.max_rollout_steps):
            if self.is_terminal(node):
                break
            
            # 使用PRM评估当前步骤
            step_reward = self.prm.score(question, state)
            
            # 继续生成
            next_tokens = self.model.generate_next_tokens(state)
            best_token = max(next_tokens, key=lambda x: x[1])[0]
            state = state + best_token
        
        # 返回最终奖励
        return self.prm.score(question, state)
    
    def _backpropagate(self, node: MCTSNode, value: float):
        """回溯更新"""
        while node:
            node.visits += 1
            node.value += value
            node = node.parent

粒子滤波推理(Particle Filtering for Inference Scaling)

粒子滤波是一种基于序列蒙特卡洛的方法,通过维护一组加权粒子来表示推理轨迹的后验分布:

class ParticleFilterInference:
    """
    粒子滤波推理
    通过加权粒子实现推理扩展
    """
    def __init__(self, model, prm: ProcessRewardModel, 
                 n_particles: int = 32, resampling_threshold: float = 0.5):
        self.model = model
        self.prm = prm
        self.n_particles = n_particles
        self.resampling_threshold = resampling_threshold
    
    def infer(self, question: str) -> str:
        # 初始化粒子
        particles = [
            {"state": question, "weight": 1.0 / self.n_particles}
            for _ in range(self.n_particles)
        ]
        
        for step in range(self.max_steps):
            # 1. 重要性采样:每个粒子扩展一步
            for p in particles:
                next_tokens = self.model.generate_next_tokens(p["state"])
                # 根据概率采样下一个token
                token = self._sample_token(next_tokens)
                p["state"] = p["state"] + token
            
            # 2. 计算权重(使用PRM)
            for p in particles:
                step_reward = self.prm.score(question, p["state"])
                p["weight"] *= step_reward
            
            # 3. 归一化权重
            total_weight = sum(p["weight"] for p in particles)
            for p in particles:
                p["weight"] /= total_weight
            
            # 4. 重采样检查
            effective_n = 1.0 / sum(p["weight"]**2 for p in particles)
            if effective_n < self.n_particles * self.resampling_threshold:
                particles = self._resample(particles)
        
        # 返回权重最高的粒子的答案
        best_particle = max(particles, key=lambda p: p["weight"])
        return self.extract_answer(best_particle["state"])
    
    def _resample(self, particles):
        """低方差重采样"""
        weights = [p["weight"] for p in particles]
        indices = np.random.choice(
            self.n_particles, size=self.n_particles, p=weights
        )
        new_particles = [
            {"state": particles[i]["state"], "weight": 1.0 / self.n_particles}
            for i in indices
        ]
        return new_particles

推理扩展效率对比

实验表明,基于粒子滤波的方法可以实现比确定性搜索更好的推理扩展效率:

方法扩展系数相对效率
贪婪解码基线
束搜索(K=4)1.2×
MCTS(N=16)2.1×
粒子滤波(N=16)4.1×
MCTS(N=64)16×5.3×
粒子滤波(N=64)16×8.4×

关键发现:粒子滤波方法在推理扩展时可以实现4-16倍的效率提升,相比确定性搜索具有显著优势。

Qwen-7B + 推理策略效果

使用过程奖励模型配合树搜索,即使是小模型也能达到接近顶级推理模型的效果:

配置MATH准确率相对计算量
Qwen-7B 基线42.5%
Qwen-7B + Best-of-N (N=16)52.3%16×
Qwen-7B + MCTS (32 rollouts)58.1%32×
Qwen-7B + 粒子滤波 (32 rollouts)64.7%32×
o1-preview66.2%-
o3-mini68.8%-

可以看到,使用合适的推理策略,Qwen-7B在32次rollouts下就能达到与o1-preview相当的性能水平。

顺序扩展 vs 并行扩展

Let Me Think! 论文的核心发现

“Let Me Think!” 论文系统性地研究了顺序扩展(Sequential Scaling)和并行扩展(Parallel Scaling)在不同任务上的表现差异。3

class ScalingExperiment:
    """
    扩展性实验框架
    """
    def __init__(self, model):
        self.model = model
    
    def sequential_scaling(self, question: str, n_steps: int) -> str:
        """
        顺序扩展:增加单次推理的思考步数
        """
        state = question
        for _ in range(n_steps):
            response = self.model.generate(state, temperature=0.0)
            state = state + response
        return self.extract_answer(state)
    
    def parallel_scaling(self, question: str, n_samples: int) -> str:
        """
        并行扩展:采样多个独立推理轨迹
        """
        answers = []
        for _ in range(n_samples):
            response = self.model.generate(question, temperature=0.8)
            answers.append(self.extract_answer(response))
        
        # 多数投票
        return Counter(answers).most_common(1)[0][0]

图问题的指数级优势

对于某些类型的图问题(如迷宫、最短路径等),顺序扩展具有指数级优势:

问题复杂度
    ▲
    │                          ╭─ 顺序扩展(指数优势)
    │                        ╱
    │                      ╱
    │                    ╱
    │                  ╱ ─ ─ ─ 并行扩展
    │                ╱
    │              ╱
    │──────────────▶ 问题规模
    │

理论分析:设问题的分支因子为,深度为

  • 并行扩展(N个样本):覆盖个节点,但每个节点仍需深度的推理
  • 顺序扩展(单样本,深度):可以覆盖个节点

对于需要”回溯”和”规划”的图问题,顺序扩展能够更有效地探索解空间。

A*-Decoding 算法

*A-Decoding**是一种结合了启发式搜索和LLM解码的方法,可以在保持相同准确率的同时大幅减少token消耗。4

class AStarDecoder:
    """
    A* 解码器
    利用启发式函数引导搜索,减少不必要的推理
    """
    def __init__(self, model, heuristic_fn):
        self.model = model
        self.heuristic_fn = heuristic_fn  # 启发式函数
    
    def decode(self, question: str, target_score: float) -> str:
        # 优先队列:(f_score, g_score, state, path)
        # f_score = g_score + h_score
        # g_score: 已消耗的token数
        # h_score: 启发式估计(到目标的距离)
        heap = [(0, 0, question, [])]
        best_answer = None
        best_answer_score = float('-inf')
        
        while heap:
            f, g, state, path = heapq.heappop(heap)
            
            # 检查是否已经得到足够好的答案
            if self.is_terminal(state):
                score = self.evaluate(state)
                if score >= target_score:
                    return self.extract_answer(state)
                if score > best_answer_score:
                    best_answer = state
                    best_answer_score = score
                continue
            
            # 扩展节点
            next_tokens = self.model.generate_next_tokens(state)
            
            for token, prob in next_tokens[:self.beam_width]:
                new_state = state + token
                new_g = g + 1  # token计数
                
                # 启发式估计
                h = self.heuristic_fn(new_state, question)
                new_f = new_g + h
                
                heapq.heappush(heap, (new_f, new_g, new_state, path + [token]))
        
        return self.extract_answer(best_answer) if best_answer else None

效率对比:A*-Decoding vs Best-of-N

方法准确率平均Token数效率比
Best-of-N (N=16)85.2%20481.0×
A-Decoding*86.1%6823.0×

关键发现:A*-Decoding在相同准确率下,token消耗减少约3倍,这意味着推理速度可以提升约3倍。

混合策略

最优策略往往是顺序扩展和并行扩展的结合

class HybridScaling:
    """
    混合扩展策略
    根据问题难度自适应选择扩展方式
    """
    def __init__(self, model, prm: ProcessRewardModel):
        self.model = model
        self.prm = prm
    
    def solve(self, question: str, budget: int) -> str:
        # 1. 估计问题难度
        difficulty = self.estimate_difficulty(question)
        
        if difficulty == "low":
            # 简单问题:并行扩展足够
            return self.parallel_scaling(question, n=budget)
        
        elif difficulty == "medium":
            # 中等问题:轻度顺序扩展 + 并行
            return self.hybrid_medium(question, budget)
        
        else:  # high
            # 复杂问题:深度顺序扩展 + 树搜索
            return self.tree_search(question, n_rollouts=budget)
    
    def hybrid_medium(self, question: str, budget: int):
        # 顺序扩展生成初始解
        solution = self.sequential_scaling(question, n_steps=16)
        
        # 并行验证和修正
        candidates = [solution]
        for _ in range(budget // 8):
            verified = self.verify_and_fix(candidates[-1], question)
            candidates.append(verified)
        
        return self.select_best(candidates)

推测解码新进展

传统推测解码的局限

传统的推测解码(Speculative Decoding)使用一个小型”草稿”模型来加速大型”验证”模型的推理。但这种方法有两个主要问题:

  1. 草稿模型能力不足:小型模型生成的候选质量差,导致大量拒绝
  2. 顺序处理:草稿生成和验证必须串行执行

Saguaro:并行化推测解码

Saguaro是一种新的推测解码框架,将草稿生成和验证并行化:

class SaguaroDecoder:
    """
    Saguaro:并行化推测解码
    同时生成多个草稿序列并行验证
    """
    def __init__(self, draft_model, verifier_model, n_specs: int = 4):
        self.draft = draft_model
        self.verifier = verifier_model
        self.n_specs = n_specs
    
    def generate(self, prompt: str) -> str:
        # 阶段1:草稿模型并行生成多个候选
        draft_tokens = []
        
        for _ in range(self.n_specs):
            # 每个草稿独立生成一段
            draft = self._generate_draft(prompt, max_len=8)
            draft_tokens.append(draft)
        
        # 阶段2:并行验证所有草稿
        verified_tokens = self._parallel_verify(prompt, draft_tokens)
        
        # 阶段3:拼接验证通过的token
        return prompt + "".join(verified_tokens)
    
    def _generate_draft(self, prompt: str, max_len: int) -> str:
        """生成单条草稿"""
        state = prompt
        for _ in range(max_len):
            token = self.draft.sample_next(state)
            if token == self.draft.eos_token:
                break
            state += token
        return state[len(prompt):]
    
    def _parallel_verify(self, prompt: str, drafts: List[str]) -> List[str]:
        """并行验证多个草稿"""
        # 构建批量验证输入
        batch_inputs = [prompt + draft for draft in drafts]
        
        # 批量验证(GPU并行)
        batch_logits = self.verifier.forward_batch(batch_inputs)
        
        # 从每个草稿中提取验证通过的token
        verified = []
        for draft, logits in zip(drafts, batch_logits):
            tokens = self._extract_verified_tokens(draft, logits)
            verified.extend(tokens)
        
        return verified

性能提升:Saguaro在多个基准上实现了约5倍的速度提升。

SemanticSpec:语义感知的验证

SemanticSpec是一种利用隐藏状态进行语义级别验证的方法,相比传统的token级验证更加高效和准确:

class SemanticSpecVerifier:
    """
    SemanticSpec:语义感知验证器
    通过隐藏状态判断语义一致性
    """
    def __init__(self, model, hidden_dim: int):
        self.model = model
        self.hidden_dim = hidden_dim
        self.semantic_proj = nn.Linear(hidden_dim, hidden_dim)
        self.semantic_head = nn.Linear(hidden_dim, 1)
    
    def verify(self, prompt: str, draft_tokens: str, 
               target_len: int) -> tuple[bool, float]:
        """
        语义级验证
        返回:(是否接受, 置信度)
        """
        # 获取草稿和目标的隐藏状态
        draft_hidden = self.model.get_hidden_states(prompt + draft_tokens)
        target_hidden = self.model.get_hidden_states(
            prompt + draft_tokens[:target_len]
        )
        
        # 语义一致性评分
        draft_semantic = self.semantic_proj(draft_hidden[-1])
        target_semantic = target_hidden[-1]
        
        similarity = torch.cosine_similarity(
            draft_semantic.unsqueeze(0),
            target_semantic.unsqueeze(0)
        )
        
        confidence = torch.sigmoid(self.semantic_head(draft_hidden[-1]))
        
        # 语义一致且置信度高时接受
        accept = (similarity > 0.9) and (confidence > 0.8)
        
        return accept.item(), confidence.item()

校准推测解码(CSD)

Calibrated Speculative Decoding (CSD) 通过校准草稿模型的置信度来提高接受率:

class CalibratedSpeculativeDecoder:
    """
    校准推测解码
    根据验证难度动态调整接受阈值
    """
    def __init__(self, draft_model, verifier_model, calibration_data):
        self.draft = draft_model
        self.verifier = verifier_model
        self.calibrator = self._calibrate(calibration_data)
    
    def _calibrate(self, data):
        """使用校准数据学习接受阈值"""
        # 收集草稿-验证器的一致性数据
        calibration_scores = []
        
        for prompt, response in data:
            draft_hidden = self.draft.get_hidden_states(prompt)
            verifier_hidden = self.verifier.get_hidden_states(prompt)
            
            # 计算隐藏状态的差异
            diff = torch.norm(draft_hidden - verifier_hidden, p=2)
            calibration_scores.append(diff.item())
        
        # 学习使接受率与准确率匹配的阈值
        return self._fit_threshold(calibration_scores)
    
    def _fit_threshold(self, scores):
        """拟合最优阈值"""
        # 目标是:接受率 ≈ 验证准确率
        # 使用分位数作为阈值
        return np.percentile(scores, 70)  # 接受前70%的草稿
    
    def generate(self, prompt: str) -> str:
        # 生成草稿
        draft = self.draft.generate(prompt, max_len=16)
        
        # 计算草稿的校准分数
        draft_hidden = self.draft.get_hidden_states(prompt + draft)
        verifier_hidden = self.verifier.get_hidden_states(prompt + draft)
        score = torch.norm(draft_hidden - verifier_hidden, p=2).item()
        
        # 根据校准阈值决定是否接受
        if score < self.calibrator:
            return draft  # 接受整个草稿
        else:
            # 拒绝并回退到验证器
            return self.verifier.generate(prompt)

性能结果:CSD在多个基准上实现了约2.33倍的速度提升,同时保持了验证器级别的准确率。

链式推理的演进

标准链式推理(Chain-of-Thought)

标准CoT通过让模型显式生成推理步骤来提升复杂推理能力:

# 标准CoT示例
"""
问题:鸡兔同笼,共8个头,26只脚,问鸡兔各几只?
 
标准推理:
设鸡有x只,兔有y只。
x + y = 8        (头数)
2x + 4y = 26     (脚数)
 
解方程:
从第一个方程:x = 8 - y
代入第二个方程:2(8 - y) + 4y = 26
16 - 2y + 4y = 26
2y = 10
y = 5
 
x = 8 - 5 = 3
 
答案:鸡3只,兔5只。
"""

断裂CoT(Fractured CoT)

断裂CoT发现一个有趣的现象:在很多情况下,截断的推理链(Truncated CoT)往往能达到与完整推理链相同的准确率:

class FracturedCoT:
    """
    断裂CoT:使用截断的推理链
    """
    def __init__(self, model, truncation_ratio: float = 0.6):
        self.model = model
        self.truncation_ratio = truncation_ratio
    
    def solve(self, problem: str) -> str:
        # 生成完整推理
        full_reasoning = self.model.generate(problem)
        
        # 找到推理的关键节点
        key_nodes = self._find_key_nodes(full_reasoning)
        
        # 根据截断比例选择保留的节点
        n_keep = int(len(key_nodes) * self.truncation_ratio)
        truncated_reasoning = self._reconstruct(key_nodes[:n_keep])
        
        # 生成答案
        answer = self.model.generate(
            problem + "\n" + truncated_reasoning,
            max_tokens=50
        )
        
        return answer
    
    def _find_key_nodes(self, reasoning: str) -> List[str]:
        """识别推理链中的关键节点"""
        # 使用某种启发式方法识别关键步骤
        nodes = []
        for step in reasoning.split("\n"):
            if self._is_key_step(step):
                nodes.append(step)
        return nodes

实验发现

数据集完整CoT准确率断裂CoT(60%)准确率节省计算量
GSM8K94.1%93.8%40%
MATH68.2%67.9%40%
ARC-Challenge86.5%85.2%40%

关键洞察:推理过程中存在大量”冗余”步骤,这些步骤对于最终答案的贡献很小。

弹性推理(Elastic Reasoning)

弹性推理是一种将”思考预算”和”解答预算”分开管理的方法:

class ElasticReasoning:
    """
    弹性推理:分离思考和解答预算
    """
    def __init__(self, model, base_think_budget: int = 2048,
                 base_solve_budget: int = 512):
        self.model = model
        self.base_think_budget = base_think_budget
        self.base_solve_budget = base_solve_budget
    
    def solve(self, problem: str, total_budget: int) -> str:
        # 根据总预算动态分配
        # 简单问题:更多解答预算
        # 复杂问题:更多思考预算
        
        difficulty = self.estimate_difficulty(problem)
        
        if difficulty == "easy":
            think_budget = total_budget * 0.2
            solve_budget = total_budget * 0.8
        elif difficulty == "medium":
            think_budget = total_budget * 0.5
            solve_budget = total_budget * 0.5
        else:  # hard
            think_budget = total_budget * 0.8
            solve_budget = total_budget * 0.2
        
        # 第一阶段:深度思考
        thinking = self.model.generate(
            f"{problem}\n请详细分析:",
            max_tokens=int(think_budget),
            stop_at_solution=True
        )
        
        # 第二阶段:解答生成
        solution = self.model.generate(
            f"{problem}\n{thinking}\n因此,答案是:",
            max_tokens=int(solve_budget)
        )
        
        return solution
    
    def estimate_difficulty(self, problem: str) -> str:
        """估计问题难度"""
        # 使用问题的词汇复杂度、长度等特征
        features = self.extract_features(problem)
        return self.classifier.predict(features)

核心思想:不同类型的问题需要不同的思考-解答时间分配比,弹性推理通过动态调整这一比例来实现高效推理。

Think Deep, Think Fast

“Think Deep, Think Fast” 论文研究了推理模型的一个有趣特性:多数投票对推理模型特别有效

class ThinkDeepThinkFast:
    """
    深度思考 + 快速采样
    专门为推理模型设计的推理策略
    """
    def __init__(self, reasoning_model, n_samples: int = 16):
        self.model = reasoning_model
        self.n_samples = n_samples
    
    def solve(self, problem: str) -> str:
        answers = []
        confidences = []
        
        for _ in range(self.n_samples):
            # 推理模型生成多个解答
            response = self.model.generate(
                problem,
                temperature=0.7,
                max_thinking_tokens=4096
            )
            
            answer = self.extract_answer(response)
            confidence = self.estimate_confidence(response)
            
            answers.append(answer)
            confidences.append(confidence)
        
        # 方法1:标准多数投票
        standard_vote = Counter(answers).most_common(1)[0][0]
        
        # 方法2:置信度加权投票
        weighted_votes = {}
        for ans, conf in zip(answers, confidences):
            weighted_votes[ans] = weighted_votes.get(ans, 0) + conf
        weighted_vote = max(weighted_votes, key=weighted_votes.get)
        
        # 方法3:仅使用高置信度答案投票
        high_conf_indices = [i for i, c in enumerate(confidences) if c > 0.8]
        if high_conf_indices:
            high_conf_answers = [answers[i] for i in high_conf_indices]
            high_conf_vote = Counter(high_conf_answers).most_common(1)[0][0]
        else:
            high_conf_vote = standard_vote
        
        return standard_vote  # 推理模型中标准投票最稳健

实验结果

模型单样本准确率+ 多数投票(N=16)提升
GPT-4 (标准)67.2%78.4%+11.2%
o1-mini71.8%85.3%+13.5%
o1-preview75.6%88.1%+12.5%

关键发现:推理模型(如o1系列)对多数投票更加敏感,提升幅度比标准LLM更大。

实践指南

策略选择框架

def select_inference_strategy(task: str, budget: str, 
                               has_verifier: bool = False) -> str:
    """
    根据任务类型和资源选择推理策略
    
    参数:
        task: 任务类型 ("math", "code", "logic", "fact")
        budget: 资源预算 ("low", "medium", "high")
        has_verifier: 是否有可用的验证器
    
    返回:
        推荐策略名称
    """
    if budget == "low":
        if task in ["math", "logic"]:
            return "标准CoT"
        else:
            return "贪婪解码"
    
    elif budget == "medium":
        if has_verifier:
            return "Best-of-N + 验证器"
        elif task == "math":
            return "CoT + 多数投票"
        else:
            return "并行采样"
    
    else:  # high
        if has_verifier:
            return "树搜索 + PRM"
        else:
            return "推理模型 (o1/R1)"

成本-收益分析

策略延迟增加准确率提升适用场景
标准CoT2-3×10-20%所有任务
多数投票(N=16)16×5-15%有明确答案的任务
Best-of-N(N=16)16×10-25%有验证器的任务
MCTS(64次rollout)64×20-35%复杂推理任务
推理模型(o1/R1)5-10×30-50%复杂推理任务

最佳实践建议

  1. 从简单开始:先尝试标准CoT,评估收益
  2. 评估ROI:计算额外延迟带来的准确率提升是否值得
  3. 利用验证器:如果有可靠的验证器,优先使用Best-of-N
  4. 自适应选择:根据问题难度动态调整策略
  5. 组合策略:可以组合多种方法,如CoT + 多数投票

参考


相关主题

Footnotes

  1. OpenAI. “Learning to Reason with LLMs”. 2024. Link

  2. Inference-Aware Fine-Tuning for Best-of-N Sampling. 2024.

  3. Let Me Think! On the Optimality of Sequential Scaling for Language Model Reasoning. 2024.

  4. A*-Decoding: Token-efficient Reasoning with Learnable Heuristic Search. 2024.