概述
LLM推理策略(Inference Strategies)是指在推理阶段(Inference Time)通过各种技术手段提升大语言模型推理能力的方法。与传统的仅依赖模型训练来提升能力不同,推理策略在模型部署后,通过增加推理时的计算资源或优化推理算法来实现能力的质的飞跃。1
核心问题:如何在不重新训练模型的情况下,通过推理策略让模型在复杂推理任务上表现更好?
推理策略的分类
推理策略
├── 采样策略
│ ├── Best-of-N 采样
│ ├── 多数投票(Majority Voting)
│ └── 温度调节采样
│
├── 搜索策略
│ ├── 树搜索(Tree Search)
│ ├── 束搜索(Beam Search)
│ └── 蒙特卡洛树搜索(MCTS)
│
├── 扩展策略
│ ├── 顺序扩展(Sequential Scaling)
│ ├── 并行扩展(Parallel Scaling)
│ └── 混合扩展(Hybrid Scaling)
│
├── 加速策略
│ ├── 推测解码(Speculative Decoding)
│ └── 批处理优化
│
└── 链式推理演进
├── 标准CoT
├── 断裂CoT(Fractured CoT)
├── 弹性推理(Elastic Reasoning)
└── 深度快速思考
Best-of-N 与多数投票
基本概念
Best-of-N 和 多数投票(Majority Voting)是两种最基础的推理策略,通过多次采样并选择最佳或最一致的结果来提升推理质量。
class BestOfNSampler:
"""
Best-of-N 采样器
生成N个候选答案,使用验证器选择最佳
"""
def __init__(self, model, verifier, n: int = 16):
self.model = model
self.verifier = verifier
self.n = n
def forward(self, question: str) -> str:
candidates = []
scores = []
for _ in range(self.n):
# 多次采样生成不同答案
response = self.model.generate(question, temperature=0.8)
candidates.append(response)
# 验证器打分
score = self.verifier.score(question, response)
scores.append(score)
# 返回得分最高的答案
best_idx = max(range(len(scores)), key=lambda i: scores[i])
return candidates[best_idx]
class MajorityVoter:
"""
多数投票器
生成多个答案,选取出现次数最多的
"""
def __init__(self, model, n: int = 40):
self.model = model
self.n = n
def forward(self, question: str) -> str:
answers = []
for _ in range(self.n):
response = self.model.generate(question, temperature=0.7)
answer = extract_final_answer(response)
answers.append(answer)
# 多数投票
return Counter(answers).most_common(1)[0][0]推理感知微调(Inference-Aware Fine-Tuning)
传统的Best-of-N方法存在一个问题:模型在训练时没有考虑推理时的探索-利用权衡(Exploration-Exploitation Trade-off)。推理感知微调通过在训练阶段模拟推理过程来解决这个问题。2
class InferenceAwareTraining:
"""
推理感知微调
训练模型时考虑推理阶段的采样策略
"""
def __init__(self, model, n_samples: int = 16, beta: float = 0.04):
self.model = model
self.n_samples = n_samples
self.beta = beta # KL散度权重
def train_step(self, batch):
"""
推理感知训练步骤
"""
# 1. 生成N个样本
candidates = []
for _ in range(self.n_samples):
response = self.model.generate(batch.prompt, temperature=0.8)
candidates.append(response)
# 2. 计算每个样本的奖励
rewards = []
for candidate in candidates:
reward = self.compute_reward(batch.prompt, candidate, batch.answer)
rewards.append(reward)
# 3. 选择最佳样本的logits作为监督目标
best_idx = max(range(len(rewards)), key=lambda i: rewards[i])
best_candidate = candidates[best_idx]
best_logits = self.model.get_logits(best_candidate)
# 4. 计算损失:同时考虑答案正确性和推理多样性
policy_loss = self.compute_policy_loss(batch, candidates, rewards)
kl_loss = self.compute_kl_divergence(batch)
total_loss = policy_loss + self.beta * kl_loss
total_loss.backward()
return total_loss.item()核心思想:模型在训练时隐式学习到在推理时应该进行多少”探索”(采样多个答案)以及”利用”(选择最佳答案)。
实验结果
在MATH数据集上的对比实验表明,推理感知微调可以显著提升Best-of-N的效果:
| 方法 | MATH准确率 |
|---|---|
| 标准采样 | 26.8% |
| Best-of-N(N=16) | 28.2% |
| 推理感知 + Best-of-N | 30.8% |
这表明通过适当的训练策略,可以让模型更好地适应推理时的采样策略。
树搜索与过程奖励模型
束搜索(Beam Search)
束搜索在每一步维护个最有可能的候选路径,而不是像贪婪解码那样只选择一个:
class BeamSearchDecoder:
"""
束搜索解码器
维护K条最优路径进行推理
"""
def __init__(self, model, beam_width: int = 4, max_depth: int = 20):
self.model = model
self.beam_width = beam_width
self.max_depth = max_depth
def decode(self, question: str) -> str:
# 初始化:每个beam是一个(n, log_prob)元组
beams = [(question, 0.0)]
for step in range(self.max_depth):
# 存储下一步的候选
candidates = []
for content, score in beams:
# 生成下一步的候选
next_tokens = self.model.generate_next_tokens(content)
for token, token_prob in next_tokens:
new_content = content + token
new_score = score + math.log(token_prob)
candidates.append((new_content, new_score))
# 选择top-K
beams = heapq.nlargest(self.beam_width, candidates, key=lambda x: x[1])
# 检查是否完成
for content, score in beams:
if self.is_complete(content):
return self.extract_answer(content)
# 返回最佳beam的答案
return self.extract_answer(max(beams, key=lambda x: x[1])[0])蒙特卡洛树搜索(MCTS)
MCTS是一种更强大的搜索算法,通过模拟多次” rollout”来评估每个动作的价值:
class MCTSNode:
"""MCTS树节点"""
def __init__(self, state: str, parent=None, action=None):
self.state = state
self.parent = parent
self.action = action
self.children = {}
self.visits = 0
self.value = 0.0
class MCTSDecoder:
"""
蒙特卡洛树搜索推理
"""
def __init__(self, model, prm: ProcessRewardModel,
n_simulations: int = 64, c_puct: float = 1.4):
self.model = model
self.prm = prm
self.n_simulations = n_simulations
self.c_puct = c_puct
def search(self, question: str) -> str:
root = MCTSNode(state=question)
for _ in range(self.n_simulations):
# 1. 选择(Selection)
node = self._select(root)
# 2. 扩展(Expansion)
if not self.is_terminal(node):
node = self._expand(node, question)
# 3. 模拟(Simulation)
value = self._simulate(node, question)
# 4. 回溯(Backpropagation)
self._backpropagate(node, value)
# 返回访问次数最多的子节点的答案
best_child = max(root.children.values(), key=lambda n: n.visits)
return self.extract_answer(best_child.state)
def _select(self, node: MCTSNode) -> MCTSNode:
"""UCB1选择"""
while node.children:
node = max(node.children.values(),
key=lambda n: self._ucb_score(n, node))
return node
def _ucb_score(self, child: MCTSNode, parent: MCTSNode) -> float:
"""UCB1公式"""
exploitation = child.value / max(child.visits, 1)
exploration = self.c_puct * math.sqrt(
math.log(parent.visits) / max(child.visits, 1)
)
return exploitation + exploration
def _expand(self, node: MCTSNode, question: str) -> MCTSNode:
"""扩展子节点"""
next_tokens = self.model.generate_next_tokens(node.state)
for token in next_tokens[:self.n_expansions]:
child_state = node.state + token
child = MCTSNode(state=child_state, parent=node, action=token)
node.children[token] = child
return list(node.children.values())[0]
def _simulate(self, node: MCTSNode, question: str) -> float:
"""Rollout模拟"""
state = node.state
for _ in range(self.max_rollout_steps):
if self.is_terminal(node):
break
# 使用PRM评估当前步骤
step_reward = self.prm.score(question, state)
# 继续生成
next_tokens = self.model.generate_next_tokens(state)
best_token = max(next_tokens, key=lambda x: x[1])[0]
state = state + best_token
# 返回最终奖励
return self.prm.score(question, state)
def _backpropagate(self, node: MCTSNode, value: float):
"""回溯更新"""
while node:
node.visits += 1
node.value += value
node = node.parent粒子滤波推理(Particle Filtering for Inference Scaling)
粒子滤波是一种基于序列蒙特卡洛的方法,通过维护一组加权粒子来表示推理轨迹的后验分布:
class ParticleFilterInference:
"""
粒子滤波推理
通过加权粒子实现推理扩展
"""
def __init__(self, model, prm: ProcessRewardModel,
n_particles: int = 32, resampling_threshold: float = 0.5):
self.model = model
self.prm = prm
self.n_particles = n_particles
self.resampling_threshold = resampling_threshold
def infer(self, question: str) -> str:
# 初始化粒子
particles = [
{"state": question, "weight": 1.0 / self.n_particles}
for _ in range(self.n_particles)
]
for step in range(self.max_steps):
# 1. 重要性采样:每个粒子扩展一步
for p in particles:
next_tokens = self.model.generate_next_tokens(p["state"])
# 根据概率采样下一个token
token = self._sample_token(next_tokens)
p["state"] = p["state"] + token
# 2. 计算权重(使用PRM)
for p in particles:
step_reward = self.prm.score(question, p["state"])
p["weight"] *= step_reward
# 3. 归一化权重
total_weight = sum(p["weight"] for p in particles)
for p in particles:
p["weight"] /= total_weight
# 4. 重采样检查
effective_n = 1.0 / sum(p["weight"]**2 for p in particles)
if effective_n < self.n_particles * self.resampling_threshold:
particles = self._resample(particles)
# 返回权重最高的粒子的答案
best_particle = max(particles, key=lambda p: p["weight"])
return self.extract_answer(best_particle["state"])
def _resample(self, particles):
"""低方差重采样"""
weights = [p["weight"] for p in particles]
indices = np.random.choice(
self.n_particles, size=self.n_particles, p=weights
)
new_particles = [
{"state": particles[i]["state"], "weight": 1.0 / self.n_particles}
for i in indices
]
return new_particles推理扩展效率对比
实验表明,基于粒子滤波的方法可以实现比确定性搜索更好的推理扩展效率:
| 方法 | 扩展系数 | 相对效率 |
|---|---|---|
| 贪婪解码 | 1× | 基线 |
| 束搜索(K=4) | 2× | 1.2× |
| MCTS(N=16) | 4× | 2.1× |
| 粒子滤波(N=16) | 4× | 4.1× |
| MCTS(N=64) | 16× | 5.3× |
| 粒子滤波(N=64) | 16× | 8.4× |
关键发现:粒子滤波方法在推理扩展时可以实现4-16倍的效率提升,相比确定性搜索具有显著优势。
Qwen-7B + 推理策略效果
使用过程奖励模型配合树搜索,即使是小模型也能达到接近顶级推理模型的效果:
| 配置 | MATH准确率 | 相对计算量 |
|---|---|---|
| Qwen-7B 基线 | 42.5% | 1× |
| Qwen-7B + Best-of-N (N=16) | 52.3% | 16× |
| Qwen-7B + MCTS (32 rollouts) | 58.1% | 32× |
| Qwen-7B + 粒子滤波 (32 rollouts) | 64.7% | 32× |
| o1-preview | 66.2% | - |
| o3-mini | 68.8% | - |
可以看到,使用合适的推理策略,Qwen-7B在32次rollouts下就能达到与o1-preview相当的性能水平。
顺序扩展 vs 并行扩展
Let Me Think! 论文的核心发现
“Let Me Think!” 论文系统性地研究了顺序扩展(Sequential Scaling)和并行扩展(Parallel Scaling)在不同任务上的表现差异。3
class ScalingExperiment:
"""
扩展性实验框架
"""
def __init__(self, model):
self.model = model
def sequential_scaling(self, question: str, n_steps: int) -> str:
"""
顺序扩展:增加单次推理的思考步数
"""
state = question
for _ in range(n_steps):
response = self.model.generate(state, temperature=0.0)
state = state + response
return self.extract_answer(state)
def parallel_scaling(self, question: str, n_samples: int) -> str:
"""
并行扩展:采样多个独立推理轨迹
"""
answers = []
for _ in range(n_samples):
response = self.model.generate(question, temperature=0.8)
answers.append(self.extract_answer(response))
# 多数投票
return Counter(answers).most_common(1)[0][0]图问题的指数级优势
对于某些类型的图问题(如迷宫、最短路径等),顺序扩展具有指数级优势:
问题复杂度
▲
│ ╭─ 顺序扩展(指数优势)
│ ╱
│ ╱
│ ╱
│ ╱ ─ ─ ─ 并行扩展
│ ╱
│ ╱
│──────────────▶ 问题规模
│
理论分析:设问题的分支因子为,深度为:
- 并行扩展(N个样本):覆盖个节点,但每个节点仍需深度的推理
- 顺序扩展(单样本,深度):可以覆盖个节点
对于需要”回溯”和”规划”的图问题,顺序扩展能够更有效地探索解空间。
A*-Decoding 算法
*A-Decoding**是一种结合了启发式搜索和LLM解码的方法,可以在保持相同准确率的同时大幅减少token消耗。4
class AStarDecoder:
"""
A* 解码器
利用启发式函数引导搜索,减少不必要的推理
"""
def __init__(self, model, heuristic_fn):
self.model = model
self.heuristic_fn = heuristic_fn # 启发式函数
def decode(self, question: str, target_score: float) -> str:
# 优先队列:(f_score, g_score, state, path)
# f_score = g_score + h_score
# g_score: 已消耗的token数
# h_score: 启发式估计(到目标的距离)
heap = [(0, 0, question, [])]
best_answer = None
best_answer_score = float('-inf')
while heap:
f, g, state, path = heapq.heappop(heap)
# 检查是否已经得到足够好的答案
if self.is_terminal(state):
score = self.evaluate(state)
if score >= target_score:
return self.extract_answer(state)
if score > best_answer_score:
best_answer = state
best_answer_score = score
continue
# 扩展节点
next_tokens = self.model.generate_next_tokens(state)
for token, prob in next_tokens[:self.beam_width]:
new_state = state + token
new_g = g + 1 # token计数
# 启发式估计
h = self.heuristic_fn(new_state, question)
new_f = new_g + h
heapq.heappush(heap, (new_f, new_g, new_state, path + [token]))
return self.extract_answer(best_answer) if best_answer else None效率对比:A*-Decoding vs Best-of-N
| 方法 | 准确率 | 平均Token数 | 效率比 |
|---|---|---|---|
| Best-of-N (N=16) | 85.2% | 2048 | 1.0× |
| A-Decoding* | 86.1% | 682 | 3.0× |
关键发现:A*-Decoding在相同准确率下,token消耗减少约3倍,这意味着推理速度可以提升约3倍。
混合策略
最优策略往往是顺序扩展和并行扩展的结合:
class HybridScaling:
"""
混合扩展策略
根据问题难度自适应选择扩展方式
"""
def __init__(self, model, prm: ProcessRewardModel):
self.model = model
self.prm = prm
def solve(self, question: str, budget: int) -> str:
# 1. 估计问题难度
difficulty = self.estimate_difficulty(question)
if difficulty == "low":
# 简单问题:并行扩展足够
return self.parallel_scaling(question, n=budget)
elif difficulty == "medium":
# 中等问题:轻度顺序扩展 + 并行
return self.hybrid_medium(question, budget)
else: # high
# 复杂问题:深度顺序扩展 + 树搜索
return self.tree_search(question, n_rollouts=budget)
def hybrid_medium(self, question: str, budget: int):
# 顺序扩展生成初始解
solution = self.sequential_scaling(question, n_steps=16)
# 并行验证和修正
candidates = [solution]
for _ in range(budget // 8):
verified = self.verify_and_fix(candidates[-1], question)
candidates.append(verified)
return self.select_best(candidates)推测解码新进展
传统推测解码的局限
传统的推测解码(Speculative Decoding)使用一个小型”草稿”模型来加速大型”验证”模型的推理。但这种方法有两个主要问题:
- 草稿模型能力不足:小型模型生成的候选质量差,导致大量拒绝
- 顺序处理:草稿生成和验证必须串行执行
Saguaro:并行化推测解码
Saguaro是一种新的推测解码框架,将草稿生成和验证并行化:
class SaguaroDecoder:
"""
Saguaro:并行化推测解码
同时生成多个草稿序列并行验证
"""
def __init__(self, draft_model, verifier_model, n_specs: int = 4):
self.draft = draft_model
self.verifier = verifier_model
self.n_specs = n_specs
def generate(self, prompt: str) -> str:
# 阶段1:草稿模型并行生成多个候选
draft_tokens = []
for _ in range(self.n_specs):
# 每个草稿独立生成一段
draft = self._generate_draft(prompt, max_len=8)
draft_tokens.append(draft)
# 阶段2:并行验证所有草稿
verified_tokens = self._parallel_verify(prompt, draft_tokens)
# 阶段3:拼接验证通过的token
return prompt + "".join(verified_tokens)
def _generate_draft(self, prompt: str, max_len: int) -> str:
"""生成单条草稿"""
state = prompt
for _ in range(max_len):
token = self.draft.sample_next(state)
if token == self.draft.eos_token:
break
state += token
return state[len(prompt):]
def _parallel_verify(self, prompt: str, drafts: List[str]) -> List[str]:
"""并行验证多个草稿"""
# 构建批量验证输入
batch_inputs = [prompt + draft for draft in drafts]
# 批量验证(GPU并行)
batch_logits = self.verifier.forward_batch(batch_inputs)
# 从每个草稿中提取验证通过的token
verified = []
for draft, logits in zip(drafts, batch_logits):
tokens = self._extract_verified_tokens(draft, logits)
verified.extend(tokens)
return verified性能提升:Saguaro在多个基准上实现了约5倍的速度提升。
SemanticSpec:语义感知的验证
SemanticSpec是一种利用隐藏状态进行语义级别验证的方法,相比传统的token级验证更加高效和准确:
class SemanticSpecVerifier:
"""
SemanticSpec:语义感知验证器
通过隐藏状态判断语义一致性
"""
def __init__(self, model, hidden_dim: int):
self.model = model
self.hidden_dim = hidden_dim
self.semantic_proj = nn.Linear(hidden_dim, hidden_dim)
self.semantic_head = nn.Linear(hidden_dim, 1)
def verify(self, prompt: str, draft_tokens: str,
target_len: int) -> tuple[bool, float]:
"""
语义级验证
返回:(是否接受, 置信度)
"""
# 获取草稿和目标的隐藏状态
draft_hidden = self.model.get_hidden_states(prompt + draft_tokens)
target_hidden = self.model.get_hidden_states(
prompt + draft_tokens[:target_len]
)
# 语义一致性评分
draft_semantic = self.semantic_proj(draft_hidden[-1])
target_semantic = target_hidden[-1]
similarity = torch.cosine_similarity(
draft_semantic.unsqueeze(0),
target_semantic.unsqueeze(0)
)
confidence = torch.sigmoid(self.semantic_head(draft_hidden[-1]))
# 语义一致且置信度高时接受
accept = (similarity > 0.9) and (confidence > 0.8)
return accept.item(), confidence.item()校准推测解码(CSD)
Calibrated Speculative Decoding (CSD) 通过校准草稿模型的置信度来提高接受率:
class CalibratedSpeculativeDecoder:
"""
校准推测解码
根据验证难度动态调整接受阈值
"""
def __init__(self, draft_model, verifier_model, calibration_data):
self.draft = draft_model
self.verifier = verifier_model
self.calibrator = self._calibrate(calibration_data)
def _calibrate(self, data):
"""使用校准数据学习接受阈值"""
# 收集草稿-验证器的一致性数据
calibration_scores = []
for prompt, response in data:
draft_hidden = self.draft.get_hidden_states(prompt)
verifier_hidden = self.verifier.get_hidden_states(prompt)
# 计算隐藏状态的差异
diff = torch.norm(draft_hidden - verifier_hidden, p=2)
calibration_scores.append(diff.item())
# 学习使接受率与准确率匹配的阈值
return self._fit_threshold(calibration_scores)
def _fit_threshold(self, scores):
"""拟合最优阈值"""
# 目标是:接受率 ≈ 验证准确率
# 使用分位数作为阈值
return np.percentile(scores, 70) # 接受前70%的草稿
def generate(self, prompt: str) -> str:
# 生成草稿
draft = self.draft.generate(prompt, max_len=16)
# 计算草稿的校准分数
draft_hidden = self.draft.get_hidden_states(prompt + draft)
verifier_hidden = self.verifier.get_hidden_states(prompt + draft)
score = torch.norm(draft_hidden - verifier_hidden, p=2).item()
# 根据校准阈值决定是否接受
if score < self.calibrator:
return draft # 接受整个草稿
else:
# 拒绝并回退到验证器
return self.verifier.generate(prompt)性能结果:CSD在多个基准上实现了约2.33倍的速度提升,同时保持了验证器级别的准确率。
链式推理的演进
标准链式推理(Chain-of-Thought)
标准CoT通过让模型显式生成推理步骤来提升复杂推理能力:
# 标准CoT示例
"""
问题:鸡兔同笼,共8个头,26只脚,问鸡兔各几只?
标准推理:
设鸡有x只,兔有y只。
x + y = 8 (头数)
2x + 4y = 26 (脚数)
解方程:
从第一个方程:x = 8 - y
代入第二个方程:2(8 - y) + 4y = 26
16 - 2y + 4y = 26
2y = 10
y = 5
x = 8 - 5 = 3
答案:鸡3只,兔5只。
"""断裂CoT(Fractured CoT)
断裂CoT发现一个有趣的现象:在很多情况下,截断的推理链(Truncated CoT)往往能达到与完整推理链相同的准确率:
class FracturedCoT:
"""
断裂CoT:使用截断的推理链
"""
def __init__(self, model, truncation_ratio: float = 0.6):
self.model = model
self.truncation_ratio = truncation_ratio
def solve(self, problem: str) -> str:
# 生成完整推理
full_reasoning = self.model.generate(problem)
# 找到推理的关键节点
key_nodes = self._find_key_nodes(full_reasoning)
# 根据截断比例选择保留的节点
n_keep = int(len(key_nodes) * self.truncation_ratio)
truncated_reasoning = self._reconstruct(key_nodes[:n_keep])
# 生成答案
answer = self.model.generate(
problem + "\n" + truncated_reasoning,
max_tokens=50
)
return answer
def _find_key_nodes(self, reasoning: str) -> List[str]:
"""识别推理链中的关键节点"""
# 使用某种启发式方法识别关键步骤
nodes = []
for step in reasoning.split("\n"):
if self._is_key_step(step):
nodes.append(step)
return nodes实验发现:
| 数据集 | 完整CoT准确率 | 断裂CoT(60%)准确率 | 节省计算量 |
|---|---|---|---|
| GSM8K | 94.1% | 93.8% | 40% |
| MATH | 68.2% | 67.9% | 40% |
| ARC-Challenge | 86.5% | 85.2% | 40% |
关键洞察:推理过程中存在大量”冗余”步骤,这些步骤对于最终答案的贡献很小。
弹性推理(Elastic Reasoning)
弹性推理是一种将”思考预算”和”解答预算”分开管理的方法:
class ElasticReasoning:
"""
弹性推理:分离思考和解答预算
"""
def __init__(self, model, base_think_budget: int = 2048,
base_solve_budget: int = 512):
self.model = model
self.base_think_budget = base_think_budget
self.base_solve_budget = base_solve_budget
def solve(self, problem: str, total_budget: int) -> str:
# 根据总预算动态分配
# 简单问题:更多解答预算
# 复杂问题:更多思考预算
difficulty = self.estimate_difficulty(problem)
if difficulty == "easy":
think_budget = total_budget * 0.2
solve_budget = total_budget * 0.8
elif difficulty == "medium":
think_budget = total_budget * 0.5
solve_budget = total_budget * 0.5
else: # hard
think_budget = total_budget * 0.8
solve_budget = total_budget * 0.2
# 第一阶段:深度思考
thinking = self.model.generate(
f"{problem}\n请详细分析:",
max_tokens=int(think_budget),
stop_at_solution=True
)
# 第二阶段:解答生成
solution = self.model.generate(
f"{problem}\n{thinking}\n因此,答案是:",
max_tokens=int(solve_budget)
)
return solution
def estimate_difficulty(self, problem: str) -> str:
"""估计问题难度"""
# 使用问题的词汇复杂度、长度等特征
features = self.extract_features(problem)
return self.classifier.predict(features)核心思想:不同类型的问题需要不同的思考-解答时间分配比,弹性推理通过动态调整这一比例来实现高效推理。
Think Deep, Think Fast
“Think Deep, Think Fast” 论文研究了推理模型的一个有趣特性:多数投票对推理模型特别有效:
class ThinkDeepThinkFast:
"""
深度思考 + 快速采样
专门为推理模型设计的推理策略
"""
def __init__(self, reasoning_model, n_samples: int = 16):
self.model = reasoning_model
self.n_samples = n_samples
def solve(self, problem: str) -> str:
answers = []
confidences = []
for _ in range(self.n_samples):
# 推理模型生成多个解答
response = self.model.generate(
problem,
temperature=0.7,
max_thinking_tokens=4096
)
answer = self.extract_answer(response)
confidence = self.estimate_confidence(response)
answers.append(answer)
confidences.append(confidence)
# 方法1:标准多数投票
standard_vote = Counter(answers).most_common(1)[0][0]
# 方法2:置信度加权投票
weighted_votes = {}
for ans, conf in zip(answers, confidences):
weighted_votes[ans] = weighted_votes.get(ans, 0) + conf
weighted_vote = max(weighted_votes, key=weighted_votes.get)
# 方法3:仅使用高置信度答案投票
high_conf_indices = [i for i, c in enumerate(confidences) if c > 0.8]
if high_conf_indices:
high_conf_answers = [answers[i] for i in high_conf_indices]
high_conf_vote = Counter(high_conf_answers).most_common(1)[0][0]
else:
high_conf_vote = standard_vote
return standard_vote # 推理模型中标准投票最稳健实验结果:
| 模型 | 单样本准确率 | + 多数投票(N=16) | 提升 |
|---|---|---|---|
| GPT-4 (标准) | 67.2% | 78.4% | +11.2% |
| o1-mini | 71.8% | 85.3% | +13.5% |
| o1-preview | 75.6% | 88.1% | +12.5% |
关键发现:推理模型(如o1系列)对多数投票更加敏感,提升幅度比标准LLM更大。
实践指南
策略选择框架
def select_inference_strategy(task: str, budget: str,
has_verifier: bool = False) -> str:
"""
根据任务类型和资源选择推理策略
参数:
task: 任务类型 ("math", "code", "logic", "fact")
budget: 资源预算 ("low", "medium", "high")
has_verifier: 是否有可用的验证器
返回:
推荐策略名称
"""
if budget == "low":
if task in ["math", "logic"]:
return "标准CoT"
else:
return "贪婪解码"
elif budget == "medium":
if has_verifier:
return "Best-of-N + 验证器"
elif task == "math":
return "CoT + 多数投票"
else:
return "并行采样"
else: # high
if has_verifier:
return "树搜索 + PRM"
else:
return "推理模型 (o1/R1)"成本-收益分析
| 策略 | 延迟增加 | 准确率提升 | 适用场景 |
|---|---|---|---|
| 标准CoT | 2-3× | 10-20% | 所有任务 |
| 多数投票(N=16) | 16× | 5-15% | 有明确答案的任务 |
| Best-of-N(N=16) | 16× | 10-25% | 有验证器的任务 |
| MCTS(64次rollout) | 64× | 20-35% | 复杂推理任务 |
| 推理模型(o1/R1) | 5-10× | 30-50% | 复杂推理任务 |
最佳实践建议
- 从简单开始:先尝试标准CoT,评估收益
- 评估ROI:计算额外延迟带来的准确率提升是否值得
- 利用验证器:如果有可靠的验证器,优先使用Best-of-N
- 自适应选择:根据问题难度动态调整策略
- 组合策略:可以组合多种方法,如CoT + 多数投票
参考
相关主题
- 链式推理:CoT的基本原理和变体
- 测试时计算扩展理论:推理策略的理论基础
- 推理模型:o1/o3/R1等推理模型的架构
- 过程奖励模型:PRM在树搜索中的应用
- MCTS与LLM推理:蒙特卡洛树搜索增强推理
- 测试时推理综述:推理技术全景
- 推测推理:SpecReason方法详解
- 隐式推理:Latent Reasoning架构