概述
Prompt工程(Prompt Engineering)是指通过设计、优化和迭代输入提示词来引导大语言模型(LLM)产生期望输出的技术与方法论。与传统机器学习需要大量标注数据不同,Prompt工程充分利用了LLM的In-Context Learning能力,通过精心设计的提示词实现任务适配。
核心原则:好的Prompt是任务、模型与用户意图的桥梁。
Prompt工程的重要性体现在以下几个方面:
| 方面 | 说明 |
|---|---|
| 无参数适应 | 无需fine-tuning即可适配新任务 |
| 成本效益 | 相比训练,计算成本极低 |
| 快速迭代 | 可以快速实验和优化 |
| 能力释放 | 激发模型内在的推理和生成能力 |
Few-shot Learning与示例设计
Few-shot vs Zero-shot vs One-shot
根据示例数量的不同,Prompt可分为三种模式:
| 模式 | 示例数量 | 特点 | 适用场景 |
|---|---|---|---|
| Zero-shot | 0 | 仅靠任务描述 | 简单、直接的任务 |
| One-shot | 1 | 一个示例 | 格式示范 |
| Few-shot | k (k≥2) | 多个示例 | 复杂任务、需要泛化 |
# Zero-shot Prompt
zero_shot = """
将以下句子翻译成英文:
今天天气很好。
"""
# One-shot Prompt
one_shot = """
将以下句子翻译成英文:
示例:
中文:你好,世界。
英文:Hello, world.
中文:今天天气很好。
英文:
"""
# Few-shot Prompt
few_shot = """
将以下句子翻译成英文:
示例1:
中文:你好,世界。
英文:Hello, world.
示例2:
中文:我喜欢读书。
英文:I like reading books.
示例3:
中文:今天天气很好。
英文:
"""示例选择策略
多样性原则
示例应该覆盖任务的不同方面和边界情况:
# 好的多样性示例选择
good_examples = [
# 正面情感
{"text": "这个产品太棒了!", "label": "正面"},
# 负面情感
{"text": "质量很差,非常失望。", "label": "负面"},
# 中性情感
{"text": "这是一个普通的杯子。", "label": "中性"},
# 复杂情感(带讽刺)
{"text": "哇,这个手机只能用两小时呢!", "label": "负面"}
]
# 差的多样性示例(全部相似)
bad_examples = [
{"text": "很好", "label": "正面"},
{"text": "不错", "label": "正面"},
{"text": "挺好", "label": "正面"}
]代表性原则
示例应该代表模型在实际应用中会遇到的数据分布:
def select_representative_examples(dataset, k=5):
"""
选择代表性示例
策略:基于聚类的分层采样
"""
# 1. 对数据进行聚类
embeddings = [get_embedding(x) for x in dataset]
clusters = kmeans(embeddings, n_clusters=k)
# 2. 从每个簇中选择最接近质心的示例
selected = []
for cluster_id in range(k):
cluster_points = [i for i, c in enumerate(clusters) if c == cluster_id]
cluster_embs = [embeddings[i] for i in cluster_points]
centroid = mean(cluster_embs)
# 选择最接近质心的点
best_idx = min(cluster_points,
key=lambda i: cosine_distance(embeddings[i], centroid))
selected.append(dataset[best_idx])
return selected难度平衡原则
包含简单、中等、困难的不同示例:
# 按难度分层的示例选择
examples_by_difficulty = {
"easy": [
{"text": "好", "label": "正面"},
{"text": "差", "label": "负面"}
],
"medium": [
{"text": "这部电影还可以,但不够精彩。", "label": "中性"},
{"text": "产品还行,就是发货有点慢。", "label": "中性"}
],
"hard": [
{"text": "虽然包装很漂亮,但内容物完全不符合描述。", "label": "负面"},
{"text": "说是升级版,结果还不如老款。", "label": "负面"}
]
}
# 平衡选择:1个简单 + 2个中等 + 1个困难
def balanced_selection(examples, k=4):
return (
[examples_by_difficulty["easy"][0]] +
random.sample(examples_by_difficulty["medium"], 2) +
[examples_by_difficulty["hard"][0]]
)示例格式与排序影响
格式一致性
格式一致性对ICL性能有显著影响:
# 一致的格式
consistent_format = """
任务:情感分类
示例1:
输入:我今天很开心
输出:正面
示例2:
输入:这个电影太无聊了
输出:负面
现在请预测:
输入:我考试得了满分
输出:
"""
# 不一致的格式(性能下降)
inconsistent_format = """
情感分类示例:
- "很好" -> 正面
- 坏 -> 负面
分类:今天心情不错 -> ???
"""排序影响
示例的排列顺序会影响模型的泛化:
# 常见的有效排序策略
ordering_strategies = {
# 1. 从简单到复杂
"easy_to_hard": [
"好", # 最简单
"一般", # 中等
"虽然贵但质量很好", # 复杂
],
# 2. 从常见到罕见
"common_to_rare": [
"我喜欢这个产品", # 常见表达
"物超所值", # 中等
"性价比无出其右", # 罕见表达
],
# 3. 按语义相似度(与query接近的放最后)
"similar_last": lambda query, demos: sorted(
demos,
key=lambda d: -semantic_similarity(query, d)
)
}K-shot的边际效益递减
随着示例数量k的增加,提升效果呈现边际递减:
ICL性能
│
80 ├───────●───●───●───● ← 边际收益递减
│ ╱
60 ├────●
│ ╱
40 ├─●
│╱
0 └───────────────────────→ k (示例数量)
0 4 8 16 32 64
# 边际效益分析
def analyze_k_shot_returns(model, task, max_k=32):
"""分析不同k值的收益"""
results = []
for k in [0, 1, 2, 4, 8, 16, 32]:
examples = select_examples(task, k)
accuracy = evaluate(model, task, examples)
results.append((k, accuracy))
# 计算边际收益
marginal_gains = []
for i in range(1, len(results)):
gain = results[i][1] - results[i-1][1]
marginal_gains.append((results[i][0], gain))
return results, marginal_gains
# 典型观察:
# k=0→1: +5~10%
# k=1→4: +3~5%
# k=4→8: +1~3%
# k=8→16: +0.5~1%
# k>16: 收益趋于平缓Chain-of-Thought推理激发
CoT的发现与机制
Chain-of-Thought (CoT) 提示由Wei等人在2022年提出,核心发现是:要求模型展示推理过程可以显著提升复杂任务的性能。
详见:链式推理与思维链
CoT有效的原因可以从计算理论角度理解:推理步骤实际上为模型提供了额外的”计算预算”,允许更深层的参数更新。
Zero-shot CoT:触发词策略
Zero-shot CoT通过简单的触发词激活模型的推理能力:
def zero_shot_cot(model, question):
"""
Zero-shot CoT 的两步法
"""
# Step 1: 触发推理
reasoning_prompt = f"{question}\n让我们逐步思考。"
reasoning = model.generate(reasoning_prompt)
# Step 2: 提取答案
answer_prompt = f"{reasoning}\n根据以上推理,答案是:"
answer = model.generate(answer_prompt)
return reasoning, answer
# 更简洁的版本(一步法)
def zero_shot_cot_v2(model, question):
"""一步到位的Zero-shot CoT"""
prompt = f"""问题:{question}
请详细分析这个问题,逐步推理,最后给出答案。
"""
return model.generate(prompt)Few-shot CoT:示例设计原则
def design_cot_demonstrations(task_examples, task_type="math"):
"""
设计CoT示例的核心原则
"""
demonstrations = []
for ex in task_examples:
if task_type == "math":
demo = f"""
问题:{ex['question']}
分析:
{ex['reasoning']}
答案:{ex['answer']}
"""
elif task_type == "logic":
demo = f"""
命题:{ex['premise']}
推理过程:
{"".join([f" {i+1}. {s}" for i, s in enumerate(ex['steps'])])}
结论:{ex['conclusion']}
"""
demonstrations.append(demo)
return "\n".join(demonstrations)示例设计要点:
| 要素 | 说明 | 示例 |
|---|---|---|
| 步骤完整性 | 每步推导都要有 | 先算...再算...最后... |
| 目标明确 | 每步服务于最终答案 | → 得出 ... |
| 格式一致 | 推理风格统一 | 第1步:... → ... |
| 错误示范 | 适当包含易错点 | 标注”易错点” |
思维链的局限性
# CoT的常见局限性
class CoTLimitations:
"""思维链的局限性分析"""
# 1. 错误传播
error_propagation = """
问题:计算 23 × 47
推理:
1. 23 × 47 = 23 × 40 + 23 × 7 ← 分解正确
2. = 920 + 23 × 7 ← 920 = 23 × 40,正确
3. = 920 + 161 ← 23 × 7 = 161,正确
4. = 1081 ← 最终答案
答案:1081 ✓
但如果第2步出错:23 × 40 = 800(错误)
后续全部错误
"""
# 2. 虚假推理
spurious_reasoning = """
问题:所有乌鸦都是鸟,所有黑的东西都是乌鸦,所以?
推理:
1. 所有乌鸦都是鸟(前提1)✓
2. 所有黑的东西都是乌鸦(前提2,错误)✓
3. 所有黑的东西都是鸟(结论)
答案:所有黑的东西都是鸟 ✗
"""
# 3. 计算成本
compute_cost = """
Token对比:
- 无CoT:~50 tokens
- CoT:~500 tokens(10倍)
- 长CoT:~2000 tokens(40倍)
"""Tree-of-Thought与搜索策略
ToT框架:探索-评估-扩展循环
Tree-of Thoughts (ToT) 将问题解决建模为树搜索过程:
from dataclasses import dataclass
from typing import List, Optional, Callable
import heapq
@dataclass
class ThoughtNode:
"""思维树节点"""
content: str
value: float
depth: int
parent: Optional['ThoughtNode'] = None
def get_path(self) -> List[str]:
"""获取从根到当前节点的路径"""
path = []
node = self
while node:
path.append(node.content)
node = node.parent
return path[::-1]
class TreeOfThoughts:
"""
树状思考框架
核心循环:探索 → 评估 → 扩展 → 选择
"""
def __init__(
self,
model,
num_branches: int = 5,
max_depth: int = 10,
value_fn: Optional[Callable] = None
):
self.model = model
self.num_branches = num_branches
self.max_depth = max_depth
self.value_fn = value_fn or self._default_value_fn
def _default_value_fn(self, thought: str) -> float:
"""默认评估函数(可替换为更复杂的价值模型)"""
# 简单的启发式评估
score = 0.0
if "错误" in thought or "不对" in thought:
score -= 0.5
if "正确" in thought or "验证" in thought:
score += 0.3
return score
def expand(self, node: ThoughtNode) -> List[ThoughtNode]:
"""扩展节点:生成多个候选下一步"""
prompt = f"""当前推理:
{node.content}
请提出3-5个可能的下一步推理方向:
"""
suggestions = self.model.generate(prompt, n=self.num_branches)
children = []
for i, suggestion in enumerate(suggestions):
child = ThoughtNode(
content=suggestion,
value=self.value_fn(suggestion),
depth=node.depth + 1,
parent=node
)
children.append(child)
return children
def evaluate(self, node: ThoughtNode) -> float:
"""评估节点价值"""
# 结合当前价值和深度惩罚
depth_penalty = 0.95 ** node.depth
return node.value * depth_penalty
def search(
self,
initial_thought: str,
strategy: str = "best"
) -> ThoughtNode:
"""
搜索最佳解决方案
Args:
initial_thought: 初始思考
strategy: 搜索策略 ("best", "bfs", "dfs")
"""
if strategy == "best":
return self._beam_search(initial_thought)
elif strategy == "bfs":
return self._bfs_search(initial_thought)
elif strategy == "dfs":
return self._dfs_search(initial_thought)
def _beam_search(self, initial: str) -> ThoughtNode:
"""束搜索:维护top-k候选"""
root = ThoughtNode(content=initial, value=0.0, depth=0)
frontier = [root]
while frontier and frontier[0].depth < self.max_depth:
candidates = []
for node in frontier:
children = self.expand(node)
for child in children:
child.value = self.evaluate(child)
candidates.append(child)
# 剪枝:保留top-k
candidates.sort(key=lambda x: x.value, reverse=True)
frontier = candidates[:self.num_branches]
return max(frontier, key=lambda x: x.value)广度优先搜索 vs 深度优先
def compare_search_strategies():
"""
搜索策略对比
"""
strategies = {
"BFS (广度优先)": {
"优点": ["全面探索", "不易陷入局部最优"],
"缺点": ["内存消耗大", "可能探索无用分支"],
"适用": ["问题空间大、分支多"],
},
"DFS (深度优先)": {
"优点": ["内存效率高", "找到解快(运气好时)"],
"缺点": ["可能陷深渊", "错过更优解"],
"适用": ["解路径短、分支明确"],
},
"Beam Search (束搜索)": {
"优点": ["平衡探索与利用", "内存可控"],
"缺点": ["可能错过远处的好解"],
"适用": ["大多数场景"],
}
}
return strategies
# BFS示例
def bfs_search(tree: TreeOfThoughts, initial: str, max_nodes=1000):
"""广度优先搜索实现"""
root = ThoughtNode(content=initial, value=0.0, depth=0)
queue = [root]
nodes_visited = 0
while queue and nodes_visited < max_nodes:
node = queue.pop(0) # FIFO
if node.depth >= tree.max_depth:
continue
children = tree.expand(node)
for child in children:
child.value = tree.evaluate(child)
queue.append(child)
nodes_visited += 1
return max(queue, key=lambda x: x.value)
# DFS示例(带剪枝)
def dfs_search(tree: TreeOfThoughts, initial: str, visited=None):
"""深度优先搜索实现"""
if visited is None:
visited = set()
root = ThoughtNode(content=initial, value=0.0, depth=0)
def dfs_recursive(node: ThoughtNode) -> Optional[ThoughtNode]:
if node.depth >= tree.max_depth:
return node
node_key = node.content[:50] # 简化去重
if node_key in visited:
return None
visited.add(node_key)
children = tree.expand(node)
best_child = None
best_value = float('-inf')
for child in children:
child.value = tree.evaluate(child)
result = dfs_recursive(child)
if result and result.value > best_value:
best_value = result.value
best_child = result
return best_child if best_child else node
return dfs_recursive(root)剪枝策略与价值估计
class PruningStrategies:
"""剪枝策略"""
@staticmethod
def confidence_pruning(node: ThoughtNode, threshold: float = 0.3) -> bool:
"""基于置信度的剪枝"""
return node.value < threshold
@staticmethod
def diversity_pruning(nodes: List[ThoughtNode], threshold: float = 0.9) -> List[ThoughtNode]:
"""基于多样性的剪枝:移除过于相似的节点"""
selected = []
for node in nodes:
is_diverse = True
for sel in selected:
sim = cosine_similarity(node.content, sel.content)
if sim > threshold:
is_diverse = False
break
if is_diverse:
selected.append(node)
return selected
@staticmethod
def expected_value_pruning(node: ThoughtNode, max_children: int = 3) -> bool:
"""基于期望价值的剪枝"""
# 估计子节点的期望价值
expected = node.value * 0.8 # 简化估计
return expected < node.value / max_children
# 价值估计器
class ValueEstimator:
"""多维度价值估计"""
def __init__(self, model):
self.model = model
def estimate(self, thought: str) -> float:
"""综合评估思考节点的价值"""
scores = {
"correctness": self._estimate_correctness(thought),
"completeness": self._estimate_completeness(thought),
"novelty": self._estimate_novelty(thought),
"simplicity": self._estimate_simplicity(thought)
}
# 加权组合
weights = {"correctness": 0.4, "completeness": 0.3,
"novelty": 0.15, "simplicity": 0.15}
return sum(scores[k] * weights[k] for k in weights)
def _estimate_correctness(self, thought: str) -> float:
"""估计正确性"""
# 可以使用验证模型或启发式规则
return 0.5 # 占位
def _estimate_completeness(self, thought: str) -> float:
"""估计完整性"""
# 检查是否包含所有必要步骤
return 0.5
def _estimate_novelty(self, thought: str) -> float:
"""估计新颖性"""
return 0.5
def _estimate_simplicity(self, thought: str) -> float:
"""估计简洁性"""
# 越简洁越好(但不能太简单)
return 0.5与MCTS的联系
ToT与MCTS与LLM推理有密切联系:
| 方面 | ToT | MCTS |
|---|---|---|
| 选择 | UCB-based | UCB-based |
| 扩展 | 模型生成 | 规则扩展 |
| 模拟 | 模型采样 | 随机采样 |
| 回传 | 价值回传 | 奖励回传 |
class MCTSLikeToT(TreeOfThoughts):
"""MCTS风格的ToT"""
def ucb_select(self, nodes: List[ThoughtNode], parent_value: float, c=1.4) -> ThoughtNode:
"""UCB1选择"""
import math
best_node = None
best_ucb = float('-inf')
for node in nodes:
if node.depth == 0:
ucb = node.value
else:
# UCB = 平均价值 + 探索奖励
exploration = c * math.sqrt(math.log(parent_value + 1) / (node.visit_count + 1))
ucb = node.value + exploration
if ucb > best_ucb:
best_ucb = ucb
best_node = node
return best_node
def backpropagate(self, node: ThoughtNode, reward: float):
"""回传更新"""
while node:
node.value = (node.value * node.visit_count + reward) / (node.visit_count + 1)
node.visit_count += 1
node = node.parentSelf-Consistency与投票解码
多路径采样策略
Self-Consistency通过采样多条推理路径来提升可靠性:
import numpy as np
from collections import Counter
from typing import List, Tuple
def self_consistency_sampling(
model,
question: str,
n_samples: int = 40,
temperature: float = 0.7,
prompt_template: str = "{question}\n让我们逐步思考。"
) -> Tuple[str, Counter]:
"""
Self-Consistency: 多路径采样 + 多数投票
Args:
model: 语言模型
question: 问题
n_samples: 采样数量
temperature: 采样温度
prompt_template: prompt模板
Returns:
final_answer: 投票得出的最终答案
answer_counts: 各答案的投票统计
"""
answers = []
reasoning_paths = []
for _ in range(n_samples):
# 生成推理路径
prompt = prompt_template.format(question=question)
response = model.generate(prompt, temperature=temperature)
# 提取答案
answer = extract_final_answer(response)
answers.append(answer)
reasoning_paths.append(response)
# 多数投票
answer_counts = Counter(answers)
final_answer = answer_counts.most_common(1)[0][0]
return final_answer, answer_counts
def extract_final_answer(text: str) -> str:
"""从推理文本中提取最终答案"""
import re
# 尝试多种提取模式
patterns = [
r'答案是[::]\s*(.+?)(?:\n|$)',
r'因此[,,]?\s*(?:最终[答案结果为是]*[::]*\s*)?(.+?)(?:\n|$)',
r'=\s*(\d+(?:\.\d+)?)',
r'最终答案[::]\s*(.+?)(?:\n|$)',
]
for pattern in patterns:
match = re.search(pattern, text)
if match:
return match.group(1).strip()
# 回退:取最后一行
return text.strip().split('\n')[-1]多数投票 vs 加权投票
class VotingStrategies:
"""投票策略"""
@staticmethod
def majority_vote(answers: List[str]) -> str:
"""简单多数投票"""
counts = Counter(answers)
return counts.most_common(1)[0][0]
@staticmethod
def weighted_vote(
answers: List[str],
weights: List[float],
answer_scores: List[float]
) -> str:
"""
加权投票
Args:
answers: 答案列表
weights: 采样权重(如置信度)
answer_scores: 每个答案的模型评分
"""
scores = Counter()
for ans, w, score in zip(answers, weights, answer_scores):
scores[ans] += w * score
return scores.most_common(1)[0][0]
@staticmethod
def soft_voting(probabilities: np.ndarray, answer_mapping: List[str]) -> str:
"""
软投票:基于概率分布
"""
# 聚合概率分布
avg_probs = probabilities.mean(axis=0)
best_idx = np.argmax(avg_probs)
return answer_mapping[best_idx]
@staticmethod
def bayesian_vote(answers: List[str], prior: dict = None) -> str:
"""
贝叶斯投票:结合先验知识
"""
if prior is None:
prior = {}
counts = Counter(answers)
# 加入先验
for ans, prior_count in prior.items():
counts[ans] += prior_count
return counts.most_common(1)[0][0]路径选择与置信度估计
class PathSelector:
"""推理路径选择器"""
def __init__(self, model):
self.model = model
def estimate_confidence(self, reasoning: str) -> float:
"""估计推理路径的置信度"""
# 方法1:基于token概率
avg_logprob = self._avg_token_probability(reasoning)
# 方法2:基于推理长度
length_score = min(len(reasoning) / 500, 1.0)
# 方法3:基于关键词检测
confidence_keywords = ["确定", "肯定", "无疑", "显然"]
doubt_keywords = ["可能", "大概", "也许", "不确定"]
confidence = reasoning.count("确定") + reasoning.count("正确")
doubt = reasoning.count("可能") + reasoning.count("不确定")
keyword_score = (confidence - doubt) / (confidence + doubt + 1)
# 综合评分
return 0.5 * avg_logprob + 0.2 * length_score + 0.3 * keyword_score
def select_paths(
self,
paths: List[str],
k: int = 5,
threshold: float = 0.5
) -> List[Tuple[str, float]]:
"""
选择最佳路径
Returns:
[(path, confidence), ...] 按置信度排序
"""
scored_paths = []
for path in paths:
conf = self.estimate_confidence(path)
scored_paths.append((path, conf))
# 按置信度排序并截取
scored_paths.sort(key=lambda x: x[1], reverse=True)
# 应用阈值
return [(p, c) for p, c in scored_paths if c >= threshold][:k]
def _avg_token_probability(self, text: str) -> float:
"""计算平均token概率"""
# 使用模型的logprobs
tokens = self.model.tokenize(text)
logprobs = self.model.get_logprobs(tokens)
return np.exp(np.mean(logprobs))计算-效果权衡
def compute_effectiveness_tradeoff():
"""
Self-Consistency的计算-效果权衡分析
"""
data = {
"samples": [1, 5, 10, 20, 40, 80],
"accuracy": [46.9, 60.2, 68.5, 73.8, 78.7, 80.1],
"compute_ratio": [1.0, 5.0, 10.0, 20.0, 40.0, 80.0],
"marginal_gain": [0, 13.3, 8.3, 5.3, 4.9, 1.4]
}
# 边际收益递减明显
"""
样本数 | 准确率 | 计算量 | 边际提升
--------|--------|--------|----------
1 | 46.9% | 1x | -
5 | 60.2% | 5x | +13.3%
10 | 68.5% | 10x | +8.3%
20 | 73.8% | 20x | +5.3%
40 | 78.7% | 40x | +4.9%
80 | 80.1% | 80x | +1.4%
建议:n=20~40是大多数场景的最佳平衡点
"""
return dataRole-playing与约束表达
角色扮演提示词设计
def design_role_prompt(role: str, task: str, constraints: List[str] = None) -> str:
"""
设计角色扮演提示词
"""
role_templates = {
"expert": """
你是一位专业的{role},拥有丰富的专业知识和实践经验。
你的回答应该体现专业性和权威性。
""",
"mentor": """
你是一位耐心的导师,善于用通俗易懂的语言解释复杂的概念。
你会循序渐进地引导思考,而不是直接给出答案。
""",
"critic": """
你是一位严格的评审,对观点持批判性思维。
你会指出逻辑漏洞、证据不足和潜在的偏见。
""",
"creative": """
你是一位富有创意的头脑风暴伙伴。
你会提出新颖的想法,打破常规思维。
"""
}
prompt = role_templates.get(role, "").format(role=role)
prompt += f"\n\n任务:{task}"
if constraints:
prompt += "\n\n约束条件:"
for c in constraints:
prompt += f"\n- {c}"
return prompt
# 示例
expert_prompt = design_role_prompt(
role="数学教授",
task="解释微积分中的极限概念",
constraints=["用直观的例子说明", "避免过于形式化的定义"]
)约束条件的自然语言表达
class ConstraintExpressions:
"""约束条件的多种表达方式"""
# 格式约束
format_constraints = [
"输出格式必须是JSON",
"答案用【】包裹",
"每行一个要点",
"使用 bullet points 列出",
"结构化输出:1. ... 2. ... 3. ..."
]
# 内容约束
content_constraints = [
"只输出答案,不要解释",
"首先分析问题,然后给出解答",
"考虑以下几点:",
"忽略任何与政治相关的讨论",
"用中文回答"
]
# 风格约束
style_constraints = [
"语气要专业、严谨",
"保持简洁明了",
"使用友好的口吻",
"语言要正式",
"可以适当幽默"
]
# 组合示例
@staticmethod
def make_constraint_prompt(constraints: dict) -> str:
"""组合多种约束"""
sections = []
if "format" in constraints:
sections.append("【输出格式】\n" + "\n".join(constraints["format"]))
if "content" in constraints:
sections.append("【内容要求】\n" + "\n".join(constraints["content"]))
if "style" in constraints:
sections.append("【风格要求】\n" + "\n".join(constraints["style"]))
return "\n\n".join(sections)
# 实际应用
constraint_prompt = ConstraintExpressions.make_constraint_prompt({
"format": [
"首先给出结论",
"然后列出2-3个支持理由",
"最后提供具体例子"
],
"content": [
"只讨论技术方面",
"避免价值判断"
],
"style": [
"语气客观中立",
"使用专业术语"
]
})系统提示词工程
class SystemPromptEngineering:
"""系统提示词工程"""
@staticmethod
def build_system_prompt(
persona: str,
capabilities: List[str],
limitations: List[str],
values: List[str]
) -> str:
"""
构建完整的系统提示词
"""
sections = []
# 身份定义
sections.append(f"# 身份\n你是一个{persona}。")
# 能力描述
if capabilities:
sections.append(f"# 核心能力\n" + "\n".join(f"- {c}" for c in capabilities))
# 限制说明
if limitations:
sections.append(f"# 已知限制\n" + "\n".join(f"- {l}" for l in limitations))
# 价值观
if values:
sections.append(f"# 行为准则\n" + "\n".join(f"- {v}" for v in values))
return "\n\n".join(sections)
@staticmethod
def iterative_refinement(
initial_prompt: str,
test_cases: List[dict],
feedback_fn: callable
) -> str:
"""
迭代优化系统提示词
"""
prompt = initial_prompt
best_prompt = prompt
best_score = 0
for iteration in range(10):
# 测试当前prompt
results = []
for case in test_cases:
response = generate(prompt, case["input"])
score = feedback_fn(response, case["expected"])
results.append(score)
avg_score = np.mean(results)
# 更新最佳
if avg_score > best_score:
best_score = avg_score
best_prompt = prompt
# 尝试改进
prompt = refine_prompt(prompt, results)
return best_prompt
def refine_prompt(prompt: str, results: List[float]) -> str:
"""基于测试结果改进prompt"""
# 简化版本:可以根据低分样例添加针对性约束
low_score_indices = [i for i, s in enumerate(results) if s < 0.7]
if low_score_indices:
# 添加针对弱点的说明
additions = """
补充说明:
- 当遇到模糊问题时,主动询问澄清
- 对于不确定的内容,表示不确定性
"""
prompt += additions
return prompt安全性与越狱问题
class SafetyGuidelines:
"""安全性设计指南"""
# 常见越狱类型
jailbreak_patterns = [
"DAN (Do Anything Now)",
"角色扮演逃逸",
"假设场景绕过",
"编码绕过",
"Unicode混淆"
]
# 防御策略
@staticmethod
def add_safety_instructions(
prompt: str,
sensitivity: str = "medium"
) -> str:
"""添加安全指令"""
safety_templates = {
"low": """
注意:请保持专业和安全的输出。如果请求涉及敏感内容,请礼貌拒绝。
""",
"medium": """
【安全约束】
1. 不生成任何有害、非法或不当内容
2. 不协助绕过安全措施或政策
3. 不冒充他人或传播虚假信息
4. 如遇可疑请求,请明确拒绝并说明原因
""",
"high": """
【严格安全约束】
你必须严格遵守以下规则:
- 绝对不生成有害内容
- 绝对不协助任何非法活动
- 绝对不泄露个人信息
- 绝对不绕过任何安全机制
- 如有任何疑虑,请直接拒绝
"""
}
return prompt + safety_templates.get(sensitivity, "")
@staticmethod
def detect_injection(user_input: str) -> bool:
"""检测提示词注入"""
import re
# 常见的注入模式
injection_patterns = [
r"忽略之前的指示",
r"ignore.*previous",
r"新的指令:",
r"新的角色:",
r"你现在是",
r"forget.*everything",
r"\\{.*\\}", # JSON注入尝试
]
for pattern in injection_patterns:
if re.search(pattern, user_input, re.IGNORECASE):
return True
return False高级Prompt技术
In-Context Learning原理
ICL是Prompt工程的基础,详见:LLM In-Context Learning机制
核心机制:
class ICLMechanism:
"""ICL机制简述"""
@staticmethod
def bayesian_view(examples: List[dict], query: str) -> str:
"""
ICL的贝叶斯视角:模型根据示例推断任务假设
"""
# 先验:模型对任务的先验知识
prior = model.get_prior("task")
# 似然:示例提供的证据
likelihood = 1.0
for ex in examples:
p = model.score(ex["input"], ex["output"])
likelihood *= p
# 后验:综合先验和似然
posterior = prior * likelihood
return model.sample(posterior)
@staticmethod
def kernel_view(examples: List[dict], query: str) -> float:
"""
ICL的核函数视角:Transformer实现了隐式核推断
"""
# 注意力权重作为核函数
attention_weights = compute_attention(examples + [query])
# 基于核的预测
return sum(
attention_weights[i] * examples[i]["output"]
for i in range(len(examples))
)思维骨架(Skeleton-of-Thought)
Skeleton-of-Thought (SoT) 优化长输出的生成:
def skeleton_of_thought(model, task: str, max_points: int = 5) -> str:
"""
思维骨架:先规划后展开
核心思想:先生成回答的骨架,再逐点展开
"""
# Step 1: 生成骨架
skeleton_prompt = f"""任务:{task}
请列出完成这个任务的关键步骤(不超过{max_points}个):
1. [第一步]
2. [第二步]
...
"""
skeleton = model.generate(skeleton_prompt)
# Step 2: 逐点展开
points = extract_points(skeleton)
expanded_sections = []
for point in points:
expansion_prompt = f"请详细说明:{point}\n\n请用2-3句话展开说明。"
expansion = model.generate(expansion_prompt)
expanded_sections.append(f"## {point}\n\n{expansion}")
# Step 3: 组合
return "\n\n".join(expanded_sections)
def extract_points(text: str) -> List[str]:
"""提取骨架点"""
import re
pattern = r'\d+\.\s*(.+?)(?=\n\d+\.|$)'
return re.findall(pattern, text, re.DOTALL)Program of Thoughts
Program of Thoughts (PoT) 将推理过程程序化:
def program_of_thought(model, problem: str) -> str:
"""
思维编程:生成可执行的推理程序
"""
# Step 1: 生成程序
program_prompt = f"""将以下问题转化为Python程序:
问题:{problem}
请生成Python代码来解决这个问题。直接输出代码,不要解释。
"""
code = model.generate(program_prompt)
# Step 2: 执行代码
try:
result = execute_code(code)
return f"程序:\n{code}\n\n执行结果:{result}"
except Exception as e:
return f"程序:\n{code}\n\n执行错误:{e}"
def execute_code(code: str) -> str:
"""安全执行Python代码(简化版本)"""
import subprocess
import json
# 限制执行时间和资源
try:
result = subprocess.run(
['python', '-c', code],
capture_output=True,
text=True,
timeout=5
)
return result.stdout if result.returncode == 0 else result.stderr
except subprocess.TimeoutExpired:
return "执行超时"
# 示例
example = """
问题:一个商店有苹果和橘子共50个。苹果每个3元,橘子每个2元,
总价为130元。问苹果和橘子各有多少个?
Program of Thoughts:
"""
pot_result = """
苹果数 + 橘子数 = 50
3 × 苹果数 + 2 × 橘子数 = 130
x + y = 50
3x + 2y = 130
Python代码:
```python
# 解方程
for x in range(51):
y = 50 - x
if 3*x + 2*y == 130:
print(f"苹果: {x}, 橘子: {y}")执行结果:苹果: 30, 橘子: 20
"""
### React: 推理+行动框架
ReAct (Reason + Act) 结合推理与外部交互:
```python
from dataclasses import dataclass
from typing import List, Callable
import json
@dataclass
class Thought:
"""思考步骤"""
thought: str
action: str
observation: str
class ReActAgent:
"""
ReAct框架:推理-行动循环
核心:Thought → Action → Observation → Thought → ...
"""
def __init__(
self,
model,
tools: dict,
max_iterations: int = 10
):
self.model = model
self.tools = tools
self.max_iterations = max_iterations
def run(self, query: str) -> str:
"""执行ReAct循环"""
history = []
for _ in range(self.max_iterations):
# 1. 推理
context = self._build_context(query, history)
thought_response = self.model.generate(context)
# 2. 解析动作
thought, action, action_input = self._parse_response(thought_response)
# 3. 执行动作
if action == "finish":
return action_input
elif action in self.tools:
observation = self.tools[action](action_input)
else:
observation = f"未知动作: {action}"
# 4. 记录历史
history.append(Thought(thought, action, observation))
return "达到最大迭代次数"
def _build_context(self, query: str, history: List[Thought]) -> str:
"""构建prompt上下文"""
context = f"""任务:{query}
你可以使用以下工具:
"""
for name in self.tools:
context += f"- {name}\n"
context += "\n逐步思考:\n"
for h in history:
context += f"Thought: {h.thought}\n"
context += f"Action: {h.action}\n"
context += f"Observation: {h.observation}\n\n"
return context
def _parse_response(self, response: str) -> tuple:
"""解析模型响应"""
import re
thought_match = re.search(r'Thought:\s*(.+?)(?=\nAction:|$)', response, re.DOTALL)
action_match = re.search(r'Action:\s*(\w+)\[(.+?)\]', response)
thought = thought_match.group(1).strip() if thought_match else ""
action = action_match.group(1) if action_match else "finish"
action_input = action_match.group(2) if action_match else ""
return thought, action, action_input
# 示例工具
def search_tool(query: str) -> str:
"""搜索工具"""
return f"搜索结果:关于'{query}'的信息..."
def calculator_tool(expr: str) -> str:
"""计算器工具"""
try:
result = eval(expr)
return str(result)
except:
return "计算错误"
实践指南
Prompt迭代优化流程
def prompt_iteration_workflow(
initial_prompt: str,
test_dataset: List[dict],
evaluation_fn: callable
) -> str:
"""
Prompt迭代优化流程
流程:
1. 初始化Prompt
2. 测试当前版本
3. 分析错误案例
4. 针对性改进
5. 重复直到收敛
"""
prompt = initial_prompt
best_prompt = prompt
best_score = 0
iteration = 0
while iteration < 20:
# 测试当前版本
results = evaluate_prompt(prompt, test_dataset, evaluation_fn)
current_score = np.mean([r['score'] for r in results])
print(f"Iteration {iteration}: Score = {current_score:.3f}")
# 更新最佳
if current_score > best_score:
best_score = current_score
best_prompt = prompt
# 收敛检查
if current_score > 0.95:
break
# 分析错误案例
errors = [r for r in results if r['score'] < 0.5]
if not errors:
break
# 针对性改进
improvements = []
for error in errors[:3]: # 每次处理最多3个错误
analysis = analyze_error(prompt, error)
improvement = suggest_improvement(analysis)
improvements.append(improvement)
# 应用改进
prompt = apply_improvements(prompt, improvements)
iteration += 1
return best_prompt
def evaluate_prompt(prompt: str, dataset: List[dict], eval_fn: callable) -> List[dict]:
"""评估Prompt"""
results = []
for case in dataset:
response = generate(prompt, case["input"])
score = eval_fn(response, case["expected"])
results.append({
"input": case["input"],
"response": response,
"expected": case["expected"],
"score": score
})
return results
def analyze_error(prompt: str, error_case: dict) -> str:
"""分析错误原因"""
analysis_prompt = f"""分析以下Prompt和输出的问题:
Prompt:
{prompt}
输入: {error_case['input']}
输出: {error_case['response']}
期望: {error_case['expected']}
请分析输出与期望的差异,以及可能的原因。
"""
return generate(analysis_prompt)
def suggest_improvement(analysis: str) -> str:
"""建议改进"""
improvement_prompt = f"""基于以下分析,建议如何改进Prompt:
{analysis}
请给出具体的修改建议。
"""
return generate(improvement_prompt)
def apply_improvements(prompt: str, improvements: List[str]) -> str:
"""应用改进"""
applied = prompt
for imp in improvements:
# 简单策略:追加约束
applied += f"\n\n补充要求:{imp}"
return appliedA/B测试与评估
class ABTesting:
"""Prompt A/B测试"""
@staticmethod
def prepare_variants(
base_prompt: str,
variations: List[dict]
) -> List[str]:
"""准备测试变体"""
variants = [base_prompt]
for var in variations:
variant = base_prompt
if "prefix" in var:
variant = var["prefix"] + "\n" + variant
if "suffix" in var:
variant = variant + "\n" + var["suffix"]
if "replace" in var:
variant = variant.replace(var["replace"]["from"],
var["replace"]["to"])
variants.append(variant)
return variants
@staticmethod
def run_experiment(
variants: List[str],
test_inputs: List[str],
evaluation_fn: callable,
n_runs: int = 3
) -> dict:
"""运行A/B测试"""
results = {}
for i, variant in enumerate(variants):
scores = []
for _ in range(n_runs):
variant_scores = []
for inp in test_inputs:
response = generate(variant, inp)
score = evaluation_fn(response)
variant_scores.append(score)
scores.append(np.mean(variant_scores))
results[f"variant_{i}"] = {
"mean": np.mean(scores),
"std": np.std(scores),
"variant": variant
}
return results
@staticmethod
def statistical_significance(results: dict, alpha: float = 0.05) -> dict:
"""统计显著性检验"""
from scipy import stats
# 提取分数
variant_names = list(results.keys())
scores = [results[name]["scores"] for name in variant_names]
# t检验
t_stat, p_value = stats.ttest_ind(scores[0], scores[1])
return {
"t_statistic": t_stat,
"p_value": p_value,
"significant": p_value < alpha,
"confidence": 1 - p_value
}提示词版本管理
import hashlib
from datetime import datetime
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class PromptVersion:
"""提示词版本"""
version_id: str
content: str
created_at: datetime
description: str
metrics: dict
class PromptVersionControl:
"""提示词版本控制"""
def __init__(self, storage_path: str = "./prompt_versions"):
self.storage_path = storage_path
self.versions: List[PromptVersion] = []
def save(self, prompt: str, description: str, metrics: dict = None) -> str:
"""保存新版本"""
version_id = self._generate_id(prompt)
version = PromptVersion(
version_id=version_id,
content=prompt,
created_at=datetime.now(),
description=description,
metrics=metrics or {}
)
self.versions.append(version)
self._persist(version)
return version_id
def _generate_id(self, content: str) -> str:
"""生成版本ID"""
hash_obj = hashlib.sha256()
hash_obj.update(content.encode())
hash_obj.update(str(datetime.now()).encode())
return hash_obj.hexdigest()[:8]
def _persist(self, version: PromptVersion):
"""持久化存储"""
import json
import os
filename = f"{self.storage_path}/{version.version_id}.json"
os.makedirs(self.storage_path, exist_ok=True)
with open(filename, 'w') as f:
json.dump({
"version_id": version.version_id,
"content": version.content,
"created_at": version.created_at.isoformat(),
"description": version.description,
"metrics": version.metrics
}, f, ensure_ascii=False, indent=2)
def load(self, version_id: str) -> Optional[str]:
"""加载指定版本"""
import json
filename = f"{self.storage_path}/{version_id}.json"
try:
with open(filename, 'r') as f:
data = json.load(f)
return data["content"]
except FileNotFoundError:
return None
def compare(self, version_id1: str, version_id2: str) -> str:
"""对比两个版本"""
v1 = self.load(version_id1)
v2 = self.load(version_id2)
if v1 is None or v2 is None:
return "版本不存在"
# 使用difflib生成对比
import difflib
diff = difflib.unified_diff(
v1.splitlines(keepends=True),
v2.splitlines(keepends=True),
fromfile=version_id1,
tofile=version_id2
)
return ''.join(diff)
def rollback(self, version_id: str) -> str:
"""回滚到指定版本"""
content = self.load(version_id)
if content:
return self.save(content, f"回滚到 {version_id}")
return None常见错误与解决方案
class CommonPromptErrors:
"""常见Prompt错误及解决方案"""
errors = {
"歧义性": {
"症状": ["输出不稳定", "同一输入不同输出"],
"原因": ["指令不够明确", "缺少具体要求"],
"解决": [
"使用具体、明确的语言",
"添加示例说明期望格式",
"明确输出约束条件"
]
},
"上下文过长": {
"症状": ["模型遗忘早期信息", "输出不完整"],
"原因": ["示例过多", "Prompt过长"],
"解决": [
"精简Prompt,只保留必要信息",
"减少示例数量(通常4-8个足够)",
"使用更简洁的格式"
]
},
"角色不一致": {
"症状": ["输出风格多变", "不符合角色设定"],
"原因": ["角色描述模糊", "缺少行为准则"],
"解决": [
"明确定义角色背景和专业领域",
"添加具体的行为准则",
"在示例中体现角色特征"
]
},
"指令冲突": {
"症状": ["模型无法同时满足多个要求"],
"原因": ["约束条件相互矛盾"],
"解决": [
"简化约束,保留最重要的",
"明确优先级(如:安全 > 准确 > 简洁)",
"拆分复杂任务为多个简单任务"
]
},
"过度依赖示例": {
"症状": ["换个问法就失效", "泛化能力差"],
"原因": ["示例过于具体", "缺少通用规则"],
"解决": [
"增加更多样化的示例",
"添加通用规则描述",
"平衡示例和规则的比例"
]
}
}
@classmethod
def diagnose(cls, prompt: str, test_results: List[dict]) -> dict:
"""诊断Prompt问题"""
issues = []
# 检查长度
if len(prompt) > 4000:
issues.append("上下文过长")
# 检查示例
if prompt.count("示例") > 10:
issues.append("可能过度依赖示例")
# 检查一致性
if "但" in prompt and ("不要" in prompt or "必须" in prompt):
issues.append("可能存在指令冲突")
return {"detected_issues": issues, "suggestions": [cls.errors.get(i, {}) for i in issues]}
# 快速修复清单
QUICK_FIXES = """
# Prompt快速修复清单
## 1. 输出不稳定?
→ 添加具体格式要求
→ 增加示例
→ 明确边界情况处理
## 2. 输出太长/太短?
太长:
→ 添加长度限制:"用一句话总结"
→ 明确范围:"只列出3个要点"
→ 简化示例
太短:
→ 要求展开:"请详细说明"
→ 添加问题引导:"包括原因、例子和影响"
## 3. 格式混乱?
→ 提供明确的格式模板
→ 用Markdown或JSON结构
→ 在示例中体现格式
## 4. 忽略某些指令?
→ 突出重要指令(大写、单独成行)
→ 减少指令数量
→ 使用"最重要的"等强调词
## 5. 创意不足?
→ 减少限制条件
→ 添加创意引导:"发挥想象力"
→ 减少示例,让模型自由发挥
## 6. 太抽象/不够实用?
→ 要求具体例子
→ 添加"例如:"引导
→ 明确使用场景
"""总结
Prompt工程是一门结合了LLM理论理解与实践经验的艺术与科学。本文档覆盖了核心方法:
| 主题 | 核心要点 |
|---|---|
| Few-shot Learning | 示例选择应兼顾多样性、代表性、难度平衡 |
| Chain-of-Thought | 展示推理过程激发模型深层计算 |
| Tree-of-Thought | 多路径探索避免局部最优 |
| Self-Consistency | 多路径投票提升可靠性 |
| Role-playing | 角色设定影响输出风格和专业知识 |
| 高级技术 | SoT、PoT、ReAct等针对特定场景优化 |
关键成功因素: