LLM评测体系
概述
大语言模型评测(LLM Evaluation)是衡量模型能力、确保模型质量、指导模型迭代的核心环节。随着GPT-4、Claude、Gemini、Qwen等模型的出现,评测体系也在不断演进,从简单的准确率指标发展为涵盖知识、推理、代码、安全等多维度的综合评测框架。1
评测的重要性与挑战
LLM评测面临三大核心挑战:
- 输出开放性:LLM的输出具有高度多样性,难以用单一指标衡量
- 能力涌现性:某些能力只在模型规模达到临界点后才会涌现
- 评测污染:测试数据可能泄露到训练集,导致虚高的评测结果
评测的核心问题:什么才算是一个”好”的模型?这个问题取决于具体的应用场景和用户需求。
评测维度分类
LLM评测维度
├── 基础能力
│ ├── 知识理解(MMLU、BBH)
│ ├── 常识推理(HellaSwag)
│ └── 科学问答(ARC)
│
├── 推理能力
│ ├── 数学推理(GSM8K、MATH)
│ ├── 逻辑推理(LogiQA)
│ └── 代码生成(HumanEval、MBPP)
│
├── 知识评测
│ ├── 开放域问答(TriviaQA)
│ ├── 搜索引擎问答(Natural Questions)
│ └── 知识编辑(KE)
│
├── 对齐与安全
│ ├── 真实性(TruthfulQA)
│ ├── 毒性检测(RealToxicityPrompts)
│ ├── 社会偏见(BBQ)
│ └── 多维安全(SafetyBench)
│
└── 垂直领域
├── 中文评测(CMMLU、C-Eval)
├── 医疗(MedQA)
└── 法律(LegalBench)
排行榜问题与评测标准化
当前评测体系存在以下问题:
| 问题 | 描述 | 影响 |
|---|---|---|
| 数据污染 | 测试集泄露到训练数据 | 准确率虚高 |
| 过拟合基准 | 针对特定数据集过度优化 | 泛化能力下降 |
| 指标单一 | 仅用准确率衡量复杂能力 | 忽略其他维度 |
| 评测泄露 | 评测过程信息泄露 | 结果不公平 |
解决方向:动态更新测试集、引入对抗性评测、多维度综合评估。
基础能力评测
MMLU:大规模多任务语言理解
MMLU(Massive Multitask Language Understanding)是目前最广泛使用的基础能力评测基准,由蒙特利尔大学和MIT等机构发布。2
from datasets import load_dataset
# 加载MMLU数据集
mmlu = load_dataset("lukaemon/mmlu", "all")
print(f"MMLU 包含 {len(mmlu['test'])} 道题目")
print(f"涵盖 {mmlu['test'].num_rows} 个学科")
# 评估示例
def evaluate_mmlu(model, dataset):
correct = 0
total = 0
for item in dataset:
question = item["question"]
choices = item["choices"]
answer_idx = item["answer"]
# 构建选择题格式
prompt = f"问题:{question}\n选项:\n"
for i, choice in enumerate(choices):
prompt += f"{chr(65+i)}. {choice}\n"
prompt += "请选择正确答案(只需输出字母):"
response = model.generate(prompt)
predicted = parse_answer(response)
if predicted == answer_idx:
correct += 1
total += 1
return correct / total评测结果解读:
| 模型 | MMLU准确率 | 说明 |
|---|---|---|
| GPT-3 (175B) | 43.9% | few-shot基线 |
| PaLM (540B) | 69.3% | 思维链提示 |
| GPT-4 | 86.4% | 显著超越人类基线 |
| Claude 3 | 88.7% | 接近专家水平 |
| GPT-4o | 88.7% | 多模态增强 |
BBH:BIG-Bench Hard子集
BBH(BIG-Bench Hard)是从BIG-Bench中筛选出的23个任务,这些任务在2022年时标准LLM无法超越人类基线。3
# BBH任务示例
bbh_tasks = [
"boolean_expressions", # 布尔表达式
"causal_judgment", # 因果判断
"date_understanding", # 日期理解
"disambiguation_qa", # 消歧问答
"dyck_languages", # Dyck语言
"formal_fallacies", # 形式谬误
"geometric_shapes", # 几何形状
"hyperbaton", # 形容词排序
"logical_deduction_five_objects", # 逻辑推导
"logical_deduction_seven_objects",
"logical_deduction_three_objects",
"movie_recommendation", # 电影推荐
"multistep_arithmetic_one", # 多步算术
"navigate", # 导航
"object_counting", # 物体计数
"penguins_in_a_table", # 表格中的企鹅
"reasoning_about_colored_objects",
"ruin_names", # 破坏名称
"salient_translation_error_detection",
"snarks", # 讽刺检测
"sports_understanding", # 体育理解
"temporal_sequences", # 时序序列
"tracking_shuffled_objects_three_objects",
"web_attribution", # 网页归属
]
def evaluate_bbh(model, task_name: str):
"""评估BBH单个任务"""
dataset = load_dataset("bigbench", task_name)
results = []
for example in dataset["validation"]:
# BBH使用3-shot CoT评估
prompt = build_cot_prompt(task_name, example)
response = model.generate(prompt)
prediction = extract_answer(response)
results.append({
"target": example["targets"][0],
"prediction": prediction,
"correct": normalize(prediction) == normalize(example["targets"][0])
})
return sum(r["correct"] for r in results) / len(results)HellaSwag:常识推理
HellaSwag(Harder Endings, Longer contexts, and Less activity for Swag)是评估常识推理的数据集,难度接近人类表现(约95%)。4
# HellaSwag评估
def evaluate_hellaswag(model, dataset):
"""
HellaSwag采用Acc@1指标:
给定上下文,从4个选项中选择最合理的结尾
"""
correct = 0
for item in dataset:
ctx_a = item["ctx_a"] # 前文
ctx_b = item["ctx_b"] # 后文片段
endings = item["endings"] # 4个候选结尾
label = item["label"] # 正确答案索引
# 构建提示
prompt = f"情境:{ctx_a}\n{ctx_b}\n\n可能的结局:\n"
for i, ending in enumerate(endings):
prompt += f"{i+1}. {ending}\n"
prompt += "\n请选择最合理的结局(输入数字1-4):"
response = model.generate(prompt)
predicted = int(response.strip()[0]) - 1 # 转为0索引
if predicted == label:
correct += 1
return correct / len(dataset)ARC:科学问答
ARC(AI2 Reasoning Challenge)包含约8000道科学选择题,测试模型的多跳推理能力。5
# ARC评估框架
class ARC_Evaluator:
def __init__(self, model):
self.model = model
def evaluate(self, split="test"):
dataset = load_dataset("allenai/ai2_arc", split=split)
results = {"easy": [], "challenge": []}
for item in dataset:
question = item["question"]
choices = item["choices"]["text"]
answer = item["choices"]["label"][item["answerKey"]]
# 多次尝试不同提示策略
response = self.cot_reasoning(question, choices)
is_correct = self.check_answer(response, answer)
results[item["problemType"]].append(is_correct)
return {
"ARC-Easy": np.mean(results["easy"]) * 100,
"ARC-Challenge": np.mean(results["challenge"]) * 100
}
def cot_reasoning(self, question, choices):
prompt = f"""问题:{question}
选项:
"""
for i, choice in enumerate(choices):
prompt += f"{chr(65+i)}. {choice}\n"
prompt += """
请先分析问题,然后在最后给出答案(格式:答案是X)。"""
return self.model.generate(prompt)推理能力评测
GSM8K:小学数学
GSM8K(Grade School Math 8K)包含8500道小学数学应用题,通常需要2-8步推理。6
from metrics import load_metric
def evaluate_gsm8k(model, split="test"):
"""
GSM8K评估要点:
- 最终答案必须是精确数字
- 包含详细的解题步骤
- 可能有多步计算
"""
dataset = load_dataset("openai/gsm8k", "main")[split]
correct = 0
results = []
for item in tqdm(dataset):
question = item["question"]
answer = item["answer"]
# 提取最终答案数字
ground_truth = extract_number(answer)
# 使用CoT生成解答
response = model.generate(
f"请逐步解答以下问题,最后给出答案。\n\n问题:{question}",
temperature=0.3
)
predicted = extract_number(response)
is_correct = predicted == ground_truth
correct += is_correct
results.append({
"question": question,
"response": response,
"expected": ground_truth,
"predicted": predicted,
"correct": is_correct
})
return {
"accuracy": correct / len(dataset),
"results": results
}
def extract_number(text: str) -> float:
"""从文本中提取最终答案数字"""
# GSM8K答案格式通常是 "...答案是X."
import re
numbers = re.findall(r"-?\d+\.?\d*", text.split("答案是")[-1])
if numbers:
return float(numbers[0])
return NoneMATH:竞赛数学
MATH数据集包含12000道高中数学竞赛题,涵盖代数、几何、数论等多个领域,难度显著高于GSM8K。7
def evaluate_math(model, dataset_path="hendrycks/competition_math"):
"""
MATH评测特点:
- 答案可能是精确的数学表达式
- 需要LaTeX格式解析
- 分多个子领域评估
"""
dataset = load_dataset(dataset_path)
results = {}
for level in ["AMC", "AIME", "ARML", "College", "High School"]:
results[level] = {"correct": 0, "total": 0}
for item in dataset["test"]:
subject = item["subject"]
level = classify_level(subject)
response = model.generate(
format_math_prompt(item["problem"]),
temperature=0.5
)
predicted = parse_math_answer(response)
expected = normalize_math(item["solution"])
is_correct = check_math_equivalence(predicted, expected)
results[level]["correct"] += is_correct
results[level]["total"] += 1
# 打印各难度级别结果
for level, stats in results.items():
print(f"{level}: {stats['correct']}/{stats['total']} = "
f"{stats['correct']/max(stats['total'],1):.2%}")
return resultsARC-Challenge:复杂推理
ARC-Challenge是ARC数据集的困难子集,要求模型具备更强的多跳推理能力。
# ARC-Challenge评估
def evaluate_arc_challenge(model):
"""
ARC-Challenge特点:
- 需要跨多个知识领域整合信息
- 包含干扰选项
- 答案通常需要2步以上推理
"""
dataset = load_dataset("allenai/ai2_arc", "ARC-Challenge")
def evaluate_item(item):
prompt = build_arc_prompt(item)
response = model.generate(prompt)
# 解析模型回答
predicted = extract_choice(response)
expected = item["answerKey"]
return predicted == expected
return np.mean([evaluate_item(item) for item in dataset])LogiQA:逻辑推理
LogiQA是专门评估逻辑推理能力的数据集,包含8000多道需要严格推理的选择题。8
def evaluate_logiqa(model):
"""
LogiQA评估逻辑推理能力
涵盖命题逻辑、演绎推理、归纳推理等
"""
dataset = load_dataset("lucadiliello/logiqa")
correct = 0
categories = {"deductive": 0, "inductive": 0, "abductive": 0}
for item in dataset["test"]:
context = item["context"]
question = item["question"]
options = item["options"]
prompt = f"""阅读以下段落并回答问题:
段落:{context}
问题:{question}
选项:
"""
for i, opt in enumerate(options):
prompt += f"{chr(65+i)}. {opt}\n"
prompt += "\n请分析推理过程后给出答案。"
response = model.generate(prompt)
predicted = extract_choice(response)
expected = chr(65 + ord(item["expected_answer"]) - ord('0'))
if predicted == expected:
correct += 1
categories[item.get("category", "deductive")] += 1
return {
"accuracy": correct / len(dataset["test"]),
"by_category": categories
}编程评测:HumanEval与MBPP
HumanEval和MBPP是两个最常用的代码生成评测基准。9
def evaluate_code_generation(model, benchmark="humaneval"):
"""
HumanEval: 164道手写编程题
MBPP: 974道Python基础编程题
使用pass@k指标评估
"""
if benchmark == "humaneval":
dataset = load_dataset("openai/openai-humaneval")
num_samples = 100 # 每个问题生成100个样本
else:
dataset = load_dataset("google-research/mbpp")
num_samples = 1000
from execute import check_correctness
results = {"pass@1": [], "pass@10": [], "pass@100": []}
for task in tqdm(dataset["test"]):
prompt = task["prompt"]
test_cases = task["test"]
entry_point = task["entry_point"]
# 生成多个候选代码
codes = []
for _ in range(num_samples):
code = model.generate(
f"完成以下Python函数:\n{prompt}",
temperature=0.8,
stop=["\nclass ", "\ndef ", "\nimport "]
)
full_code = prompt + code
codes.append(full_code)
# 计算pass@k
passed = []
for code in codes:
try:
is_correct = check_correctness(code, test_cases, entry_point)
passed.append(is_correct)
except:
passed.append(False)
results["pass@1"].append(any(passed[:1]))
results["pass@10"].append(any(passed[:10]))
results["pass@100"].append(any(passed))
return {
f"pass@1": np.mean(results["pass@1"]) * 100,
f"pass@10": np.mean(results["pass@10"]) * 100,
f"pass@100": np.mean(results["pass@100"]) * 100,
}知识评测
TriviaQA:开放域问答
TriviaQA包含超过65万个问答对,涵盖广泛的知识领域,用于评估模型的开放域知识能力。10
def evaluate_triviaqa(model, split="web"):
"""
TriviaQA特点:
- 问题独立于任何给定上下文
- 需要模型依赖内部知识
- 答案形式多样(人名、地点、日期等)
"""
dataset = load_dataset("rajpurkar/triviaqa", split=f"{split}")
correct = 0
predictions = []
for item in tqdm(dataset):
question = item["question"]
response = model.generate(
f"回答以下问题,只需给出答案:\n{question}",
temperature=0.3
)
predicted = normalize_answer(response)
expected = item["answer"]
is_correct = check_answer_match(predicted, expected)
correct += is_correct
predictions.append({
"question": question,
"predicted": predicted,
"expected": expected,
"correct": is_correct
})
return {
"accuracy": correct / len(dataset),
"predictions": predictions
}Natural Questions:Google搜索问答
Natural Questions来自真实的Google搜索查询,包含长答案和短答案两种类型。
def evaluate_nq(model, answer_type="short"):
"""
Natural Questions评估
- 长答案:段落级别
- 短答案:实体级别
"""
dataset = load_dataset("google-research-datasets/natural_questions",
split="validation")
correct = 0
for item in dataset:
question = item["question"]["text"]
if answer_type == "short":
# 短答案:通常是一个或多个实体
expected_answers = [ans["text"] for ans in item["annotations"]["short_answers"]]
else:
# 长答案:段落或表格
expected_answers = extract_long_answers(item["annotations"])
response = model.generate(
f"基于你的知识回答以下问题:\n{question}",
temperature=0.3
)
if check_answer_match(response, expected_answers):
correct += 1
return correct / len(dataset)知识编辑评测
知识编辑(Knowledge Editing)评测模型对特定知识的更新能力。
def evaluate_knowledge_editing(model, editor):
"""
评估知识编辑效果
关键指标:可靠性、泛化性、局部性
"""
dataset = load_dataset("zjunlp/KnowledgeEditing")
metrics = {"reliability": [], "generality": [], "locality": []}
for edit in dataset:
target = edit["target_new"]["answer"]
# 应用知识编辑
editor.apply(model, edit["prompt"], target)
# 测试可靠性:编辑的问题
response = model.generate(edit["prompt"])
rel_correct = check_answer_match(response, target)
metrics["reliability"].append(rel_correct)
# 测试泛化性:同义问题
for rephrase in edit["rephrase_prompts"]:
response = model.generate(rephrase)
gen_correct = check_answer_match(response, target)
metrics["generality"].append(gen_correct)
# 测试局部性:无关问题
for unrelated in edit["neighborhood_prompts"]:
response = model.generate(unrelated)
# 期望不改变原有回答或保持正确
metrics["locality"].append(not is_hallucination(response))
return {k: np.mean(v) for k, v in metrics.items()}对齐与安全性评测
TruthfulQA:真实性
TruthfulQA专门评估模型在回答问题时是否会输出虚假或误导性信息,包含817个问题。11
def evaluate_truthfulqa(model):
"""
TruthfulQA评估要点:
- 问题涉及人类常犯的错误观念
- 模型可能被误导给出错误答案
- 需要识别并拒绝回答危险问题
"""
dataset = load_dataset("truthfulqa/truthfulqa_mc2")
results = {"truthful": 0, "informative": 0}
for item in dataset:
question = item["question"]
mc2_choices = item["mc2_targets"] # 多选答案
response = model.generate(
f"回答以下问题:\n{question}",
temperature=0.7
)
# TruthfulQA使用多选格式评估
# 判断回答是否与真实答案一致
is_truthful = evaluate_truthfulness(response, item)
is_informative = evaluate_informativeness(response, question)
results["truthful"] += is_truthful
results["informative"] += is_informative
return {
"truthful_accuracy": results["truthful"] / len(dataset),
"informative_accuracy": results["informative"] / len(dataset),
"truthful_x_info": results["truthful"] * results["informative"] / len(dataset)**2
}RealToxicityPrompts:毒性检测
RealToxicityPrompts评估模型在给定有毒提示时生成有毒内容的倾向。
def evaluate_toxicity(model, perspective_api_key=None):
"""
评估模型输出的毒性
使用Perspective API或本地毒性检测器
"""
dataset = load_dataset("allenai/real-toxicity-prompts",
split="challenge_prompts")
toxicity_scores = []
for item in tqdm(dataset):
prompt = item["prompt"]["text"]
# 生成续写
response = model.generate(prompt, max_tokens=50)
# 使用Perspective API评估毒性
if perspective_api_key:
toxicity = perspective_api.check_toxicity(response)
else:
toxicity = local_toxicity_detector(response)
toxicity_scores.append(toxicity)
return {
"mean_toxicity": np.mean(toxicity_scores),
"toxicity_at_10": np.mean([s > 0.1 for s in toxicity_scores]),
"toxicity_at_50": np.mean([s > 0.5 for s in toxicity_scores]),
"toxicity_at_90": np.mean([s > 0.9 for s in toxicity_scores]),
}BBQ:社会偏见
BBQ(Bias Benchmark for QA)评估模型在涉及社会群体的问答中是否存在偏见。12
def evaluate_bbq(model):
"""
BBQ评估社会偏见
测试9个社会群体维度
"""
dataset = load_dataset("facebook/bbq")
dimensions = ["age", "disability_status", "gender_identity",
"national_origin", "physical_appearance",
"race_ethnicity", "religion", "sexual_orientation",
"socioeconomic"]
results = {dim: {"biased": 0, "neutral": 0, "total": 0}
for dim in dimensions}
for item in dataset:
dim = item["category"]
context = item["context"]
question = item["question"]
# 两种设置下的回答
neg_context = item["neg_context"] # 负面刻板设置
pos_context = item["pos_context"] # 中性/正面设置
# 评估偏见
bias_score = evaluate_bias(
model, context, neg_context, question,
item["answer_options"]
)
results[dim]["total"] += 1
if bias_score > 0.5: # 存在偏见
results[dim]["biased"] += 1
else:
results[dim]["neutral"] += 1
return resultsSafetyBench:多维安全评测
SafetyBench是综合性的安全评测基准,涵盖7个安全维度。
def evaluate_safety_bench(model):
"""
SafetyBench安全维度:
- 伦理道德
- 非法活动
- 恶意软件
- 危险行为
- 医疗安全
- 金融安全
- 个人隐私
"""
dataset = load_dataset("thu-coai/SafetyBench", split="test")
categories = {
"ethics": [],
"illegal": [],
"malware": [],
"physical_harm": [],
"privacy": [],
"financial": [],
"health": []
}
for item in dataset:
category = item["category"]
question = item["question"]
response = model.generate(
f"请回答以下问题:\n{question}",
temperature=0.3
)
is_safe = check_safety(response, category)
categories[category].append(is_safe)
return {cat: np.mean(scores) for cat, scores in categories.items()}中文评测体系
CMMLU:中文多任务理解
CMMLU(Chinese Massive Multitask Language Understanding)是专门针对中文的MMLU,包含11573道选择题,涵盖67个学科。13
def evaluate_cmmlu(model):
"""
CMMLU评估中文基础能力
涵盖自然科学、社会科学、工程技术等领域
"""
dataset = load_dataset("haonan-li/cmmlu")
correct = 0
results = {}
for subject in dataset.keys():
if subject == "test":
continue
subject_correct = 0
subject_total = 0
for item in dataset[subject]:
question = item["question"]
choices = item["choices"]
answer = ord(item["answer"]) - ord('A')
prompt = f"问题:{question}\n选项:\n"
for i, choice in enumerate(choices):
prompt += f"{chr(65+i)}. {choice}\n"
prompt += "请选择正确答案(只需输出字母):"
response = model.generate(prompt)
predicted = ord(parse_single_char(response).upper()) - ord('A')
if predicted == answer:
correct += 1
subject_correct += 1
subject_total += 1
results[subject] = subject_correct / subject_total
return {
"overall": correct / sum(len(dataset[s]) for s in dataset.keys() if s != "test"),
"by_subject": results
}C-Eval:中文能力评测
C-Eval是清华大学发布的中文大模型评测基准,包含13948道选择题,覆盖139个学科。14
def evaluate_c_eval(model):
"""
C-Eval评估中文能力
包含5个难度级别
"""
dataset = load_dataset("celueval/cEval")
levels = {"初中": [], "高中": [], "大学": [], "硕士": [], "博士": []}
for item in dataset["val"]:
level = item["level"]
question = item["question"]
choices = item["choices"]
answer = ord(item["answer"]) - ord('A')
response = model.generate(format_ceval_prompt(question, choices))
predicted = parse_single_char(response)
is_correct = predicted == chr(65 + answer)
levels[level].append(is_correct)
return {level: np.mean(scores) if scores else 0
for level, scores in levels.items()}MMCU:中文专业能力
MMCU(Massive Multitask Chinese Understanding)评估中文专业领域知识。
def evaluate_mMCU(model):
"""
MMCU评估中文专业能力
涵盖医学、法学、心理学等领域
"""
dataset = load_dataset("RUCAIBox/mMCU")
domains = ["medicine", "law", "psychology", "education", "history"]
results = {}
for domain in domains:
domain_dataset = dataset[domain]
correct = 0
for item in domain_dataset:
response = model.generate(format_prompt(item))
if is_correct(response, item["answer"]):
correct += 1
results[domain] = correct / len(domain_dataset)
return resultsCMRC:中文阅读理解
CMRC(Chinese Machine Reading Comprehension)是中文阅读理解评测基准。
def evaluate_cmrc(model):
"""
CMRC 2018/2019 中文阅读理解
需要从给定文章中提取答案
"""
dataset = load_dataset("cmrc2018", split="validation")
em_scores = [] # Exact Match
f1_scores = [] # Token-level F1
for item in dataset:
context = item["context"]
question = item["question"]
response = model.generate(
f"阅读理解:\n文章:{context}\n问题:{question}\n请给出答案:"
)
predicted = response.strip()
expected = item["answers"][0]
em_scores.append(compute_em(predicted, expected))
f1_scores.append(compute_f1(predicted, expected))
return {
"exact_match": np.mean(em_scores),
"f1_score": np.mean(f1_scores)
}评测方法论
自动化评测 vs 人类评估
class EvaluationStrategy:
"""
选择合适的评测策略
"""
@staticmethod
def select_strategy(task_type: str, budget: str) -> str:
"""
根据任务类型和预算选择评测策略
决策树:
- 规则可定义的任务 → 自动化评测
- 开放式生成任务 → 人类评估
- 成本敏感场景 → 采样+自动化
"""
if task_type in ["classification", "extraction", "matching"]:
return "自动化评测"
elif task_type in ["summarization", "creative", "open_qa"]:
if budget == "low":
return "LLM辅助评估 (G-Eval)"
elif budget == "medium":
return "采样人类评估"
else:
return "完整人类评估"
elif task_type in ["reasoning", "math", "code"]:
return "自动化评测 + 结果验证"
return "混合策略"| 评测方法 | 成本 | 质量 | 适用场景 |
|---|---|---|---|
| 规则匹配 | $0.001/样本 | 中等 | 分类、提取 |
| LLM辅助评估 | $0.05-0.10/样本 | 高 | 生成质量评估 |
| 人类评估 | $2-10/样本 | 最高 | 关键决策、创意 |
| 混合策略 | $0.01-0.05/样本 | 很高 | 生产监控 |
提示工程与评测公平性
class FairEvaluation:
"""
确保评测公平性的技巧
"""
@staticmethod
def build_few_shot_prompt(task: str, n_examples: int = 5) -> str:
"""构建few-shot提示,避免泄露答案"""
examples = load_calibration_examples(task, n=n_examples)
prompt = "请仔细阅读以下示例并完成最后的任务:\n\n"
for ex in examples:
prompt += f"任务:{ex['question']}\n"
prompt += f"你的回答:\n"
# 注意:不展示标准答案,避免提示泄露
prompt += f"{ex['reference_answer']}\n\n"
return prompt
@staticmethod
def control_confounders(
model_response: str,
confounders: dict
) -> dict:
"""
控制混杂变量,确保评测公平
- 回答长度
- 格式偏好
- 位置偏差
"""
return {
"length_normalized": normalize_length(model_response),
"format_agnostic": remove_formatting(model_response),
"order_balanced": balance_order_effects(model_response)
}防御对抗攻击评测
def evaluate_adversarial_robustness(model):
"""
评估模型对对抗攻击的鲁棒性
"""
attacks = {
"prompt_injection": load_prompt_injection_tests(),
"jailbreak": load_jailbreak_prompts(),
"data_poisoning": load_poisoning_tests(),
"model_extraction": load_extraction_tests()
}
results = {}
for attack_type, test_cases in attacks.items():
safe_count = 0
for prompt in test_cases:
response = model.generate(prompt)
# 检查是否成功攻击
if not detect_unsafe_response(response, attack_type):
safe_count += 1
results[attack_type] = {
"safety_rate": safe_count / len(test_cases),
"unsafe_examples": len(test_cases) - safe_count
}
return results长上下文评测
def evaluate_long_context(model, max_length: int = 200000):
"""
长上下文评测
- Needle-in-a-Haystack
- 多跳推理
- 信息整合
"""
results = {}
# Needle-in-a-Haystack测试
results["needle"] = evaluate_needle_task(
model,
context_lengths=[8192, 32768, 100000, 200000],
num_needles=5
)
# 多跳推理测试
results["multi_hop"] = evaluate_multi_hop_retrieval(
model,
context_length=100000,
hops=[2, 3, 4, 5]
)
# 信息整合测试
results["synthesis"] = evaluate_information_synthesis(
model,
context_length=50000
)
return results最新评测趋势
开放生成评测
传统评测(如MMLU)使用选择题格式,但无法评估模型的真实表达能力。开放生成评测成为新趋势。
class OpenEndedEvaluation:
"""
开放生成评测框架
"""
def evaluate_instruction_following(self, model):
"""
IFEval:指令遵循评测
检查模型是否遵循请求的具体指令
"""
dataset = load_dataset("google-research-datasets/ifeval", split="train")
results = []
for item in dataset:
instructions = item["prompt"]
constraints = item["constraints"]
response = model.generate(instructions)
# 检查各项约束是否被遵循
constraint_scores = []
for constraint in constraints:
is_followed = check_constraint(response, constraint)
constraint_scores.append(is_followed)
results.append(np.mean(constraint_scores))
return np.mean(results)
def evaluate_summarization(self, model):
"""
摘要评测:使用G-Eval或LLM-as-Judge
"""
dataset = load_dataset("cnn_dailymail", split="validation")
scores = []
for item in dataset:
article = item["article"]
reference = item["highlights"]
response = model.generate(f"请摘要以下文章:\n{article}")
# 使用LLM评估质量
quality = llm_judge.evaluate(
prompt=f"评估以下摘要的质量:\n摘要:{response}\n参考:{reference}",
dimensions=["factuality", "coverage", "coherence"]
)
scores.append(quality)
return np.mean(scores)多模态评测
def evaluate_multimodal(model, benchmark="MME"):
"""
多模态评测基准
MME: Perception and Cognition
MMBench: Multi-modal Understanding
"""
benchmarks = {
"MME": load_mme_benchmark(),
"MMBench": load_mmbench(),
"SEED-Bench": load_seed_bench(),
"OCRBench": load_ocr_bench()
}
results = {}
for name, dataset in benchmarks.items():
correct = 0
for item in dataset:
image = item["image"]
question = item["question"]
response = model.generate(
image=image,
text=question
)
if check_multimodal_answer(response, item["answer"]):
correct += 1
results[name] = correct / len(dataset)
return resultsAgent能力评测
def evaluate_agent_capability(model, agent_framework):
"""
Agent能力评测
- 任务规划
- 工具使用
- 多步骤执行
"""
benchmarks = {
"GAIA": load_gaia_benchmark(), # 通用AI助手
"WebArena": load_web_arena(), # Web导航
"AgentBench": load_agent_bench(), # 多环境Agent
"ToolBench": load_tool_bench() # 工具使用
}
results = {}
for name, benchmark in benchmarks.items():
task_results = []
for task in benchmark:
trajectory = agent_framework.run(task)
score = evaluate_trajectory(trajectory, task["expected"])
task_results.append(score)
results[name] = {
"success_rate": np.mean(task_results),
"avg_steps": np.mean([len(t["steps"]) for t in benchmark])
}
return results成本效率评测
def evaluate_cost_efficiency(model, tasks: list):
"""
成本效率评测
在给定预算下最大化性能
"""
budget_range = [0.01, 0.1, 1.0, 10.0] # 美元/查询
results = []
for budget in budget_range:
performances = []
costs = []
for task in tasks:
# 估算完成任务的最优策略
strategy = select_efficient_strategy(task, budget)
start_cost = get_cost()
response = execute_strategy(model, task, strategy)
actual_cost = get_cost() - start_cost
performance = evaluate_response(response, task)
performances.append(performance)
costs.append(actual_cost)
results.append({
"budget": budget,
"avg_performance": np.mean(performances),
"avg_cost": np.mean(costs),
"efficiency": np.mean(performances) / np.mean(costs)
})
return results主流评测基准对比
| 基准 | 任务数 | 样本数 | 主要能力 | 更新频率 |
|---|---|---|---|---|
| MMLU | 57 | 14,042 | 知识理解 | 静态 |
| BBH | 23 | ~16,000 | 复杂推理 | 静态 |
| GSM8K | 1 | 8,500 | 数学 | 静态 |
| MATH | 7 | 12,500 | 竞赛数学 | 静态 |
| HumanEval | 1 | 164 | 代码生成 | 静态 |
| TruthfulQA | 1 | 817 | 真实性 | 静态 |
| HellaSwag | 1 | 42,042 | 常识推理 | 静态 |
| ARC | 2 | 7,787 | 科学推理 | 静态 |
总结
LLM评测体系是模型研发和应用部署的重要基础设施。核心要点:
- 多维度评估:单一基准无法全面衡量模型能力,需要综合知识、推理、代码、安全等多个维度
- 评测演进:从选择题到开放生成,从静态基准到动态评测
- 中文评测:CMMLU、C-Eval等中文基准填补了中文模型评测的空白
- 方法论:根据任务类型和资源选择合适的评测方法
- 趋势:Agent能力、多模态、成本效率成为新的评测方向
随着LLM技术的快速发展,评测体系也在持续演进。建立科学、全面的评测文化,是推动LLM技术进步的关键。
参考资料
相关主题
Footnotes
-
Chang et al. “A Survey on Evaluation of Large Language Models”. arXiv:2307.03109, 2023. ↩
-
Hendrycks et al. “Measuring Massive Multitask Language Understanding”. ICLR 2021. ↩
-
Suzgun et al. “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them”. arXiv:2210.09261, 2022. ↩
-
Zellers et al. “HellaSwag: Can a Machine Really Finish Your Sentence?“. ACL 2019. ↩
-
Clark et al. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. arXiv:1803.05457, 2018. ↩
-
Cobbe et al. “Training Verifiers to Solve Math Word Problems”. arXiv:2110.14168, 2021. ↩
-
Hendrycks et al. “Measuring Mathematical Problem Solving With the MATH Dataset”. NeurIPS 2021 Datasets and Benchmarks Track. ↩
-
Liu et al. “LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning”. IJCAI 2020. ↩
-
Chen et al. “Evaluating Large Language Models Trained on Code”. arXiv:2107.03374, 2021. ↩
-
Joshi et al. “TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension”. ACL 2017. ↩
-
Lin et al. “TruthfulQA: Measuring How Models Mimic Human Falsehoods”. ACL 2022. ↩
-
Parrish et al. “BBQ: A Hand-built Bias Benchmark for Question Answering”. Findings of ACL 2022. ↩
-
Li et al. “CMMLU: Measuring massive multitask language understanding in Chinese”. arXiv:2306.13388, 2023. ↩
-
Huang et al. “C-Eval: A Multi-Level Multi-Domain Chinese Evaluation Benchmark for Large Language Models”. arXiv:2305.06622, 2023. ↩