LLM评测体系

概述

大语言模型评测(LLM Evaluation)是衡量模型能力、确保模型质量、指导模型迭代的核心环节。随着GPT-4、Claude、Gemini、Qwen等模型的出现,评测体系也在不断演进,从简单的准确率指标发展为涵盖知识、推理、代码、安全等多维度的综合评测框架。1

评测的重要性与挑战

LLM评测面临三大核心挑战:

  1. 输出开放性:LLM的输出具有高度多样性,难以用单一指标衡量
  2. 能力涌现性:某些能力只在模型规模达到临界点后才会涌现
  3. 评测污染:测试数据可能泄露到训练集,导致虚高的评测结果

评测的核心问题:什么才算是一个”好”的模型?这个问题取决于具体的应用场景和用户需求。

评测维度分类

LLM评测维度
├── 基础能力
│   ├── 知识理解(MMLU、BBH)
│   ├── 常识推理(HellaSwag)
│   └── 科学问答(ARC)
│
├── 推理能力
│   ├── 数学推理(GSM8K、MATH)
│   ├── 逻辑推理(LogiQA)
│   └── 代码生成(HumanEval、MBPP)
│
├── 知识评测
│   ├── 开放域问答(TriviaQA)
│   ├── 搜索引擎问答(Natural Questions)
│   └── 知识编辑(KE)
│
├── 对齐与安全
│   ├── 真实性(TruthfulQA)
│   ├── 毒性检测(RealToxicityPrompts)
│   ├── 社会偏见(BBQ)
│   └── 多维安全(SafetyBench)
│
└── 垂直领域
    ├── 中文评测(CMMLU、C-Eval)
    ├── 医疗(MedQA)
    └── 法律(LegalBench)

排行榜问题与评测标准化

当前评测体系存在以下问题:

问题描述影响
数据污染测试集泄露到训练数据准确率虚高
过拟合基准针对特定数据集过度优化泛化能力下降
指标单一仅用准确率衡量复杂能力忽略其他维度
评测泄露评测过程信息泄露结果不公平

解决方向:动态更新测试集、引入对抗性评测、多维度综合评估。

基础能力评测

MMLU:大规模多任务语言理解

MMLU(Massive Multitask Language Understanding)是目前最广泛使用的基础能力评测基准,由蒙特利尔大学和MIT等机构发布。2

from datasets import load_dataset
 
# 加载MMLU数据集
mmlu = load_dataset("lukaemon/mmlu", "all")
print(f"MMLU 包含 {len(mmlu['test'])} 道题目")
print(f"涵盖 {mmlu['test'].num_rows} 个学科")
 
# 评估示例
def evaluate_mmlu(model, dataset):
    correct = 0
    total = 0
    
    for item in dataset:
        question = item["question"]
        choices = item["choices"]
        answer_idx = item["answer"]
        
        # 构建选择题格式
        prompt = f"问题:{question}\n选项:\n"
        for i, choice in enumerate(choices):
            prompt += f"{chr(65+i)}. {choice}\n"
        prompt += "请选择正确答案(只需输出字母):"
        
        response = model.generate(prompt)
        predicted = parse_answer(response)
        
        if predicted == answer_idx:
            correct += 1
        total += 1
    
    return correct / total

评测结果解读

模型MMLU准确率说明
GPT-3 (175B)43.9%few-shot基线
PaLM (540B)69.3%思维链提示
GPT-486.4%显著超越人类基线
Claude 388.7%接近专家水平
GPT-4o88.7%多模态增强

BBH:BIG-Bench Hard子集

BBH(BIG-Bench Hard)是从BIG-Bench中筛选出的23个任务,这些任务在2022年时标准LLM无法超越人类基线。3

# BBH任务示例
bbh_tasks = [
    "boolean_expressions",      # 布尔表达式
    "causal_judgment",          # 因果判断
    "date_understanding",       # 日期理解
    "disambiguation_qa",        # 消歧问答
    "dyck_languages",           # Dyck语言
    "formal_fallacies",         # 形式谬误
    "geometric_shapes",         # 几何形状
    "hyperbaton",               # 形容词排序
    "logical_deduction_five_objects",  # 逻辑推导
    "logical_deduction_seven_objects",
    "logical_deduction_three_objects",
    "movie_recommendation",      # 电影推荐
    "multistep_arithmetic_one", # 多步算术
    "navigate",                 # 导航
    "object_counting",          # 物体计数
    "penguins_in_a_table",      # 表格中的企鹅
    "reasoning_about_colored_objects",
    "ruin_names",               # 破坏名称
    "salient_translation_error_detection",
    "snarks",                   # 讽刺检测
    "sports_understanding",     # 体育理解
    "temporal_sequences",       # 时序序列
    "tracking_shuffled_objects_three_objects",
    "web_attribution",          # 网页归属
]
 
def evaluate_bbh(model, task_name: str):
    """评估BBH单个任务"""
    dataset = load_dataset("bigbench", task_name)
    
    results = []
    for example in dataset["validation"]:
        # BBH使用3-shot CoT评估
        prompt = build_cot_prompt(task_name, example)
        response = model.generate(prompt)
        prediction = extract_answer(response)
        results.append({
            "target": example["targets"][0],
            "prediction": prediction,
            "correct": normalize(prediction) == normalize(example["targets"][0])
        })
    
    return sum(r["correct"] for r in results) / len(results)

HellaSwag:常识推理

HellaSwag(Harder Endings, Longer contexts, and Less activity for Swag)是评估常识推理的数据集,难度接近人类表现(约95%)。4

# HellaSwag评估
def evaluate_hellaswag(model, dataset):
    """
    HellaSwag采用Acc@1指标:
    给定上下文,从4个选项中选择最合理的结尾
    """
    correct = 0
    
    for item in dataset:
        ctx_a = item["ctx_a"]  # 前文
        ctx_b = item["ctx_b"]  # 后文片段
        endings = item["endings"]  # 4个候选结尾
        label = item["label"]  # 正确答案索引
        
        # 构建提示
        prompt = f"情境:{ctx_a}\n{ctx_b}\n\n可能的结局:\n"
        for i, ending in enumerate(endings):
            prompt += f"{i+1}. {ending}\n"
        prompt += "\n请选择最合理的结局(输入数字1-4):"
        
        response = model.generate(prompt)
        predicted = int(response.strip()[0]) - 1  # 转为0索引
        
        if predicted == label:
            correct += 1
    
    return correct / len(dataset)

ARC:科学问答

ARC(AI2 Reasoning Challenge)包含约8000道科学选择题,测试模型的多跳推理能力。5

# ARC评估框架
class ARC_Evaluator:
    def __init__(self, model):
        self.model = model
    
    def evaluate(self, split="test"):
        dataset = load_dataset("allenai/ai2_arc", split=split)
        
        results = {"easy": [], "challenge": []}
        
        for item in dataset:
            question = item["question"]
            choices = item["choices"]["text"]
            answer = item["choices"]["label"][item["answerKey"]]
            
            # 多次尝试不同提示策略
            response = self.cot_reasoning(question, choices)
            
            is_correct = self.check_answer(response, answer)
            results[item["problemType"]].append(is_correct)
        
        return {
            "ARC-Easy": np.mean(results["easy"]) * 100,
            "ARC-Challenge": np.mean(results["challenge"]) * 100
        }
    
    def cot_reasoning(self, question, choices):
        prompt = f"""问题:{question}
 
选项:
"""
        for i, choice in enumerate(choices):
            prompt += f"{chr(65+i)}. {choice}\n"
        
        prompt += """
请先分析问题,然后在最后给出答案(格式:答案是X)。"""
        
        return self.model.generate(prompt)

推理能力评测

GSM8K:小学数学

GSM8K(Grade School Math 8K)包含8500道小学数学应用题,通常需要2-8步推理。6

from metrics import load_metric
 
def evaluate_gsm8k(model, split="test"):
    """
    GSM8K评估要点:
    - 最终答案必须是精确数字
    - 包含详细的解题步骤
    - 可能有多步计算
    """
    dataset = load_dataset("openai/gsm8k", "main")[split]
    
    correct = 0
    results = []
    
    for item in tqdm(dataset):
        question = item["question"]
        answer = item["answer"]
        
        # 提取最终答案数字
        ground_truth = extract_number(answer)
        
        # 使用CoT生成解答
        response = model.generate(
            f"请逐步解答以下问题,最后给出答案。\n\n问题:{question}",
            temperature=0.3
        )
        
        predicted = extract_number(response)
        is_correct = predicted == ground_truth
        
        correct += is_correct
        results.append({
            "question": question,
            "response": response,
            "expected": ground_truth,
            "predicted": predicted,
            "correct": is_correct
        })
    
    return {
        "accuracy": correct / len(dataset),
        "results": results
    }
 
def extract_number(text: str) -> float:
    """从文本中提取最终答案数字"""
    # GSM8K答案格式通常是 "...答案是X."
    import re
    numbers = re.findall(r"-?\d+\.?\d*", text.split("答案是")[-1])
    if numbers:
        return float(numbers[0])
    return None

MATH:竞赛数学

MATH数据集包含12000道高中数学竞赛题,涵盖代数、几何、数论等多个领域,难度显著高于GSM8K。7

def evaluate_math(model, dataset_path="hendrycks/competition_math"):
    """
    MATH评测特点:
    - 答案可能是精确的数学表达式
    - 需要LaTeX格式解析
    - 分多个子领域评估
    """
    dataset = load_dataset(dataset_path)
    
    results = {}
    for level in ["AMC", "AIME", "ARML", "College", "High School"]:
        results[level] = {"correct": 0, "total": 0}
    
    for item in dataset["test"]:
        subject = item["subject"]
        level = classify_level(subject)
        
        response = model.generate(
            format_math_prompt(item["problem"]),
            temperature=0.5
        )
        
        predicted = parse_math_answer(response)
        expected = normalize_math(item["solution"])
        
        is_correct = check_math_equivalence(predicted, expected)
        results[level]["correct"] += is_correct
        results[level]["total"] += 1
    
    # 打印各难度级别结果
    for level, stats in results.items():
        print(f"{level}: {stats['correct']}/{stats['total']} = "
              f"{stats['correct']/max(stats['total'],1):.2%}")
    
    return results

ARC-Challenge:复杂推理

ARC-Challenge是ARC数据集的困难子集,要求模型具备更强的多跳推理能力。

# ARC-Challenge评估
def evaluate_arc_challenge(model):
    """
    ARC-Challenge特点:
    - 需要跨多个知识领域整合信息
    - 包含干扰选项
    - 答案通常需要2步以上推理
    """
    dataset = load_dataset("allenai/ai2_arc", "ARC-Challenge")
    
    def evaluate_item(item):
        prompt = build_arc_prompt(item)
        response = model.generate(prompt)
        
        # 解析模型回答
        predicted = extract_choice(response)
        expected = item["answerKey"]
        
        return predicted == expected
    
    return np.mean([evaluate_item(item) for item in dataset])

LogiQA:逻辑推理

LogiQA是专门评估逻辑推理能力的数据集,包含8000多道需要严格推理的选择题。8

def evaluate_logiqa(model):
    """
    LogiQA评估逻辑推理能力
    涵盖命题逻辑、演绎推理、归纳推理等
    """
    dataset = load_dataset("lucadiliello/logiqa")
    
    correct = 0
    categories = {"deductive": 0, "inductive": 0, "abductive": 0}
    
    for item in dataset["test"]:
        context = item["context"]
        question = item["question"]
        options = item["options"]
        
        prompt = f"""阅读以下段落并回答问题:
 
段落:{context}
 
问题:{question}
 
选项:
"""
        for i, opt in enumerate(options):
            prompt += f"{chr(65+i)}. {opt}\n"
        prompt += "\n请分析推理过程后给出答案。"
        
        response = model.generate(prompt)
        predicted = extract_choice(response)
        expected = chr(65 + ord(item["expected_answer"]) - ord('0'))
        
        if predicted == expected:
            correct += 1
            categories[item.get("category", "deductive")] += 1
    
    return {
        "accuracy": correct / len(dataset["test"]),
        "by_category": categories
    }

编程评测:HumanEval与MBPP

HumanEvalMBPP是两个最常用的代码生成评测基准。9

def evaluate_code_generation(model, benchmark="humaneval"):
    """
    HumanEval: 164道手写编程题
    MBPP: 974道Python基础编程题
    
    使用pass@k指标评估
    """
    if benchmark == "humaneval":
        dataset = load_dataset("openai/openai-humaneval")
        num_samples = 100  # 每个问题生成100个样本
    else:
        dataset = load_dataset("google-research/mbpp")
        num_samples = 1000
    
    from execute import check_correctness
    
    results = {"pass@1": [], "pass@10": [], "pass@100": []}
    
    for task in tqdm(dataset["test"]):
        prompt = task["prompt"]
        test_cases = task["test"]
        entry_point = task["entry_point"]
        
        # 生成多个候选代码
        codes = []
        for _ in range(num_samples):
            code = model.generate(
                f"完成以下Python函数:\n{prompt}",
                temperature=0.8,
                stop=["\nclass ", "\ndef ", "\nimport "]
            )
            full_code = prompt + code
            codes.append(full_code)
        
        # 计算pass@k
        passed = []
        for code in codes:
            try:
                is_correct = check_correctness(code, test_cases, entry_point)
                passed.append(is_correct)
            except:
                passed.append(False)
        
        results["pass@1"].append(any(passed[:1]))
        results["pass@10"].append(any(passed[:10]))
        results["pass@100"].append(any(passed))
    
    return {
        f"pass@1": np.mean(results["pass@1"]) * 100,
        f"pass@10": np.mean(results["pass@10"]) * 100,
        f"pass@100": np.mean(results["pass@100"]) * 100,
    }

知识评测

TriviaQA:开放域问答

TriviaQA包含超过65万个问答对,涵盖广泛的知识领域,用于评估模型的开放域知识能力。10

def evaluate_triviaqa(model, split="web"):
    """
    TriviaQA特点:
    - 问题独立于任何给定上下文
    - 需要模型依赖内部知识
    - 答案形式多样(人名、地点、日期等)
    """
    dataset = load_dataset("rajpurkar/triviaqa", split=f"{split}")
    
    correct = 0
    predictions = []
    
    for item in tqdm(dataset):
        question = item["question"]
        
        response = model.generate(
            f"回答以下问题,只需给出答案:\n{question}",
            temperature=0.3
        )
        
        predicted = normalize_answer(response)
        expected = item["answer"]
        
        is_correct = check_answer_match(predicted, expected)
        correct += is_correct
        predictions.append({
            "question": question,
            "predicted": predicted,
            "expected": expected,
            "correct": is_correct
        })
    
    return {
        "accuracy": correct / len(dataset),
        "predictions": predictions
    }

Natural Questions:Google搜索问答

Natural Questions来自真实的Google搜索查询,包含长答案和短答案两种类型。

def evaluate_nq(model, answer_type="short"):
    """
    Natural Questions评估
    - 长答案:段落级别
    - 短答案:实体级别
    """
    dataset = load_dataset("google-research-datasets/natural_questions", 
                          split="validation")
    
    correct = 0
    
    for item in dataset:
        question = item["question"]["text"]
        
        if answer_type == "short":
            # 短答案:通常是一个或多个实体
            expected_answers = [ans["text"] for ans in item["annotations"]["short_answers"]]
        else:
            # 长答案:段落或表格
            expected_answers = extract_long_answers(item["annotations"])
        
        response = model.generate(
            f"基于你的知识回答以下问题:\n{question}",
            temperature=0.3
        )
        
        if check_answer_match(response, expected_answers):
            correct += 1
    
    return correct / len(dataset)

知识编辑评测

知识编辑(Knowledge Editing)评测模型对特定知识的更新能力。

def evaluate_knowledge_editing(model, editor):
    """
    评估知识编辑效果
    关键指标:可靠性、泛化性、局部性
    """
    dataset = load_dataset("zjunlp/KnowledgeEditing")
 
    metrics = {"reliability": [], "generality": [], "locality": []}
    
    for edit in dataset:
        target = edit["target_new"]["answer"]
        
        # 应用知识编辑
        editor.apply(model, edit["prompt"], target)
        
        # 测试可靠性:编辑的问题
        response = model.generate(edit["prompt"])
        rel_correct = check_answer_match(response, target)
        metrics["reliability"].append(rel_correct)
        
        # 测试泛化性:同义问题
        for rephrase in edit["rephrase_prompts"]:
            response = model.generate(rephrase)
            gen_correct = check_answer_match(response, target)
            metrics["generality"].append(gen_correct)
        
        # 测试局部性:无关问题
        for unrelated in edit["neighborhood_prompts"]:
            response = model.generate(unrelated)
            # 期望不改变原有回答或保持正确
            metrics["locality"].append(not is_hallucination(response))
    
    return {k: np.mean(v) for k, v in metrics.items()}

对齐与安全性评测

TruthfulQA:真实性

TruthfulQA专门评估模型在回答问题时是否会输出虚假或误导性信息,包含817个问题。11

def evaluate_truthfulqa(model):
    """
    TruthfulQA评估要点:
    - 问题涉及人类常犯的错误观念
    - 模型可能被误导给出错误答案
    - 需要识别并拒绝回答危险问题
    """
    dataset = load_dataset("truthfulqa/truthfulqa_mc2")
    
    results = {"truthful": 0, "informative": 0}
    
    for item in dataset:
        question = item["question"]
        mc2_choices = item["mc2_targets"]  # 多选答案
        
        response = model.generate(
            f"回答以下问题:\n{question}",
            temperature=0.7
        )
        
        # TruthfulQA使用多选格式评估
        # 判断回答是否与真实答案一致
        is_truthful = evaluate_truthfulness(response, item)
        is_informative = evaluate_informativeness(response, question)
        
        results["truthful"] += is_truthful
        results["informative"] += is_informative
    
    return {
        "truthful_accuracy": results["truthful"] / len(dataset),
        "informative_accuracy": results["informative"] / len(dataset),
        "truthful_x_info": results["truthful"] * results["informative"] / len(dataset)**2
    }

RealToxicityPrompts:毒性检测

RealToxicityPrompts评估模型在给定有毒提示时生成有毒内容的倾向。

def evaluate_toxicity(model, perspective_api_key=None):
    """
    评估模型输出的毒性
    使用Perspective API或本地毒性检测器
    """
    dataset = load_dataset("allenai/real-toxicity-prompts", 
                          split="challenge_prompts")
    
    toxicity_scores = []
    
    for item in tqdm(dataset):
        prompt = item["prompt"]["text"]
        
        # 生成续写
        response = model.generate(prompt, max_tokens=50)
        
        # 使用Perspective API评估毒性
        if perspective_api_key:
            toxicity = perspective_api.check_toxicity(response)
        else:
            toxicity = local_toxicity_detector(response)
        
        toxicity_scores.append(toxicity)
    
    return {
        "mean_toxicity": np.mean(toxicity_scores),
        "toxicity_at_10": np.mean([s > 0.1 for s in toxicity_scores]),
        "toxicity_at_50": np.mean([s > 0.5 for s in toxicity_scores]),
        "toxicity_at_90": np.mean([s > 0.9 for s in toxicity_scores]),
    }

BBQ:社会偏见

BBQ(Bias Benchmark for QA)评估模型在涉及社会群体的问答中是否存在偏见。12

def evaluate_bbq(model):
    """
    BBQ评估社会偏见
    测试9个社会群体维度
    """
    dataset = load_dataset("facebook/bbq")
    
    dimensions = ["age", "disability_status", "gender_identity",
                  "national_origin", "physical_appearance",
                  "race_ethnicity", "religion", "sexual_orientation",
                  "socioeconomic"]
    
    results = {dim: {"biased": 0, "neutral": 0, "total": 0} 
               for dim in dimensions}
    
    for item in dataset:
        dim = item["category"]
        context = item["context"]
        question = item["question"]
        
        # 两种设置下的回答
        neg_context = item["neg_context"]  # 负面刻板设置
        pos_context = item["pos_context"]  # 中性/正面设置
        
        # 评估偏见
        bias_score = evaluate_bias(
            model, context, neg_context, question, 
            item["answer_options"]
        )
        
        results[dim]["total"] += 1
        if bias_score > 0.5:  # 存在偏见
            results[dim]["biased"] += 1
        else:
            results[dim]["neutral"] += 1
    
    return results

SafetyBench:多维安全评测

SafetyBench是综合性的安全评测基准,涵盖7个安全维度。

def evaluate_safety_bench(model):
    """
    SafetyBench安全维度:
    - 伦理道德
    - 非法活动
    - 恶意软件
    - 危险行为
    - 医疗安全
    - 金融安全
    - 个人隐私
    """
    dataset = load_dataset("thu-coai/SafetyBench", split="test")
    
    categories = {
        "ethics": [],
        "illegal": [],
        "malware": [],
        "physical_harm": [],
        "privacy": [],
        "financial": [],
        "health": []
    }
    
    for item in dataset:
        category = item["category"]
        question = item["question"]
        
        response = model.generate(
            f"请回答以下问题:\n{question}",
            temperature=0.3
        )
        
        is_safe = check_safety(response, category)
        categories[category].append(is_safe)
    
    return {cat: np.mean(scores) for cat, scores in categories.items()}

中文评测体系

CMMLU:中文多任务理解

CMMLU(Chinese Massive Multitask Language Understanding)是专门针对中文的MMLU,包含11573道选择题,涵盖67个学科。13

def evaluate_cmmlu(model):
    """
    CMMLU评估中文基础能力
    涵盖自然科学、社会科学、工程技术等领域
    """
    dataset = load_dataset("haonan-li/cmmlu")
    
    correct = 0
    results = {}
    
    for subject in dataset.keys():
        if subject == "test":
            continue
        
        subject_correct = 0
        subject_total = 0
        
        for item in dataset[subject]:
            question = item["question"]
            choices = item["choices"]
            answer = ord(item["answer"]) - ord('A')
            
            prompt = f"问题:{question}\n选项:\n"
            for i, choice in enumerate(choices):
                prompt += f"{chr(65+i)}. {choice}\n"
            prompt += "请选择正确答案(只需输出字母):"
            
            response = model.generate(prompt)
            predicted = ord(parse_single_char(response).upper()) - ord('A')
            
            if predicted == answer:
                correct += 1
                subject_correct += 1
            subject_total += 1
        
        results[subject] = subject_correct / subject_total
    
    return {
        "overall": correct / sum(len(dataset[s]) for s in dataset.keys() if s != "test"),
        "by_subject": results
    }

C-Eval:中文能力评测

C-Eval是清华大学发布的中文大模型评测基准,包含13948道选择题,覆盖139个学科。14

def evaluate_c_eval(model):
    """
    C-Eval评估中文能力
    包含5个难度级别
    """
    dataset = load_dataset("celueval/cEval")
    
    levels = {"初中": [], "高中": [], "大学": [], "硕士": [], "博士": []}
    
    for item in dataset["val"]:
        level = item["level"]
        question = item["question"]
        choices = item["choices"]
        answer = ord(item["answer"]) - ord('A')
        
        response = model.generate(format_ceval_prompt(question, choices))
        predicted = parse_single_char(response)
        
        is_correct = predicted == chr(65 + answer)
        levels[level].append(is_correct)
    
    return {level: np.mean(scores) if scores else 0 
            for level, scores in levels.items()}

MMCU:中文专业能力

MMCU(Massive Multitask Chinese Understanding)评估中文专业领域知识。

def evaluate_mMCU(model):
    """
    MMCU评估中文专业能力
    涵盖医学、法学、心理学等领域
    """
    dataset = load_dataset("RUCAIBox/mMCU")
    
    domains = ["medicine", "law", "psychology", "education", "history"]
    
    results = {}
    for domain in domains:
        domain_dataset = dataset[domain]
        correct = 0
        
        for item in domain_dataset:
            response = model.generate(format_prompt(item))
            if is_correct(response, item["answer"]):
                correct += 1
        
        results[domain] = correct / len(domain_dataset)
    
    return results

CMRC:中文阅读理解

CMRC(Chinese Machine Reading Comprehension)是中文阅读理解评测基准。

def evaluate_cmrc(model):
    """
    CMRC 2018/2019 中文阅读理解
    需要从给定文章中提取答案
    """
    dataset = load_dataset("cmrc2018", split="validation")
    
    em_scores = []  # Exact Match
    f1_scores = []  # Token-level F1
    
    for item in dataset:
        context = item["context"]
        question = item["question"]
        
        response = model.generate(
            f"阅读理解:\n文章:{context}\n问题:{question}\n请给出答案:"
        )
        
        predicted = response.strip()
        expected = item["answers"][0]
        
        em_scores.append(compute_em(predicted, expected))
        f1_scores.append(compute_f1(predicted, expected))
    
    return {
        "exact_match": np.mean(em_scores),
        "f1_score": np.mean(f1_scores)
    }

评测方法论

自动化评测 vs 人类评估

class EvaluationStrategy:
    """
    选择合适的评测策略
    """
    
    @staticmethod
    def select_strategy(task_type: str, budget: str) -> str:
        """
        根据任务类型和预算选择评测策略
        
        决策树:
        - 规则可定义的任务 → 自动化评测
        - 开放式生成任务 → 人类评估
        - 成本敏感场景 → 采样+自动化
        """
        if task_type in ["classification", "extraction", "matching"]:
            return "自动化评测"
        
        elif task_type in ["summarization", "creative", "open_qa"]:
            if budget == "low":
                return "LLM辅助评估 (G-Eval)"
            elif budget == "medium":
                return "采样人类评估"
            else:
                return "完整人类评估"
        
        elif task_type in ["reasoning", "math", "code"]:
            return "自动化评测 + 结果验证"
        
        return "混合策略"
评测方法成本质量适用场景
规则匹配$0.001/样本中等分类、提取
LLM辅助评估$0.05-0.10/样本生成质量评估
人类评估$2-10/样本最高关键决策、创意
混合策略$0.01-0.05/样本很高生产监控

提示工程与评测公平性

class FairEvaluation:
    """
    确保评测公平性的技巧
    """
    
    @staticmethod
    def build_few_shot_prompt(task: str, n_examples: int = 5) -> str:
        """构建few-shot提示,避免泄露答案"""
        examples = load_calibration_examples(task, n=n_examples)
        
        prompt = "请仔细阅读以下示例并完成最后的任务:\n\n"
        for ex in examples:
            prompt += f"任务:{ex['question']}\n"
            prompt += f"你的回答:\n"
            # 注意:不展示标准答案,避免提示泄露
            prompt += f"{ex['reference_answer']}\n\n"
        
        return prompt
    
    @staticmethod
    def control_confounders(
        model_response: str,
        confounders: dict
    ) -> dict:
        """
        控制混杂变量,确保评测公平
        - 回答长度
        - 格式偏好
        - 位置偏差
        """
        return {
            "length_normalized": normalize_length(model_response),
            "format_agnostic": remove_formatting(model_response),
            "order_balanced": balance_order_effects(model_response)
        }

防御对抗攻击评测

def evaluate_adversarial_robustness(model):
    """
    评估模型对对抗攻击的鲁棒性
    """
    attacks = {
        "prompt_injection": load_prompt_injection_tests(),
        "jailbreak": load_jailbreak_prompts(),
        "data_poisoning": load_poisoning_tests(),
        "model_extraction": load_extraction_tests()
    }
    
    results = {}
    
    for attack_type, test_cases in attacks.items():
        safe_count = 0
        
        for prompt in test_cases:
            response = model.generate(prompt)
            
            # 检查是否成功攻击
            if not detect_unsafe_response(response, attack_type):
                safe_count += 1
        
        results[attack_type] = {
            "safety_rate": safe_count / len(test_cases),
            "unsafe_examples": len(test_cases) - safe_count
        }
    
    return results

长上下文评测

def evaluate_long_context(model, max_length: int = 200000):
    """
    长上下文评测
    - Needle-in-a-Haystack
    - 多跳推理
    - 信息整合
    """
    results = {}
    
    # Needle-in-a-Haystack测试
    results["needle"] = evaluate_needle_task(
        model, 
        context_lengths=[8192, 32768, 100000, 200000],
        num_needles=5
    )
    
    # 多跳推理测试
    results["multi_hop"] = evaluate_multi_hop_retrieval(
        model,
        context_length=100000,
        hops=[2, 3, 4, 5]
    )
    
    # 信息整合测试
    results["synthesis"] = evaluate_information_synthesis(
        model,
        context_length=50000
    )
    
    return results

最新评测趋势

开放生成评测

传统评测(如MMLU)使用选择题格式,但无法评估模型的真实表达能力。开放生成评测成为新趋势。

class OpenEndedEvaluation:
    """
    开放生成评测框架
    """
    
    def evaluate_instruction_following(self, model):
        """
        IFEval:指令遵循评测
        检查模型是否遵循请求的具体指令
        """
        dataset = load_dataset("google-research-datasets/ifeval", split="train")
        
        results = []
        for item in dataset:
            instructions = item["prompt"]
            constraints = item["constraints"]
            
            response = model.generate(instructions)
            
            # 检查各项约束是否被遵循
            constraint_scores = []
            for constraint in constraints:
                is_followed = check_constraint(response, constraint)
                constraint_scores.append(is_followed)
            
            results.append(np.mean(constraint_scores))
        
        return np.mean(results)
    
    def evaluate_summarization(self, model):
        """
        摘要评测:使用G-Eval或LLM-as-Judge
        """
        dataset = load_dataset("cnn_dailymail", split="validation")
        
        scores = []
        for item in dataset:
            article = item["article"]
            reference = item["highlights"]
            
            response = model.generate(f"请摘要以下文章:\n{article}")
            
            # 使用LLM评估质量
            quality = llm_judge.evaluate(
                prompt=f"评估以下摘要的质量:\n摘要:{response}\n参考:{reference}",
                dimensions=["factuality", "coverage", "coherence"]
            )
            scores.append(quality)
        
        return np.mean(scores)

多模态评测

def evaluate_multimodal(model, benchmark="MME"):
    """
    多模态评测基准
    MME: Perception and Cognition
    MMBench: Multi-modal Understanding
    """
    benchmarks = {
        "MME": load_mme_benchmark(),
        "MMBench": load_mmbench(),
        "SEED-Bench": load_seed_bench(),
        "OCRBench": load_ocr_bench()
    }
    
    results = {}
    for name, dataset in benchmarks.items():
        correct = 0
        for item in dataset:
            image = item["image"]
            question = item["question"]
            
            response = model.generate(
                image=image,
                text=question
            )
            
            if check_multimodal_answer(response, item["answer"]):
                correct += 1
        
        results[name] = correct / len(dataset)
    
    return results

Agent能力评测

def evaluate_agent_capability(model, agent_framework):
    """
    Agent能力评测
    - 任务规划
    - 工具使用
    - 多步骤执行
    """
    benchmarks = {
        "GAIA": load_gaia_benchmark(),      # 通用AI助手
        "WebArena": load_web_arena(),        # Web导航
        "AgentBench": load_agent_bench(),   # 多环境Agent
        "ToolBench": load_tool_bench()      # 工具使用
    }
    
    results = {}
    for name, benchmark in benchmarks.items():
        task_results = []
        
        for task in benchmark:
            trajectory = agent_framework.run(task)
            score = evaluate_trajectory(trajectory, task["expected"])
            task_results.append(score)
        
        results[name] = {
            "success_rate": np.mean(task_results),
            "avg_steps": np.mean([len(t["steps"]) for t in benchmark])
        }
    
    return results

成本效率评测

def evaluate_cost_efficiency(model, tasks: list):
    """
    成本效率评测
    在给定预算下最大化性能
    """
    budget_range = [0.01, 0.1, 1.0, 10.0]  # 美元/查询
    
    results = []
    
    for budget in budget_range:
        performances = []
        costs = []
        
        for task in tasks:
            # 估算完成任务的最优策略
            strategy = select_efficient_strategy(task, budget)
            
            start_cost = get_cost()
            response = execute_strategy(model, task, strategy)
            actual_cost = get_cost() - start_cost
            
            performance = evaluate_response(response, task)
            performances.append(performance)
            costs.append(actual_cost)
        
        results.append({
            "budget": budget,
            "avg_performance": np.mean(performances),
            "avg_cost": np.mean(costs),
            "efficiency": np.mean(performances) / np.mean(costs)
        })
    
    return results

主流评测基准对比

基准任务数样本数主要能力更新频率
MMLU5714,042知识理解静态
BBH23~16,000复杂推理静态
GSM8K18,500数学静态
MATH712,500竞赛数学静态
HumanEval1164代码生成静态
TruthfulQA1817真实性静态
HellaSwag142,042常识推理静态
ARC27,787科学推理静态

总结

LLM评测体系是模型研发和应用部署的重要基础设施。核心要点:

  1. 多维度评估:单一基准无法全面衡量模型能力,需要综合知识、推理、代码、安全等多个维度
  2. 评测演进:从选择题到开放生成,从静态基准到动态评测
  3. 中文评测:CMMLU、C-Eval等中文基准填补了中文模型评测的空白
  4. 方法论:根据任务类型和资源选择合适的评测方法
  5. 趋势:Agent能力、多模态、成本效率成为新的评测方向

随着LLM技术的快速发展,评测体系也在持续演进。建立科学、全面的评测文化,是推动LLM技术进步的关键。

参考资料


相关主题

Footnotes

  1. Chang et al. “A Survey on Evaluation of Large Language Models”. arXiv:2307.03109, 2023.

  2. Hendrycks et al. “Measuring Massive Multitask Language Understanding”. ICLR 2021.

  3. Suzgun et al. “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them”. arXiv:2210.09261, 2022.

  4. Zellers et al. “HellaSwag: Can a Machine Really Finish Your Sentence?“. ACL 2019.

  5. Clark et al. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. arXiv:1803.05457, 2018.

  6. Cobbe et al. “Training Verifiers to Solve Math Word Problems”. arXiv:2110.14168, 2021.

  7. Hendrycks et al. “Measuring Mathematical Problem Solving With the MATH Dataset”. NeurIPS 2021 Datasets and Benchmarks Track.

  8. Liu et al. “LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning”. IJCAI 2020.

  9. Chen et al. “Evaluating Large Language Models Trained on Code”. arXiv:2107.03374, 2021.

  10. Joshi et al. “TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension”. ACL 2017.

  11. Lin et al. “TruthfulQA: Measuring How Models Mimic Human Falsehoods”. ACL 2022.

  12. Parrish et al. “BBQ: A Hand-built Bias Benchmark for Question Answering”. Findings of ACL 2022.

  13. Li et al. “CMMLU: Measuring massive multitask language understanding in Chinese”. arXiv:2306.13388, 2023.

  14. Huang et al. “C-Eval: A Multi-Level Multi-Domain Chinese Evaluation Benchmark for Large Language Models”. arXiv:2305.06622, 2023.