DAPO（离散策略优化算法）

概述

从PPO到GRPO再到DAPO的演进

在GRPO出现之前，PPO（Proximal Policy Optimization）一直是强化学习对齐的主导算法。¹然而，PPO需要同时维护策略网络和价值网络，这在训练数十亿参数的大语言模型时会带来巨大的计算负担。GRPO通过组内相对奖励替代价值函数，大幅简化了训练流程。²

DAPO（Discrete Algorithmic Policy Optimization）由字节跳动团队在2025年提出，是这一演进路径上的最新突破。DAPO并非简单的超参数调整或小技巧堆砌，而是一套系统性的算法改进，核心洞察是：LLM生成的响应是离散的token序列，这一特性使得我们可以设计更精细的优化策略。

DAPO的核心观察

DAPO团队发现了四个关键问题，并针对性地提出了解决方案：

极端响应的干扰：全对或全错的响应组无法提供有效的学习信号
Clip上界的不对称性：传统PPO/GRPO只裁剪下限，忽略了上限的优化空间
采样效率的失衡：高频采样的prompt学得快，低频采样的prompt学得慢
响应级的粗粒度优化：对整个响应给予统一的优势值，忽略了token级别的差异

四大核心技术

过滤（Filtering）

问题分析

在GRPO中，我们对每个问题采样 $G$ 个响应。当这 $G$ 个响应全部正确或全部错误时，组内归一化优势函数会将所有响应的优势值都设为零或接近零：

\hat{A}_{i} = \frac{r _{i} - μ _{G}}{σ _{G}} \approx 0 当 r_{1} = r_{2} = \dots = r_{G}

这种情况下，梯度信号消失，模型无法学习。

过滤策略

DAPO引入了过滤机制，在计算优势函数之前，先检查响应组的多样性：

// DAPO过滤机制伪代码
std::pair<std::vector<std::vector<int>>, std::vector<double>> filter_and_prepare(
    const std::vector<std::vector<int>>& responses,
    const std::vector<double>& rewards,
    double reward_threshold = 0.5) {
    
    // 分类响应：正样本 vs 负样本
    std::vector<std::vector<int>> pos_responses;
    std::vector<std::vector<int>> neg_responses;
    std::vector<double> pos_rewards, neg_rewards;
    
    for (size_t i = 0; i < responses.size(); ++i) {
        if (rewards[i] >= reward_threshold) {
            pos_responses.push_back(responses[i]);
            pos_rewards.push_back(rewards[i]);
        } else {
            neg_responses.push_back(responses[i]);
            neg_rewards.push_back(rewards[i]);
        }
    }
    
    // 如果全部是同一类，丢弃该组
    if (pos_responses.empty() || neg_responses.empty()) {
        return {{}, {}};  // 返回空组，稍后跳过
    }
    
    // 合并并重新计算优势
    std::vector<std::vector<int>> filtered_responses;
    std::vector<double> all_rewards;
    filtered_responses.insert(filtered_responses.end(), pos_responses.begin(), pos_responses.end());
    filtered_responses.insert(filtered_responses.end(), neg_responses.begin(), neg_responses.end());
    all_rewards.insert(all_rewards.end(), pos_rewards.begin(), pos_rewards.end());
    all_rewards.insert(all_rewards.end(), neg_rewards.begin(), neg_rewards.end());
    
    return {filtered_responses, all_rewards};
}

数学形式

过滤后的优势函数变为：

\hat{A}_{i}^{f i lt ere d} = {\frac{r _{i} - μ _{f i lt ere d}}{σ _{f i lt ere d}} 0 if filtered_group \neq = \emptyset otherwise

其中 $μ_{f i lt ere d}$ 和 $σ_{f i lt ere d}$ 仅在过滤后的响应组上计算。

效果

过滤机制确保了每个训练样本都能提供有意义的学习信号，显著提升了训练稳定性和最终性能。

Clip-Higher

传统Clip的不对称性

在标准PPO/GRPO中，策略比率被裁剪到 $[1 - ϵ, 1 + ϵ]$ 区间：

L^{C L I P} = - min (\frac{π _{θ}}{π _{o l d}} \hat{A}, clip (\frac{π _{θ}}{π _{o l d}}, 1 - ϵ, 1 + ϵ) \hat{A})

当 $\hat{A} > 0$ （正优势）时，clip发生在 $\frac{π _{θ}}{π _{o l d}} > 1 + ϵ$ ，即阻止策略变得过于激进。

当 $\hat{A} < 0$ （负优势）时，clip发生在 $\frac{π _{θ}}{π _{o l d}} < 1 - ϵ$ ，即阻止策略变得过于保守。

问题：对于负优势的响应，模型可能过度抑制某些行为，导致响应变得过于简短或保守。

DAPO的解法：非对称Clip

DAPO引入了非对称的clip区间 $[ϵ_{l o w}, ϵ_{hi g h}]$ ：

L^{D A PO - C L I P} = - min (\frac{π _{θ}}{π _{o l d}} \hat{A}, clip (\frac{π _{θ}}{π _{o l d}}, 1 - ϵ_{l o w}, 1 + ϵ_{hi g h}) \hat{A})

其中典型设置为 $ϵ_{l o w} = 0.2$ ， $ϵ_{hi g h} = 0.3$ 或更高。

理论分析

考虑正优势情况下的梯度：

\nabla_{θ} L^{C L I P} \approx {- \hat{A} \cdot \nabla_{θ} \frac{π _{θ}}{π _{o l d}} 0 \frac{π _{θ}}{π _{o l d}} \in [1 - ϵ_{l o w}, 1 + ϵ_{hi g h}] otherwise

非对称clip允许正优势时更大的策略提升，同时保持对负优势的适度约束。这与直觉一致：对于好的响应，我们希望模型更有信心地生成它们。

实现

torch::Tensor dapo_clip_loss(
    const torch::Tensor& ratio,
    const torch::Tensor& advantages,
    double eps_low = 0.2,
    double eps_high = 0.3) {
    
    // 非对称clip
    torch::Tensor clipped_ratio = torch::clamp(
        ratio, 
        1.0 - eps_low,  // 下界保持不变
        1.0 + eps_high  // 上界扩大
    );
    
    // 取最小（与PPO/GRPO一致）
    torch::Tensor unclipped = ratio * advantages;
    torch::Tensor clipped = clipped_ratio * advantages;
    
    torch::Tensor loss = -torch.min(unclipped, clipped);
    
    return loss.mean();
}

动态采样（Dynamic Sampling）

采样频率问题

在标准RL训练中，每个prompt在每个训练步骤中被采样的概率相同。这导致：

简单prompt：模型很快就能达到高准确率，但仍然占用大量训练资源
困难prompt：模型学习缓慢，但它们提供最宝贵的学习信号

DAPO的解决方案

DAPO引入了动态采样率机制，根据模型在每个prompt上的表现动态调整采样频率：

P (q_{i}) \propto uncertainty (q_{i}) = 1 - \frac{max _{j} r _{ij}}{\sum _{j} r _{ij}}

其中 $r_{ij}$ 是第 $i$ 个prompt第 $j$ 次采样的奖励。

class DynamicSampler {
public:
    DynamicSampler(int batch_size, int num_prompts, double alpha = 0.7)
        : batch_size_(batch_size), alpha_(alpha) {
        // 初始化均匀采样权重
        sampling_probs_.resize(num_prompts, 1.0 / num_prompts);
        sample_counts_.resize(num_prompts, 0);
    }
    
    std::vector<int> sample_prompts(
        const std::vector<double>& uncertainties) {
        
        // 更新采样概率：结合不确定性调整
        double total_uncertainty = 0.0;
        for (double u : uncertainties) {
            total_uncertainty += std::pow(u, alpha_);
        }
        
        std::vector<double> new_probs(uncertainties.size());
        for (size_t i = 0; i < uncertainties.size(); ++i) {
            new_probs[i] = std::pow(uncertainties[i], alpha_) / total_uncertainty;
            // 混合历史概率和平滑
            sampling_probs_[i] = 0.7 * new_probs[i] + 0.3 * sampling_probs_[i];
        }
        
        // 归一化
        double sum = std::accumulate(sampling_probs_.begin(), sampling_probs_.end(), 0.0);
        for (double& p : sampling_probs_) p /= sum;
        
        // 按概率采样batch_size个prompt
        std::vector<int> sampled;
        std::discrete_distribution<> dist(sampling_probs_.begin(), sampling_probs_.end());
        std::mt19937 gen(std::random_device{}());
        
        for (int i = 0; i < batch_size_; ++i) {
            sampled.push_back(dist(gen));
            sample_counts_[sampled.back()]++;
        }
        
        return sampled;
    }
    
private:
    int batch_size_;
    double alpha_;
    std::vector<double> sampling_probs_;
    std::vector<int> sample_counts_;
};

效果

动态采样确保了困难样本获得更多训练机会，同时避免在已经掌握的简单样本上浪费计算资源。

令牌级策略（Token-level Policy）

响应级 vs 令牌级

传统PPO/GRPO在响应级别计算优势：

\hat{A}_{res p o n se} = reward (res p o n se) - b

这意味着响应中的每个token都被赋予相同的优势值。然而，响应中的不同令牌对最终奖励的贡献可能差异巨大。

DAPO的令牌级优势

DAPO将优势分解到每个token位置：

对于序列 $y = (y_{1}, y_{2}, ..., y_{T})$ ，定义令牌级优势：

\hat{A}_{t}^{t o k e n} = γ^{T - t} \cdot \hat{A}_{res p o n se}

其中 $γ \in (0, 1)$ 是衰减因子，早期token的贡献被衰减。

重要性采样加权

更精细的做法是使用重要性加权：

\hat{A}_{t}^{t o k e n} = \frac{\sum _{τ = t}^{T} \nabla _{θ} lo g π _{θ} ( y _{τ} ∣ y _{< τ} , q )}{\sum _{τ = t}^{T} 1} \cdot \hat{A}_{res p o n se}

torch::Tensor token_level_advantage(
    const torch::Tensor& log_probs,      // [batch, seq_len]
    const torch::Tensor& advantages,     // [batch]
    double gamma = 0.95) {
    
    int batch_size = log_probs.size(0);
    int seq_len = log_probs.size(1);
    
    // 创建衰减掩码
    torch::Tensor decay = torch::pow(
        torch::full({seq_len}, gamma, log_probs.options()),
        torch::arange(seq_len, log_probs.options()).flip(0)
    );
    
    // 扩展advantage到seq_len维度
    torch::Tensor adv_expanded = advantages.unsqueeze(1).expand({batch_size, seq_len});
    
    // 计算token级优势
    torch::Tensor token_adv = decay * adv_expanded;
    
    return token_adv;
}
 
// 在策略梯度计算中使用
torch::Tensor dapo_token_loss(
    const torch::Tensor& log_probs,
    const torch::Tensor& old_log_probs,
    const torch::Tensor& token_advantages,
    const torch::Tensor& mask) {
    
    // 策略比率
    torch::Tensor ratio = torch::exp(log_probs - old_log_probs);
    
    // 非对称clip
    torch::Tensor clipped_ratio = torch::clamp(ratio, 1.0 - 0.2, 1.0 + 0.3);
    
    // Token级损失
    torch::Tensor unclipped = ratio * token_advantages * mask;
    torch::Tensor clipped = clipped_ratio * token_advantages * mask;
    
    torch::Tensor loss = -torch.min(unclipped, clipped).sum() / mask.sum();
    
    return loss;
}

理论分析

DAPO的梯度推导

考虑单个响应的DAPO损失：

L_{D A PO} = - E_{q, y \sim π_{θ}} [t = 1 \sum T min (r_{t} \hat{A}_{t}, c l i p (r_{t}, 1 - ϵ_{l o w}, 1 + ϵ_{hi g h}) \hat{A}_{t})]

其中 $r_{t} = \frac{π _{θ} ( y _{t} ∣ y _{< t} , q )}{π _{o l d} ( y _{t} ∣ y _{< t} , q )}$ 是token级的策略比率。

对 $θ$ 求梯度：

\nabla_{θ} L_{D A PO} = - E_{q, y} [t = 1 \sum T g_{t}]

其中

g_{t} = ⎩ ⎨ ⎧ \hat{A}_{t} \nabla_{θ} lo g π_{θ} (y_{t} ∣ y_{< t}, q) 0 clip 梯度 if r_{t} \in [1 - ϵ_{l o w}, 1 + ϵ_{hi g h}] if r_{t} \in / [1 - ϵ_{l o w}, 1 + ϵ_{hi g h}] and \hat{A}_{t} \cdot (r_{t} - c l i p (r_{t})) < 0 otherwise

与GRPO的对比

特性	GRPO	DAPO
优势计算	响应级组归一化	令牌级 + 过滤
Clip范围	对称 $[1 - ϵ, 1 + ϵ]$	非对称 $[1 - ϵ_{l o w}, 1 + ϵ_{hi g h}]$
采样策略	均匀采样	动态采样（基于不确定性）
极端响应处理	接受零梯度	过滤丢弃
理论保证	PPO的直接推广	更严格的KL约束

KL散度分析

DAPO对策略更新的约束比GRPO更精细。定义累积KL散度：

D_{K L}^{c u m} (π_{n e w} ∣∣ π_{o l d}) = E_{q} [t \sum D_{K L} (π_{n e w} (\cdot ∣ y_{< t}, q) ∣∣ π_{o l d} (\cdot ∣ y_{< t}, q))]

DAPO的token级非对称clip等价于在累积KL上施加了自适应约束，使得：

正优势时，允许更大的KL增加（鼓励探索好的方向）
负优势时，严格限制KL增加（避免过度偏离）

实验结果

数学基准测试

DAPO在多个数学基准测试上取得了显著提升：

模型	MATH	GSM8K	MMLU
Base	42.3%	76.1%	65.2%
SFT	56.8%	85.4%	68.9%
GRPO	61.2%	89.7%	70.3%
DAPO	66.5%	92.1%	72.8%

可以看到，DAPO在所有基准上都取得了显著提升，尤其在最具挑战性的MATH基准上提升最大。

训练稳定性

DAPO的四大技术共同作用，显著提升了训练稳定性：

过滤机制消除了零梯度情况
非对称Clip允许更好的正向探索
动态采样确保了困难样本的充分训练
令牌级策略提供了更精细的梯度信号

训练曲线显示，DAPO的收敛更平滑，最终性能方差更小。

消融实验

技术组合	MATH提升	相对于GRPO
GRPO baseline	-	-
+ 过滤	+1.8%	+1.8%
+ Clip-Higher	+1.2%	+3.0%
+ 动态采样	+1.5%	+4.5%
+ 令牌级策略	+1.0%	+5.5%
全部	+5.3%	+5.3%

每项技术都独立有效，组合使用效果更佳。

泛化能力

DAPO训练出的模型在分布外任务上表现出更好的泛化能力：

不同领域的数学问题：从初等数学到高等数学
推理长度的外推：训练时响应长度 ≤ 512 tokens，测试时可达 1024+ tokens
对抗性样本：对数学问题的扰动更加鲁棒

实践指南

超参数设置

以下是DAPO的推荐超参数配置：

超参数	推荐值	说明
$ϵ_{l o w}$	0.15 ~ 0.25	下界clip强度
$ϵ_{hi g h}$	0.25 ~ 0.4	上界clip强度（可大于 $ϵ_{l o w}$ ）
组大小 $G$	8 ~ 16	过滤后的有效组大小
过滤阈值	0.5	区分正负样本的阈值
$γ$ （衰减）	0.9 ~ 0.98	令牌级优势的衰减因子
$α$ （采样混合）	0.6 ~ 0.8	动态采样的不确定性权重

训练技巧

1. 渐进式启用技术

建议按以下顺序逐步引入DAPO技术：

首先启用过滤机制（最简单，效果稳定）
然后引入Clip-Higher（注意调整学习率）
接着添加动态采样（监控采样分布）
最后启用令牌级策略（可能需要更多调参）

2. 学习率调整

DAPO的token级策略和非对称clip使得有效步长发生变化，建议：

初始学习率可设为GRPO的 0.8 ~ 1.0 倍
训练过程中观察KL散度，适当调整

3. 奖励模型质量

DAPO对奖励模型质量更敏感，因为：

过滤机制依赖准确的奖励判断
令牌级优势会放大奖励噪声

确保奖励模型在目标分布上有高准确率（> 90%）。

与其他方法的结合

DAPO + Process Reward Model (PRM)

DAPO的令牌级策略天然适合与PRM结合：

\hat{A}_{t}^{PRM} = r_{t}^{p rocess} - b_{t}

其中 $r_{t}^{p rocess}$ 是中间步骤的奖励，而非最终响应级奖励。

DAPO + Constitutional AI

DAPO的过滤机制可以扩展为多维过滤：

过滤正确性：保留部分错误响应（提供负样本）
过滤有害性：移除有害响应
过滤冗长度：移除过长或过短响应

DAPO + RLAIF

在无法获取人类反馈的场景下，可以使用AI反馈：

class DAPORLAIF {
    // 使用LLM判断响应质量
    torch::Tensor get_ai_feedback(
        const std::string& question,
        const std::vector<std::string>& responses) {
        
        // 构造评判prompt
        std::string judge_prompt = build_judge_prompt(question, responses);
        
        // 获取评判结果
        auto judgments = llm_judge_->generate(judge_prompt);
        
        // 解析为数值奖励
        return parse_rewards(judgments);
    }
    
    // DAPO训练循环
    void train_with_rlaif(const std::vector<std::string>& prompts) {
        for (int step = 0; step < num_steps_; ++step) {
            // 采样响应
            auto responses = policy_->sample_batch(prompts);
            
            // 获取AI反馈
            auto rewards = get_ai_feedback(prompts, responses);
            
            // DAPO过滤
            auto [filtered_responses, filtered_rewards] = filter(responses, rewards);
            
            if (!filtered_responses.empty()) {
                // 计算令牌级优势
                auto token_adv = compute_token_advantages(filtered_rewards);
                
                // 策略更新
                update_policy(filtered_responses, token_adv);
            }
        }
    }
};

实现代码

完整训练循环

#include <torch/torch.h>
#include <vector>
#include <memory>
 
class DAPOTrainer {
public:
    DAPOTrainer(
        std::shared_ptr<PolicyModel> policy,
        std::shared_ptr<ReferenceModel> ref_model,
        std::shared_ptr<RewardModel> reward_model,
        const DAPOConfig& config)
        : policy_(policy), ref_model_(ref_model), 
          reward_model_(reward_model), config_(config) {}
    
    void train_step(const std::vector<std::string>& prompts) {
        const int G = config_.group_size;
        const int B = prompts.size();
        
        // 1. 动态采样（可选）
        std::vector<int> sampled_indices;
        if (config_.use_dynamic_sampling) {
            sampled_indices = dynamic_sampler_.sample(prompts);
        } else {
            sampled_indices.resize(prompts.size());
            std::iota(sampled_indices.begin(), sampled_indices.end(), 0);
        }
        
        // 2. 对每个prompt采样G个响应
        std::vector<std::vector<std::vector<int>>> all_responses(B);
        std::vector<std::vector<double>> all_rewards(B);
        std::vector<std::vector<torch::Tensor>> all_log_probs(B);
        
        for (int i = 0; i < B; ++i) {
            const auto& prompt = prompts[sampled_indices[i]];
            
            for (int g = 0; g < G; ++g) {
                // 采样响应
                auto response = policy_->sample(prompt);
                all_responses[i].push_back(response.tokens);
                all_log_probs[i].push_back(response.log_prob);
                
                // 获取奖励
                double reward = reward_model_->score(prompt, response.text);
                all_rewards[i].push_back(reward);
            }
        }
        
        // 3. DAPO过滤
        std::vector<double> filtered_rewards;
        std::vector<torch::Tensor> filtered_log_probs;
        std::vector<torch::Tensor> filtered_old_log_probs;
        std::vector<torch::Tensor> filtered_advantages;
        std::vector<int> filtered_lengths;
        
        for (int i = 0; i < B; ++i) {
            auto filtered = dapo_filter_(all_rewards[i], config_.filter_threshold);
            
            if (filtered.empty()) continue;  // 跳过无信号组
            
            // 组内归一化
            auto normalized_adv = normalize_advantages(filtered.rewards);
            
            for (size_t j = 0; j < filtered.indices.size(); ++j) {
                filtered_rewards.push_back(filtered.rewards[j]);
                filtered_log_probs.push_back(all_log_probs[i][filtered.indices[j]]);
                filtered_old_log_probs.push_back(all_log_probs[i][filtered.indices[j]].detach());
                filtered_advantages.push_back(normalized_adv[j]);
                filtered_lengths.push_back(all_responses[i][filtered.indices[j]].size());
            }
        }
        
        if (filtered_rewards.empty()) return;
        
        // 4. 计算令牌级优势
        auto token_advantages = compute_token_level_advantages(
            filtered_advantages, filtered_lengths, config_.gamma);
        
        // 5. 计算DAPO损失
        torch::Tensor loss = 0.0;
        for (size_t i = 0; i < filtered_log_probs.size(); ++i) {
            // 策略比率
            torch::Tensor ratio = torch::exp(filtered_log_probs[i] - filtered_old_log_probs[i]);
            
            // 非对称clip
            torch::Tensor clipped_ratio = torch::clamp(
                ratio,
                1.0 - config_.eps_low,
                1.0 + config_.eps_high
            );
            
            // Token级损失
            torch::Tensor unclipped = ratio * token_advantages[i];
            torch::Tensor clipped = clipped_ratio * token_advantages[i];
            
            loss += -torch.min(unclipped, clipped).sum();
        }
        loss /= filtered_log_probs.size();
        
        // 6. 添加KL散度约束（可选）
        if (config_.kl_coef > 0) {
            torch::Tensor kl_loss = compute_kl_divergence(policy_, ref_model_);
            loss += config_.kl_coef * kl_loss;
        }
        
        // 7. 反向传播
        optimizer_->zero_grad();
        loss.backward();
        torch::nn::utils::clip_grad_norm_(policy_->parameters(), 1.0);
        optimizer_->step();
    }
    
private:
    std::shared_ptr<PolicyModel> policy_;
    std::shared_ptr<ReferenceModel> ref_model_;
    std::shared_ptr<RewardModel> reward_model_;
    DAPOConfig config_;
    std::unique_ptr<Optimizer> optimizer_;
    DynamicSampler dynamic_sampler_;
    
    // DAPO过滤实现
    struct FilteredGroup {
        std::vector<double> rewards;
        std::vector<size_t> indices;
    };
    
    FilteredGroup dapo_filter_(const std::vector<double>& rewards, double threshold) {
        FilteredGroup result;
        std::vector<double> pos, neg;
        
        for (size_t i = 0; i < rewards.size(); ++i) {
            if (rewards[i] >= threshold) {
                pos.push_back(rewards[i]);
                result.indices.push_back(i);
            } else {
                neg.push_back(rewards[i]);
            }
        }
        
        // 必须同时有正负样本
        if (pos.empty() || neg.empty()) {
            return {};
        }
        
        result.rewards = pos;
        result.rewards.insert(result.rewards.end(), neg.begin(), neg.end());
        return result;
    }
    
    torch::Tensor normalize_advantages(const std::vector<double>& rewards) {
        double mean = std::accumulate(rewards.begin(), rewards.end(), 0.0) / rewards.size();
        double sq_sum = 0;
        for (double r : rewards) sq_sum += (r - mean) * (r - mean);
        double std = std::sqrt(sq_sum / rewards.size()) + 1e-8;
        
        torch::Tensor adv = torch::zeros({(int)rewards.size()});
        for (size_t i = 0; i < rewards.size(); ++i) {
            adv[i] = (rewards[i] - mean) / std;
        }
        return adv;
    }
    
    torch::Tensor compute_token_level_advantages(
        const std::vector<torch::Tensor>& advantages,
        const std::vector<int>& lengths,
        double gamma) {
        
        std::vector<torch::Tensor> token_advantages;
        
        for (size_t i = 0; i < advantages.size(); ++i) {
            int T = lengths[i];
            torch::Tensor decay = torch::pow(
                torch::full({T}, gamma), 
                torch::arange(T, advantages[i].options()).flip(0)
            );
            token_advantages.push_back(decay * advantages[i]);
        }
        
        return torch::stack(token_advantages);
    }
    
    torch::Tensor compute_kl_divergence(
        std::shared_ptr<PolicyModel> policy,
        std::shared_ptr<ReferenceModel> ref) {
        // 计算策略与参考策略的KL散度
        // 实现细节省略
        return torch::tensor(0.0);
    }
};

参考文献

DAPO是字节跳动团队在LLM对齐领域的重要贡献，与GRPO和PPO构成了策略优化方法的演进脉络。更多训练流程可参考LLM训练流程。

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347. ↩
Shao, Z., Wang, P., Zhu, Y., et al. (2024). DeepSeekMath: Pushing the Limit of Mathematical Reasoning in Open Language Models. arXiv preprint arXiv:2402.03300. ↩

Metaphor

探索

DAPO（离散策略优化算法）

概述

从PPO到GRPO再到DAPO的演进

DAPO的核心观察

四大核心技术

过滤（Filtering）

问题分析

过滤策略

数学形式

效果

Clip-Higher

传统Clip的不对称性

DAPO的解法：非对称Clip

理论分析

实现

动态采样（Dynamic Sampling）

采样频率问题

DAPO的解决方案

效果

令牌级策略（Token-level Policy）

响应级 vs 令牌级

DAPO的令牌级优势

重要性采样加权

理论分析

DAPO的梯度推导

与GRPO的对比

KL散度分析

实验结果

数学基准测试

训练稳定性

消融实验

泛化能力

实践指南

超参数设置

训练技巧

1. 渐进式启用技术

2. 学习率调整

3. 奖励模型质量

与其他方法的结合

DAPO + Process Reward Model (PRM)

DAPO + Constitutional AI

DAPO + RLAIF

实现代码

完整训练循环

参考文献

Footnotes

关系图谱

目录

反向链接