因果逆强化学习与约束推断

1. 逆强化学习基础回顾

1.1 标准IRL问题

正向问题：给定MDP $(S, A, P, R, γ)$ ，求最优策略 $π^{*}$ 。

逆向问题：给定MDP $(S, A, P, γ)$ 和专家演示 $D = {τ_{1}, τ_{2}, ..., τ_{n}}$ ，恢复奖励函数 $R$ 。

┌─────────────────────────────────────────────────────────────────┐
│                    逆强化学习问题                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   正向:  (S, A, P, γ) + R  →  π*                               │
│           ↑                                                      │
│           │                                                     │
│           ▼                                                      │
│   逆向:  (S, A, P, γ) + π*  →  R                                │
│                                                                  │
│   问题: 通常有多个R使得π*是最优的（不可识别性）                   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

1.2 标准IRL的局限性

局限性	描述	后果
奖励不可识别	多个奖励函数可解释同一策略	无法恢复真实奖励
混淆因素	演示可能受混淆因素影响	学习到虚假相关性
缺乏因果解释	无法区分因果动作和巧合动作	泛化能力差
约束缺失	假设专家完全理性	无法处理安全约束

1.3 因果IRL的必要性

因果IRL的核心思想：

因果奖励恢复：学习奖励的因果结构，而非表面相关性
约束推断：从演示中推断安全约束和偏好
反事实校正：校正混淆因素导致的偏差

2. 因果约束推断

2.1 约束推断问题

定义：给定专家演示 $D$ 和环境模型，推断约束集合 $C$ ，使得专家策略 $π_{E}$ 满足这些约束。

C^{*} = ar g C max Score (π_{E}, π_{C}) s.t. C \in C

其中 $π_{C}$ 是满足约束 $C$ 的最优策略。

2.2 约束类型

约束类型	数学形式	示例
安全约束	$P(\text{safe}	do(\pi)) \geq 1-\epsilon$
效率约束	$E [cost] \leq K$	完成任务时间不超过K
偏好约束	$P(a_1	s) > P(a_2
因果约束	$CE (s, a) \geq δ$	动作必须有因果效应

2.3 约束推断的数学框架

约束推断的优化目标：

C min L_{constraint} (C) + λ R (C)

其中：

$L_{constraint}$ ：约束违反损失
$R (C)$ ：正则化项

约束违反损失：

L_{constraint} (C) = (s, a) \in D \sum max (0, g (s, a; C))^{2}

其中 $g (s, a; C) \leq 0$ 表示约束 $C$ 被满足。

3. Inverse Constrained Reinforcement Learning (ICRL)

3.1 ICRL基本框架

ICRL（Inverse Constrained RL）同时推断奖励函数和约束：

R, C min E_{π_{E}} [- lo g P (τ ∣ R)] + λ \cdot ∥ C ∥_{1} s.t. π_{E} \in Π_{C}

其中 $Π_{C}$ 是满足约束 $C$ 的策略集合。

3.2 ICRL算法流程

┌─────────────────────────────────────────────────────────────────┐
│                      ICRL算法流程                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   1. 初始化奖励函数R和约束C                                       │
│                                                                  │
│   2. 交替优化循环:                                                │
│      ┌─────────────────────────────────────────────────┐        │
│      │  a) 给定R,C，使用IRL学习约束策略π_C              │        │
│      │                                                    │        │
│      │  b) 给定π_C，更新约束C以最小化约束违反            │        │
│      │                                                    │        │
│      │  c) 给定C，更新奖励R以最大化演示可能性             │        │
│      └─────────────────────────────────────────────────┘        │
│                                                                  │
│   3. 直到收敛                                                     │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

3.3 约束更新规则

投影梯度下降：

def update_constraints(constraints, demonstrations, policy):
    """
    更新约束参数
    """
    violations = compute_constraint_violations(constraints, demonstrations)
    
    # 梯度上升（最小化违反）
    constraints = constraints - alpha * violations
    
    # 投影到可行域
    constraints = project_to_constraints(constraints)
    
    return constraints

4. 因果IRL的数学基础

4.1 因果奖励函数

定义：因果奖励函数 $R_{c}$ 满足：

R_{c} (s, a, s^{'}) = f (CE (s, a, s^{'}))

其中 $CE$ 是状态转移的因果效应。

4.2 因果IRL的优化目标

因果最大熵IRL：

R_{c} \in R_{c} max E_{π_{E}} [lo g P (τ ∣ R_{c})] - α \cdot ∥ R_{c} ∥_{smooth}

其中 $R_{c}$ 是因果奖励函数空间。

4.3 因果约束的识别

定理（因果约束可识别性）：设因果图 $G$ 已知，则约束集合 $C$ 可以从专家演示中识别，如果：

充分性： $C$ 能够完全解释专家行为
最小性： $C$ 是满足充分性的最小集合
因果完备性：所有因果相关约束都在 $C$ 中

5. Preference-Based IRL

5.1 偏好学习框架

偏好数据： $P = {(s, a_{1}, a_{2}, pref), ...}$

其中 $pref \in {a_{1} ≻ a_{2}, a_{2} ≻ a_{1}, equal}$ 。

5.2 Bradley-Terry模型

偏好概率：

P (a_{1} ≻ a_{2} ∣ s) = \frac{1}{1 + exp ( - θ ^{T} ( ϕ ( s , a _{1} ) - ϕ ( s , a _{2} )))}

其中 $ϕ (s, a)$ 是状态-动作特征， $θ$ 是偏好参数。

5.3 因果偏好学习

因果偏好模型：

P_{causal} (a_{1} ≻ a_{2} ∣ s) = σ (CE (s, a_{1}) - CE (s, a_{2}) + β \cdot CF (s, a_{1}, a_{2}))

其中：

$CE$ ：因果效应
$CF$ ：反事实对比

6. PyTorch实现

6.1 因果约束推断器

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import Tensor
from typing import Tuple, List, Optional, Dict
import numpy as np
 
class CausalConstraintInferrer(nn.Module):
    """
    因果约束推断器
    从专家演示中学习因果约束
    """
    def __init__(self, state_dim: int, action_dim: int,
                 n_constraints: int = 4,
                 hidden_dim: int = 128):
        super().__init__()
        
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.n_constraints = n_constraints
        
        # 状态编码器
        self.state_encoder = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        # 动作编码器
        self.action_encoder = nn.Sequential(
            nn.Linear(action_dim, hidden_dim // 2),
            nn.ReLU()
        )
        
        # 约束评分网络
        self.constraint_scorer = nn.Sequential(
            nn.Linear(hidden_dim + hidden_dim // 2, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, n_constraints),
            nn.Sigmoid()  # 输出约束满足概率
        )
        
        # 因果效应估计器（复用之前的模块）
        from causal_effect_estimator import CausalEffectEstimator
        self.causal_estimator = CausalEffectEstimator(
            state_dim, action_dim, hidden_dim
        )
    
    def forward(self, state: Tensor, action: Tensor) -> Tensor:
        """
        预测每个约束的满足概率
        
        Returns:
            constraint_probs: (batch, n_constraints) 每个约束的满足概率
        """
        s_enc = self.state_encoder(state)
        a_enc = self.action_encoder(action)
        
        combined = torch.cat([s_enc, a_enc], dim=-1)
        return self.constraint_scorer(combined)
    
    def compute_constraint_loss(self, states: Tensor, actions: Tensor,
                               expert_mask: Optional[Tensor] = None) -> Tensor:
        """
        计算约束违反损失
        
        鼓励专家动作满足约束，非专家动作违反约束
        """
        constraint_probs = self.forward(states, actions)
        
        if expert_mask is not None:
            # 专家动作应该满足约束
            expert_loss = -(1 - constraint_probs) * expert_mask.unsqueeze(1)
            # 非专家动作可以违反约束
            non_expert_loss = constraint_probs * (1 - expert_mask).unsqueeze(1)
            loss = (expert_loss + non_expert_loss).mean()
        else:
            # 无监督版本：鼓励约束满足
            loss = -(constraint_probs.mean())
        
        return loss
    
    def infer_constraints(self, demonstrations: List[Dict]) -> Dict[str, float]:
        """
        从演示中推断约束
        """
        self.eval()
        
        with torch.no_grad():
            constraint_satisfactions = {i: [] for i in range(self.n_constraints)}
            
            for demo in demonstrations:
                states = torch.FloatTensor(demo["states"])
                actions = torch.FloatTensor(demo["actions"])
                
                probs = self.forward(states, actions)
                
                for c in range(self.n_constraints):
                    constraint_satisfactions[c].extend(probs[:, c].tolist())
        
        return {
            f"constraint_{c}": np.mean(satisfactions) 
            for c, satisfactions in constraint_satisfactions.items()
        }
 
 
class CausalIRL:
    """
    因果逆强化学习算法
    结合约束推断和因果奖励学习
    """
    def __init__(self, state_dim: int, action_dim: int,
                 n_constraints: int = 4,
                 hidden_dim: int = 256,
                 lr: float = 1e-3,
                 gamma: float = 0.99):
        
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.gamma = gamma
        
        # 因果奖励网络
        self.reward_network = nn.Sequential(
            nn.Linear(state_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
        
        # 约束推断器
        self.constraint_inferrer = CausalConstraintInferrer(
            state_dim, action_dim, n_constraints, hidden_dim
        )
        
        # 因果效应估计器
        self.causal_estimator = CausalEffectEstimator(
            state_dim, action_dim, hidden_dim
        )
        
        # 鉴别器（用于区分专家演示和学到的策略）
        self.discriminator = nn.Sequential(
            nn.Linear(state_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1),
            nn.Sigmoid()
        )
        
        # 优化器
        self.optimizer = optim.Adam(
            list(self.reward_network.parameters()) +
            list(self.constraint_inferrer.parameters()) +
            list(self.causal_estimator.parameters()) +
            list(self.discriminator.parameters()),
            lr=lr
        )
    
    def compute_causal_reward(self, state: Tensor, 
                             action: Tensor) -> Tensor:
        """
        计算因果奖励
        R_c(s, a) = f(CE(s, a))
        """
        # 估计因果效应
        effect_mean, effect_var = self.causal_estimator(state, action)
        
        # 因果效应作为奖励
        causal_reward = torch.norm(effect_mean, dim=-1, keepdim=True)
        
        # 结合网络预测
        sa_combined = torch.cat([state, action], dim=-1)
        network_reward = self.reward_network(sa_combined)
        
        return causal_reward + 0.1 * network_reward
    
    def compute_constraint_reward(self, state: Tensor,
                                  action: Tensor) -> Tensor:
        """
        计算约束奖励（惩罚违反约束的动作）
        """
        constraint_probs = self.constraint_inferrer(state, action)
        
        # 低概率 = 高惩罚
        constraint_penalty = (1 - constraint_probs).mean(dim=-1, keepdim=True)
        
        return -constraint_penalty
    
    def update(self, demonstrations: List[Dict],
              generated_trajectories: List[Dict],
              lambda_constraint: float = 0.5) -> Dict[str, float]:
        """
        更新因果IRL模型
        
        Args:
            demonstrations: 专家演示列表
            generated_trajectories: 学到的策略生成的轨迹
            lambda_constraint: 约束奖励权重
        """
        self.optimizer.zero_grad()
        
        total_loss = 0.0
        losses = {}
        
        # 1. 对抗损失：让鉴别器区分专家和生成轨迹
        expert_loss = 0.0
        generated_loss = 0.0
        
        for demo in demonstrations:
            states = torch.FloatTensor(demo["states"])
            actions = torch.FloatTensor(demo["actions"])
            
            sa_pairs = torch.cat([states, actions], dim=-1)
            expert_loss -= torch.log(self.discriminator(sa_pairs) + 1e-8).mean()
        
        for traj in generated_trajectories:
            states = torch.FloatTensor(traj["states"])
            actions = torch.FloatTensor(traj["actions"])
            
            sa_pairs = torch.cat([states, actions], dim=-1)
            generated_loss -= torch.log(1 - self.discriminator(sa_pairs) + 1e-8).mean()
        
        adversarial_loss = expert_loss + generated_loss
        losses["adversarial"] = adversarial_loss.item()
        
        # 2. 因果奖励一致性损失
        reward_consistency_loss = 0.0
        for demo in demonstrations:
            states = torch.FloatTensor(demo["states"])
            actions = torch.FloatTensor(demo["actions"])
            
            # 专家动作应该有高因果奖励
            causal_reward = self.compute_causal_reward(states, actions)
            reward_consistency_loss -= causal_reward.mean()
        
        reward_consistency_loss /= max(len(demonstrations), 1)
        losses["reward_consistency"] = reward_consistency_loss.item()
        
        # 3. 约束推断损失
        constraint_loss = 0.0
        for demo in demonstrations:
            states = torch.FloatTensor(demo["states"])
            actions = torch.FloatTensor(demo["actions"])
            
            # 标记为专家动作
            expert_mask = torch.ones(len(demo["states"]))
            constraint_loss += self.constraint_inferrer.compute_constraint_loss(
                states, actions, expert_mask
            )
        
        constraint_loss /= max(len(demonstrations), 1)
        losses["constraint"] = constraint_loss.item()
        
        # 总损失
        total_loss = (adversarial_loss + 
                     reward_consistency_loss + 
                     lambda_constraint * constraint_loss)
        
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.optimizer.param_groups[0]["params"], 1.0)
        self.optimizer.step()
        
        return losses
    
    def predict_reward(self, state: Tensor, action: Tensor) -> Tensor:
        """预测总奖励"""
        causal_reward = self.compute_causal_reward(state, action)
        constraint_penalty = self.compute_constraint_reward(state, action)
        return causal_reward + lambda_constraint * constraint_penalty

6.2 约束推断ICRL实现

class ICRLConstraintInference:
    """
    Inverse Constrained RL 约束推断
    实现ICRL算法
    """
    def __init__(self, env, irl_module: CausalIRL,
                 lambda_constraints: float = 1.0,
                 constraint_lr: float = 1e-2):
        
        self.env = env
        self.irl = irl_module
        self.lambda_constraints = lambda_constraints
        self.constraint_lr = constraint_lr
        
        # 约束参数
        self.constraint_weights = torch.zeros(1, irl_module.n_constraints)
        self.constraint_weights = nn.Parameter(self.constraint_weights)
        
        # 约束优化器
        self.constraint_optimizer = optim.Adam(
            [self.constraint_weights], lr=constraint_lr
        )
    
    def compute_constraint_violation(self, state: Tensor, 
                                     action: Tensor) -> Tensor:
        """
        计算约束违反程度
        
        Returns:
            violation: 约束违反标量
        """
        constraint_probs = self.irl.constraint_inferrer(state, action)
        
        # 违反 = 1 - 满足概率
        violations = 1 - constraint_probs
        
        # 加权求和
        weighted_violations = violations * torch.softmax(
            self.constraint_weights, dim=-1
        )
        
        return weighted_violations.sum(dim=-1).mean()
    
    def update_constraints(self, demonstrations: List[Dict],
                          n_iterations: int = 100) -> List[float]:
        """
        迭代更新约束参数
        """
        violations_history = []
        
        for iteration in range(n_iterations):
            total_violation = 0.0
            
            for demo in demonstrations:
                states = torch.FloatTensor(demo["states"])
                actions = torch.FloatTensor(demo["actions"])
                
                # 计算违反
                violation = self.compute_constraint_violation(states, actions)
                
                # 梯度下降（最小化违反）
                self.constraint_optimizer.zero_grad()
                (-violation).backward()
                self.constraint_optimizer.step()
                
                total_violation += violation.item()
            
            avg_violation = total_violation / len(demonstrations)
            violations_history.append(avg_violation)
            
            # 投影到非负域
            with torch.no_grad():
                self.constraint_weights.clamp_(min=0)
        
        return violations_history
    
    def combined_update(self, demonstrations: List[Dict],
                       generated_trajectories: List[Dict],
                       n_irl_iterations: int = 5,
                       n_constraint_iterations: int = 10) -> Dict[str, float]:
        """
        交替更新IRL和约束
        
        1. 固定约束，更新IRL
        2. 固定IRL，更新约束
        """
        all_losses = {"irl": [], "constraint": []}
        
        for outer_iter in range(n_irl_iterations):
            # 步骤1：固定约束，更新IRL
            for _ in range(3):
                irl_losses = self.irl.update(
                    demonstrations, generated_trajectories,
                    self.lambda_constraints
                )
                all_losses["irl"].append(irl_losses)
            
            # 步骤2：固定IRL，更新约束
            constraint_violations = self.update_constraints(
                demonstrations, n_constraint_iterations
            )
            all_losses["constraint"].extend(constraint_violations)
        
        return all_losses
 
 
class PreferenceBasedCausalIRL:
    """
    基于偏好的因果IRL
    从偏好数据中学习因果奖励和约束
    """
    def __init__(self, state_dim: int, action_dim: int,
                 hidden_dim: int = 256):
        
        self.state_dim = state_dim
        self.action_dim = action_dim
        
        # 因果奖励网络
        self.reward_network = CausalRewardNetwork(state_dim, action_dim, hidden_dim)
        
        # 偏好鉴别器
        self.preference_discriminator = nn.Sequential(
            nn.Linear(state_dim + 2 * action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1),
            nn.Sigmoid()
        )
        
        # 优化器
        self.optimizer = optim.Adam(self.reward_network.parameters(), lr=1e-3)
    
    def bradley_terry_loss(self, states: Tensor, 
                          action_pairs: Tensor, 
                          preferences: Tensor) -> Tensor:
        """
        Bradley-Terry偏好损失
        
        Args:
            states: 状态(batch, state_dim)
            action_pairs: 动作对(batch, 2, action_dim)
            preferences: 偏好标签(batch,)，1表示第一个动作优选，0表示第二个
        """
        a1, a2 = action_pairs[:, 0], action_pairs[:, 1]
        
        # 计算动作对的奖励差异
        r1 = self.reward_network(states, a1)
        r2 = self.reward_network(states, a2)
        
        # 预测偏好概率
        diff = r1 - r2
        pred_prob = torch.sigmoid(diff)
        
        # 二元交叉熵损失
        loss = F.binary_cross_entropy(
            pred_prob.squeeze(), 
            preferences.float()
        )
        
        return loss
    
    def counterfactual_preference_loss(self, states: Tensor,
                                      action_pairs: Tensor,
                                      preferences: Tensor) -> Tensor:
        """
        反事实偏好损失
        """
        a1, a2 = action_pairs[:, 0], action_pairs[:, 1]
        
        # 计算因果效应
        ce1 = self.reward_network.causal_estimator.estimate_causal_effect(
            states, a1, a2
        )
        ce2 = self.reward_network.causal_estimator.estimate_causal_effect(
            states, a2, a1
        )
        
        # 偏好与因果效应一致
        # 如果a1优选，则CE(a1)应该大于CE(a2)
        ce_diff = ce1 - ce2
        
        loss = F.margin_ranking_loss(
            ce_diff.squeeze(),
            torch.zeros_like(ce_diff.squeeze()),
            preferences.float() * 2 - 1,  # 转换为+1/-1
            margin=0.5
        )
        
        return loss
    
    def update(self, preferences: Dict) -> float:
        """
        更新偏好IRL模型
        """
        states = torch.FloatTensor(preferences["states"])
        action_pairs = torch.FloatTensor(preferences["action_pairs"])
        prefs = torch.LongTensor(preferences["preferences"])
        
        # 标准偏好损失
        bt_loss = self.bradley_terry_loss(states, action_pairs, prefs)
        
        # 反事实偏好损失
        cf_loss = self.counterfactual_preference_loss(
            states, action_pairs, prefs
        )
        
        # 总损失
        loss = bt_loss + 0.5 * cf_loss
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        return loss.item()
    
    def predict_reward(self, state: Tensor, action: Tensor) -> Tensor:
        """预测奖励"""
        return self.reward_network(state, action)

7. 因果约束推断的变体

7.1 Safe IRL

Safe IRL专注于安全约束推断：

C_{safe} min L (π_{E}, π_{C_{safe}}) s.t. P (unsafe ∣ d o (π_{C_{safe}})) = 0

7.2 Inverse Reward Design (IRD)

IRD从奖励函数的不确定性中推断偏好：

P (R ∣ D) \propto P (D ∣ R) \cdot P (R)

7.3 Reward Inference from Demonstrations

RIDE通过轨迹比较推断奖励：

R^{*} (s, a) \propto lo g \frac{P ( τ _{g oo d} ∣ s , a )}{P ( τ _{ba d} ∣ s , a )}

8. 实际应用

8.1 自动驾驶

场景：从人类驾驶数据中学习安全约束

约束推断：
- 不变约束：始终保持车道、不闯红灯
- 可变约束：跟车距离、变道时机

因果分析：
- 刹车动作与停车有因果关系
- 方向盘角度与轨迹弯曲有因果关系

8.2 手术机器人

场景：从外科医生演示中学习手术约束

约束推断：
- 安全约束：不损伤特定组织
- 精度约束：手术器械位置精度
- 效率约束：手术时间

因果分析：
- 器械移动与组织响应有因果关系
- 力度与组织变形有因果关系

8.3 工业自动化

场景：从熟练工人操作中学习机器人约束

约束推断：
- 效率约束：完成任务时间
- 质量约束：产品质量
- 安全约束：工人安全距离

因果分析：
- 动作与产品质量有因果关系
- 动作与能耗有因果关系

9. 收敛性与理论保证

9.1 ICRL的收敛性

定理（ICRL收敛性）：ICRL算法在以下条件下收敛：

约束空间 $C$ 是凸的
约束违反损失 $L_{constraint}$ 是凸的
学习率满足 Robbins-Monro 条件

9.2 因果IRL的PAC界

定理（因果IRL样本复杂度）：以概率至少 $1 - δ$ ，因果IRL在

N (ϵ, δ) = O (\frac{1}{ϵ ^{2}} (d_{R} + d_{C} + ln \frac{1}{δ}))

样本内收敛到 $ϵ$ -最优奖励和约束。

10. 总结

核心要点

约束推断：从专家演示中学习隐式约束，而非显式奖励
因果IRL：结合因果效应和反事实推理增强IRL
偏好学习：从偏好比较中推断因果偏好结构
安全性：确保学到的策略满足推断的安全约束

算法对比

方法	优点	缺点	适用场景
标准IRL	理论基础好	不可识别性	奖励可恢复场景
ICRL	处理约束	计算复杂	安全关键应用
因果IRL	泛化强	需要因果假设	跨环境迁移
偏好IRL	数据高效	需要偏好标注	人类反馈学习

下一步

因果世界模型 - 使用因果结构构建可解释的世界模型

Metaphor

探索

因果逆强化学习与约束推断

因果逆强化学习与约束推断

1. 逆强化学习基础回顾

1.1 标准IRL问题

1.2 标准IRL的局限性

1.3 因果IRL的必要性

2. 因果约束推断

2.1 约束推断问题

2.2 约束类型

2.3 约束推断的数学框架

3. Inverse Constrained Reinforcement Learning (ICRL)

3.1 ICRL基本框架

3.2 ICRL算法流程

3.3 约束更新规则

4. 因果IRL的数学基础

4.1 因果奖励函数

4.2 因果IRL的优化目标

4.3 因果约束的识别

5. Preference-Based IRL

5.1 偏好学习框架

5.2 Bradley-Terry模型

5.3 因果偏好学习

6. PyTorch实现

6.1 因果约束推断器

6.2 约束推断ICRL实现

7. 因果约束推断的变体

7.1 Safe IRL

7.2 Inverse Reward Design (IRD)

7.3 Reward Inference from Demonstrations

8. 实际应用

8.1 自动驾驶

8.2 手术机器人

8.3 工业自动化

9. 收敛性与理论保证

9.1 ICRL的收敛性

9.2 因果IRL的PAC界

10. 总结

核心要点

算法对比

下一步

参考文献

关系图谱

目录

反向链接