Q-Learning

概述

Q-Learning是强化学习中最重要的算法之一，由Watkins于1989年提出，是一种离线策略（off-policy）的时序差分（TD）控制算法。¹

核心思想：直接从TD error中学习最优动作价值函数，无需环境模型。

TD学习回顾

TD(0)算法

TD(0)是TD学习的特例，只向前看一步：

V (S_{t}) \leftarrow V (S_{t}) + α [R_{t + 1} + γV (S_{t + 1}) - V (S_{t})]

其中 $α$ 是学习率， $[R_{t + 1} + γV (S_{t + 1}) - V (S_{t}]$ 是TD误差。

TD误差

δ_{t} = R_{t + 1} + γV (S_{t + 1}) - V (S_{t})

$δ_{t} > 0$ ：实际回报高于当前估计，价值被上调
$δ_{t} < 0$ ：实际回报低于当前估计，价值被下调

Q-Learning算法

核心思想

Q-Learning直接学习最优动作价值函数 $Q^{*}$ ，使用当前状态的TD target作为目标：

Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ a^{'} max Q (S_{t + 1}, a^{'}) - Q (S_{t}, A_{t})]

算法流程

1. 初始化：
   - Q(s, a) = 0, ∀s ∈ S, a ∈ A
   - 设置学习率 α 和折扣因子 γ

2. 对每个episode重复：
   a) 初始化状态 S
   
   b) 对episode的每一步：
      - 选择动作 A（ε-greedy策略）
      - 执行动作，获得 R, S'
      - TD更新：
        Q(S, A) ← Q(S, A) + α[R + γ max_a' Q(S', a') - Q(S, A)]
      - S ← S'
   
   c) 直到 S 是终止状态

Python实现

import numpy as np
 
class QLearningAgent:
    """Q-Learning智能体"""
    
    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.95, epsilon=0.1):
        self.n_states = n_states
        self.n_actions = n_actions
        self.alpha = alpha      # 学习率
        self.gamma = gamma     # 折扣因子
        self.epsilon = epsilon # 探索率
        
        # Q表
        self.Q = np.zeros((n_states, n_actions))
    
    def choose_action(self, state):
        """ε-greedy策略选择动作"""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)  # 探索
        else:
            return np.argmax(self.Q[state])  # 利用
    
    def update(self, state, action, reward, next_state, done):
        """Q-Learning TD更新"""
        if done:
            target = reward
        else:
            target = reward + self.gamma * np.max(self.Q[next_state])
        
        # TD更新
        self.Q[state, action] += self.alpha * (target - self.Q[state, action])
    
    def train(self, env, n_episodes=1000):
        """训练智能体"""
        rewards = []
        
        for episode in range(n_episodes):
            state = env.reset()
            episode_reward = 0
            done = False
            
            while not done:
                action = self.choose_action(state)
                next_state, reward, done = env.step(action)
                
                self.update(state, action, reward, next_state, done)
                
                state = next_state
                episode_reward += reward
            
            rewards.append(episode_reward)
            
            if (episode + 1) % 100 == 0:
                avg_reward = np.mean(rewards[-100:])
                print(f"Episode {episode+1}, Avg Reward: {avg_reward:.3f}")
        
        return rewards

离线策略 vs 在线策略

区分

特性	在线策略（On-Policy）	离线策略（Off-Policy）
行为策略	用于生成动作	用于生成动作
目标策略	与行为策略相同	学习最优策略
样本来源	当前策略	任意策略
代表算法	SARSA, PPO	Q-Learning, DQN
收敛性	较弱	较强

Q-Learning是离线策略的证明

Q-Learning同时使用两个策略：

行为策略： $ϵ$ -greedy，用于选择动作
目标策略：贪婪策略 $π (s) = ar g max_{a} Q (s, a)$ ，用于计算TD目标

由于行为策略和目标策略不同，Q-Learning可以从非最优轨迹中学习最优策略。

SARSA：在线策略对比

SARSA（State-Action-Reward-State-Action）与Q-Learning的关键区别在于：

# Q-Learning（离线策略）
next_action = np.argmax(self.Q[next_state])  # 使用贪婪选择
target = reward + self.gamma * self.Q[next_state, next_action]
 
# SARSA（在线策略）
next_action = self.choose_action(next_state)  # 使用ε-greedy选择
target = reward + self.gamma * self.Q[next_state, next_action]

算法对比

方面	Q-Learning	SARSA
策略更新目标	$max_{a^{'}} Q (s^{'}, a^{'})$	$Q (s^{'}, a_{t + 1})$
是否保守	激进	保守
收敛条件	$ϵ$ 最终衰减到0	更宽松
实际表现	可能”撞墙”	更安全

SARSA实现

class SARSAAgent:
    """SARSA智能体"""
    
    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.95, epsilon=0.1):
        self.n_states = n_states
        self.n_actions = n_actions
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.Q = np.zeros((n_states, n_actions))
    
    def choose_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        else:
            return np.argmax(self.Q[state])
    
    def update(self, state, action, reward, next_state, next_action, done):
        if done:
            target = reward
        else:
            # 使用下一个实际选择的动作
            target = reward + self.gamma * self.Q[next_state, next_action]
        
        self.Q[state, action] += self.alpha * (target - self.Q[state, action])
    
    def train(self, env, n_episodes=1000):
        for episode in range(n_episodes):
            state = env.reset()
            action = self.choose_action(state)  # 选择初始动作
            done = False
            episode_reward = 0
            
            while not done:
                next_state, reward, done = env.step(action)
                next_action = self.choose_action(next_state)  # 选择下一个动作
                
                self.update(state, action, reward, next_state, next_action, done)
                
                state = next_state
                action = next_action
                episode_reward += reward
            
            if (episode + 1) % 100 == 0:
                print(f"Episode {episode+1}, Reward: {episode_reward}")

收敛性分析

收敛条件

Q-Learning在以下条件下保证收敛到 $Q^{*}$ ：²

有限状态和动作空间
每个状态-动作对被访问无限次
学习率满足： $\sum_{t} α_{t} = \infty$ 且 $\sum_{t} α_{t}^{2} < \infty$
环境是马尔可夫的
满足GLIE条件：Greedy in the Limit with Infinite Exploration

GLIE条件

GLIE包含两个部分：

无限探索：每个状态-动作对被无限次访问
极限贪婪：最终策略收敛到贪婪策略

实现GLIE的常见方式： $ϵ_{t} = 1/ t$ 或 $ϵ_{t} = ϵ_{0} / t$

示例：Cliff Walking

问题描述

4x12网格：
S○○○○○○○○○○○    S: 起点
○○○○○○○○○○○○    ○: 安全区域
○○○○○○○○○○○○    C: 悬崖（-100奖励，重置）
G○○○○○○○○○○C    G: 目标
     CCCCCCCC

智能体需要从S走到G，避开悬崖。

实现

import numpy as np
import matplotlib.pyplot as plt
 
class CliffWalking:
    """悬崖行走环境"""
    
    def __init__(self):
        self.n_rows = 4
        self.n_cols = 12
        self.n_states = self.n_rows * self.n_cols
        self.n_actions = 4
        self.start = 36  # (3, 0)
        self.goal = 47  # (3, 11)
        self.cliff = list(range(38, 47))  # 最下方中间区域
    
    def reset(self):
        self.state = self.start
        return self.state
    
    def step(self, action):
        row, col = self.state // self.n_cols, self.state % self.n_cols
        
        # 移动
        if action == 0: row = max(0, row - 1)      # up
        elif action == 1: row = min(3, row + 1)     # down
        elif action == 2: col = max(0, col - 1)     # left
        elif action == 3: col = min(11, col + 1)   # right
        
        self.state = row * self.n_cols + col
        
        # 奖励
        if self.state == self.goal:
            reward = 0
            done = True
        elif self.state in self.cliff:
            reward = -100
            done = True
            self.state = self.start  # 重置
        else:
            reward = -1
            done = False
        
        return self.state, reward, done
 
 
def compare_algorithms():
    env = CliffWalking()
    n_episodes = 500
    
    # Q-Learning
    q_agent = QLearningAgent(48, 4, alpha=0.5, gamma=1.0, epsilon=0.1)
    q_rewards = q_agent.train(env, n_episodes)
    
    # SARSA
    sarsa_agent = SARSAAgent(48, 4, alpha=0.5, gamma=1.0, epsilon=0.1)
    sarsa_rewards = sarsa_agent.train(env, n_episodes)
    
    # 绘图
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    plt.plot(np.convolve(q_rewards, np.ones(10)/10, mode='valid'), label='Q-Learning')
    plt.plot(np.convolve(sarsa_rewards, np.ones(10)/10, mode='valid'), label='SARSA')
    plt.xlabel('Episode')
    plt.ylabel('Reward (10-episode moving avg)')
    plt.legend()
    plt.title('Cliff Walking: Q-Learning vs SARSA')
    
    plt.subplot(1, 2, 2)
    plt.plot(q_agent.Q.mean(axis=1).reshape(4, 12), label='Q-Learning')
    plt.plot(sarsa_agent.Q.mean(axis=1).reshape(4, 12), label='SARSA')
    plt.title('State Value Heatmap')
    plt.legend()
    
    plt.tight_layout()
    plt.show()
 
 
compare_algorithms()

结果分析

观察结果：
1. Q-Learning最终回报更高（-12左右），因为它学到了贴着悬崖走的最优路径
2. SARSA回报较低（-25左右），因为它避免悬崖边缘（保守策略）
3. 两者都在悬崖附近有负值

Q-Learning的扩展

1. Double Q-Learning

解决Q-Learning的Q值过估计问题：

class DoubleQLearningAgent:
    """Double Q-Learning"""
    
    def __init__(self, n_states, n_actions):
        self.Q1 = np.zeros((n_states, n_actions))
        self.Q2 = np.zeros((n_states, n_actions))
        self.n_actions = n_actions
    
    def choose_action(self, state, epsilon=0.1):
        if np.random.random() < epsilon:
            return np.random.randint(self.n_actions)
        else:
            return np.argmax(self.Q1[state] + self.Q2[state])
    
    def update(self, state, action, reward, next_state, done, alpha=0.1, gamma=0.95):
        if np.random.random() < 0.5:
            # 更新Q1
            if done:
                target = reward
            else:
                a = np.argmax(self.Q1[next_state])
                target = reward + gamma * self.Q2[next_state, a]
            self.Q1[state, action] += alpha * (target - self.Q1[state, action])
        else:
            # 更新Q2
            if done:
                target = reward
            else:
                a = np.argmax(self.Q2[next_state])
                target = reward + gamma * self.Q1[next_state, a]
            self.Q2[state, action] += alpha * (target - self.Q2[state, action])

2. Delayed Q-Learning

对于每个状态-动作对，只有在访问次数达到阈值后才更新，减少方差。

3. 优先级扫描（Prioritized Sweeping）

根据TD误差大小优先更新”意外”的状态。

与深度学习的结合

Q-Learning的局限性：

只能处理离散、低维状态空间
无法处理图像等高维输入

解决方案：用深度神经网络近似Q函数，即DQN。详见 DQN。

参考

后续主题

DQN：深度Q网络处理高维状态空间
Actor-Critic：结合策略梯度和价值函数
动态规划：基于模型的规划方法

Watkins, “Learning from Delayed Rewards”, PhD Thesis, 1989 ↩
Sutton & Barto, “Reinforcement Learning: An Introduction”, 2nd Edition, 2018, Chapter 6 ↩

Metaphor

探索

Q-Learning

概述

TD学习回顾

TD(0)算法

TD误差

Q-Learning算法

核心思想

算法流程

Python实现

离线策略 vs 在线策略

区分

Q-Learning是离线策略的证明

SARSA：在线策略对比

算法对比

SARSA实现

收敛性分析

收敛条件

GLIE条件

示例：Cliff Walking

问题描述

实现

结果分析

Q-Learning的扩展

1. Double Q-Learning

2. Delayed Q-Learning

3. 优先级扫描（Prioritized Sweeping）

与深度学习的结合

参考

后续主题

关系图谱

目录

反向链接

Metaphor

探索

Q-Learning

概述

TD学习回顾

TD(0)算法

TD误差

Q-Learning算法

核心思想

算法流程

Python实现

离线策略 vs 在线策略

区分

Q-Learning是离线策略的证明

SARSA：在线策略对比

算法对比

SARSA实现

收敛性分析

收敛条件

GLIE条件

示例：Cliff Walking

问题描述

实现

结果分析

Q-Learning的扩展

1. Double Q-Learning

2. Delayed Q-Learning

3. 优先级扫描（Prioritized Sweeping）

与深度学习的结合

参考

后续主题

Footnotes

关系图谱

目录

反向链接