概述

Q-Learning是强化学习中最重要的算法之一,由Watkins于1989年提出,是一种离线策略(off-policy)的时序差分(TD)控制算法。1

核心思想:直接从TD error中学习最优动作价值函数,无需环境模型。

TD学习回顾

TD(0)算法

TD(0)是TD学习的特例,只向前看一步:

其中 是学习率,TD误差

TD误差

  • :实际回报高于当前估计,价值被上调
  • :实际回报低于当前估计,价值被下调

Q-Learning算法

核心思想

Q-Learning直接学习最优动作价值函数 ,使用当前状态的TD target作为目标:

算法流程

1. 初始化:
   - Q(s, a) = 0, ∀s ∈ S, a ∈ A
   - 设置学习率 α 和折扣因子 γ

2. 对每个episode重复:
   a) 初始化状态 S
   
   b) 对episode的每一步:
      - 选择动作 A(ε-greedy策略)
      - 执行动作,获得 R, S'
      - TD更新:
        Q(S, A) ← Q(S, A) + α[R + γ max_a' Q(S', a') - Q(S, A)]
      - S ← S'
   
   c) 直到 S 是终止状态

Python实现

import numpy as np
 
class QLearningAgent:
    """Q-Learning智能体"""
    
    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.95, epsilon=0.1):
        self.n_states = n_states
        self.n_actions = n_actions
        self.alpha = alpha      # 学习率
        self.gamma = gamma     # 折扣因子
        self.epsilon = epsilon # 探索率
        
        # Q表
        self.Q = np.zeros((n_states, n_actions))
    
    def choose_action(self, state):
        """ε-greedy策略选择动作"""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)  # 探索
        else:
            return np.argmax(self.Q[state])  # 利用
    
    def update(self, state, action, reward, next_state, done):
        """Q-Learning TD更新"""
        if done:
            target = reward
        else:
            target = reward + self.gamma * np.max(self.Q[next_state])
        
        # TD更新
        self.Q[state, action] += self.alpha * (target - self.Q[state, action])
    
    def train(self, env, n_episodes=1000):
        """训练智能体"""
        rewards = []
        
        for episode in range(n_episodes):
            state = env.reset()
            episode_reward = 0
            done = False
            
            while not done:
                action = self.choose_action(state)
                next_state, reward, done = env.step(action)
                
                self.update(state, action, reward, next_state, done)
                
                state = next_state
                episode_reward += reward
            
            rewards.append(episode_reward)
            
            if (episode + 1) % 100 == 0:
                avg_reward = np.mean(rewards[-100:])
                print(f"Episode {episode+1}, Avg Reward: {avg_reward:.3f}")
        
        return rewards

离线策略 vs 在线策略

区分

特性在线策略(On-Policy)离线策略(Off-Policy)
行为策略用于生成动作用于生成动作
目标策略与行为策略相同学习最优策略
样本来源当前策略任意策略
代表算法SARSA, PPOQ-Learning, DQN
收敛性较弱较强

Q-Learning是离线策略的证明

Q-Learning同时使用两个策略:

  • 行为策略-greedy,用于选择动作
  • 目标策略:贪婪策略 ,用于计算TD目标

由于行为策略和目标策略不同,Q-Learning可以从非最优轨迹中学习最优策略

SARSA:在线策略对比

SARSA(State-Action-Reward-State-Action)与Q-Learning的关键区别在于:

# Q-Learning(离线策略)
next_action = np.argmax(self.Q[next_state])  # 使用贪婪选择
target = reward + self.gamma * self.Q[next_state, next_action]
 
# SARSA(在线策略)
next_action = self.choose_action(next_state)  # 使用ε-greedy选择
target = reward + self.gamma * self.Q[next_state, next_action]

算法对比

方面Q-LearningSARSA
策略更新目标
是否保守激进保守
收敛条件 最终衰减到0更宽松
实际表现可能”撞墙”更安全

SARSA实现

class SARSAAgent:
    """SARSA智能体"""
    
    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.95, epsilon=0.1):
        self.n_states = n_states
        self.n_actions = n_actions
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.Q = np.zeros((n_states, n_actions))
    
    def choose_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        else:
            return np.argmax(self.Q[state])
    
    def update(self, state, action, reward, next_state, next_action, done):
        if done:
            target = reward
        else:
            # 使用下一个实际选择的动作
            target = reward + self.gamma * self.Q[next_state, next_action]
        
        self.Q[state, action] += self.alpha * (target - self.Q[state, action])
    
    def train(self, env, n_episodes=1000):
        for episode in range(n_episodes):
            state = env.reset()
            action = self.choose_action(state)  # 选择初始动作
            done = False
            episode_reward = 0
            
            while not done:
                next_state, reward, done = env.step(action)
                next_action = self.choose_action(next_state)  # 选择下一个动作
                
                self.update(state, action, reward, next_state, next_action, done)
                
                state = next_state
                action = next_action
                episode_reward += reward
            
            if (episode + 1) % 100 == 0:
                print(f"Episode {episode+1}, Reward: {episode_reward}")

收敛性分析

收敛条件

Q-Learning在以下条件下保证收敛到 2

  1. 有限状态和动作空间
  2. 每个状态-动作对被访问无限次
  3. 学习率满足
  4. 环境是马尔可夫的
  5. 满足GLIE条件:Greedy in the Limit with Infinite Exploration

GLIE条件

GLIE包含两个部分:

  • 无限探索:每个状态-动作对被无限次访问
  • 极限贪婪:最终策略收敛到贪婪策略

实现GLIE的常见方式:

示例:Cliff Walking

问题描述

4x12网格:
S○○○○○○○○○○○    S: 起点
○○○○○○○○○○○○    ○: 安全区域
○○○○○○○○○○○○    C: 悬崖(-100奖励,重置)
G○○○○○○○○○○C    G: 目标
     CCCCCCCC    

智能体需要从S走到G,避开悬崖。

实现

import numpy as np
import matplotlib.pyplot as plt
 
class CliffWalking:
    """悬崖行走环境"""
    
    def __init__(self):
        self.n_rows = 4
        self.n_cols = 12
        self.n_states = self.n_rows * self.n_cols
        self.n_actions = 4
        self.start = 36  # (3, 0)
        self.goal = 47  # (3, 11)
        self.cliff = list(range(38, 47))  # 最下方中间区域
    
    def reset(self):
        self.state = self.start
        return self.state
    
    def step(self, action):
        row, col = self.state // self.n_cols, self.state % self.n_cols
        
        # 移动
        if action == 0: row = max(0, row - 1)      # up
        elif action == 1: row = min(3, row + 1)     # down
        elif action == 2: col = max(0, col - 1)     # left
        elif action == 3: col = min(11, col + 1)   # right
        
        self.state = row * self.n_cols + col
        
        # 奖励
        if self.state == self.goal:
            reward = 0
            done = True
        elif self.state in self.cliff:
            reward = -100
            done = True
            self.state = self.start  # 重置
        else:
            reward = -1
            done = False
        
        return self.state, reward, done
 
 
def compare_algorithms():
    env = CliffWalking()
    n_episodes = 500
    
    # Q-Learning
    q_agent = QLearningAgent(48, 4, alpha=0.5, gamma=1.0, epsilon=0.1)
    q_rewards = q_agent.train(env, n_episodes)
    
    # SARSA
    sarsa_agent = SARSAAgent(48, 4, alpha=0.5, gamma=1.0, epsilon=0.1)
    sarsa_rewards = sarsa_agent.train(env, n_episodes)
    
    # 绘图
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    plt.plot(np.convolve(q_rewards, np.ones(10)/10, mode='valid'), label='Q-Learning')
    plt.plot(np.convolve(sarsa_rewards, np.ones(10)/10, mode='valid'), label='SARSA')
    plt.xlabel('Episode')
    plt.ylabel('Reward (10-episode moving avg)')
    plt.legend()
    plt.title('Cliff Walking: Q-Learning vs SARSA')
    
    plt.subplot(1, 2, 2)
    plt.plot(q_agent.Q.mean(axis=1).reshape(4, 12), label='Q-Learning')
    plt.plot(sarsa_agent.Q.mean(axis=1).reshape(4, 12), label='SARSA')
    plt.title('State Value Heatmap')
    plt.legend()
    
    plt.tight_layout()
    plt.show()
 
 
compare_algorithms()

结果分析

观察结果:
1. Q-Learning最终回报更高(-12左右),因为它学到了贴着悬崖走的最优路径
2. SARSA回报较低(-25左右),因为它避免悬崖边缘(保守策略)
3. 两者都在悬崖附近有负值

Q-Learning的扩展

1. Double Q-Learning

解决Q-Learning的Q值过估计问题:

class DoubleQLearningAgent:
    """Double Q-Learning"""
    
    def __init__(self, n_states, n_actions):
        self.Q1 = np.zeros((n_states, n_actions))
        self.Q2 = np.zeros((n_states, n_actions))
        self.n_actions = n_actions
    
    def choose_action(self, state, epsilon=0.1):
        if np.random.random() < epsilon:
            return np.random.randint(self.n_actions)
        else:
            return np.argmax(self.Q1[state] + self.Q2[state])
    
    def update(self, state, action, reward, next_state, done, alpha=0.1, gamma=0.95):
        if np.random.random() < 0.5:
            # 更新Q1
            if done:
                target = reward
            else:
                a = np.argmax(self.Q1[next_state])
                target = reward + gamma * self.Q2[next_state, a]
            self.Q1[state, action] += alpha * (target - self.Q1[state, action])
        else:
            # 更新Q2
            if done:
                target = reward
            else:
                a = np.argmax(self.Q2[next_state])
                target = reward + gamma * self.Q1[next_state, a]
            self.Q2[state, action] += alpha * (target - self.Q2[state, action])

2. Delayed Q-Learning

对于每个状态-动作对,只有在访问次数达到阈值后才更新,减少方差。

3. 优先级扫描(Prioritized Sweeping)

根据TD误差大小优先更新”意外”的状态。

与深度学习的结合

Q-Learning的局限性:

  • 只能处理离散、低维状态空间
  • 无法处理图像等高维输入

解决方案:用深度神经网络近似Q函数,即DQN。详见 DQN

参考


后续主题

  • DQN:深度Q网络处理高维状态空间
  • Actor-Critic:结合策略梯度和价值函数
  • 动态规划:基于模型的规划方法

Footnotes

  1. Watkins, “Learning from Delayed Rewards”, PhD Thesis, 1989

  2. Sutton & Barto, “Reinforcement Learning: An Introduction”, 2nd Edition, 2018, Chapter 6