概述
Q-Learning是强化学习中最重要的算法之一,由Watkins于1989年提出,是一种离线策略(off-policy)的时序差分(TD)控制算法。1
核心思想:直接从TD error中学习最优动作价值函数,无需环境模型。
TD学习回顾
TD(0)算法
TD(0)是TD学习的特例,只向前看一步:
其中 是学习率, 是TD误差。
TD误差
- :实际回报高于当前估计,价值被上调
- :实际回报低于当前估计,价值被下调
Q-Learning算法
核心思想
Q-Learning直接学习最优动作价值函数 ,使用当前状态的TD target作为目标:
算法流程
1. 初始化:
- Q(s, a) = 0, ∀s ∈ S, a ∈ A
- 设置学习率 α 和折扣因子 γ
2. 对每个episode重复:
a) 初始化状态 S
b) 对episode的每一步:
- 选择动作 A(ε-greedy策略)
- 执行动作,获得 R, S'
- TD更新:
Q(S, A) ← Q(S, A) + α[R + γ max_a' Q(S', a') - Q(S, A)]
- S ← S'
c) 直到 S 是终止状态
Python实现
import numpy as np
class QLearningAgent:
"""Q-Learning智能体"""
def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.95, epsilon=0.1):
self.n_states = n_states
self.n_actions = n_actions
self.alpha = alpha # 学习率
self.gamma = gamma # 折扣因子
self.epsilon = epsilon # 探索率
# Q表
self.Q = np.zeros((n_states, n_actions))
def choose_action(self, state):
"""ε-greedy策略选择动作"""
if np.random.random() < self.epsilon:
return np.random.randint(self.n_actions) # 探索
else:
return np.argmax(self.Q[state]) # 利用
def update(self, state, action, reward, next_state, done):
"""Q-Learning TD更新"""
if done:
target = reward
else:
target = reward + self.gamma * np.max(self.Q[next_state])
# TD更新
self.Q[state, action] += self.alpha * (target - self.Q[state, action])
def train(self, env, n_episodes=1000):
"""训练智能体"""
rewards = []
for episode in range(n_episodes):
state = env.reset()
episode_reward = 0
done = False
while not done:
action = self.choose_action(state)
next_state, reward, done = env.step(action)
self.update(state, action, reward, next_state, done)
state = next_state
episode_reward += reward
rewards.append(episode_reward)
if (episode + 1) % 100 == 0:
avg_reward = np.mean(rewards[-100:])
print(f"Episode {episode+1}, Avg Reward: {avg_reward:.3f}")
return rewards离线策略 vs 在线策略
区分
| 特性 | 在线策略(On-Policy) | 离线策略(Off-Policy) |
|---|---|---|
| 行为策略 | 用于生成动作 | 用于生成动作 |
| 目标策略 | 与行为策略相同 | 学习最优策略 |
| 样本来源 | 当前策略 | 任意策略 |
| 代表算法 | SARSA, PPO | Q-Learning, DQN |
| 收敛性 | 较弱 | 较强 |
Q-Learning是离线策略的证明
Q-Learning同时使用两个策略:
- 行为策略:-greedy,用于选择动作
- 目标策略:贪婪策略 ,用于计算TD目标
由于行为策略和目标策略不同,Q-Learning可以从非最优轨迹中学习最优策略。
SARSA:在线策略对比
SARSA(State-Action-Reward-State-Action)与Q-Learning的关键区别在于:
# Q-Learning(离线策略)
next_action = np.argmax(self.Q[next_state]) # 使用贪婪选择
target = reward + self.gamma * self.Q[next_state, next_action]
# SARSA(在线策略)
next_action = self.choose_action(next_state) # 使用ε-greedy选择
target = reward + self.gamma * self.Q[next_state, next_action]算法对比
| 方面 | Q-Learning | SARSA |
|---|---|---|
| 策略更新目标 | ||
| 是否保守 | 激进 | 保守 |
| 收敛条件 | 最终衰减到0 | 更宽松 |
| 实际表现 | 可能”撞墙” | 更安全 |
SARSA实现
class SARSAAgent:
"""SARSA智能体"""
def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.95, epsilon=0.1):
self.n_states = n_states
self.n_actions = n_actions
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon
self.Q = np.zeros((n_states, n_actions))
def choose_action(self, state):
if np.random.random() < self.epsilon:
return np.random.randint(self.n_actions)
else:
return np.argmax(self.Q[state])
def update(self, state, action, reward, next_state, next_action, done):
if done:
target = reward
else:
# 使用下一个实际选择的动作
target = reward + self.gamma * self.Q[next_state, next_action]
self.Q[state, action] += self.alpha * (target - self.Q[state, action])
def train(self, env, n_episodes=1000):
for episode in range(n_episodes):
state = env.reset()
action = self.choose_action(state) # 选择初始动作
done = False
episode_reward = 0
while not done:
next_state, reward, done = env.step(action)
next_action = self.choose_action(next_state) # 选择下一个动作
self.update(state, action, reward, next_state, next_action, done)
state = next_state
action = next_action
episode_reward += reward
if (episode + 1) % 100 == 0:
print(f"Episode {episode+1}, Reward: {episode_reward}")收敛性分析
收敛条件
Q-Learning在以下条件下保证收敛到 :2
- 有限状态和动作空间
- 每个状态-动作对被访问无限次
- 学习率满足: 且
- 环境是马尔可夫的
- 满足GLIE条件:Greedy in the Limit with Infinite Exploration
GLIE条件
GLIE包含两个部分:
- 无限探索:每个状态-动作对被无限次访问
- 极限贪婪:最终策略收敛到贪婪策略
实现GLIE的常见方式: 或
示例:Cliff Walking
问题描述
4x12网格:
S○○○○○○○○○○○ S: 起点
○○○○○○○○○○○○ ○: 安全区域
○○○○○○○○○○○○ C: 悬崖(-100奖励,重置)
G○○○○○○○○○○C G: 目标
CCCCCCCC
智能体需要从S走到G,避开悬崖。
实现
import numpy as np
import matplotlib.pyplot as plt
class CliffWalking:
"""悬崖行走环境"""
def __init__(self):
self.n_rows = 4
self.n_cols = 12
self.n_states = self.n_rows * self.n_cols
self.n_actions = 4
self.start = 36 # (3, 0)
self.goal = 47 # (3, 11)
self.cliff = list(range(38, 47)) # 最下方中间区域
def reset(self):
self.state = self.start
return self.state
def step(self, action):
row, col = self.state // self.n_cols, self.state % self.n_cols
# 移动
if action == 0: row = max(0, row - 1) # up
elif action == 1: row = min(3, row + 1) # down
elif action == 2: col = max(0, col - 1) # left
elif action == 3: col = min(11, col + 1) # right
self.state = row * self.n_cols + col
# 奖励
if self.state == self.goal:
reward = 0
done = True
elif self.state in self.cliff:
reward = -100
done = True
self.state = self.start # 重置
else:
reward = -1
done = False
return self.state, reward, done
def compare_algorithms():
env = CliffWalking()
n_episodes = 500
# Q-Learning
q_agent = QLearningAgent(48, 4, alpha=0.5, gamma=1.0, epsilon=0.1)
q_rewards = q_agent.train(env, n_episodes)
# SARSA
sarsa_agent = SARSAAgent(48, 4, alpha=0.5, gamma=1.0, epsilon=0.1)
sarsa_rewards = sarsa_agent.train(env, n_episodes)
# 绘图
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(np.convolve(q_rewards, np.ones(10)/10, mode='valid'), label='Q-Learning')
plt.plot(np.convolve(sarsa_rewards, np.ones(10)/10, mode='valid'), label='SARSA')
plt.xlabel('Episode')
plt.ylabel('Reward (10-episode moving avg)')
plt.legend()
plt.title('Cliff Walking: Q-Learning vs SARSA')
plt.subplot(1, 2, 2)
plt.plot(q_agent.Q.mean(axis=1).reshape(4, 12), label='Q-Learning')
plt.plot(sarsa_agent.Q.mean(axis=1).reshape(4, 12), label='SARSA')
plt.title('State Value Heatmap')
plt.legend()
plt.tight_layout()
plt.show()
compare_algorithms()结果分析
观察结果:
1. Q-Learning最终回报更高(-12左右),因为它学到了贴着悬崖走的最优路径
2. SARSA回报较低(-25左右),因为它避免悬崖边缘(保守策略)
3. 两者都在悬崖附近有负值
Q-Learning的扩展
1. Double Q-Learning
解决Q-Learning的Q值过估计问题:
class DoubleQLearningAgent:
"""Double Q-Learning"""
def __init__(self, n_states, n_actions):
self.Q1 = np.zeros((n_states, n_actions))
self.Q2 = np.zeros((n_states, n_actions))
self.n_actions = n_actions
def choose_action(self, state, epsilon=0.1):
if np.random.random() < epsilon:
return np.random.randint(self.n_actions)
else:
return np.argmax(self.Q1[state] + self.Q2[state])
def update(self, state, action, reward, next_state, done, alpha=0.1, gamma=0.95):
if np.random.random() < 0.5:
# 更新Q1
if done:
target = reward
else:
a = np.argmax(self.Q1[next_state])
target = reward + gamma * self.Q2[next_state, a]
self.Q1[state, action] += alpha * (target - self.Q1[state, action])
else:
# 更新Q2
if done:
target = reward
else:
a = np.argmax(self.Q2[next_state])
target = reward + gamma * self.Q1[next_state, a]
self.Q2[state, action] += alpha * (target - self.Q2[state, action])2. Delayed Q-Learning
对于每个状态-动作对,只有在访问次数达到阈值后才更新,减少方差。
3. 优先级扫描(Prioritized Sweeping)
根据TD误差大小优先更新”意外”的状态。
与深度学习的结合
Q-Learning的局限性:
- 只能处理离散、低维状态空间
- 无法处理图像等高维输入
解决方案:用深度神经网络近似Q函数,即DQN。详见 DQN。
参考
后续主题
- DQN:深度Q网络处理高维状态空间
- Actor-Critic:结合策略梯度和价值函数
- 动态规划:基于模型的规划方法