XGBoost深入解析

概述

XGBoost（eXtreme Gradient Boosting）是GBDT的高效实现版本，由陈天奇于2016年提出。¹ 通过引入正则化目标函数、二阶导数优化和高效的系统设计，XGBoost成为Kaggle竞赛和工业界最流行的机器学习算法之一。

1. XGBoost目标函数

1.1 带正则化的目标函数

XGBoost在GBDT的基础上增加了显式正则化项：

L (ϕ) = i = 1 \sum N L (y_{i}, \overset{y}{^}_{i}) + m = 1 \sum M Ω (f_{m})

其中：

$L (y_{i}, \overset{y}{^}_{i})$ 是损失函数（如MSE、LogLoss）
$Ω (f) = γ T + \frac{1}{2} λ \sum_{j = 1}^{T} w_{j}^{2}$ 是正则化项
$T$ 是叶子节点数
$w_{j}$ 是叶子节点 $j$ 的预测值

1.2 正则化的物理意义

正则项	作用
$γ T$	惩罚叶子节点数量，控制树的复杂度
$\frac{1}{2} λ \sum w_{j}^{2}$	L2正则化，防止叶子权重过大
$\alpha \sum	w_j

1.3 前向分步框架

在第 $m$ 轮迭代时，目标函数为：

L^{(m)} = i = 1 \sum N L (y_{i}, F_{m - 1} (x_{i}) + f_{m} (x_{i})) + Ω (f_{m})

其中 $f_{m}$ 是第 $m$ 棵树的输出。

2. 二阶泰勒展开

2.1 泰勒展开近似

XGBoost对损失函数进行二阶泰勒展开：

L (y_{i}, F_{m - 1} + f) \approx L (y_{i}, F_{m - 1}) + g_{i} f + \frac{1}{2} h_{i} f^{2}

其中：

$g_{i} = \frac{\partial L ( y _{i} , F _{m - 1} ( x _{i} ))}{\partial F _{m - 1} ( x _{i} )}$ 是一阶导数（梯度）
$h_{i} = \frac{\partial ^{2} L ( y _{i} , F _{m - 1} ( x _{i} ))}{\partial F _{m - 1} ( x _{i} ) ^{2}}$ 是二阶导数（Hessian对角元素）

2.2 简化的目标函数

移除常数项后：

\tilde{L}^{(m)} = i = 1 \sum N [g_{i} f (x_{i}) + \frac{1}{2} h_{i} f^{2} (x_{i})] + Ω (f_{m})

2.3 不同损失函数的一阶、二阶导数

损失函数	表达式	一阶导数 $g_{i}$	二阶导数 $h_{i}$
MSE	$\frac{1}{2} (y - F)^{2}$	$F - y$	$1$
Logistic	$y ln (1 + e^{- F}) + (1 - y) ln (1 + e^{F})$	$σ (F) - y$	$σ (F) (1 - σ (F))$
Cross-Entropy	$- y ln σ (F) - (1 - y) ln (1 - σ (F))$	$σ (F) - y$	$σ (F) (1 - σ (F))$
Huber	$\begin{cases}\frac{1}{2}(y-F)^2 &	y-F	\leq\delta \ \delta(

3. 树结构打分

3.1 叶子节点输出解析解

将树结构 $f_{m}$ 写成叶子节点的形式：

f_{m} (x) = j = 1 \sum T w_{j} \cdot 1 [x \in R_{j}]

其中 $R_{j}$ 是叶子节点 $j$ 的样本集合。

3.2 目标函数重写

\tilde{L}^{(m)} = j = 1 \sum T w_{j} i \in R_{j} \sum g_{i} + \frac{1}{2} w_{j}^{2} i \in R_{j} \sum h_{i} + γ T + \frac{1}{2} λ j = 1 \sum T w_{j}^{2}

定义：

$G_{j} = \sum_{i \in R_{j}} g_{i}$ — 叶子节点 $j$ 的一阶梯度和
$H_{j} = \sum_{i \in R_{j}} h_{i}$ — 叶子节点 $j$ 的二阶梯度和

则：

\tilde{L}^{(m)} = j = 1 \sum T [G_{j} w_{j} + \frac{1}{2} (H_{j} + λ) w_{j}^{2}] + γ T

3.3 最优叶子权重

对于每个叶子节点，最优权重 $w_{j}^{*}$ 为：

\frac{\partial L ~}{\partial w _{j}} = G_{j} + (H_{j} + λ) w_{j} = 0

w_{j}^{*} = - \frac{G _{j}}{H _{j} + λ}

3.4 最优目标函数值

将最优权重代入：

\tilde{L}^{*} = - \frac{1}{2} j = 1 \sum T \frac{G _{j}^{2}}{H _{j} + λ} + γ T

重要性质：对于固定树结构，最优权重和最优目标函数值都可以闭式计算！

4. 树结构搜索算法

4.1 精确贪心算法

枚举所有可能的分裂点，计算增益：

Gain = \frac{1}{2} [\frac{G _{L}^{2}}{H _{L} + λ} + \frac{G _{R}^{2}}{H _{R} + λ} - \frac{( G _{L} + G _{R} ) ^{2}}{H _{L} + H _{R} + λ}] - γ

其中：

$\frac{G _{L}^{2}}{H _{L} + λ}$ 和 $\frac{G _{R}^{2}}{H _{R} + λ}$ 是左右子树的分数
$\frac{( G _{L} + G _{R} ) ^{2}}{H _{L} + H _{R} + λ}$ 是分裂前父节点的分数
$γ$ 是分裂惩罚

def exact_gain(G, H, G_L, H_L, G_R, H_R, gamma, lambda_reg):
    """计算精确分裂增益"""
    # 左子树分数
    score_left = G_L**2 / (H_L + lambda_reg)
    # 右子树分数
    score_right = G_R**2 / (H_R + lambda_reg)
    # 父节点分数
    score_parent = (G_L + G_R)**2 / (H_L + H_R + lambda_reg)
    
    # 增益 = 左右子树分数 - 父节点分数 - 复杂度惩罚
    gain = 0.5 * (score_left + score_right - score_parent) - gamma
    return gain
 
def find_best_split(X, G, H, gamma, lambda_reg):
    """寻找最优分裂"""
    best_gain = -float('inf')
    best_feature, best_split = None, None
    
    d = X.shape[1]
    G_total = G.sum()
    H_total = H.sum()
    
    for j in range(d):
        # 获取特征j的唯一值作为候选分裂点
        thresholds = np.unique(X[:, j])
        
        for s in thresholds:
            # 分裂样本
            left_mask = X[:, j] <= s
            right_mask = ~left_mask
            
            # 计算左右子树的梯度
            G_L, H_L = G[left_mask].sum(), H[left_mask].sum()
            G_R, H_R = G[right_mask].sum(), H[right_mask].sum()
            
            # 计算增益
            gain = exact_gain(G_total, H_total, G_L, H_L, G_R, H_R, gamma, lambda_reg)
            
            if gain > best_gain:
                best_gain = gain
                best_feature = j
                best_split = s
    
    return best_feature, best_split, best_gain

4.2 近似算法

当数据量很大时，精确枚举所有分裂点代价过高。XGBoost使用分位数（Quantile）近似：

def approximate_best_split(X, G, H, feature, n_bins=256):
    """
    近似分裂算法：使用特征分位数作为候选点
    
    参数:
        X: 特征矩阵
        G, H: 梯度统计
        feature: 特征索引
        n_bins: 分箱数量
    """
    # 计算特征的百分位数
    percentiles = np.linspace(0, 100, n_bins + 1)
    thresholds = np.percentile(X[:, feature], percentiles[1:-1])
    
    # 构建梯度统计的压缩表示
    data = np.column_stack([X[:, feature], G, H])
    data_sorted = data[data[:, 0].argsort()]
    
    # 计算每个桶的梯度统计
    bucket_size = len(data_sorted) // n_bins
    candidates = []
    
    for b in range(n_bins):
        start = b * bucket_size
        end = (b + 1) * bucket_size if b < n_bins - 1 else len(data_sorted)
        
        G_bucket = data_sorted[start:end, 1].sum()
        H_bucket = data_sorted[start:end, 2].sum()
        
        # 使用桶的右边界作为候选点
        split_value = data_sorted[end - 1, 0]
        candidates.append((split_value, G_bucket, H_bucket, start, end))
    
    return candidates

4.3 加权分位数

XGBoost使用二阶梯度 $h_{i}$ 作为样本权重，实现加权分位数：

s_{k} = \frac{1}{\sum _{i} h _{i}} i \sum h_{i} \cdot 1 [特征值 \leq threshold_{k}]

这确保候选分裂点在二阶梯度意义下均匀分布。

5. 系统设计优化

5.1 特征排序并行化

XGBoost在训练前对每个特征进行排序，然后并行地寻找最优分裂：

特征1: [排序索引] → [并行分裂计算] → 最优分裂
特征2: [排序索引] → [并行分裂计算] → 最优分裂
...

5.2 缓存感知访问

class CacheAwareGradientStatistics:
    """缓存感知的梯度统计计算"""
    
    def __init__(self, X, y, gradients, block_size=2**10):
        self.X = X
        self.y = y
        self.gradients = gradients
        self.block_size = block_size
        self._prefetch()
    
    def _prefetch(self):
        """预取数据以优化缓存"""
        n_samples = len(self.X)
        
        for start in range(0, n_samples, self.block_size):
            end = min(start + self.block_size, n_samples)
            
            # 提取数据块
            X_block = self.X[start:end]
            g_block = self.gradients[start:end, 0]
            h_block = self.gradients[start:end, 1]
            
            # 计算该块的统计
            yield X_block, g_block, h_block

5.3 缺失值处理

XGBoost对缺失值有内置支持：

def find_split_with_missing(X, G, H, feature, missing_mask):
    """
    处理缺失值的分裂查找
    
    思路：将缺失值分别放入左右子树，计算两边的增益
          选择使增益最大的方向
    """
    # 非缺失样本
    valid_mask = ~missing_mask
    X_valid = X[valid_mask, feature]
    G_valid = G[valid_mask]
    H_valid = H[valid_mask]
    
    # 缺失样本的梯度
    G_missing = G[missing_mask].sum()
    H_missing = H[missing_mask].sum()
    
    # 遍历候选分裂点
    best_gain = -float('inf')
    best_split = None
    best_missing_direction = None
    
    for split in find_candidates(X_valid):
        # 计算不包含缺失值时的增益
        left_mask = X_valid <= split
        G_L = G_valid[left_mask].sum() + G_missing
        H_L = H_valid[left_mask].sum() + H_missing
        G_R = G_valid[~left_mask].sum()
        H_R = H_valid[~left_mask].sum()
        
        gain = compute_gain(G_L, H_L, G_R, H_R)
        
        if gain > best_gain:
            best_gain = gain
            best_split = split
            best_missing_direction = 'left'
    
    return best_split, best_missing_direction

6. XGBoost vs 传统GBDT

6.1 关键改进

方面	GBDT	XGBoost
目标函数	一阶优化	二阶泰勒展开
正则化	隐式（收缩/子采样）	显式正则化项
缺失值	需要预处理	内置处理
列采样	无	支持（特征子采样）
分裂点搜索	精确贪心	精确 + 近似
并行化	弱	强（特征级并行）

6.2 二阶近似的优势

一阶近似（GBDT）:        二阶近似（XGBoost）:
     ↓                        ↓
     ●───────────────→    ●───────────────→
     ↖ 一阶导数方向        ↖↗ 考虑曲率的方向
                           
  需要更多迭代            更少的迭代达到相同精度

二阶近似考虑了损失函数的曲率信息，能更准确地估计最优下降方向。

7. PyTorch实现核心代码

import torch
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
class XGBoostRegressor:
    """XGBoost核心实现"""
    
    def __init__(self, n_estimators=100, max_depth=6, 
                 learning_rate=0.1, lambda_reg=1.0, gamma=0,
                 subsample=1.0, colsample_bytree=1.0, min_child_weight=1):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.lr = learning_rate
        self.lambda_reg = lambda_reg
        self.gamma = gamma
        self.subsample = subsample
        self.colsample_bytree = colsample_bytree
        self.min_child_weight = min_child_weight
        self.trees = []
        self.F_0 = None
    
    def fit(self, X, y):
        """训练XGBoost模型"""
        X = np.array(X)
        y = np.array(y)
        
        # 初始化
        self.F_0 = np.mean(y)
        F = self.F_0 * np.ones(len(y))
        
        # 一阶、二阶导数（MSE损失）
        g = F - y
        h = np.ones(len(y))
        
        for m in range(self.n_estimators):
            # 子采样
            if self.subsample < 1.0:
                n_sample = int(len(y) * self.subsample)
                indices = np.random.choice(len(y), n_sample, replace=False)
                X_m, g_m, h_m, F_m = X[indices], g[indices], h[indices], F[indices]
            else:
                X_m, g_m, h_m, F_m = X, g, h, F
            
            # 列采样
            if self.colsample_bytree < 1.0:
                n_feature = int(X.shape[1] * self.colsample_bytree)
                feature_indices = np.random.choice(X.shape[1], n_feature, replace=False)
                X_m = X_m[:, feature_indices]
            else:
                feature_indices = np.arange(X.shape[1])
            
            # 训练树
            tree = self._build_tree(X_m, g_m, h_m, F_m, depth=0)
            tree['feature_indices'] = feature_indices
            self.trees.append(tree)
            
            # 更新预测和梯度
            predictions = self._predict_tree(tree, X_m)
            F_m = F_m - self.lr * predictions
            g_m = F_m - y[indices if self.subsample < 1.0 else slice(None)]
            
            # 更新全局变量（对于完整数据）
            if self.subsample < 1.0:
                F_full = self._predict_tree(tree, X)
                F = F - self.lr * F_full
                g = F - y
            else:
                F = F_m
                g = g_m
        
        return self
    
    def _build_tree(self, X, g, h, F, depth):
        """递归构建树"""
        if depth >= self.max_depth or len(X) < 2 * self.min_child_weight:
            # 叶子节点：计算最优权重
            w = -g.sum() / (h.sum() + self.lambda_reg)
            return {'leaf': True, 'w': w}
        
        # 寻找最优分裂
        best_score = -float('inf')
        best_split = None
        
        d = X.shape[1]
        for j in range(d):
            thresholds = np.unique(X[:, j])
            for s in thresholds:
                left_mask = X[:, j] <= s
                right_mask = ~left_mask
                
                if left_mask.sum() < self.min_child_weight or \
                   right_mask.sum() < self.min_child_weight:
                    continue
                
                # 计算分裂增益
                G_L, H_L = g[left_mask].sum(), h[left_mask].sum()
                G_R, H_R = g[right_mask].sum(), h[right_mask].sum()
                
                score = 0.5 * (
                    G_L**2 / (H_L + self.lambda_reg) +
                    G_R**2 / (H_R + self.lambda_reg) -
                    (G_L + G_R)**2 / (H_L + H_R + self.lambda_reg)
                ) - self.gamma
                
                if score > best_score:
                    best_score = score
                    best_split = {'feature': j, 'threshold': s, 
                                 'left_mask': left_mask, 'right_mask': right_mask}
        
        if best_split is None or best_score <= 0:
            w = -g.sum() / (h.sum() + self.lambda_reg)
            return {'leaf': True, 'w': w}
        
        # 递归构建子树
        return {
            'leaf': False,
            'feature': best_split['feature'],
            'threshold': best_split['threshold'],
            'left': self._build_tree(X[best_split['left_mask']], 
                                     g[best_split['left_mask']],
                                     h[best_split['left_mask']],
                                     F[best_split['left_mask']], depth + 1),
            'right': self._build_tree(X[best_split['right_mask']],
                                      g[best_split['right_mask']],
                                      h[best_split['right_mask']],
                                      F[best_split['right_mask']], depth + 1)
        }
    
    def _predict_tree(self, tree, X):
        """使用树进行预测"""
        if tree['leaf']:
            return tree['w'] * np.ones(len(X))
        
        predictions = np.zeros(len(X))
        left_mask = X[:, tree['feature']] <= tree['threshold']
        right_mask = ~left_mask
        
        predictions[left_mask] = self._predict_tree(tree['left'], X[left_mask])
        predictions[right_mask] = self._predict_tree(tree['right'], X[right_mask])
        
        return predictions
    
    def predict(self, X):
        """预测"""
        X = np.array(X)
        F = self.F_0 * np.ones(len(X))
        
        for tree in self.trees:
            feature_indices = tree.pop('feature_indices')
            predictions = self._predict_tree(tree, X[:, feature_indices])
            F = F - self.lr * predictions
            tree['feature_indices'] = feature_indices
        
        return F

8. XGBoost参数调优指南

8.1 关键参数

参数	作用	建议范围
`n_estimators`	树的数量	100-1000
`max_depth`	树的最大深度	3-10
`learning_rate`	学习率（收缩）	0.01-0.3
`subsample`	行采样比例	0.5-1.0
`colsample_bytree`	列采样比例	0.5-1.0
`min_child_weight`	最小叶子权重和	1-10
`lambda_reg`	L2正则化	0-10
`gamma`	最小分裂增益	0-5

8.2 调参策略

# 两阶段调参策略
param_grid = {
    # 第一阶段：固定棵树，调学习率和深度
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7],
    
    # 第二阶段：基于最佳参数，增加树的数量
    'n_estimators': [500, 1000, 2000],
    
    # 第三阶段：正则化微调
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
}

Metaphor

探索

XGBoost深入解析

XGBoost深入解析

概述

1. XGBoost目标函数

1.1 带正则化的目标函数

1.2 正则化的物理意义

1.3 前向分步框架

2. 二阶泰勒展开

2.1 泰勒展开近似

2.2 简化的目标函数

2.3 不同损失函数的一阶、二阶导数

3. 树结构打分

3.1 叶子节点输出解析解

3.2 目标函数重写

3.3 最优叶子权重

3.4 最优目标函数值

4. 树结构搜索算法

4.1 精确贪心算法

4.2 近似算法

4.3 加权分位数

5. 系统设计优化

5.1 特征排序并行化

5.2 缓存感知访问

5.3 缺失值处理

6. XGBoost vs 传统GBDT

6.1 关键改进

6.2 二阶近似的优势

7. PyTorch实现核心代码

8. XGBoost参数调优指南

8.1 关键参数

8.2 调参策略

参考

相关主题

关系图谱

目录

反向链接

Metaphor

探索

XGBoost深入解析

XGBoost深入解析

概述

1. XGBoost目标函数

1.1 带正则化的目标函数

1.2 正则化的物理意义

1.3 前向分步框架

2. 二阶泰勒展开

2.1 泰勒展开近似

2.2 简化的目标函数

2.3 不同损失函数的一阶、二阶导数

3. 树结构打分

3.1 叶子节点输出解析解

3.2 目标函数重写

3.3 最优叶子权重

3.4 最优目标函数值

4. 树结构搜索算法

4.1 精确贪心算法

4.2 近似算法

4.3 加权分位数

5. 系统设计优化

5.1 特征排序并行化

5.2 缓存感知访问

5.3 缺失值处理

6. XGBoost vs 传统GBDT

6.1 关键改进

6.2 二阶近似的优势

7. PyTorch实现核心代码

8. XGBoost参数调优指南

8.1 关键参数

8.2 调参策略

参考

相关主题

Footnotes

关系图谱

目录

反向链接