CatBoost算法详解

1. 概述

CatBoost是Yandex于2018年提出的梯度提升算法，其核心设计目标是原生支持类别特征，无需复杂的预处理即可处理混合类型数据。

1.1 CatBoost的核心贡献

贡献	说明
Ordered Target Encoding	解决类别特征的目标泄露问题
对称树结构	提高推理速度，减少过拟合
防止目标泄露	使用历史信息计算统计量
自动处理缺失值	无需预处理
GPU加速	原生CUDA支持

1.2 与其他算法的区别

特性	XGBoost	LightGBM	CatBoost
类别特征处理	需编码	原生支持	原生+Ordered
缺失值处理	学习默认方向	处理	处理
树结构	普通二叉树	Leaf-wise	对称树
防止目标泄露	无	无	有（Ordered TS）

2. 类别特征处理：Ordered Target Encoding

2.1 目标泄露问题

在标准梯度提升中，类别特征的统计量（如目标均值）是使用全部数据计算的，这会导致目标泄露：

\overset{y}{ˉ}_{j} = \frac{\sum _{i \in D_{j}} y _{i}}{∣ D _{j} ∣}

其中 $D_{j}$ 是类别 $j$ 的样本集合。这个统计量使用了当前样本的标签 $y_{i}$ 。

2.2 目标泄露的影响

目标泄露会导致：

训练集过度乐观：模型”看到”了未来信息
泛化性能下降：测试集上没有泄露信息
特征重要性失真：类别特征的 importance 被高估

2.3 Ordered Target Statistics

CatBoost使用Ordered Target Statistics (OTS) 来解决这个问题：

对于样本 x_i，其类别特征值为 cat:

1. 从训练数据中随机选择一个排列 σ

2. 只使用排在 x_i 之前的样本计算统计量:
   TS(cat, i) = f(y_{σ(1)}, ..., y_{σ(i-1)}, cat)
   
3. 通常使用滑动均值:
   TS(cat, i) = (sum_{j < i, σ(j) = cat} y_j + prior) / (count_{j < i, σ(j) = cat} + α)

2.4 OTS的变体

CatBoost使用多个随机排列来减少方差：

def ordered_target_stats(X, y, cat_feature, n_permutations=4, alpha=1.0):
    """
    计算Ordered Target Statistics
    """
    n = len(y)
    n_cats = X[cat_feature].nunique()
    prior = y.mean()
    
    result = np.zeros(n)
    
    for _ in range(n_permutations):
        # 随机排列
        perm = np.random.permutation(n)
        inv_perm = np.argsort(perm)
        
        # 计算累积和
        cat_indices = X[cat_feature].values[perm]
        y_perm = y.values[perm]
        
        # 预处理：计算类别映射
        cat_to_sum = {}
        cat_to_count = {}
        
        cumulative = np.zeros(n + 1)
        cat_mask = np.zeros((n + 1, n_cats))
        
        for i in range(n):
            cat = cat_indices[i]
            cumulative[i + 1] = cumulative[i] + y_perm[i]
            
            # 更新类别掩码
            cat_mask[i + 1] = cat_mask[i]
            cat_mask[i + 1, cat] += 1
        
        # 计算OTS
        for i in range(n):
            orig_idx = inv_perm[i]
            cat = cat_indices[i]
            
            count = cat_mask[i, cat]
            sum_y = cumulative[i] - (cat_mask[i, :] * y_perm[:i]).sum()
            
            # 使用当前样本之前的统计
            ts = (sum_y + alpha * prior) / (count + alpha)
            result[orig_idx] += ts / n_permutations
    
    return result

2.5 处理未见过的类别

对于测试集中未在训练集出现的类别值，CatBoost使用全局先验：

TS (new_cat, i) = prior

3. 对称树结构

3.1 对称树的定义

对称树（Symmetric Tree）在每一层使用相同的分裂特征和阈值：

Level 0: 特征A < 5
        /         \
Level 1: 特征A < 5    特征A < 5    # 同一分裂
        /    \       /    \
Level 2: ...    ...    ...    ...   # 所有路径使用相同的分裂

3.2 对称树 vs 普通决策树

特性	普通决策树	对称树
分裂策略	每节点独立选择	每层统一选择
树深度	不平衡	平衡
推理速度	较慢	较快
过拟合风险	较高	较低
表达能力	较强	适中

3.3 对称树的优势

推理速度快：所有路径共享相同的分裂规则
减少过拟合：限制了模型的自由度
可解释性强：可以可视化整个树的逻辑

3.4 预测分割（Prediction Stretching）

对称树允许一个叶节点覆盖多个叶子来提高表达能力：

普通树:      对称树（Prediction Stretching）:
    A<5          A<5
   /  \         /  \
  B<3  C<7     B<3  B<3
 / \  / \     / \  / \
a  b c  d    a  b c  d

4. 梯度提升与有序提升

4.1 标准梯度提升

标准梯度提升在每轮迭代中使用全部数据计算梯度：

for t in range(n_estimators):
    # 计算梯度（使用全部数据）
    g = compute_gradient(y, F)
    
    # 拟合梯度
    h = fit_tree(X, g)
    
    # 更新
    F += learning_rate * h

4.2 有序提升（Ordered Boosting）

CatBoost使用有序提升来防止目标泄露：

for t in range(n_estimators):
    # 对每个样本，使用不同的训练集计算梯度
    for i in range(n_samples):
        # 使用排在样本i之前的数据训练
        D_train = {j < i}  # 或者随机划分
        
        # 只在D_train上计算梯度
        g_i = compute_gradient(y_i, F_{-i}(x_i))
        
        # 拟合单个样本的梯度
        h_i = fit_tree(X, g)
    
    # 汇总所有梯度
    F += learning_rate * sum(h_i)

4.3 实际实现

为避免 $O (N^{2})$ 的复杂度，CatBoost使用采样策略：

1. 将数据分成 k 个桶

2. 对于桶 i:
   - 使用桶 1 到 i-1 的数据训练模型
   - 预测桶 i 的样本

3. 累积所有桶的预测作为梯度估计

4.4 对比

方法	梯度计算	目标泄露	复杂度
标准GBDT	全部数据	有	$O (N)$
有序提升	历史数据	无	$O (N^{2})$
CatBoost	采样历史	轻微	$O (N)$

5. 多重加法表征（Multi-Addditive Scoring）

5.1 表征的定义

CatBoost将树分解为多重加法表征：

F (x) = t = 1 \sum T j = 1 \sum J_{t} b_{t j} \cdot 1 [x \in R_{t j}]

其中 $R_{t j}$ 是第 $t$ 棵树第 $j$ 个叶节点对应的区域。

5.2 类别特征的处理

对于包含类别特征的数据，预测函数为：

F (x) = t = 1 \sum T j = 1 \sum J_{t} ϕ_{t} (c_{1}, \dots, c_{K}, R_{t j})

其中 $c_{k}$ 是第 $k$ 个类别特征的编码值。

6. 类别特征的高级处理

6.1 类别组合

CatBoost可以自动创建类别特征的组合：

from catboost import CatBoostClassifier
 
model = CatBoostClassifier(
    cat_features=['color', 'brand', 'material'],
    one_hot_max_size=10,  # 小类别使用one-hot
    # 大类别自动组合
)

6.2 类别组合的复杂度

对于 $K$ 个类别特征，每个可能的组合数为：

2^{K} - K - 1 （非空真子集）

CatBoost通过贪心策略选择最有价值的组合。

6.3 数值特征的处理

数值特征也使用目标统计进行处理：

NumTS (x_{i}) = \frac{\sum _{j : x_{j} < x_{i}} y _{j} + α \cdot prior}{\sum _{j : x_{j} < x_{i}} 1 + α}

7. GPU加速

7.1 GPU训练

CatBoost原生支持GPU训练：

from catboost import CatBoostClassifier, Pool
 
train_pool = Pool(X_train, y_train, cat_features=cat_indices)
valid_pool = Pool(X_test, y_test, cat_features=cat_indices)
 
model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.05,
    depth=6,
    task_type='GPU',  # 使用GPU
    devices='0'      # GPU设备号
)
 
model.fit(train_pool, eval_set=valid_pool, early_stopping_rounds=50)

7.2 GPU优化的组件

组件	GPU优化
目标统计计算	使用GPU并行计算
树构建	GPU采样
梯度计算	GPU矩阵运算
预测	向量化操作

7.3 内存效率

CatBoost使用数据打包来减少GPU内存占用：

类别特征使用整数编码
数值特征使用float16/float32
稀疏数据使用压缩格式

8. 参数详解

8.1 核心参数

参数	默认值	说明
`iterations`	1000	迭代次数
`learning_rate`	0.03	学习率
`depth`	6	树的深度
`l2_leaf_reg`	3.0	L2正则化系数
`border_count`	254	数值特征的分桶数
`bagging_temperature`	1	采样温度

8.2 类别特征参数

参数	默认值	说明
`cat_features`	None	类别特征的索引
`one_hot_max_size`	10	One-hot编码的最大类别数
`max_ctr_complexity`	1	类别组合的最大复杂度

8.3 防止过拟合

参数	默认值	说明
`random_strength`	1	随机强度
`bagging_temperature`	1	采样温度
`random_seed`	None	随机种子
`early_stopping_rounds`	None	早停轮数

8.4 调参建议

# 防止过拟合
params = {
    'depth': 6,                    # 减小深度
    'l2_leaf_reg': 3.0,           # 增加正则化
    'learning_rate': 0.03,        # 减小学习率
    'iterations': 2000,           # 增加迭代，配合小学习率
    'bagging_temperature': 0.5,   # 降低采样温度
    'random_strength': 0.5        # 增加随机强度
}
 
# 提高准确率
params = {
    'depth': 8,                   # 增加深度
    'l2_leaf_reg': 1.0,           # 减小正则化
    'learning_rate': 0.1,         # 增加学习率
    'iterations': 500,            # 减少迭代，配合大学习率
    'border_count': 128,          # 增加分桶数
    'max_ctr_complexity': 2       # 允许更多类别组合
}

9. 代码实现

9.1 基础用法

from catboost import CatBoostClassifier, Pool
import pandas as pd
import numpy as np
 
# 准备数据
X = pd.DataFrame({
    'feature1': np.random.randn(1000),
    'feature2': np.random.randn(1000),
    'cat_feature': np.random.choice(['A', 'B', 'C'], 1000)
})
y = np.random.randint(0, 2, 1000)
 
# 指定类别特征
cat_features = ['cat_feature']
 
# 创建Pool
train_pool = Pool(X, y, cat_features=cat_features)
 
# 训练
model = CatBoostClassifier(
    iterations=500,
    learning_rate=0.05,
    depth=6,
    cat_features=cat_features,
    verbose=100
)
model.fit(train_pool)
 
# 预测
predictions = model.predict(X)
proba = model.predict_proba(X)

9.2 回归任务

from catboost import CatBoostRegressor
 
model = CatBoostRegressor(
    iterations=500,
    learning_rate=0.05,
    depth=6,
    loss_function='RMSE',  # 回归损失
    cat_features=cat_features
)
 
model.fit(train_pool)
predictions = model.predict(X_test)

9.3 排序任务

from catboost import CatBoostRanker, Pool
 
train_pool = Pool(
    X_train, y_train, 
    group_id=group_column,  # 查询组ID
    cat_features=cat_features
)
 
model = CatBoostRanker(
    loss_function='YetiRankPairwise',
    iterations=500
)
 
model.fit(train_pool)
predictions = model.predict(X_test)

9.4 自定义损失函数

from catboost import CatBoostClassifier
 
def custom_objective(y_pred, train_data):
    y_true = train_data.get_label()
    grad = y_pred - y_true
    hess = np.ones_like(grad)
    return grad, hess
 
def custom_metric(y_pred, train_data):
    y_true = train_data.get_label()
    error = np.mean((y_pred - y_true) ** 2)
    return 'custom_error', error, False
 
model = CatBoostClassifier(
    iterations=500,
    learning_rate=0.05,
    loss_function=custom_objective,
    eval_metric=custom_metric
)
 
model.fit(train_pool, eval_set=valid_pool)

9.5 特征重要性

# 获取特征重要性
feature_importance = model.get_feature_importance()
 
# 可视化
model.plot_feature_importance()
 
# 类别特征分解
feature_importance = model.get_feature_importance(
    type='FeatureImportances:CatFeaturesQuantized'
)

10. 实际应用技巧

10.1 类别特征编码

# 类别特征不需要预处理
# CatBoost自动处理
 
# 但可以指定编码方式
model = CatBoostClassifier(
    cat_features=['cat1', 'cat2'],
    one_hot_max_size=10,  # 小类别用one-hot
    # 大类别用Ordered Target Statistics
)

10.2 缺失值处理

# CatBoost自动处理缺失值
# 数值特征的缺失值用 -999 或 nan 表示
# 类别特征的缺失值用 'NA' 表示
 
X = pd.DataFrame({
    'feature1': [1.0, 2.0, np.nan, 4.0],  # 缺失值
    'cat_feature': ['A', 'B', None, 'C']  # 缺失类别
})
 
model = CatBoostClassifier()
model.fit(X, y)  # 自动处理

10.3 处理不平衡数据

# 方法1：class_weights
model = CatBoostClassifier(
    class_weights={0: 1, 1: 10}  # 正样本权重10倍
)
 
# 方法2：auto_class_weights
model = CatBoostClassifier(
    auto_class_weights='Balanced'  # 自动平衡
)
 
# 方法3：scale_pos_weight
model = CatBoostClassifier(
    scale_pos_weight=10
)

10.4 模型保存与加载

# 保存
model.save_model('catboost_model.cbm')
 
# 加载
loaded_model = CatBoostClassifier()
loaded_model.load_model('catboost_model.cbm')
 
# 导出为JSON
model.save_model('model.json', format='json')
 
# 导出为CoreML
model.save_model('model.mlmodel', format='coreml')

11. 与XGBoost/LightGBM的对比

11.2 性能对比

数据类型	XGBoost	LightGBM	CatBoost
表格数据	好	最好	好
类别特征多	一般	一般	最好
数值特征多	最好	最好	好
小数据集	好	一般	最好

11.3 训练时间对比

数据规模	XGBoost	LightGBM	CatBoost
10万样本	中等	快	中等
100万样本	慢	最快	慢
类别特征多	慢	中等	快

11.4 选择建议

选择 CatBoost 当:
- 类别特征占比 > 50%
- 需要自动处理类别组合
- 数据集较小
- 需要严格的泛化性能

选择 LightGBM 当:
- 数据量很大
- 需要最快训练速度
- 数值特征为主

选择 XGBoost 当:
- 需要最稳定的性能
- 需要部署到生产环境
- 需要最多的调参选项

Metaphor

探索