机器学习稳定性基石：深度学习Bagging（Bootstrap Aggregating）算法的原理、手动计算与Python/Java双代码实战

原创

jack.yang

发布于 2026-03-29 16:39:31

1510

文章被收录于专栏：大模型系列大模型系列

关键词：机器学习、Bagging算法、Bootstrap聚合、集成学习、随机森林基础、偏差方差分解、Python Bagging、Java Bagging、OOB误差、决策树集成

一句话答案：Bagging 通过对训练集有放回随机采样构建多个基模型，并平均（回归）或投票（分类） 其预测结果——它不降低偏差，但显著降低方差，是随机森林的理论基础！

如果你在搜索：

“Bagging 和 Boosting 有什么区别？”
“为什么 Bagging 能减少过拟合？”
“如何手写 Bagging？”
“Bagging 适用于哪些模型？”

那么，这篇文章就是为你写的——从自助法到方差分解，一步不跳。

一、什么是 Bagging？它的核心思想是什么？

Bagging（Bootstrap Aggregating）由 Leo Breiman 于 1996 年提出，是一种并行式集成学习方法。

🔑 两大核心步骤：

Bootstrap 采样：
- 从原始数据集（大小为 (n)）中有放回随机抽取 (n) 个样本，形成一个“自助样本”（Bootstrap Sample）
- 平均约 63.2% 的原始样本会被选中，其余为“袋外样本”（Out-of-Bag, OOB）
Aggregating 集成：
- 对每个自助样本训练一个相同的基学习器（如决策树）
- 回归任务：取所有模型预测的均值
- 分类任务：取所有模型的多数投票

💡 关键洞察：Bagging 通过引入随机性 + 模型平均，降低单模型对训练数据的敏感度。

二、为什么 Bagging 能提升性能？偏差-方差分解揭秘

任何模型的泛化误差可分解为：

📊 Bagging 对各项的影响：

组件	单模型	Bagging 集成	原因
偏差（Bias）	固定	基本不变	所有基模型同质，期望预测不变
方差（Variance）	高	✅ 显著降低	模型间不相关 → 平均后方差下降
过拟合风险	高（如深树）	✅ 大幅降低	随机采样打破噪声模式

✅ 结论：Bagging 特别适合高方差、低偏差的模型（如未剪枝决策树）。

三、手工推演：用 Bagging 预测房价（回归任务）

📊 原始数据集（5个样本）

房屋ID	面积(x)	价格(y, 万元)
A	50	100
B	70	140
C	90	180
D	110	220
E	130	260

目标：用 Bagging（3棵树）预测面积=100 的房价。

🔁 步骤1：生成3个自助样本（有放回抽5次）

样本1：[A, B, B, C, E] → y=[100,140,140,180,260]
样本2：[C, C, D, D, E] → y=[180,180,220,220,260]
样本3：[A, A, D, E, E] → y=[100,100,220,260,260]

💡 注意：B 在样本1出现两次，C 在样本2出现两次，A 在样本3出现两次。

🌲 步骤2：为每个样本训练一棵未剪枝回归树（叶节点=均值）

树1（基于样本1）：
- 若 x ≤ 80 → y = (100+140+140)/3 ≈ 126.7
- 若 x > 80 → y = (180+260)/2 = 220
- 预测 x=100 → 220
树2（样本2）：
- x ≤ 100 → y=(180+180)/2=180
- x > 100 → y=(220+220+260)/3≈233.3
- 预测 x=100 → 180（假设切分点<100）
树3（样本3）：
- x ≤ 90 → y=(100+100)/2=100
- x > 90 → y=(220+260+260)/3≈246.7
- 预测 x=100 → 246.7

📊 步骤3：Bagging 集成预测

💡 真实关系 y=2x → 真实值=200。单棵树可能严重偏离（如树3预测246.7），但Bagging 平滑了极端预测。

四、Bagging vs Boosting：根本区别

特性	Bagging	Boosting（如AdaBoost）
训练方式	⚡ 并行（独立训练）	🔄 串行（依赖前一轮）
样本权重	所有样本等权（但采样随机）	❗ 动态调整（关注错分样本）
目标	✅ 降低方差	✅ 降低偏差
基模型要求	高方差模型（如深树）	弱模型（如决策桩）
过拟合风险	低	高（尤其噪声数据）
代表算法	随机森林	AdaBoost, GBDT, XGBoost

🎯 简单记：

Bagging = “民主投票”（大家独立判断，然后平均）
Boosting = “师徒传承”（徒弟专门纠正师傅的错误）

五、Python 实现：手写 Bagging + 决策树基学习器

import numpy as np
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier

class Bagging:
    def __init__(self, base_estimator=None, n_estimators=10, task='regression'):
        self.base_estimator = base_estimator or (
            DecisionTreeRegressor(max_depth=5) if task == 'regression' 
            else DecisionTreeClassifier(max_depth=5)
        )
        self.n_estimators = n_estimators
        self.task = task
        self.estimators = []

    def fit(self, X, y):
        n_samples = X.shape[0]
        self.estimators = []
        
        for _ in range(self.n_estimators):
            # Bootstrap 采样（有放回）
            indices = np.random.choice(n_samples, size=n_samples, replace=True)
            X_boot, y_boot = X[indices], y[indices]
            
            # 训练基模型
            estimator = self.base_estimator
            if hasattr(estimator, 'fit'):
                estimator = self.base_estimator.__class__(**self.base_estimator.get_params())
            estimator.fit(X_boot, y_boot)
            self.estimators.append(estimator)

    def predict(self, X):
        if self.task == 'regression':
            predictions = np.array([est.predict(X) for est in self.estimators])
            return np.mean(predictions, axis=0)
        else:  # classification
            predictions = np.array([est.predict(X) for est in self.estimators])
            # 多数投票
            return np.array([np.bincount(preds).argmax() for preds in predictions.T])

# === 测试回归 ===
X = np.array([[50], [70], [90], [110], [130]])
y = np.array([100, 140, 180, 220, 260])
bag_reg = Bagging(task='regression', n_estimators=3)
bag_reg.fit(X, y)
print("Bagging预测(100):", bag_reg.predict())  # ≈215

# === 测试分类 ===
X_cls = np.array([[1], [2], [3], [4], [5]])
y_cls = np.array([0, 0, 1, 1, 1])
bag_cls = Bagging(task='classification', n_estimators=3)
bag_cls.fit(X_cls, y_cls)
print("Bagging分类(3.5):", bag_cls.predict())  # 应为1

六、Java 实现：Bagging 核心逻辑（使用 Weka 决策树）

import weka.classifiers.trees.J48;
import weka.core.Instances;
import weka.core.Instance;
import java.util.Random;

public class Bagging {
    private weka.classifiers.Classifier[] models;
    private int numModels;
    private Random rand = new Random();

    public void buildClassifier(Instances data, int numModels) throws Exception {
        this.numModels = numModels;
        models = new weka.classifiers.Classifier[numModels];
        
        for (int i = 0; i < numModels; i++) {
            // Bootstrap 采样
            Instances bootstrap = new Instances(data, data.numInstances());
            for (int j = 0; j < data.numInstances(); j++) {
                int idx = rand.nextInt(data.numInstances());
                bootstrap.add(data.instance(idx));
            }
            
            // 训练基模型（以J48为例）
            J48 tree = new J48();
            tree.buildClassifier(bootstrap);
            models[i] = tree;
        }
    }

    public double[] distributionForInstance(Instance inst) throws Exception {
        double[] avgDist = new double[inst.numClasses()];
        for (weka.classifiers.Classifier model : models) {
            double[] dist = model.distributionForInstance(inst);
            for (int i = 0; i < avgDist.length; i++) {
                avgDist[i] += dist[i];
            }
        }
        // 平均概率分布
        for (int i = 0; i < avgDist.length; i++) {
            avgDist[i] /= numModels;
        }
        return avgDist;
    }

    // 分类预测：取概率最大类
    public double classifyInstance(Instance inst) throws Exception {
        double[] dist = distributionForInstance(inst);
        int bestClass = 0;
        for (int i = 1; i < dist.length; i++) {
            if (dist[i] > dist[bestClass]) bestClass = i;
        }
        return bestClass;
    }
}

💡 回归任务需替换为 M5P 或自定义回归树。

七、Bagging 的三大优势与适用场景

✅ 核心优势

降低方差：对不稳定模型（如深决策树）效果显著
天然支持 OOB 评估：无需单独验证集
易于并行化：各基模型独立训练

🎯 最佳应用场景

高方差基模型：未剪枝决策树、神经网络（小数据）
中小数据集：Bootstrap 能有效扩充“视图”
需要模型稳定性：金融风控、医疗诊断
作为随机森林的基础：RF = Bagging + 特征随机

八、局限性与注意事项

问题	说明
❌ 不降低偏差	若基模型本身有偏（如线性模型拟合非线性），Bagging 无法改善
❌ 计算成本高	需训练 (N) 个模型（但可并行）
❌ 对稳定模型无效	如线性回归、KNN（本身方差低）
⚠️ OOB 估计有偏	尤其在小数据集上