机器学习中的Stacking模型融合 - 好文

最近学习了模型融合的方法，遇到了Stacking的方法来解决模型融合的问题，因此做了以下总结。

1.Stacking是什么？

*
Stacking简单理解就是讲几个简单的模型，一般采用将它们进行K折交叉验证输出预测结果，然后将每个模型输出的预测结果合并为新的特征，并使用新的模型加以训练。
* Stacking模型本质上是一种分层的结构，这里简单起见，只分析二级Stacking.假设我们有3个基模型M1、M2、M3。
*
* 基模型M1，对训练集train训练，然后用于预测train和test的标签列，分别是P1，T1
* 模型融合的图示如下

2.Stacking的好处在哪里？

*
做大数据的比赛的一般是是使用单一模型进行预测，或者是多个模型进行比较，选出最合适的模型，我们所做的交叉验证主要是多个模型的加权平均。我们使用单个模型进行交叉验证，一般是使用K-fold交叉验证，来降低模型的过拟合风险，提高模型的准确度。
* 下边是在Kaggle的房价预测比赛中使用的Stacking代码(核心部分)：
GitHub <https://github.com/1mrliu/Kaggle_Competition/tree/master/HousePices>
class StackingAveragedModels (BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self, base_models, meta_model, n_folds=5): self.base_models =
base_models self.meta_model = meta_model self.n_folds = n_folds# We again fit
the data on clones of the original models def fit(self, X, y):
self.base_models_ = [list ()for x in self.base_models] self.meta_model_ =
clone(self.meta_model) kfold = KFold(n_splits=self.n_folds, shuffle=True,
random_state=156) # 使用K-fold的方法来进行交叉验证，将每次验证的结果作为新的特征来进行处理
out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models))) for i,
modelin enumerate(self.base_models): for train_index, holdout_index in
kfold.split(X, y): instance = clone(model)
self.base_models_[i].append(instance) instance.fit(X[train_index],
y[train_index]) y_pred = instance.predict(X[holdout_index])
out_of_fold_predictions[holdout_index, i] = y_pred# 将交叉验证预测出的结果和训练集中的标签值进行训练
self.meta_model_.fit(out_of_fold_predictions, y)return self # 从得到的新的特征
采用新的模型进行预测并输出结果 def predict(self, X): meta_features = np.column_stack ([
np.column_stack([model.predict (X)for model in base_models]).mean (axis=1) for
base_modelsin self.base_models_]) return
self.meta_model_.predict(meta_features) stacked_averaged_models =
StackingAveragedModels(base_models=(ENet, GBoost, KRR),# meta_model=model_lgb)
meta_model=lasso)
* 主要的过程简述如下：

* 首先需要几个模型，然后对已有的数据集进行K折交叉验证
* K折交叉验证训练集，对每折的输出结果保存，最后进行合并
*
对于测试集T1的得到，有两种方法。注意到刚刚是2折交叉验证，M1相当于训练了2次，所以一种方法是每一次训练M1，可以直接对整个test进行预测，这样2折交叉验证后测试集相当于预测了2次，然后对这两列求平均得到T1。
* 是两层循环，第一层循环控制基模型的数目，第二层循环控制的是交叉验证的次数K，对每一个基模型会训练K次，然后拼接得到预测结果P1。

*
该图是一个基模型得到P1和T1的过程，采用的是5折交叉验证，所以循环了5次，拼接得到P1，测试集预测了5次，取平均得到T1。而这仅仅只是第二层输入的一列/一个特征，并不是整个训练集。再分析作者的代码也就很清楚了。也就是刚刚提到的两层循环

热门工具换一换