Open In App

Ensemble Methods in Python

Improve
Improve
Like Article
Like
Save
Share
Report

Ensemble means a group of elements viewed as a whole rather than individually. An Ensemble method creates multiple models and combines them to solve it. Ensemble methods help to improve the robustness/generalizability of the model. In this article, we will discuss some methods with their implementation in Python. For this, we choose a dataset from the UCI repository.

Basic ensemble methods

1. Averaging method: It is mainly used for regression problems. The method consists of building multiple models independently and returning the average of the prediction of all the models. In general, the combined output is better than an individual output because variance is reduced.

In the below example, three regression models (linear regression, xgboost, and random forest) are trained and their predictions are averaged. The final prediction output is pred_final.

Python3




# importing utility modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
 
# importing machine learning models for prediction
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.linear_model import LinearRegression
 
# loading train data set in dataframe from train_data.csv file
df = pd.read_csv("train_data.csv")
 
# getting target data from the dataframe
target = df["target"]
 
# getting train data from the dataframe
train = df.drop("target")
 
# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(
    train, target, test_size=0.20)
 
# initializing all the model objects with default parameters
model_1 = LinearRegression()
model_2 = xgb.XGBRegressor()
model_3 = RandomForestRegressor()
 
# training all the model on the training dataset
model_1.fit(X_train, y_target)
model_2.fit(X_train, y_target)
model_3.fit(X_train, y_target)
 
# predicting the output on the validation dataset
pred_1 = model_1.predict(X_test)
pred_2 = model_2.predict(X_test)
pred_3 = model_3.predict(X_test)
 
# final prediction after averaging on the prediction of all 3 models
pred_final = (pred_1+pred_2+pred_3)/3.0
 
# printing the mean squared error between real value and predicted value
print(mean_squared_error(y_test, pred_final))


Output:

4560

2. Max voting: It is mainly used for classification problems. The method consists of building multiple models independently and getting their individual output called ‘vote’. The class with maximum votes is returned as output. 

In the below example, three classification models (logistic regression, xgboost, and random forest) are combined using sklearn VotingClassifier, that model is trained and the class with maximum votes is returned as output. The final prediction output is pred_final. Please note it’s a classification, not regression, so the loss may be different from other types of ensemble methods.

Python




# importing utility modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
 
# importing machine learning models for prediction
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
 
# importing voting classifier
from sklearn.ensemble import VotingClassifier
 
# loading train data set in dataframe from train_data.csv file
df = pd.read_csv("train_data.csv")
 
# getting target data from the dataframe
target = df["Weekday"]
 
# getting train data from the dataframe
train = df.drop("Weekday")
 
# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(
    train, target, test_size=0.20)
 
# initializing all the model objects with default parameters
model_1 = LogisticRegression()
model_2 = XGBClassifier()
model_3 = RandomForestClassifier()
 
# Making the final model using voting classifier
final_model = VotingClassifier(
    estimators=[('lr', model_1), ('xgb', model_2), ('rf', model_3)], voting='hard')
 
# training all the model on the train dataset
final_model.fit(X_train, y_train)
 
# predicting the output on the test dataset
pred_final = final_model.predict(X_test)
 
# printing log loss between actual and predicted value
print(log_loss(y_test, pred_final))


Output:

231

Let’s have a look at a bit more advanced ensemble methods

Advanced ensemble methods

Ensemble methods are extensively used in classical machine learning. Examples of algorithms using bagging are random forest and bagging meta-estimator and examples of algorithms using boosting are GBM, XGBM, Adaboost, etc. 

As a developer of a machine learning model, it is highly recommended to use ensemble methods. The ensemble methods are used extensively in almost all competitions and research papers.

1. Stacking: It is an ensemble method that combines multiple models (classification or regression) via meta-model (meta-classifier or meta-regression). The base models are trained on the complete dataset, then the meta-model is trained on features returned (as output) from base models. The base models in stacking are typically different. The meta-model helps to find the features from base models to achieve the best accuracy.

Algorithm:

  1. Split the train dataset into n parts
  2. A base model (say linear regression) is fitted on n-1 parts and predictions are made for the nth part. This is done for each one of the n part of the train set.
  3. The base model is then fitted on the whole train dataset.
  4. This model is used to predict the test dataset.
  5. The Steps 2 to 4 are repeated for another base model which results in another set of predictions for the train and test dataset.
  6. The predictions on train data set are used as a feature to build the new model.
  7. This final model is used to make the predictions on test dataset

Stacking is a bit different from the basic ensembling methods because it has first-level and second-level models. Stacking features are first extracted by training the dataset with all the first-level models. A first-level model is then using the train stacking features to train the model than this model predicts the final output with test stacking features.

Python3




# importing utility modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
 
# importing machine learning models for prediction
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.linear_model import LinearRegression
 
# importing stacking lib
from vecstack import stacking
 
# loading train data set in dataframe from train_data.csv file
df = pd.read_csv("train_data.csv")
 
# getting target data from the dataframe
target = df["target"]
 
# getting train data from the dataframe
train = df.drop("target")
 
# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(
    train, target, test_size=0.20)
 
 
# initializing all the base model objects with default parameters
model_1 = LinearRegression()
model_2 = xgb.XGBRegressor()
model_3 = RandomForestRegressor()
 
# putting all base model objects in one list
all_models = [model_1, model_2, model_3]
 
# computing the stack features
s_train, s_test = stacking(all_models, X_train, X_test,
                           y_train, regression=True, n_folds=4)
 
# initializing the second-level model
final_model = model_1
 
# fitting the second level model with stack features
final_model = final_model.fit(s_train, y_train)
 
# predicting the final output using stacking
pred_final = final_model.predict(X_test)
 
# printing the mean squared error between real value and predicted value
print(mean_squared_error(y_test, pred_final))


Output:

4510 

2. Blending: It is similar to the stacking method explained above, but rather than using the whole dataset for training the base-models, a validation dataset is kept separate to make predictions. 

Algorithm: 

  1. Split the training dataset into train, test and validation dataset.
  2. Fit all the base models using train dataset.
  3. Make predictions on validation and test dataset.
  4. These predictions are used as features to build a second level model
  5. This model is used to make predictions on test and meta-features

Python3




# importing utility modules
import pandas as pd
from sklearn.metrics import mean_squared_error
 
# importing machine learning models for prediction
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.linear_model import LinearRegression
 
# importing train test split
from sklearn.model_selection import train_test_split
 
# loading train data set in dataframe from train_data.csv file
df = pd.read_csv("train_data.csv")
 
# getting target data from the dataframe
target = df["target"]
 
# getting train data from the dataframe
train = df.drop("target")
 
#Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.20)
 
# performing the train test and validation split
train_ratio = 0.70
validation_ratio = 0.20
test_ratio = 0.10
 
# performing train test split
x_train, x_test, y_train, y_test = train_test_split(
    train, target, test_size=1 - train_ratio)
 
# performing test validation split
x_val, x_test, y_val, y_test = train_test_split(
    x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio))
 
# initializing all the base model objects with default parameters
model_1 = LinearRegression()
model_2 = xgb.XGBRegressor()
model_3 = RandomForestRegressor()
 
# training all the model on the train dataset
 
# training first model
model_1.fit(x_train, y_train)
val_pred_1 = model_1.predict(x_val)
test_pred_1 = model_1.predict(x_test)
 
# converting to dataframe
val_pred_1 = pd.DataFrame(val_pred_1)
test_pred_1 = pd.DataFrame(test_pred_1)
 
# training second model
model_2.fit(x_train, y_train)
val_pred_2 = model_2.predict(x_val)
test_pred_2 = model_2.predict(x_test)
 
# converting to dataframe
val_pred_2 = pd.DataFrame(val_pred_2)
test_pred_2 = pd.DataFrame(test_pred_2)
 
# training third model
model_3.fit(x_train, y_train)
val_pred_3 = model_1.predict(x_val)
test_pred_3 = model_1.predict(x_test)
 
# converting to dataframe
val_pred_3 = pd.DataFrame(val_pred_3)
test_pred_3 = pd.DataFrame(test_pred_3)
 
# concatenating validation dataset along with all the predicted validation data (meta features)
df_val = pd.concat([x_val, val_pred_1, val_pred_2, val_pred_3], axis=1)
df_test = pd.concat([x_test, test_pred_1, test_pred_2, test_pred_3], axis=1)
 
# making the final model using the meta features
final_model = LinearRegression()
final_model.fit(df_val, y_val)
 
# getting the final output
final_pred = final_model.predict(df_test)
 
#printing the mean squared error between real value and predicted value
print(mean_squared_error(y_test, pred_final))


 Output:

4790 

3. Bagging: It is also known as a bootstrapping method. Base models are run on bags to get a fair distribution of the whole dataset. A bag is a subset of the dataset along with a replacement to make the size of the bag the same as the whole dataset. The final output is formed after combining the output of all base models. 

Algorithm:

  1. Create multiple datasets from the train dataset by selecting observations with replacements
  2. Run a base model on each of the created datasets independently
  3. Combine the predictions of all the base models to each the final output

Bagging normally uses only one base model (XGBoost Regressor used in the code below).

Python




# importing utility modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
 
# importing machine learning models for prediction
import xgboost as xgb
 
# importing bagging module
from sklearn.ensemble import BaggingRegressor
 
# loading train data set in dataframe from train_data.csv file
df = pd.read_csv("train_data.csv")
 
# getting target data from the dataframe
target = df["target"]
 
# getting train data from the dataframe
train = df.drop("target")
 
# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(
    train, target, test_size=0.20)
 
# initializing the bagging model using XGboost as base model with default parameters
model = BaggingRegressor(base_estimator=xgb.XGBRegressor())
 
# training model
model.fit(X_train, y_train)
 
# predicting the output on the test dataset
pred = model.predict(X_test)
 
# printing the mean squared error between real value and predicted value
print(mean_squared_error(y_test, pred_final))


Output:

4666 

4. Boosting: Boosting is a sequential method–it aims to prevent a wrong base model from affecting the final output. Instead of combining the base models, the method focuses on building a new model that is dependent on the previous one. A new model tries to remove the errors made by its previous one. Each of these models is called weak learners. The final model (aka strong learner) is formed by getting the weighted mean of all the weak learners. 

Algorithm:

  1. Take a subset of the train dataset.
  2. Train a base model on that dataset.
  3. Use third model to make predictions on the whole dataset.
  4. Calculate errors using the predicted values and actual values.
  5. Initialize all data points with same weight.
  6. Assign higher weight to incorrectly predicted data points.
  7. Make another model, make predictions using the new model in such a way that errors made by the previous model are mitigated/corrected.
  8. Similarly, create multiple models–each successive model correcting the errors of the previous model.
  9. The final model (strong learner) is the weighted mean of all the previous models (weak learners).

Python3




# importing utility modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
 
# importing machine learning models for prediction
from sklearn.ensemble import GradientBoostingRegressor
 
# loading train data set in dataframe from train_data.csv file
df = pd.read_csv("train_data.csv")
 
# getting target data from the dataframe
target = df["target"]
 
# getting train data from the dataframe
train = df.drop("target")
 
# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(
    train, target, test_size=0.20)
 
# initializing the boosting module with default parameters
model = GradientBoostingRegressor()
 
# training the model on the train dataset
model.fit(X_train, y_train)
 
# predicting the output on the test dataset
pred_final = model.predict(X_test)
 
# printing the mean squared error between real value and predicted value
print(mean_squared_error(y_test, pred_final))


Output:

4789 

Note: The scikit-learn provides several modules/methods for ensemble methods. Please note the accuracy of a method does not suggest one method is superior to another. The article aims to give a brief introduction to ensemble methods–not to compare between them. The programmer must use a method that suits the data.



Last Updated : 27 Mar, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads