Automated machine learning (AutoML) is the process of automating the end-to-end process of applying machine learning to real-world problems. AutoML automates most of the steps in an ML pipeline, with a minimum amount of human effort and without compromising on its performance.
Automatic machine learning broadly includes the following steps:
- Data preparation and Ingestion: The real-world data can be raw data or just in any format. In this step, data needs to be converted into a format that can be processed easily. This also required to decide the data type of different columns in the dataset. We also required a clear knowledge about the task we need to perform on data (e.g classification, regression, etc.)
- Feature Engineering: This includes various steps that are required for cleaning the dataset such as dealing with NULL /missing values, selecting the most important features of the dataset, and removing the low-correlational features, dealing with the skewed dataset.
- Hyperparameter Optimization: To obtain the best results on any model, the AutoML need to carefully tune the hyperparameter values.
- Model Selection: H2O autoML trains with a large number of models in order to produce the best results. H2O AutoML also trains the data of different ensembles to get the best performance out of training data.
H2O AutoML contains the cutting-edge and distributed implementation of many machine learning algorithms. These algorithms are available in Java, Python, Spark, Scala, and R. H2O also provide a web GUI that uses JSON to implement these algorithms. The models trained on H2O AutoML can be easily deployed on the Spark server, AWS, etc.
The main advantage of H2O AutoML is that it automates the steps like basic data processing, model training and tuning, Ensemble and stacking of various models to provide the models with the best performance so that developers can focus on other steps like data collection, feature engineering and deployment of model.
Functionalities of H2O AutoML
- H2O AutoML provides necessary data processing capabilities. These are also included in all of the H2O algorithms.
- Trains a Random grid of algorithms like GBMs, DNNs, GLMs, etc. using a carefully chosen hyper-parameter space.
- Individual models are tuned using cross-validation.
- Two Stacked Ensembles are trained. One ensemble contains all the models (optimized for model performance), and the other ensemble provides just the best performing model from each algorithm class/family (optimized for production use).
- Returns a sorted “Leaderboard” of all models.
- All models can be easily exported to production.
Architecture:
H2O AutoML uses H2O architecture. H2O architecture can be divided into different layers in which the top layer will be different APIs, and the bottom layer will be H2O JVM.
H2O Software Stack
H2O provides REST API clients for Python, R, Excel, Tableau, and Flow Web UI using socket connections.
The bottom layer contains different components that will run on the H2O JVM process.
An H2O cluster consists of one or more nodes. Each node is a single JVM process. Each JVM process is split into three layers: language, algorithms, and core infrastructure.
- The first layer in the bottom section is the language layer. The language layer consists of an expression evaluation engine for R and the Scala layer.
- The second layer is the algorithm layer. This layer contains an algorithms that are already provided in the H2O such as: XGBoost, GBM, Random Forest, K-Means etc.
- The third layer is the core infrastructure layer that deals with resource management such as Memory and CPU management.
Implementation:
- In this code, we will be using California Housing Dataset which is easily available in colab. First, we need to import the necessary packages.
Code:
python3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
|
- Now, we load the California Housing Dataset. This is already available in the sample data folder when we load the environment in colab.
Code:
python3
df = pd.read_csv( 'sample_data / california_housing_train.csv' )
|
- Let’s look at the dataset, we use the head function to list the first few rows of the dataset.
Code:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0
- Now, let’s check for null values in the dataset. As we can see that there are no null values in the dataset.
Code:
longitude 0
latitude 0
housing_median_age 0
total_rooms 0
total_bedrooms 0
population 0
households 0
median_income 0
median_house_value 0
dtype: int64
- Now we need to install the h2o, we can install it using pip. Note, if you are using the local environment for H2O, you need to install the Java Development Kit (JDK). After installing JDK and H2O, we will initialize it, if it works fine this will start an H2O instance on the localhost. There are many arguments which we can pass such as:
- nthreads: No of cores H2O server can use, by default it uses all cores of CPU.
- ip: IP address of the server where the H2O server will run. By default, it uses localhost.
- port: port on which the H2O server will run.
- max_mem_size: A character string specifying the maximum size, in bytes, of the memory allocation pool to H2O. This value must be a multiple of 1024 greater than 2MB. Append the letter m or M to indicate megabytes, or g or G to indicate gigabytes. Similarly, there is another parameter min_mem_size. For more details please look at H2O docs
Code:
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
Java Version: openjdk version "11.0.7" 2020-04-14; OpenJDK Runtime Environment (build 11.0.7+10-post-Ubuntu-2ubuntu218.04); OpenJDK 64-Bit Server VM (build 11.0.7+10-post-Ubuntu-2ubuntu218.04, mixed mode, sharing)
Starting server from /usr/local/lib/python3.6/dist-packages/h2o/backend/bin/h2o.jar
Ice root: /tmp/tmpebz1_45i
JVM stdout: /tmp/tmpebz1_45i/h2o_unknownUser_started_from_python.out
JVM stderr: /tmp/tmpebz1_45i/h2o_unknownUser_started_from_python.err
Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
H2O_cluster_uptime: 03 secs
H2O_cluster_timezone: Etc/UTC
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.30.0.6
H2O_cluster_version_age: 13 days
H2O_cluster_name: H2O_from_python_unknownUser_h4lj71
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 3.180 Gb
H2O_cluster_total_cores: 2
H2O_cluster_allowed_cores: 2
H2O_cluster_status: accepting new members, healthy
H2O_connection_url: http://127.0.0.1:54321
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
H2O_API_Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python_version: 3.6.9 final
- The H2O instance can also be assessed from localhost: 54321, this instance provides a web GUI called FlowGUI. Now, we need to convert the train data frame into the H2O Dataframe.
python3
train_df = h2o.H2OFrame(df)
train_df.describe()
|
Parse progress: |?????????????????????????????????????????????????????????| 100%
Rows:17000
Cols:9
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
type real real int int int int int real int
mins -124.35 32.54 1.0 2.0 1.0 3.0 1.0 0.4999 14999.0
mean -119.5621082352941 35.62522470588239 28.589352941176436 2643.6644117647143 539.4108235294095 1429.573941176477 501.2219411764718 3.8835781000000016 207300.9123529415
maxs -114.31 41.95 52.0 37937.0 6445.0 35682.0 6082.0 15.0001 500001.0
sigma 2.0051664084260357 2.137339794657087 12.586936981660406 2179.9470714527765 421.4994515798648 1147.852959159527 384.52084085590155 1.9081565183791034 115983.76438720895
zeros 0 0 0 0 0 0 0 0 0
missing 0 0 0 0 0 0 0 0 0
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
1 -114.47 34.4 19.0 7650.0 1901.0 1129.0 463.0 1.82 80100.0
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.925 65500.0
5 -114.58 33.63 29.0 1387.0 236.0 671.0 239.0 3.3438 74000.0
6 -114.58 33.61 25.0 2907.0 680.0 1841.0 633.0 2.6768 82400.0
7 -114.59 34.83 41.0 812.0 168.0 375.0 158.0 1.7083 48500.0
8 -114.59 33.61 34.0 4789.0 1175.0 3134.0 1056.0 2.1782 58400.0
9 -114.6 34.83 46.0 1497.0 309.0 787.0 271.0 2.1908 48100.0
- Now, we load our test dataset into pandas DataFrame and convert it into the H2O Dataframe.
Code:
python3
test = pd.read_csv( 'sample_data / california_housing_test.csv' )
test = h2o.H2OFrame(test)
x = test.columns
y = 'median_house_value'
x.remove(y)
|
Parse progress: |?????????????????????????????????????????????????????????| 100%
- Now, we run AutoML and start training.
Code:
python3
from h2o.automl import H2OAutoML
aml = H2OAutoML(max_runtime_secs = 600 ,
seed = 1 ,
balance_classes = False ,
project_name = 'Project 1'
)
aml.train(x = x, y = y, training_frame = train_df)
|
AutoML progress: |????????????????????????????????????????????????????????| 100%
CPU times: user 40 s, sys: 1.24 s, total: 41.2 s
Wall time: 9min 39s
- In this step, we will look for the best performing model using the leaderboard and it will most probably be one of the two stacked ensemble models.
python3
lb = aml.leaderboard
lb.head(rows = lb.nrows)
|
model_id mean_residual_deviance rmse mse mae rmsle
StackedEnsemble_AllModels_AutoML_20200714_173719 2.04045e+09 45171.3 2.04045e+09 29642.1 0.221447
StackedEnsemble_BestOfFamily_AutoML_20200714_173719 2.06576e+09 45450.6 2.06576e+09 29949.4 0.223522
GBM_3_AutoML_20200714_173719 2.15623e+09 46435.2 2.15623e+09 30763.8 0.227577
GBM_4_AutoML_20200714_173719 2.15913e+09 46466.4 2.15913e+09 30786.7 0.228627
XGBoost_grid__1_AutoML_20200714_173719_model_5 2.16562e+09 46536.2 2.16562e+09 31075.9 0.233288
GBM_2_AutoML_20200714_173719 2.17639e+09 46651.8 2.17639e+09 31014.8 0.229731
GBM_grid__1_AutoML_20200714_173719_model_2 2.2457e+09 47388.8 2.2457e+09 31717.9 0.236673
GBM_grid__1_AutoML_20200714_173719_model_4 2.24615e+09 47393.6 2.24615e+09 31533.6 0.235206
GBM_grid__1_AutoML_20200714_173719_model_5 2.30368e+09 47996.7 2.30368e+09 31888 0.234582
GBM_grid__1_AutoML_20200714_173719_model_3 2.31412e+09 48105.3 2.31412e+09 32428.7 0.241596
GBM_1_AutoML_20200714_173719 2.38155e+09 48801.2 2.38155e+09 32817.8 0.241261
GBM_5_AutoML_20200714_173719 2.38712e+09 48858.1 2.38712e+09 32730.3 0.238373
XGBoost_grid__1_AutoML_20200714_173719_model_2 2.41444e+09 49137 2.41444e+09 33359.3 nan
XGBoost_grid__1_AutoML_20200714_173719_model_1 2.43811e+09 49377.2 2.43811e+09 33392.7 nan
XGBoost_grid__1_AutoML_20200714_173719_model_6 2.44549e+09 49451.8 2.44549e+09 33620.7 nan
XGBoost_grid__1_AutoML_20200714_173719_model_7 2.46672e+09 49666.1 2.46672e+09 33264.5 nan
XGBoost_3_AutoML_20200714_173719 2.47346e+09 49733.9 2.47346e+09 33829 nan
XGBoost_grid__1_AutoML_20200714_173719_model_3 2.53867e+09 50385.2 2.53867e+09 33713.1 0.252152
XGBoost_grid__1_AutoML_20200714_173719_model_4 2.61998e+09 51185.8 2.61998e+09 34084.3 nan
GBM_grid__1_AutoML_20200714_173719_model_1 2.63332e+09 51315.9 2.63332e+09 35218.1 nan
XGBoost_1_AutoML_20200714_173719 2.64565e+09 51435.9 2.64565e+09 34900.5 nan
XGBoost_2_AutoML_20200714_173719 2.67031e+09 51675 2.67031e+09 35556.1 nan
DRF_1_AutoML_20200714_173719 2.90447e+09 53893.1 2.90447e+09 36925.5 0.263639
XRT_1_AutoML_20200714_173719 2.92071e+09 54043.6 2.92071e+09 37116.6 0.264397
XGBoost_grid__1_AutoML_20200714_173719_model_8 4.32541e+09 65767.9 4.32541e+09 43502.3 0.287448
DeepLearning_1_AutoML_20200714_173719 5.06767e+09 71187.6 5.06767e+09 49467.4 nan
DeepLearning_grid__2_AutoML_20200714_173719_model_1 6.01537e+09 77558.8 6.01537e+09 56478.1 0.386805
DeepLearning_grid__3_AutoML_20200714_173719_model_1 7.85515e+09 88629.3 7.85515e+09 64133.5 0.448841
GBM_grid__1_AutoML_20200714_173719_model_6 8.44986e+09 91923.1 8.44986e+09 71726.4 0.483173
DeepLearning_grid__1_AutoML_20200714_173719_model_2 8.72689e+09 93417.8 8.72689e+09 65346.1 nan
DeepLearning_grid__1_AutoML_20200714_173719_model_1 8.9643e+09 94680 8.9643e+09 68862.6 nan
GLM_1_AutoML_20200714_173719 1.34525e+10 115985 1.34525e+10 91648.3 0.592579
- In this step, we explore the base learners of the stacked ensemble model and select the best performing base learning model.
Code:
python3
se = aml.leader
metalearner = h2o.get_model(se.metalearner()[ 'name' ]))
metalearner.varimp()
|
[('XGBoost_grid__1_AutoML_20200714_173719_model_5',
36607.81502851827,
1.0,
0.3400955145231931),
('GBM_4_AutoML_20200714_173719',
33538.168782584005,
0.9161477885652846,
0.311577753531396),
('GBM_3_AutoML_20200714_173719',
27022.573640463357,
0.7381640674105295,
0.25104628830851705),
('XGBoost_grid__1_AutoML_20200714_173719_model_3',
7512.2319349954105,
0.2052084214570911,
0.06979046367994166),
('GBM_2_AutoML_20200714_173719',
1221.399944930078,
0.03336445903637191,
0.011347102862762904),
('XGBoost_grid__1_AutoML_20200714_173719_model_4',
897.9511180098376,
0.024528945999926915,
0.008342184510556763),
('XGBoost_grid__1_AutoML_20200714_173719_model_2',
839.6650323257486,
0.022936769967604773,
0.007800692583632669),
('GBM_grid__1_AutoML_20200714_173719_model_2', 0.0, 0.0, 0.0),
('GBM_grid__1_AutoML_20200714_173719_model_4', 0.0, 0.0, 0.0),
('GBM_grid__1_AutoML_20200714_173719_model_5', 0.0, 0.0, 0.0),
('GBM_grid__1_AutoML_20200714_173719_model_3', 0.0, 0.0, 0.0),
('GBM_1_AutoML_20200714_173719', 0.0, 0.0, 0.0),
('GBM_5_AutoML_20200714_173719', 0.0, 0.0, 0.0),
('XGBoost_grid__1_AutoML_20200714_173719_model_1', 0.0, 0.0, 0.0),
('XGBoost_grid__1_AutoML_20200714_173719_model_6', 0.0, 0.0, 0.0),
('XGBoost_grid__1_AutoML_20200714_173719_model_7', 0.0, 0.0, 0.0),
('XGBoost_3_AutoML_20200714_173719', 0.0, 0.0, 0.0),
('GBM_grid__1_AutoML_20200714_173719_model_1', 0.0, 0.0, 0.0),
('XGBoost_1_AutoML_20200714_173719', 0.0, 0.0, 0.0),
('XGBoost_2_AutoML_20200714_173719', 0.0, 0.0, 0.0),
('DRF_1_AutoML_20200714_173719', 0.0, 0.0, 0.0),
('XRT_1_AutoML_20200714_173719', 0.0, 0.0, 0.0),
('XGBoost_grid__1_AutoML_20200714_173719_model_8', 0.0, 0.0, 0.0),
('DeepLearning_1_AutoML_20200714_173719', 0.0, 0.0, 0.0),
('DeepLearning_grid__2_AutoML_20200714_173719_model_1', 0.0, 0.0, 0.0),
('DeepLearning_grid__3_AutoML_20200714_173719_model_1', 0.0, 0.0, 0.0),
('GBM_grid__1_AutoML_20200714_173719_model_6', 0.0, 0.0, 0.0),
('DeepLearning_grid__1_AutoML_20200714_173719_model_2', 0.0, 0.0, 0.0),
('DeepLearning_grid__1_AutoML_20200714_173719_model_1', 0.0, 0.0, 0.0),
('GLM_1_AutoML_20200714_173719', 0.0, 0.0, 0.0)]
- Now, we calculate error on this base learning model and plot the feature importance plot using this model.
python3
model = h2o.get_model( 'XGBoost_grid__1_AutoML_20200714_173719_model_5' )
model.model_performance(test)
|
ModelMetricsRegression: xgboost
** Reported on test data. **
MSE: 2194912948.887177
RMSE: 46849.89806698812
MAE: 31039.50846508789
RMSLE: 0.24452804591616809
Mean Residual Deviance: 2194912948.887177
Code:
python3
model.varimp_plot(num_of_features = 9 )
|
- Now, we can save this model using the model.save method, this model can be deployed on various platforms.
Code:
python3
model_path = h2o.save_model(model = model, path = 'sample_data/' , force = True )
|
References:
Share your thoughts in the comments
Please Login to comment...