Model Building for Data Analytics

Last Updated : 29 May, 2023

Prerequisite – Life Cycle Phases of Data Analytics

After formulating the problem and preprocessing the data accordingly. We select the type of model we should build for our model. Like if our problem requires our result to have higher explainability then we use models like Linear regression or decision tree but if our model requires to have higher accuracy then we build models like XGBOOST or Deep Neural Network.

Model Building In Data Analytics

Model building is an essential part of data analytics and is used to extract insights and knowledge from the data to make business decisions and strategies. In this phase of the project data science team needs to develop data sets for training, testing, and production purposes. These data sets enable data scientists to develop an analytical method and train it while holding aside some of the data for testing the model. Model building in data analytics is aimed at achieving not only high accuracy on the training data but also the ability to generalize and perform well on new, unseen data. Therefore, the focus is on creating a model that can capture the underlying patterns and relationships in the data, rather than simply memorizing the training data.

To do this we divide our dataset into two parts

Training dataset
Test dataset

Note: Based on the dataset quality and quantity of the data one may choose to divide his dataset into three parts training and testing and validation data.

Dividing The Dataset For Model Building

To divide the dataset we will use the Python sklearn library which helps us in dividing the dataset into training and testing datasets. Here we will choose the ratio by which we want to divide the dataset by default it 3:1 for training and testing.

Python code for creating and dividing the dataset

We will first create a random array of dimensions having 2 columns and 100 rows and convert it into a dataframe using pandas. After that, we will use the sklearn package to divide the dataframe into test and train datasets and also we will separate our dataset into dependent and independent variables.

Python3

from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
data = np.random.randint(low=10, high=100,
                         size=2000).reshape(1000, 2)
data = pd.DataFrame(data, columns=('x', 'y'))
X = data[['x', 'y']]
y = np.random.rand(1000)
train_data_x, test_data_x, train_data_y, test_data_y = \
    train_test_split(data, y, test_size=0.25)

Scaling The Dataset

Sacling the dataset is an important preprocessing step before feeding the to the outliers. there are several benefits of scaling the data theses are as:

It prevents features with different scales from dominating the model like example suppose column A has data ranging from 1 to 1000 and column B has data ranging from 0 to 1 in that case column A can influence our model decision even if it is not an important feature. But after scaling all our columns comes in the similar range
It speeds up our model convergence. Many optimization algorithms such as gradient descent are very sensitive to the scale of the data. By scaling data between 0 to 1 these algorithm converges faster.

Effect of scaling on Gradient Descent

Scaling the dataset makes our model more robust to the outliers.
Some algorithms like K-nearest neighbors (KNN), use the distance between data points to make predictions in this case if the columns have different scales then the distance can go higher.

Python code for scaling the columns

We will use StandardScaler object from the sklearn library to scale our independent features of the dataset.

Python3

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
train_data_x_scaled = scaler.fit_transform(train_data_x.to_numpy())
test_data_x = scaler.transform(test_data_x.to_numpy())

Modeling The Data

After scaling and splitting the data it has now become ready for fitting to the model. The choice of choosing model totally depends on our problem formulation. There are a variety of models present that we can choose from. However, before choosing the model first, we should identify these points in the data

Whether our problem is a regression problem or a classification problem
Whether we want a model which is more explainable or we want a model which has a higher accuracy

Python code for modeling the data

Since our target value is continuous so here we will consider it as a regression problem. For making it simple and explainable we will use the decision tree model.

Python3

from sklearn.tree import DecisionTreeRegressor,plot_tree
reg = DecisionTreeRegressor(min_samples_split=4,
                            max_leaf_nodes=10)
reg.fit(train_data_x_scaled,train_data_y)
y_pred = reg.predict(test_data_x)

After making the model we evaluate the model on the evaluation matrix. In our case, we will mean square error for computing the accuracy of our model.

Python code for evaluation

Python3

from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_pred,test_data_y))

Output:

0.109

Since our dataset was randomly generated mean square error of 661 is not bad. One good thing about the decision tree is that we can also see the decision that was made to model the data.

Plotting The Decision Graph

We can use the plot_tree function from the sklearn library to visualize on what basis the decision is made.

Python3

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(4,4), dpi=800)
plot_tree(reg, filled=True, ax=axes, fontsize=2)
plt.show()

Output:

Decision tree diagram for The model

Suggest improvement

Changing Nature of Software - Software Engineering

Unique risks of ERP Projects

Share your thoughts in the comments

Model Building for Data Analytics

Model Building In Data Analytics

Dividing The Dataset For Model Building

Python3

Scaling The Dataset

Python3

Modeling The Data

Python3

Python3

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?