Open In App

Model Building for Data Analytics

Last Updated : 29 May, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Prerequisite – Life Cycle Phases of Data Analytics

After formulating the problem and preprocessing the data accordingly. We select the type of model we should build for our model. Like if our problem requires our result to have higher explainability then we use models like Linear regression or decision tree but if our model requires to have higher accuracy then we build models like XGBOOST or Deep Neural Network. 

Model Building In Data Analytics

Model building is an essential part of data analytics and is used to extract insights and knowledge from the data to make business decisions and strategies. In this phase of the project data science team needs to develop data sets for training, testing, and production purposes. These data sets enable data scientists to develop an analytical method and train it while holding aside some of the data for testing the model. Model building in data analytics is aimed at achieving not only high accuracy on the training data but also the ability to generalize and perform well on new, unseen data. Therefore, the focus is on creating a model that can capture the underlying patterns and relationships in the data, rather than simply memorizing the training data.

To do this we divide our dataset into two parts

  1. Training dataset 
  2. Test dataset 

Note: Based on the dataset quality and quantity of the data one may choose to divide his dataset into three parts training and testing and validation data.

Dividing The Dataset For Model Building 

To divide the dataset we will use the Python sklearn library which helps us in dividing the dataset into training and testing datasets. Here we will choose the ratio by which we want to divide the dataset by default it 3:1 for training and testing.     

Python code for creating and dividing the dataset 

We will first create a random array of dimensions having 2 columns and 100 rows and convert it into a dataframe using pandas. After that, we will use the sklearn package to divide the dataframe into test and train datasets and also we will separate our dataset into dependent and independent variables. 

Python3




from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
data = np.random.randint(low=10, high=100,
                         size=2000).reshape(1000, 2)
data = pd.DataFrame(data, columns=('x', 'y'))
X = data[['x', 'y']]
y = np.random.rand(1000)
train_data_x, test_data_x, train_data_y, test_data_y = \
    train_test_split(data, y, test_size=0.25)


Scaling The Dataset 

Sacling the dataset is an important preprocessing step before feeding the to the outliers. there are several benefits of scaling the data theses are as:

  • It prevents features with different scales from dominating the model like example suppose column A has data ranging from 1 to 1000 and column B has data ranging from 0 to 1 in that case column A can influence our model decision even if it is not an important feature. But after scaling all our columns comes in the similar range  
  • It speeds up our model convergence. Many optimization algorithms such as gradient descent are very sensitive to the scale of the data. By scaling data between 0 to 1 these algorithm converges faster.
Effect of scaling on Gradient Descent

Effect of scaling on Gradient Descent 

  • Scaling the dataset makes our model more robust to the outliers. 
  • Some algorithms like K-nearest neighbors (KNN), use the distance between data points to make predictions in this case if the columns have different scales then the distance can go higher. 

Python code for scaling the columns 

We will use StandardScaler object from the sklearn library to scale our independent features of the dataset. 

Python3




from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
train_data_x_scaled = scaler.fit_transform(train_data_x.to_numpy())
test_data_x = scaler.transform(test_data_x.to_numpy())


Modeling The Data 

After scaling and splitting the data it has now become ready for fitting to the model. The choice of choosing model totally depends on our problem formulation. There are a variety of models present that we can choose from. However, before choosing the model first, we should identify these points in the data 

  1. Whether our problem is a regression problem or a classification problem 
  2. Whether we want a model which is more explainable or we want a model which has a higher accuracy 

Python code for modeling the data 

Since our target value is continuous so here we will consider it as a regression problem. For making it simple and explainable we will use the decision tree model. 

Python3




from sklearn.tree import DecisionTreeRegressor,plot_tree
reg = DecisionTreeRegressor(min_samples_split=4,
                            max_leaf_nodes=10)
reg.fit(train_data_x_scaled,train_data_y)
y_pred = reg.predict(test_data_x)


After making the model we evaluate the model on the evaluation matrix. In our case, we will mean square error for computing the accuracy of our model.

Python code for evaluation 

Python3




from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_pred,test_data_y))


Output:

0.109

Since our dataset was randomly generated mean square error of 661 is not bad. One good thing about the decision tree is that we can also see the decision that was made to model the data. 

Plotting The Decision Graph 

We can use the plot_tree function from the sklearn library to visualize on what basis the decision is made. 

Python3




fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(4,4), dpi=800)
plot_tree(reg, filled=True, ax=axes, fontsize=2)
plt.show()


Output:

Decision tree diagram for The model



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads