Open In App

Convert A Categorical Variable Into Dummy Variables

Last Updated : 11 Dec, 2020
Improve
Improve
Like Article
Like
Save
Share
Report

All the statistical and machine learning models are built on the foundation of data. A grouped or composite entity holding the relevant to a particular problem together is called a data set. These data sets are composed of Independent Variables or the features and the Dependent Variables or the Labels. All of these variables can be classified into two types of data: Quantitative and Categorical.

In this article, we are going to deal with the various methods to convert Categorical Variables into Dummy Variables which is an essential part of data pre-processing, which in itself is an integral part of the Machine Learning or Statistical Model. The categorical variables can be further subdivided into the following categories :

  • Binary or Dichotomous is essentially the variables that can have only two outcomes such as Win/Lose, On/Off, and so on.
  • Nominal Variables are used to represent groups with no particular ranking such as colors, brands, and so on.
  • Ordinal Variables represent groups with a specified ranking order such as Winners of a race, App Ratings to name a few.

Dummy Variables act as indicators of the presence or absence of a category in a Categorical Variable. The usual convention dictates that 0 represents absence while 1 represents presence. The conversion of Categorical Variables into Dummy Variables leads to the formation of the two-dimensional binary matrix where each column represents a particular category. The following example will further clarify the process of conversion.

Data set containing categorical variable:

OUTLOOK TEMPERATURE HUMIDITY WINDY
Rainy Hot High No
Rainy Hot High Yes
Overcast Hot High No
Sunny Mild High No
Sunny Cool Normal No

Data set containing a dummy variable :

RAINY OVERCAST SUNNY HOT MILD COOL HIGH NORMAL YES NO
1 0 0 1 0 0 1 0 0 1
1 0 0 1 0 0 1 0 1 0
0 1 0 1 0 0 1 0 0 1
0 0 1 0 1 0 1 0 0 1
0 0 1 0 0 1 0 1 0 1

Explanation:

The above data set comprises four categorical columns: OUTLOOK, TEMPERATURE, HUMIDITY, WINDY. 

Let’s consider the column WINDY which is composed of two categories: YES and NO. So, in the data set that contains the Dummy Variables, the column WINDY is replaced by two columns which each represent the categories: YES and NO. Now comparing the rows of the columns YES and NO with WINDY, we mark 0 for YES where it is absent and 1 where it is present. The same is done for column NO. This methodology is adopted for all the categorical columns. The important thing to notice is that each categorical column is replaced by the number of unique categories it has in the data set containing dummy variables.

We are going to be exploring three approaches to convert Categorical Variables into Dummy Variables in this article. 

These approaches are as follows:

  1. Using the LabelBinarizer from sklearn
  2. Using BinaryEncoder from category_encoders
  3. Using the get_dummies() function of the pandas library

Creating the data set:

The first step is creating the data set. This data set comprises 4 categorical columns which go by the name of OUTLOOK, TEMPERATURE, HUMIDITY, WINDY. The following is the code for the creation of the data set. We make this data set using the pandas.DataFrame() and dictionary.

Python3




# code to create the dataset
  
# importing the libraries
import pandas as pd
  
# creating the dictionary
dictionary = {'OUTLOOK': ['Rainy', 'Rainy',
                          'Overcast', 'Sunny',
                          'Sunny', 'Sunny',
                          'Overcast', 'Rainy',
                          'Rainy', 'Sunny',
                          'Rainy', 'Overcast',
                          'Overcast', 'Sunny'],
              'TEMPERATURE': ['Hot', 'Hot', 'Hot',
                              'Mild', 'Cool',
                              'Cool', 'Cool',
                              'Mild', 'Cool',
                              'Mild', 'Mild',
                              'Mild', 'Hot', 'Mild'],
              'HUMIDITY': ['High', 'High', 'High',
                           'High', 'Normal', 'Normal',
                           'Normal', 'High', 'Normal',
                           'Normal', 'Normal', 'High',
                           'Normal', 'High'],
              'WINDY': ['No', 'Yes', 'No', 'No', 'No',
                        'Yes', 'Yes', 'No', 'No',
                        'No', 'Yes', 'Yes', 'No',
                        'Yes']}
  
# converting the dictionary to DataFrame
df = pd.DataFrame(dictionary)
  
display(df)


Output:

The above is the data set that we will be using for the approaches ahead.

Approach 1:

Using this approach, we use LabelBinarizer from sklearn which converts one categorical column to a data frame with dummy variables at a time. This data frame can then be appended to the main data frame in the case of there being more than one Categorical column.

Python3




# importing the libraries
from sklearn.preprocessing import LabelBinarizer
  
# creating a copy of the
# original data frame
df1 = df.copy()
  
# creating an object 
# of the LabelBinarizer
label_binarizer = LabelBinarizer()
  
# fitting the column 
# TEMPERATURE to LabelBinarizer
label_binarizer_output = label_binarizer.fit_transform( df1['TEMPERATURE'])
  
# creating a data frame from the object
result_df = pd.DataFrame(label_binarizer_output,
                         columns = label_binarizer.classes_)
  
display(result_df)


Output:

Conversion of TEMPERATURE column

Similarly, we can transform other categorical columns as well.

Approach 2:

Using the BinaryEncoder from the category_encoders library. Using this approach we can convert multiple categorical columns into dummy variables in a single go.

category_encoders: The category_encoders is a Python library developed under the scikit-learn-transformers library. The primary objective of this library is to convert categorical variables into quantifiable numeric variables. There are various advantages of this library such as being readily compatible with the sklearn transformers which allow them to be readily trained and stored in serializable files such as pickle for later use. This library works great in working with data frames as well which is of great use while dealing with machine learning and statistical models. It provides a great range of methods for the conversion from categorical to numeric variables as well which can be categorized into Supervised and Unsupervised. 

For installation run this command into the terminal:

pip install category_encoders

For conda:

conda install -c conda-forge category_encoders

Code:

Python3




# importing the libraries
import category_encoders as cat_encoder
  
# creating a copy of the original data frame
df2 = df.copy()
  
# creating an object BinaryEncoder
# this code calls all columns
# we can specify specific columns as well
encoder = cat_encoder.BinaryEncoder(cols = df2.columns)
  
# fitting the columns to a data frame
df_category_encoder = encoder.fit_transform( df2 )
  
display(df_category_encoder)


Output:

Data Frame created from all the Categorical Columns

Approach 3:

Under this approach, we deploy the simplest way to perform the conversion of all possible Categorical Columns in a data frame to Dummy Columns by using the get_dummies() method of the pandas library.

We can either specify the columns to get the dummies by default it will convert all the possible categorical columns to their dummy columns.

Python3




# importing the libraries
import pandas as pd
  
# creating a copy of the original data frame
df3 = df.copy()
  
# calling the get_dummies method
# the first parameter mentions the
# the name of the data frame to store the
# new data frame in
# the second parameter is the list of
# columns which if not mentioned
# returns the dummies for all
# categorical columns
df3 = pd.get_dummies(df3,
                     columns = ['WINDY', 'OUTLOOK'])
  
display(df3)


Output:

Using the get_dummies() for the columns WINDY and OUTLOOK



Similar Reads

How to Convert Categorical Variable to Numeric in Pandas?
In this article, we will learn how to convert a categorical variable into a Numeric by using pandas. When we look at the categorical data, the first question that arises to anyone is how to handle those data, because machine learning is always good at dealing with numeric values. We could make machine learning models by using text data. So, to make
3 min read
ML | Dummy variable trap in Regression Models
Before learning about the dummy variable trap, let's first understand what actually dummy variable is. Dummy Variable in Regression Models: In statistics, especially in regression models, we deal with various kinds of data. The data may be quantitative (numerical) or qualitative (categorical). The numerical data can be easily handled in regression
2 min read
Why do we need to discard one dummy variable?
Answer: We discard one dummy variable to avoid multicollinearity in regression analysis.Explanation:When dealing with categorical variables in regression analysis, such as in linear regression or logistic regression, it's common practice to use dummy variables to represent categorical data numerically. A dummy variable is a binary variable that tak
3 min read
How to Create Dummy Variables in Python with Pandas?
A dataset may contain various type of values, sometimes it consists of categorical values. So, in-order to use those categorical value for programming efficiently we create dummy variables. A dummy variable is a binary variable that indicates whether a separate categorical variable takes on a specific value. Explanation: As you can see three dummy
2 min read
How to convert categorical string data into numeric in Python?
The datasets have both numerical and categorical features. Categorical features refer to string data types and can be easily understood by human beings. However, machines cannot interpret the categorical data directly. Therefore, the categorical data must be converted into numerical data for further processing. There are many ways to convert catego
4 min read
Grouping Categorical Variables in Pandas Dataframe
Firstly, we have to understand what are Categorical variables in pandas. Categorical are the datatype available in pandas library of python. A categorical variable takes only a fixed category (usually fixed number) of values. Some examples of Categorical variables are gender, blood group, language etc. One main contrast with these variables are tha
2 min read
How to handle missing values of categorical variables in Python?
Machine Learning is the field of study that gives computers the capability to learn without being explicitly programmed. Often we come across datasets in which some values are missing from the columns. This causes problems when we apply a machine learning model to the dataset. This increases the chances of error when we are training the machine lea
4 min read
How to convert categorical data to binary data in Python?
Categorical Data is data that corresponds to the Categorical Variable. A Categorical Variable is a variable that takes fixed, a limited set of possible values. For example Gender, Blood group, a person having country residential or not, etc. Characteristics of Categorical Data : This is mostly used in Statistics.Numerical Operation like Addition, S
4 min read
How to convert Categorical features to Numerical Features in Python?
It's difficult to create machine learning models that can't have features that have categorical values, such models cannot function. categorical variables have string-type values. thus we have to convert string values to numbers. This can be accomplished by creating new features based on the categories and setting values to them. In this article, w
2 min read
ML | Dummy classifiers using sklearn
A dummy classifier is a type of classifier which does not generate any insight about the data and classifies the given data using only simple rules. The classifier's behavior is completely independent of the training data as the trends in the training data are completely ignored and instead uses one of the strategies to predict the class label. It
3 min read