Open In App

How to Automate Data Cleaning in Python?

Last Updated : 31 May, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In Data Science and Machine Learning, Data Cleaning plays an essential role. Data Cleaning is the process of retaining only the crucial information from the output so that only relevant features are sent as input to the machine learning model. It is a very crucial step in data science and it helps in getting better results as all the noise is already removed. 

How To Automate Data Cleaning in Python

 

But, have you wondered that such a process can be time taking and coding a pipeline for every new dataset can be quite tiresome? Hence, it is a good idea to automate the whole process by creating a set pipeline for the process and then using it on every piece of data that one needs to clean. Such a pipeline can make the work easier as well as more refined. One will not have to worry about missing a step and all that is needed is to use that same pipeline again.

In this article, we will be working on creating a complete pipeline using multiple libraries, modules, and functions in Python to clean a CSV file.

How to Automate Data Cleaning in Python?

To understand the process of automating data cleaning by creating a pipeline in Python, we should start by understanding the whole point of data cleaning in a machine-learning task. The user information or any raw data contained a lot of noise (unwanted parts) in it. Such data sent to the model directly can lead to a lot of confusion and henceforth can lead to unsatisfactory results. Hence, removing all the unwanted and unnecessary data from the original data can help the model perform better. This is the reason that data cleaning is an essential step in most of the Machine Learning tasks.

In reference to data cleaning, automating the process, essentially means creating a set of rules (function in terms of code) that align and organize the whole process of data cleaning and make it easier for us to run the same pipeline as per the requirement on different sets of data.

Automating data cleaning in Python means creating a set of rules(function in terms of code) that align and organize the whole process of data cleaning.

The data cleaning process can be done using various libraries and the following are some most popular ones:

1. Text Data Cleaning Using Regular Expressions

Regular Expressions are simple sequences of characters that one matches in a given text. Furthermore, these identified sequences can be removed or extracted from the text using a simple Python module called Regex imported in the code as re.

Here’s a sample code for using regular expressions in Python:

Python




import re
  
# Read text
text = "Read the data from https://www.gfg.org/ posted by gfg@gmail.com"
  
# Remove all non-alphanumeric characters
clean_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
  
# Find all email addresses in text
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
  
# Replace all phone numbers with "PHONE NUMBER"
clean_text = re.sub(r'\d{3}[-.\s]??\d{3}[-.\s]??\d{4}', 'PHONE NUMBER', clean_text)
  
# Write cleaned text to output file
print(clean_text)
  
print("Emails found:", emails)


Output:

Read the data from httpswwwgfgorg posted by gfggmailcom
Emails found: ['gfg@gmail.com']

Here are some examples of the data cleaning tasks that can be done using regular expressions:

  • Extract specific patterns from text data, such as email addresses, phone numbers, or dates.
  • Remove unwanted characters or substrings from text data.
  • Replace specific patterns or substrings with new values.
  • Standardize text data by converting all characters to lowercase or uppercase.
  • Identify and remove duplicates based on text data.
  • Split text data into separate columns or rows based on delimiters or patterns.

2. Read the Dataset Using Pandas

Pandas is an extremely popular, well-known, and one of the libraries used in almost all Machine Learning tasks. Pandas are essentially a package of Python used to deal majorly with data frames and manipulate them as per the need. While working on the data cleaning process of a data frame pandas can prove to be a very helpful library.

Below is a sample code for using Pandas in Python for data cleaning:

Python




import pandas as pd
  
# Read in CSV file
df = pd.read_csv('input.csv')
  
# Drop duplicates
df = df.drop_duplicates()
  
# Fill missing values with mean
df = df.fillna(df.mean())
  
# Convert data types
df['date'] = pd.to_datetime(df['date'])
df['amount'] = df['amount'].astype(float)
  
# Define custom cleaning function
def clean_name(name):
    return name.strip().title()
  
# Apply custom function to 'name' column
df['name'] = df['name'].apply(clean_name)
  
# Write cleaned data to CSV file
df.to_csv('output.csv', index=False)


Here are some things you can do with Pandas to automate the data-cleaning process:

Task Function Used Description
Remove duplicates drop_duplicates() Remove duplicate rows from a dataframe.
Drop missing values dropna() Remove rows or columns with missing values.
Impute missing values fillna() Fill in missing values in a dataframe with a specified value or method.
Convert data types astype() Convert the data type of a column in a dataframe.
Rename columns rename() Rename columns in a dataframe.
Group and aggregate data groupby(), agg(), apply() Group and aggregate data in a dataframe.
Filter data query(), loc[], iloc[] Filter data in a dataframe using various methods
Apply functions to data apply() Apply a function to a column or row in a dataframe
Merge data merge(), join(), concat() Merge data from multiple dataframes
Pivot data pivot_table() The method allows for more advanced features such as multi-index and custom aggregation.

By using these functions and methods, you can create a powerful data-cleaning pipeline in Pandas to automate the data-cleaning process.

3. Mathematical Operations Using NumPy

NumPy is another popular library in Python for numerical computing. As its name suggests, it stands for Numerical Python. It provides a powerful array data structure that can be used for efficient data processing and analysis. NumPy has several functions for cleaning data, such as filtering, sorting, and aggregating data.

Here is an example code for using NumPy to filter and sort data:

Python




import numpy as np
  
# create a numpy array
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
  
# filter the array to keep only values greater than 5
filtered_data = data[data > 5]
  
# sort the filtered data in descending order
sorted_data = np.sort(filtered_data)[::-1]
  
print(sorted_data)


Output:

[10  9  8  7  6]

Here are some things you can do with Pandas to automate the data-cleaning process:

Task Function Used Description
Replace missing values numpy.nan_to_num() Replaces NaN values with zeros or a specified value
Identify missing values numpy.isnan() Returns a boolean array indicating where NaN values are found
Replace outliers numpy.clip() Clips values to a specified range to remove outliers
Normalize data numpy.linalg.norm() Computes the norm (magnitude) of a vector or matrix
Standardize data numpy.std(), numpy.mean() Computes the standard deviation and mean of a dataset
Scale data numpy.interp() Scales a dataset to a specified range
Remove duplicate values numpy.unique() Removes duplicate values from an array
Filter data based on a condition numpy.where() Returns values from one array if a condition in another array is met
Split data into chunks numpy.array_split() Splits an array into equally-sized subarrays

Creating a Basic Data Cleaning Pipeline in Python

Now that we have discussed some of the popular libraries for automating data cleaning in Python, let’s dive into some of the techniques for using these libraries to clean data. Following is a structure of a basic data-cleaning pipeline that covers the most essential steps:

  • Loading the CSV file: The CSV file is loaded as a data frame using the pandas module in Python.
  • Preprocessing the Data: The data has multiple attributes and mostly these are not in a format that Machine Learning modules can understand. Hence following key preprocessing steps can be applied:
    • Removing duplicates: Duplicate rows in a dataset can cause errors or bias in analysis, so it’s important to remove them.
    • Correcting inconsistent data: Inconsistent data can arise due to errors in data entry or data integration. 
    • Handling outliers: Outliers can skew analysis, so it’s important to handle them appropriately. 
    • Formatting data: Data may need to be formatted to meet the requirements of the analysis. 
  • Handling missing values: Missing values can cause problems with analysis, so it’s important to handle them appropriately. Here’s an example of how to handle missing values using the pandas library in Python:

The above steps include some of the significant and key ones, but, as per the requirement, one can add or remove functions and clean the data using the updated pipeline.

Implementing the Pipeline

The final pipeline can be implemented as follows using Python. Here the implemented code is tested on a custom-generated dataset as well to see the effect of the data-cleaning process. 

Python




import pandas as pd
from sklearn.preprocessing import LabelEncoder
import numpy as np
  
def drop_duplicates(df, subset_name):
    df.drop_duplicates(subset=[subset_name], inplace=True)
    return df
  
def encode(df, column_to_encode):
    le = LabelEncoder()
    # fit and transform a column using the LabelEncoder
    df[column_to_encode] = le.fit_transform(df[column_to_encode])
    return df
  
def outlier_handling(df, column_with_outliers):
    q1 = df[column_with_outliers].quantile(0.25)
    q3 = df[column_with_outliers].quantile(0.75)
    iqr = q3 - q1
    # remove outliers
    df = df[(df[column_with_outliers] > (q1 - 1.5 * iqr)) 
            & (df[column_with_outliers] < (q3 + 1.5 * iqr))] 
    return df
  
def date_formatting(df, column_with_date):
    # format date column
    df[column_with_date] = pd.to_datetime(df[column_with_date], 
                                          format='%m/%d/%Y'
    return df
  
def remove_missing_values(df):
    # Find missing values
    missing_values = df.isnull().sum()
    # Remove rows with missing values
    df = df.dropna()
    # Print number of missing values removed
    print("Removed {} missing values".format(missing_values.sum()))
    return df
  
  
def data_cleaning_pipeline(df_path,
                           duplication_subset,
                           column_to_encode,
                           column_with_outliers, 
                           column_with_date):
    df = pd.read_csv(df_path)
    df_no_duplicates = drop_duplicates(df, duplication_subset)
    df_encoded = encode(df_no_duplicates , column_to_encode)
    df_no_outliers = outlier_handling(df_encoded, column_with_outliers)
    df_date_formatted = date_formatting(df_no_outliers, column_with_date)
    df_no_nulls = remove_missing_values(df_date_formatted)
    return df_no_nulls
  
# Create a sample DataFrame
data = {'Name': ['John', 'Jane', 'Bob', 'John', 'Alice'],
        'Age': [30, 25, 40, 30, np.NaN],
        'Gender': ['Male', 'Female', 'Male', 'Male', 'Female'],
        'Income': [50000, 60000, 70000, 45000, 80000],
        'Birthdate': ['01/01/1990', '02/14/1996', '03/15/1981',
                      '01/01/1990', '06/30/1986'],
        'Married': [True, False, True, False, True],
        'Children': [2, 0, 1, 0, 3]}
df = pd.DataFrame(data)
print('Before Preprocessing:\n',df)
# Save DataFrame as CSV file
df.to_csv('my_data.csv', index=False)
      
clean_df = data_cleaning_pipeline('my_data.csv',
                                  'Name'
                                  'Gender'
                                  'Income',
                                  'Birthdate')
  
print('\nAfter preprocessing')
clean_df.head()


Output:

Before Preprocessing:
     Name   Age  Gender  Income   Birthdate  Married  Children
0   John  30.0    Male   50000  01/01/1990     True         2
1   Jane  25.0  Female   60000  02/14/1996    False         0
2    Bob  40.0    Male   70000  03/15/1981     True         1
3   John  30.0    Male   45000  01/01/1990    False         0
4  Alice   NaN  Female   80000  06/30/1986     True         3
Removed 1 missing values

After preprocessing
   Name   Age  Gender  Income  Birthdate  Married  Children
0  John  30.0       1   50000 1990-01-01     True         2
1  Jane  25.0       0   60000 1996-02-14    False         0
2   Bob  40.0       1   70000 1981-03-15     True         1

In the above code, a random dataset is created and then it is saved into a CSV file to pass it through the data cleaning pipeline. You can easily use the same pipeline by simply passing a different file path in the “data_cleaning_pipeline functions”.

Following is a snippet of the data before the cleaning process:

Python3




df.head()


Output:

Before Preprocessing-Geeksforgeeks

Before Preprocessing

 Following is a snippet of the data after the cleaning process:

Python3




clean_df.head()


Output:

Cleaned dataset-Geeksforgeeks

Cleaned dataset

Automating the data cleaning process, includes, systematically creating all the required functions and using them to solve the purpose as per the need. Some of the key ones and their code is explained in this article and used in the pipeline.

Finally concluding, using a detailed pipeline and standardizing the data-cleaning process can help save time and increase the efficiency of the process. The standard functions can be used as per the requirement once a pipeline has been created leaving a lesser chance for an error.

FAQs

Q1. What is Data Cleaning?

Answer:

Data Cleaning is the process of retaining only the crucial information from the output so that only relevant features are sent as input to the machine learning model.

Q2. Why is data cleaning important?

Answer:

Data cleaning is an important step to make sure that the data or information passed to a machine learning model or used for a data science task is cleaned and does not contain any unrequired information.

Q3. Can data cleaning be automated?

Answer:

Yes, data cleaning can be automated using programming languages like Python. This not only helps in aligning the process but also saves a tonne of time that is spent time and again in rewriting a cleaning script for every new piece of data.

Q4. What are some common techniques used for data cleaning?

Answer:

Some common techniques used for data cleaning include identifying and handling missing values, removing duplicates, correcting inconsistent values, and handling outliers.



Similar Reads

Cleaning Data with PySpark Python
In this article, we are going to know how to cleaning of data with PySpark in Python. Pyspark is an interface for Apache Spark. Apache Spark is an Open Source Analytics Engine for Big Data Processing. Today we will be focusing on how to perform Data Cleaning using PySpark. We will perform Null Values Handing, Value Replacement &amp; Outliers remova
5 min read
Data Cleaning in Tableau
To visualise data in Tableau, we need a data source file. Most of the times the data file contains no straw value and can be used directly for the visualisation. But there can be situations that the data source is not formatted and needs to be clean. So, this article aims to learn how to clean the data file in Tableau. Steps to follow: Open the Tab
2 min read
Cleaning data with dropna in Pyspark
While dealing with a big size Dataframe which consists of many rows and columns they also consist of many NULL or None values at some row or column, or some of the rows are totally NULL or None. So in this case, if we apply an operation on the same Dataframe that contains many NULL or None values then we will not get the correct or desired output f
4 min read
How to automate live data to your website with Python
Automating the process of fetching and displaying live data ensures that your website remains up-to-date with the latest information from a data source. We will cover the concepts related to this topic, provide examples with proper output screenshots, and outline the necessary steps to achieve automation. Fetch Live Weather Data Using FlaskBefore w
5 min read
HTML Cleaning and Entity Conversion | Python
The very important and always ignored task on web is the cleaning of text. Whenever one thinks to parse HTML, embedded Javascript and CSS is always avoided. The users are only interested in tags and text present on the webserver. lxml installation - It is a Python binding for C libraries - libxslt and libxml2. So maintaining a Python base, it is ve
3 min read
PyQt5 QSpinBox - Cleaning the text
In this article we will see how we can clean the text of the spin box, cleaning of the text means removing spacing and zeroes before any number as we know 001 is similar to the 1 therefore cleaning of text will remove the unwanted zeroes from the text. In order to do this we use cleanText method. Syntax : spin_box.cleanText() Argument : It takes no
2 min read
PyQt5 QDoubleSpinBox – Cleaning Text
In this article we will see how we can clean the text of the QDoubleSpinBox. We can clean the text with the help of clean text property, this property holds the text of the spin box excluding any prefix, suffix, or leading or trailing whitespace. In order to do this we will use cleanText method with the double spin box object. Syntax : dd_spin.clea
2 min read
Python | Automate Google Search using Selenium
Google search can be automated using Python script in just 2 minutes. This can be done using selenium (a browser automation tool). Selenium is a portable framework for testing web applications. It can automatically perform the same interactions that any you need to perform manually and this is a small example of it. Mastering Selenium will help you
3 min read
Automate linkedin connections using Python
LinkedIn connections are a very important thing for an IT professional, so we need to send connection requests to a lot of people who can be useful to us. But sometimes sending connection requests one at a time can be a little annoying and hectic. It would be nice to automate this work but How? Python to rescue!In this article, we will learn how to
3 min read
Python Script to Automate Refreshing an Excel Spreadsheet
In this article, we will learn how we can automate refreshing an Excel Spreadsheet using Python. So to open the Excel application and workbooks we are going to use the pywin32 module. You can install the module using the below code: pip install pywin32 Then we are going to open the Excel application using the win32com.client.Dispatch() method and w
1 min read
Article Tags :
Practice Tags :