How to preprocess string data within a Pandas DataFrame?
Sometimes, the data which we’re working on might be stuffed in a single column, but for us to work on the data, the data should be spread out into different columns and the columns must be of different data types. When all the data is combined in a single string, the string needs to be preprocessed. This article is about preprocessing string data within a Pandas DataFrame.
Syntax:
Series.str.extract(pat, flags=0, expand=True)
Parameters:
- pat: regex expression which helps us divide data into columns.
- flags: by default 0 no flags, int parameter.
- expand: Returns a DataFrame with one column per capture group if True.
returns:
method returns a dataframe or series
Step 1: Import packages
Pandas package is imported.
Step 2: Create dataframe:
pd.DataFrame() method is used to create a dataframe of the dictionary given. We create a dataframe that needs to be preprocessed. All the data resides in a single column in string format at the start.
Python3
data = { 'CovidData' : [ 'Anhui 1.0 2020-01-22 17:00:00' ,
'Beijing 14.0 2020-01-22 17:00:00' ,
'Washington 1.0 2020-01-24 17:00:00' ,
'Victoria 3.0 2020-01-31 23:59:00' ,
'Macau 10.0 2020-02-06 14:23:04' ]}
dataset = pd.DataFrame(data)
|
str. extract() takes a regex expression string and other parameters to extract data into columns. (….-..-.. ..:..:..) is used to extract dates in the form (yyyy-mm-dd hh:mm:ss), Datetime objects are of that format.
Python3
dataset[ 'LastUpdated' ] = dataset[ 'CovidData' ]. str .extract(
'(....-..-.. ..:..:..)' , expand = True )
dataset[ 'LastUpdated' ]
|
Output:
str. extract() takes a regex expression string ”([A-Za-z]+)”. it extracts strings which have alphabets.
Python3
dataset[ 'State' ] = dataset[ 'CovidData' ]. str .extract( '([A-Za-z]+)' , expand = True )
dataset[ 'State' ]
|
Output:
‘(\d+.\d)’ is used to match decimals. + represents one or more numbers before ‘.'(decimal) and one number after the decimal. ex: 12.1, 3.5 etc… .
Python3
dataset[ 'confirmed_cases' ] = dataset[ 'CovidData' ]. str .extract(
'(\d+.\d)' , expand = True )
dataset[ 'confirmed_cases' ]
|
Output:
Dataframe before preprocessing:
Dataframe after preprocessing:
Method 2: Using apply() function
In this method, we preprocess a dataset that contains movie reviews, it’s the rotten tomatoes dataset. The panda’s package, re and stop_words packages are imported. We store the stop words in a variable called stop_words. Data set is imported with the help of the pd.read_csv() method. We use the apply() method to preprocess string data. str.lower is used to convert all the string data to lower case. re.sub(r'[^\w\s]’, ”, x) helps us get rid of punctuation marks and finally, we remove stop_words from the string data. As the CSV file is huge a part of the data is displayed to see the difference.
Python3
import pandas as pd
from stop_words import get_stop_words
import re
stop_words = get_stop_words( 'en' )
data = pd.read_csv( 'test.csv' )
print ( 'Before string processing : ' )
print (data[(data[ 'PhraseId' ] > = 157139 ) & (
data[ 'PhraseId' ] < = 157141 )][ 'Phrase' ])
data[ 'Phrase' ] = data[ 'Phrase' ]. apply ( str .lower)
data[ 'Phrase' ] = data[ 'Phrase' ]. apply ( lambda x: re.sub(r '[^\w\s]' , '', x)
)
data[ 'Phrase' ] = data[ 'Phrase' ]. apply ( lambda x: ' ' .join(
w for w in x.split() if w not in stop_words))
print ( 'After string processing : ' )
data[(data[ 'PhraseId' ] > = 157139 ) & (data[ 'PhraseId' ] < = 157141 )][ 'Phrase' ]
|
Output:
Last Updated :
21 Mar, 2024
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...