Remove duplicates from a dataframe in PySpark
Last Updated :
16 Dec, 2021
In this article, we are going to drop the duplicate data from dataframe using pyspark in Python
Before starting we are going to create Dataframe for demonstration:
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 1" ],
[ "3" , "rohith" , "company 2" ],
[ "4" , "sridevi" , "company 1" ],
[ "1" , "sravan" , "company 1" ],
[ "4" , "sridevi" , "company 1" ]]
columns = [ 'Employee ID' , 'Employee NAME' , 'Company' ]
dataframe = spark.createDataFrame(data,columns)
print ( 'Actual data in dataframe' )
dataframe.show()
|
Output:
Method 1: Using distinct() method
It will remove the duplicate rows in the dataframe
Syntax: dataframe.distinct()
Where, dataframe is the dataframe name created from the nested lists using pyspark
Example 1: Python program to drop duplicate data using distinct() function
Python3
print ( 'distinct data after dropping duplicate rows' )
dataframe.distinct().show()
|
Output:
Example 2: Python program to select distinct data in only two columns.
We can use select () function along with distinct function to get distinct values from particular columns
Syntax: dataframe.select([‘column 1′,’column n’]).distinct().show()
Python3
dataframe.select([ 'Employee ID' ,
'Employee NAME' ]).distinct().show()
|
Output:
Method 2: Using dropDuplicates() method
Syntax: dataframe.dropDuplicates()
where, dataframe is the dataframe name created from the nested lists using pyspark
Example 1: Python program to remove duplicate data from the employee table.
Python3
dataframe.dropDuplicates().show()
|
Output:
Example 2: Python program to remove duplicate values in specific columns
Python3
dataframe.select([ 'Employee ID' ,
'Employee NAME' ]).dropDuplicates().show()
|
Output:
Share your thoughts in the comments
Please Login to comment...