Count rows based on condition in Pyspark Dataframe
Last Updated :
29 Jun, 2021
In this article, we will discuss how to count rows based on conditions in Pyspark dataframe.
For this, we are going to use these methods:
- Using where() function.
- Using filter() function.
Creating Dataframe for demonstration:
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ "1" , "sravan" , "vignan" ],
[ "2" , "ojaswi" , "vvit" ],
[ "3" , "rohith" , "vvit" ],
[ "4" , "sridevi" , "vignan" ],
[ "1" , "sravan" , "vignan" ],
[ "5" , "gnanesh" , "iit" ]]
columns = [ 'ID' , 'NAME' , 'college' ]
dataframe = spark.createDataFrame(data,columns)
print ( 'Actual data in dataframe' )
dataframe.show()
|
Output:
Note: If we want to get all row count we can use count() function
Syntax: dataframe.count()
Where, dataframe is the pyspark input dataframe
Example: Python program to get all row count
Python3
print ( 'Total rows in dataframe' )
dataframe.count()
|
Output:
Total rows in dataframe
6
Method 1: using where()
where(): This clause is used to check the condition and give the results
Syntax: dataframe.where(condition)
Where the condition is the dataframe condition
Example 1: Condition to get rows in dataframe where ID =1
Python3
print ('Total rows in dataframe where\
ID = 1 with where clause')
print (dataframe.where(dataframe. ID = = '1' ).count())
print ( 'They are ' )
dataframe.where(dataframe. ID = = '1' ).show()
|
Output:
Example 2: Condition to get rows in dataframe with multiple conditions.
Python3
print ('Total rows in dataframe where\
ID except 1 with where clause')
print (dataframe.where(dataframe. ID ! = '1' ).count())
print ('Total rows in dataframe where\
college is vignan with where clause')
print (dataframe.where(dataframe.college = = 'vignan' ).count())
print ('Total rows in dataframe where ID greater\
than 2 with where clause')
print (dataframe.where(dataframe. ID > 2 ).count())
|
Output:
Total rows in dataframe where ID except 1 with where clause
4
Total rows in dataframe where college is vignan with where clause
3
Total rows in dataframe where ID greater than 2 with where clause
3
Example 3: Python program for multiple conditions
Python3
print ('Total rows in dataframe where ID \
not equal to 1 and name is sridevi')
print (dataframe.where((dataframe. ID ! = '1' ) &
(dataframe.NAME = = 'sridevi' )
).count())
print ('Total rows in dataframe where college is \
vignan or iit with where clause')
print (dataframe.where((dataframe.college = = 'vignan' ) |
(dataframe.college = = 'iit' )).count())
|
Output:
Total rows in dataframe where ID not equal to 1 and name is sridevi
1
Total rows in dataframe where college is vignan or iit with where clause
4
Method 2: Using filter()
filter(): This clause is used to check the condition and give the results, Both are similar
Syntax: dataframe.filter(condition)
Example 1: Python program to get rows where id = 1
Python3
print ('Total rows in dataframe where\
ID = 1 with filter clause')
print (dataframe. filter (dataframe. ID = = '1' ).count())
print ( 'They are ' )
dataframe. filter (dataframe. ID = = '1' ).show()
|
Output:
Example 2: Python program for multiple conditions
Python3
print ('Total rows in dataframe where ID not \
equal to 1 and name is sridevi')
print (dataframe. filter ((dataframe. ID ! = '1' ) &
(dataframe.NAME = = 'sridevi' )).count())
print ('Total rows in dataframe where college\
is vignan or iit with filter clause')
print (dataframe. filter ((dataframe.college = = 'vignan' ) |
(dataframe.college = = 'iit' )).count())
|
Output:
Total rows in dataframe where ID not equal to 1 and name is sridevi
1
Total rows in dataframe where college is vignan or iit with filter clause
4
Share your thoughts in the comments
Please Login to comment...