PySpark – Split dataframe into equal number of rows
When there is a huge dataset, it is better to split them into equal chunks and then process each dataframe individually. This is possible if the operation on the dataframe is independent of the rows. Each chunk or equally split dataframe then can be processed parallel making use of the resources more efficiently. In this article, we will discuss how to split PySpark dataframes into an equal number of rows.
Creating Dataframe for demonstration:
Python
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
columns = [ "Brand" , "Product" ]
data = [
( "HP" , "Laptop" ),
( "Lenovo" , "Mouse" ),
( "Dell" , "Keyboard" ),
( "Samsung" , "Monitor" ),
( "MSI" , "Graphics Card" ),
( "Asus" , "Motherboard" ),
( "Gigabyte" , "Motherboard" ),
( "Zebronics" , "Cabinet" ),
( "Adata" , "RAM" ),
( "Transcend" , "SSD" ),
( "Kingston" , "HDD" ),
( "Toshiba" , "DVD Writer" )
]
prod_df = spark.createDataFrame(data = data,
schema = columns)
prod_df.show()
|
Output:
In the above code block, we have defined the schema structure for the dataframe and provided sample data. Our dataframe consists of 2 string-type columns with 12 records.
Example 1: Split dataframe using ‘DataFrame.limit()’
We will make use of the split() method to create ‘n’ equal dataframes.
Syntax: DataFrame.limit(num)
Where, Limits the result count to the number specified.
Code:
Python
n_splits = 4
each_len = prod_df.count() / / n_splits
copy_df = prod_df
i = 0
while i < n_splits:
temp_df = copy_df.limit(each_len)
copy_df = copy_df.subtract(temp_df)
temp_df.show(truncate = False )
i + = 1
|
Output:
Example 2: Split the dataframe, perform the operation and concatenate the result
We will now split the dataframe in ‘n’ equal parts and perform concatenation operation on each of these parts individually and then concatenate the result to a `result_df`. This is to demonstrate how we can use the extension of the previous code to perform a dataframe operation separately on each dataframe and then append these individual dataframes to produce a new dataframe which has a length equal to the original dataframe.
Python
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.functions import concat, col, lit
n_splits = 4
each_len = prod_df.count() / / n_splits
copy_df = prod_df
def modify_dataframe(data):
return data.select(
concat(col( "Brand" ), lit( " - " ),
col( "Product" ))
)
schema = StructType([
StructField( 'Brand - Product' , StringType(), True )
])
result_df = spark.createDataFrame(data = [],
schema = schema)
i = 0
while i < n_splits:
temp_df = copy_df.limit(each_len)
copy_df = copy_df.subtract(temp_df)
temp_df_mod = modify_dataframe(data = temp_df)
temp_df_mod.show(truncate = False )
result_df = result_df.union(temp_df_mod)
i + = 1
result_df.show(truncate = False )
|
Output:
Last Updated :
18 Jul, 2021
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...