PySpark – Random Splitting Dataframe
Last Updated :
01 Feb, 2023
In this article, we are going to learn how to randomly split data frame using PySpark in Python.
A tool created by Apache Spark Community to use Python with Spark altogether is known as Pyspark. Thus, while working for the Pyspark data frame, Sometimes we are required to randomly split that data frame. In this article, we are going to achieve this using randomSplit() function of Pyspark. This function not only splits the data frame as per the fraction but always gives us different values when the function is run.
randomSplit() function:
Syntax: data_frame.randomSplit(weights, seed=None)
Parameters:
- weights: The list of double values in which the data frame will be split.
- seed: The seed for sampling which divides the data frame always in same fractional parts until the seed value or weights value is changed.
Prerequisite
Note: In the article about installing Pyspark we have to install python instead of scala rest of the steps are the same.
Modules Required:
Pyspark: The API which was introduced to support Spark and Python language and has features of Scikit-learn and Pandas libraries of Python is known as Pyspark. This module can be installed through the following command in Python:
pip install pyspark
Stepwise Implementation:
Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, read the CSV file and display it to see if it is correctly uploaded.
data_frame=csv_file = spark_session.read.csv('Path_to_csv_file',
sep = ',', inferSchema = True, header = True)
data_frame.show()
Step 4: Next, split the data frame randomly using randomSplit function having weights and seeds as arguments. Further, store the split data frame either in the list or different variables.
splits=data_frame.randomSplit(weights, seed=None)
Step 5: Finally, display the list elements or the variables to see how the data frame is split.
splits[0].count()
splits[1].count()
Example 1:
In this example, we have split the data frame (link) through randomSplit function by only weights as an argument and stored it in the list. We have split the data frame twice through randomSplit function to see if we get the same fractional value each time. What we observed is that we got different values each time.
Python3
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark_session.read.csv(
'california_housing_train.csv' ,
sep = ',' , inferSchema = True , header = True )
data_frame.show()
splits = data_frame.randomSplit([ 1.0 , 3.0 ])
splits[ 0 ].count()
splits[ 1 ].count()
splits = data_frame.randomSplit([ 1.0 , 3.0 ])
splits[ 0 ].count()
splits[ 1 ].count()
|
Output:
4233
12767
4202
12798
Example 2:
In this example, we have split the data frame (link) through randomSplit function by weights as well as seed as arguments and stored it in the list. We have split the data frame twice through randomSplit function to see if we get the same fractional value each time. What we observed is that we got the same values each time.
Python3
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark_session.read.csv(
'california_housing_train.csv' ,
sep = ',' , inferSchema = True , header = True )
data_frame.show()
splits = data_frame.randomSplit([ 1.0 , 3.0 ], 26 )
splits[ 0 ].count()
splits[ 1 ].count()
splits = data_frame.randomSplit([ 1.0 , 3.0 ], 26 )
splits[ 0 ].count()
splits[ 1 ].count()
|
Output:
4181
12819
4181
12819
Example 3:
In this example, we have split the data frame (link) through randomSplit function by only weights as an argument and stored it in the list. We have split the data frame twice through randomSplit function to see if we get the same fractional value each time. What we observed is that we got different values each time.
Python3
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark_session.read.csv(
'california_housing_train.csv' ,
sep = ',' , inferSchema = True , header = True )
data_frame.show()
split1, split2 = data_frame.randomSplit([ 1.0 , 5.0 ])
split1.count()
split2.count()
split1, split2 = data_frame.randomSplit([ 1.0 , 5.0 ])
split1.count()
split2.count()
|
Output:
2818
14182
2783
14217
Example 4:
In this example, we have split the data frame (link) through randomSplit function by weights as well as seed as arguments and stored it in the variables. We have split the data frame twice through randomSplit function to see if we get the same fractional value each time. What we observed is that we got the same values each time.
Python3
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark_session.read.csv(
'california_housing_train.csv' ,
sep = ',' , inferSchema = True , header = True )
data_frame.show()
split1, split2 = data_frame.randomSplit([ 1.0 , 5.0 ], 24 )
split1.count()
split2.count()
split1, split2 = data_frame.randomSplit([ 1.0 , 5.0 ], 24 )
split1.count()
split2.count()
|
Output:
2776
14224
2776
14224
Share your thoughts in the comments
Please Login to comment...