Append data to an empty dataframe in PySpark
Last Updated :
05 Apr, 2022
In this article, we are going to see how to append data to an empty DataFrame in PySpark in the Python programming language.
Method 1: Make an empty DataFrame and make a union with a non-empty DataFrame with the same schema
The union() function is the most important for this operation. It is used to mix two DataFrames that have an equivalent schema of the columns.
Syntax : FirstDataFrame.union(Second DataFrame)
Returns : DataFrame with rows of both DataFrames.
Example:
In this example, we create a DataFrame with a particular schema and data create an EMPTY DataFrame with the same scheme and do a union of these two DataFrames using the union() function in the python language.
Python
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark_session = SparkSession.builder.appName(
'Spark_Session' ).getOrCreate()
emp_RDD = spark_session.sparkContext.emptyRDD()
columns1 = StructType([StructField( 'Name' , StringType(), False ),
StructField( 'Salary' , IntegerType(), False )])
first_df = spark_session.createDataFrame(data = emp_RDD,
schema = columns1)
first_df.show()
rows = [[ 'Ajay' , 56000 ], [ 'Srikanth' , 89078 ],
[ 'Reddy' , 76890 ], [ 'Gursaidutt' , 98023 ]]
columns = [ 'Name' , 'Salary' ]
second_df = spark_session.createDataFrame(rows, columns)
second_df.show()
first_df = first_df.union(second_df)
first_df.show()
|
Output :
+----+------+
|Name|Salary|
+----+------+
+----+------+
+----------+------+
| Name|Salary|
+----------+------+
| Ajay| 56000|
| Srikanth| 89078|
| Reddy| 76890|
|Gursaidutt| 98023|
+----------+------+
+----------+------+
| Name|Salary|
+----------+------+
| Ajay| 56000|
| Srikanth| 89078|
| Reddy| 76890|
|Gursaidutt| 98023|
+----------+------+
Method 2: Add a singular row to an empty DataFrame by converting the row into a DataFrame
We can use createDataFrame() to convert a single row in the form of a Python List. The details of createDataFrame() are :
Syntax : CurrentSession.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)
Parameters :
data :
- schema : str/list , optional: Contains a String or List of column names.
- samplingRatio : float, optional: A sample of rows for inference
- verifySchema : bool, optional: Verify data types of every row against the specified schema. The value is True by default.
Example:
In this example, we create a DataFrame with a particular schema and single row and create an EMPTY DataFrame with the same schema using createDataFrame(), do a union of these two DataFrames using union() function further store the above result in the earlier empty DataFrame and use show() to see the changes.
Python
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark_session = SparkSession.builder.appName(
'Spark_Session' ).getOrCreate()
emp_RDD = spark_session.sparkContext.emptyRDD()
columns = StructType([StructField( 'Stadium' , StringType(), False ),
StructField( 'Capacity' , IntegerType(), False )])
df = spark_session.createDataFrame(data = emp_RDD,
schema = columns)
df.show()
added_row = [[ 'Motera Stadium' , 132000 ]]
added_df = spark_session.createDataFrame(added_row, columns)
df = df.union(added_df)
df.show()
|
Output :
+-------+--------+
|Stadium|Capacity|
+-------+--------+
+-------+--------+
+--------------+--------+
| Stadium|Capacity|
+--------------+--------+
|Motera Stadium| 132000|
+--------------+--------+
Method 3: Convert the empty DataFrame into a Pandas DataFrame and use the append() function
We will use toPandas() to convert PySpark DataFrame to Pandas DataFrame. Its syntax is :
Syntax : PySparkDataFrame.toPandas()
Returns : Corresponding Pandas DataFrame
We will then use the Pandas append() function. Its syntax is :
Syntax : PandasDataFrame.append(other, ignore_index=False, verify_integrity=False, sort=False)
Parameters :
- other : Pandas DataFrame, Numpy Series etc: The data that has to be appended.
- ignore_index : bool: If indexed a ignored then the indexes of the new DataFrame will have no relations to the older ones.
- sort : bool: Sort the columns if alignment of the columns in other and PandasDataFrame is different.
Example:
Here we create an empty DataFrame where data is to be added, then we convert the data to be added into a Spark DataFrame using createDataFrame() and further convert both DataFrames to a Pandas DataFrame using toPandas() and use the append() function to add the non-empty data frame to the empty DataFrame and ignore the indexes as we are getting a new DataFrame.Finally, we convert our final Pandas DataFrame to a Spark DataFrame using createDataFrame().
Python
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark_session = SparkSession.builder.appName(
'Spark_Session' ).getOrCreate()
emp_RDD = spark_session.sparkContext.emptyRDD()
columns = StructType([StructField( 'Stadium' , StringType(), False ),
StructField( 'Capacity' , IntegerType(), False )])
df = spark_session.createDataFrame(data = emp_RDD,
schema = columns)
df.show()
added_row = [[ 'Motera Stadium' , 132000 ]]
added_df = spark_session.createDataFrame(added_row,
columns)
pandas_added = added_df.toPandas()
df = df.toPandas()
df = df.append(pandas_added, ignore_index = True )
df = spark_session.createDataFrame(df)
df.show()
|
Output :
+-------+--------+
|Stadium|Capacity|
+-------+--------+
+-------+--------+
+--------------+--------+
| Stadium|Capacity|
+--------------+--------+
|Motera Stadium| 132000|
+--------------+--------+
Share your thoughts in the comments
Please Login to comment...