Renaming columns with Apache Spark (PySpark)
Introduction
In this post, you will learn how to rename columns of a Dataframe with PySpark.
In this post, we will cover reading and writing csv files with Apace Spark (PySpark)
Code so far to follow along
from pyspark.sql.types import StructType, StructField, DateType, IntegerType, DoubleType
from pyspark.sql.functions import *
url = 'https://raw.githubusercontent.com/adamrichardson14/PySpark-Tutorial/main/sales.csv'
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
schema = StructType(
[
StructField("date", DateType(), True),
StructField("est_ref", IntegerType(), True),
StructField("capacity", IntegerType(), True),
StructField("occupancy", DoubleType(), True),
StructField("rooms_sold", IntegerType(), True),
StructField("avg_rate_paid", DoubleType(), True),
StructField("sales_value", DoubleType(), True),
]
)
path = SparkFiles.get('sales.csv')
# I need to append "file://" as I'm using Databricks.
# If you are doing this locally, just use the path and delete "file://" +
raw = spark.read.csv("file://" + path,
header=True,
schema=schema,
)
raw = raw.withColumnRenamed("est_ref", "EstKey")
Renaming a single column
Let’s take a look at how we can rename a single column with PySpark. We’re going to change the est_ref
column to EstKey
raw = raw.withColumnRenamed("est_ref", "EstKey")
Notice that we are overwriting the original raw
dataframe by reassigning the result of withColumnRenamed
to a new raw variable, which is returning a new dataframe. We will do this a lot as dataframes are immutable.
Renaming multiple columns
The most common way to rename multiple columns, is simply to chain withColumnRenamed()
. Let’s rename some more columns and change the names to follow PascalCase
format
raw = (
raw.withColumnRenamed("est_ref", "EstKey")
.withColumnRenamed("date", "Date")
.withColumnRenamed("capacity", "Capacity")
.withColumnRenamed("occupancy", "Occupancy")
.withColumnRenamed("rooms_sold", "RoomsSold")
.withColumnRenamed("avg_rate_paid", "AvgRatePaid")
.withColumnRenamed("sales_value", "SalesValue")
)
We now have the following
|-- Date: date (nullable = true)
|-- EstKey: integer (nullable = true)
|-- Capacity: integer (nullable = true)
|-- Occupancy: double (nullable = true)
|-- RoomsSold: integer (nullable = true)
|-- AvgRatePaid: double (nullable = true)
|-- SalesValue: double (nullable = true)
Renaming Multiple Columns Helper Function
In Pandas, we have a nice function that will rename columns for us, however we don’t currently have that in PySpark. The beauty of programming is that we can just create one! I have this helper function imported into most notebooks and use it frequently as it’s much cleaner than just chaining withColumnRenamed()
from pyspark.sql import DataFrame
def rename_columns(df: DataFrame, columns_dict: dict) -> DataFrame:
for old_name, new_name in columns_dict.items():
df = df.withColumnRenamed(old_name, new_name)
return df
You can see this function is taking in a dataframe, along with a dictionary which contains the current column name, and the column name you will change it to. We loop over the dictionary, call withColumnRenamed
and simply return the new dataframe once the loop has gone through all of the items.
You can use the function like this to achieve the same as we did above. I think it’s a little cleaner
raw = rename_columns(
raw,
{
"est_ref": "EstKey",
"date": "Date",
"capacity": "Capacity",
"occupancy": "Occupancy",
"rooms_sold": "RoomsSold",
"avg_rate_paid": "AvgRatePaid",
"sales_value": "SalesValue",
},
)
Renaming all columns with toDF
I would still recommend doing the above, as if the order of your data changes, this method will fail. If you’re just working with some static files though and you want to change the names of all of the columns, this is probably the fastest way.
Step 1
Create a List
with your new column names in the correct order. We’re going to change them back now to their original names after making the changes above.
new_column_names = ["est_ref", "date", "capacity", "occupancy", "rooms_sold", "avg_rate_paid", "sales_value"]
Step 2
Use toDF() to rename all of the columns
raw = raw.toDF(*new_column_names)
And now we are back to where we started… What a ride!
|-- est_ref: date (nullable = true)
|-- date: integer (nullable = true)
|-- capacity: integer (nullable = true)
|-- occupancy: double (nullable = true)
|-- rooms_sold: integer (nullable = true)
|-- avg_rate_paid: double (nullable = true)
|-- sales_value: double (nullable = true)
Conclusion
For continuing the course, I’m going to remove the last toDF
rename, so we have our new names for the next tutorial.
Hopefully you can now rename any columns 😃
Related Posts
-
Apache Spark - Complete guide
By: Adam RichardsonLearn everything you need to know about Apache Spark with this comprehensive guide. We will cover Apache spark basics, all the way to advanced.
-
Learn all about Apache Spark Data Types
By: Adam RichardsonIn this blog post, we will explore the different data types available in PySpark and how to use them effectively in your data processing tasks.
-
Learn How to Read and Write CSV Files with Apache Spark.
By: Adam RichardsonIn this post, we will cover reading and writing csv files with Apace Spark (PySpark)
-
Apache Spark Local Setup Guide
By: Adam RichardsonIn this blog post, you will learn how to setup Apache Spark on your computer. This means you can learn Apache Spark with a local install at 0 cost.