Spark SQL Column / Data Types explained
Introduction
Types can be a little confusing across SQL, Spark SQL, Pandas etc. Here we breakdown all of the Spark SQL data types so you know exactly which type you should be using.
Bookmark this and use it as a reference!
All Spark Data types
Column Type | Technical Description | Example Use Case |
---|---|---|
StringType | Represents character string values | Storing names, addresses, or other textual information |
IntegerType | Represents 32-bit integer values | Storing age, quantity, or other whole numbers |
LongType | Represents 64-bit integer values | Storing large whole numbers like population, or timestamps |
FloatType | Represents single-precision 32-bit floating point values | Storing decimal numbers like price, weight, or temperature |
DoubleType | Represents double-precision 64-bit floating point values | Storing high-precision decimal numbers like scientific data |
BooleanType | Represents true or false values | Storing binary conditions like isActive, isApproved, or isValid |
TimestampType | Represents values comprising of date and time with nanosecond precision | Storing event timestamps or datetime values for analysis |
DateType | Represents date values without time information | Storing birthdates, order dates, or other date-only values |
BinaryType | Represents binary data (byte arrays) | Storing file contents like images, documents, or other files |
ArrayType | Represents arrays of elements with the same data type | Storing multiple values of the same type like tags, or scores |
MapType | Represents key-value pairs with specified key and value data types | Storing dictionaries, or lookup tables with key-value pairs |
StructType | Represents a structured data type containing a list of StructFields with various types | Storing complex data structures like address or person details |
DecimalType | Represents fixed-point decimal numbers with user-defined precision and scale | Storing exact decimal values like currency amounts or percentages |
How to set the types with Schema
The best practice when working with Spark SQL that we recommend is to set the types by using a Schema object, when creating your dataframe.
Here is an example of how to set all of the types.
# Sample data for each column type
data = [
("John Doe", 30, 16000000000, 72.5, 0.987654321, True, "2023-04-08 14:30:00", "2023-04-08", bytearray(b"\x01\x02\x03"), ["apple", "banana"], {"key": "value"}, Row(a=1, b="text"), Decimal("3.14"))
]
# Define schema for the sample data
schema = StructType([
StructField("Name", StringType(), True),
StructField("Age", IntegerType(), True),
StructField("Population", LongType(), True),
StructField("Weight", FloatType(), True),
StructField("Accuracy", DoubleType(), True),
StructField("IsValid", BooleanType(), True),
StructField("Timestamp", TimestampType(), True),
StructField("Date", DateType(), True),
StructField("BinaryData", BinaryType(), True),
StructField("Fruits", ArrayType(StringType()), True),
StructField("KeyValue", MapType(StringType(), StringType()), True),
StructField("PersonDetails", StructType([
StructField("a", IntegerType(), True),
StructField("b", StringType(), True),
]), True),
StructField("Amount", DecimalType(10, 2), True),
])
# Create DataFrame using the sample data and schema
df = spark.createDataFrame(data, schema)
How to set types when adding columns
It’s not always possible to set all types in the schema, as you may be adding new columns to your dataframe. Here is an example of how to create a new column with all of the types.
# Sample data for name column
data = [
("John Doe",)
]
# Define schema for the name column
schema = StructType([
StructField("Name", StringType(), True),
])
# Create DataFrame using the sample data and schema
df = spark.createDataFrame(data, schema)
# Add new columns for each data type
df = df.withColumn("Age", lit(30).cast(IntegerType()))
df = df.withColumn("Population", lit(16000000000).cast(LongType()))
df = df.withColumn("Weight", lit(72.5).cast(FloatType()))
df = df.withColumn("Accuracy", lit(0.987654321).cast(DoubleType()))
df = df.withColumn("IsValid", lit(True).cast(BooleanType()))
df = df.withColumn("Timestamp", lit("2023-04-08 14:30:00").cast(TimestampType()))
df = df.withColumn("Date", lit("2023-04-08").cast(DateType()))
df = df.withColumn("BinaryData", lit(bytearray(b"\x01\x02\x03")).cast(BinaryType()))
df = df.withColumn("Fruits", lit(["apple", "banana"]).cast(ArrayType(StringType())))
df = df.withColumn("KeyValue", lit({"key": "value"}).cast(MapType(StringType(), StringType())))
df = df.withColumn("PersonDetails", lit(Row(a=1, b="text")).cast(StructType([
StructField("a", IntegerType(), True),
StructField("b", StringType(), True),
])))
df = df.withColumn("Amount", lit(Decimal("3.14")).cast(DecimalType(10, 2)))
Related Posts
-
Apache Spark - Complete guide
By: Adam RichardsonLearn everything you need to know about Apache Spark with this comprehensive guide. We will cover Apache spark basics, all the way to advanced.
-
Mastering JSON Files in PySpark
By: Adam RichardsonLearn how to read and write JSON files in PySpark effectively with this comprehensive guide for developers seeking to enhance their data processing skills.
-
Pivoting and Unpivoting with PySpark
By: Adam RichardsonLearn how to effectively pivot and unpivot data in PySpark with step-by-step examples for efficient data transformation and analysis in big data projects.
-
PySpark Demystified: A Simple Guide of how it works
By: Adam RichardsonExplore the simplicity of PySpark's workings, from data processing to redistributing tasks across clusters, with our easy-to-understand guide for developers.