Working with Python Pandas Data Types

Intro

We’re going to look at data types and almost everything you need to know about them in Python Pandas. This is part of a series and the previous post was about renaming columns.

Renaming Columns in Python Pandas

In this post, we will cover how to rename a single or multiple columns in Python Pandas.

Current dataframe

Our dataframe looks like this now.

date	estKey	capacity	occupancy	roomsSold	avgRate	salesValue
2022-12-27	0	289	0.75	217	35.97	7805.49
2022-12-27	1	203	0.35	71	82.31	5844.01
2022-12-27	2	207	0.51	106	227.83	24149.98
2022-12-27	3	27	0.37	10	126.46	1264.60
2022-12-27	4	20	0.87	17	191.57	3256.69

What are data types?

Simple Explanation

Data types are like categories for the information in a table. Pandas can hold several different types of data, including:

Numbers: These can be whole numbers (like 3 or 17) or numbers with decimals (like 2.5 or 10.9). These can be used to show amounts or measurements.
Labels: Sometimes we want to group or filter information by categories. These categories can be written as words (like “red” or “big”) or numbers (like 1 or 2).
Words: Sometimes we want to store information as words or sentences. We can do this with the text data type.
Dates and times: Pandas has special ways to store and work with dates and times. This can be helpful when we want to see how things change over time.
True or false: Sometimes we just need to know if something is true or false. We can use the boolean data type for this.

Pandas Data Types

It’s important to recognise that across different ways of storing data, these data types are referred to with different names, and can behave slightly different. Here are the Pandas data types we will be using.

Type Table

Data Type	Description
`int64`	Integer data type. Can hold whole numbers such as 3, 17, or 100.
`float64`	Floating point data type. Can hold numbers with decimal points such as 2.5, 10.9, or 3.14159.
`object`	Object data type. Can hold a variety of data types, including strings, lists, and dictionaries.
`bool`	Boolean data type. Can hold the values `True` or `False`.
`datetime64[ns]`	Datetime data type. Can hold date and time information, such as “2022-12-28” or “14:30:00”.
`category`	Categorical data type. Can hold a fixed set of categorical values, such as “red”, “yellow”, or “green”.

Type Use Cases

To help understanding more, here are some use-cases of each of the data types.

Data Type	Use Case
`int64`	An `int64` data type can be used to represent quantitative data that does not contain decimal points. For example, you might use an `int64` column to store the number of items in an order, the number of employees in a company, or the number of votes in an election.
`float64`	A `float64` data type can be used to represent quantitative data that does contain decimal points. This data type is useful for storing data with a high level of precision, such as measurements, currency values, or scientific data.
`object`	An `object` data type can be used to store a variety of data types, including strings, lists, and dictionaries. This data type is often used when the specific data type of the column is not known in advance, or when the column contains a mix of data types.
`bool`	A `bool` data type can be used to store true/false values. This data type is often used to store binary data, such as whether a customer has opted in to a newsletter or whether a product is in stock.
`datetime64[ns]`	A `datetime64[ns]` data type can be used to store date and time information. This data type is useful for working with time-series data, such as stock prices, weather data, or event logs.
`category`	A `category` data type can be used to store a fixed set of categorical values. This data type is useful for storing data that can be grouped or filtered, such as product categories, geographical

Checking our types

Now that we understand the types we can use in Pandas data. Let’s check the types. Types are inferred when using readcsv which means a _best guess is made. You should always check the data types when working with data.

You can check types on a dataframe by using the .info() method.

Let’s add that to our code

import pandas as pd

raw = pd.read_csv("sales.csv")

raw.rename(columns={'est_ref': 'est_key', 'avg_rate_paid': 'avg_rate'}, inplace=True)

raw.rename(columns=lambda x: x[0].lower() + x.strip().lower().replace('_', ' ').title().replace(' ', '')[1:], inplace=True)

raw.info()

Here is the important information from the output

Column	Non-Null Count	Dtype
date	1451274	object
estKey	1451274	int64
capacity	1451274	int64
occupancy	1451274	float64
roomsSold	1451274	int64
avgRate	1451274	float64
salesValue	1451274	float64

Changing a single type

So, the date column as an object does not look correct however the rest of the data types do.

Changing a string to a date in Pandas

Let’s change the object to a date.

import pandas as pd

raw = pd.read_csv("sales.csv")

raw.rename(columns={'est_ref': 'est_key', 'avg_rate_paid': 'avg_rate'}, inplace=True)

raw.rename(columns=lambda x: x[0].lower() + x.strip().lower().replace('_', ' ').title().replace(' ', '')[1:], inplace=True)

raw['date'] = pd.to_datetime(raw['date'])

raw.head()

The data has now changed from 27/12/2022 to 2022-12-27

date	estKey	capacity	occupancy	roomsSold	avgRate	salesValue
2022-12-27	0	289	0.75	217	35.97	7805.49

You can now also check the type by adding raw.info() which will return datetime64[ns] which is what we expect.

Column	Non-Null Count	Dtype
date	1451274	datetime64[ns]

Explaining the code

We use the code raw['date'] = pd.to_datetime(raw['date'])

raw['date'] Assigns a new value to the “date” column of the raw data frame.
The new value is the result of calling pd.to_datetime() on the existing “date” column.
pd.to_datetime() converts the values in the “date” column to a datetime data type.

Setting multiple types

It’s a good idea to always explicitly set the data types you want. If the source data will change or the type is inferred differently, this will offer a failsafe. If data is of the incorrect format and can not be converted, this will give you an error too, which is actually helpful. You wouldn’t want to cause problems later in the process, so it’s good to find it early.

Code for setting the types

import pandas as pd

raw = pd.read_csv("sales.csv")

raw.rename(columns={'est_ref': 'est_key', 'avg_rate_paid': 'avg_rate'}, inplace=True)

raw.rename(columns=lambda x: x[0].lower() + x.strip().lower().replace('_', ' ').title().replace(' ', '')[1:], inplace=True)

raw['date'] = pd.to_datetime(raw['date'])

raw = raw.astype({
    'date': 'datetime64[ns]',
    'estKey': 'int64',
    'capacity': 'int64',
    'occupancy': 'float64',
    'roomsSold': 'int64',
    'avgRate': 'float64',
    'salesValue': 'float64'
})

raw.info()

We don’t expect the output to change or to have any errors at this point, this is just a best practice as described above to prevent any problem occurring in the future.

Type checklist

This is best practice. If you’re just doing a quick one off piece of adhoc analysis that won’t be repeated, it’s not necessarily worth investing the time, but it’s a good habit to get into.

All the types in our data are correct. If you’re not sure on a type, please reference the tables above.
We have explicitly set types for all of the columns in our dataframe.

Handling Null Values in Python Pandas

Learn how to handle null or missing values in your data using the powerful pandas library in Python.