~ 5 min read

Working with Python Pandas Data Types

By: Adam Richardson
Share:

Intro

We’re going to look at data types and almost everything you need to know about them in Python Pandas. This is part of a series and the previous post was about renaming columns.

Current dataframe

Our dataframe looks like this now.

dateestKeycapacityoccupancyroomsSoldavgRatesalesValue
2022-12-2702890.7521735.977805.49
2022-12-2712030.357182.315844.01
2022-12-2722070.51106227.8324149.98
2022-12-273270.3710126.461264.60
2022-12-274200.8717191.573256.69

What are data types?

Simple Explanation

Data types are like categories for the information in a table. Pandas can hold several different types of data, including:

  • Numbers: These can be whole numbers (like 3 or 17) or numbers with decimals (like 2.5 or 10.9). These can be used to show amounts or measurements.
  • Labels: Sometimes we want to group or filter information by categories. These categories can be written as words (like “red” or “big”) or numbers (like 1 or 2).
  • Words: Sometimes we want to store information as words or sentences. We can do this with the text data type.
  • Dates and times: Pandas has special ways to store and work with dates and times. This can be helpful when we want to see how things change over time.
  • True or false: Sometimes we just need to know if something is true or false. We can use the boolean data type for this.

Pandas Data Types

It’s important to recognise that across different ways of storing data, these data types are referred to with different names, and can behave slightly different. Here are the Pandas data types we will be using.

Type Table

Data TypeDescription
int64Integer data type. Can hold whole numbers such as 3, 17, or 100.
float64Floating point data type. Can hold numbers with decimal points such as 2.5, 10.9, or 3.14159.
objectObject data type. Can hold a variety of data types, including strings, lists, and dictionaries.
boolBoolean data type. Can hold the values True or False.
datetime64[ns]Datetime data type. Can hold date and time information, such as “2022-12-28” or “14:30:00”.
categoryCategorical data type. Can hold a fixed set of categorical values, such as “red”, “yellow”, or “green”.

Type Use Cases

To help understanding more, here are some use-cases of each of the data types.

Data TypeUse Case
int64An int64 data type can be used to represent quantitative data that does not contain decimal points. For example, you might use an int64 column to store the number of items in an order, the number of employees in a company, or the number of votes in an election.
float64A float64 data type can be used to represent quantitative data that does contain decimal points. This data type is useful for storing data with a high level of precision, such as measurements, currency values, or scientific data.
objectAn object data type can be used to store a variety of data types, including strings, lists, and dictionaries. This data type is often used when the specific data type of the column is not known in advance, or when the column contains a mix of data types.
boolA bool data type can be used to store true/false values. This data type is often used to store binary data, such as whether a customer has opted in to a newsletter or whether a product is in stock.
datetime64[ns]A datetime64[ns] data type can be used to store date and time information. This data type is useful for working with time-series data, such as stock prices, weather data, or event logs.
categoryA category data type can be used to store a fixed set of categorical values. This data type is useful for storing data that can be grouped or filtered, such as product categories, geographical

Checking our types

Now that we understand the types we can use in Pandas data. Let’s check the types. Types are inferred when using readcsv which means a _best guess is made. You should always check the data types when working with data.

You can check types on a dataframe by using the .info() method.

Let’s add that to our code

import pandas as pd

raw = pd.read_csv("sales.csv")

raw.rename(columns={'est_ref': 'est_key', 'avg_rate_paid': 'avg_rate'}, inplace=True)

raw.rename(columns=lambda x: x[0].lower() + x.strip().lower().replace('_', ' ').title().replace(' ', '')[1:], inplace=True)

raw.info()

Here is the important information from the output

ColumnNon-Null CountDtype
date1451274object
estKey1451274int64
capacity1451274int64
occupancy1451274float64
roomsSold1451274int64
avgRate1451274float64
salesValue1451274float64

Changing a single type

So, the date column as an object does not look correct however the rest of the data types do.

Changing a string to a date in Pandas

Let’s change the object to a date.

import pandas as pd

raw = pd.read_csv("sales.csv")

raw.rename(columns={'est_ref': 'est_key', 'avg_rate_paid': 'avg_rate'}, inplace=True)

raw.rename(columns=lambda x: x[0].lower() + x.strip().lower().replace('_', ' ').title().replace(' ', '')[1:], inplace=True)

raw['date'] = pd.to_datetime(raw['date'])

raw.head()

The data has now changed from 27/12/2022 to 2022-12-27

dateestKeycapacityoccupancyroomsSoldavgRatesalesValue
2022-12-2702890.7521735.977805.49

You can now also check the type by adding raw.info() which will return datetime64[ns] which is what we expect.

ColumnNon-Null CountDtype
date1451274datetime64[ns]

Explaining the code

We use the code raw['date'] = pd.to_datetime(raw['date'])

  • raw['date'] Assigns a new value to the “date” column of the raw data frame.
  • The new value is the result of calling pd.to_datetime() on the existing “date” column.
  • pd.to_datetime() converts the values in the “date” column to a datetime data type.

Setting multiple types

It’s a good idea to always explicitly set the data types you want. If the source data will change or the type is inferred differently, this will offer a failsafe. If data is of the incorrect format and can not be converted, this will give you an error too, which is actually helpful. You wouldn’t want to cause problems later in the process, so it’s good to find it early.

Code for setting the types

import pandas as pd

raw = pd.read_csv("sales.csv")

raw.rename(columns={'est_ref': 'est_key', 'avg_rate_paid': 'avg_rate'}, inplace=True)

raw.rename(columns=lambda x: x[0].lower() + x.strip().lower().replace('_', ' ').title().replace(' ', '')[1:], inplace=True)

raw['date'] = pd.to_datetime(raw['date'])

raw = raw.astype({
    'date': 'datetime64[ns]',
    'estKey': 'int64',
    'capacity': 'int64',
    'occupancy': 'float64',
    'roomsSold': 'int64',
    'avgRate': 'float64',
    'salesValue': 'float64'
})

raw.info()

We don’t expect the output to change or to have any errors at this point, this is just a best practice as described above to prevent any problem occurring in the future.

Type checklist

This is best practice. If you’re just doing a quick one off piece of adhoc analysis that won’t be repeated, it’s not necessarily worth investing the time, but it’s a good habit to get into.

  • All the types in our data are correct. If you’re not sure on a type, please reference the tables above.
  • We have explicitly set types for all of the columns in our dataframe.
Share:
Subscribe to our newsletter

Stay up to date with our latest content - No spam!

Related Posts