~ 5 min read

Aggregating Data with Python Pandas

By: Adam Richardson
Share:

Intro

In the context of data analysis, “aggregating data” refers to the process of summarizing data by grouping it and applying statistical functions to the groups. This can be useful for understanding patterns and trends in the data, and for creating more compact and meaningful summaries of the data.

Pandas provides a variety of functions for aggregating data. In Pandas, you can use the groupBy function to group data by one or more columns, and then apply statistical functions like mean, sum, min, max, or count to the groups.

Getting our data

In the previous post, we wrote our clean dataset to a CSV. I’ve started a new notebook now, let’s bring in pandas and also the clean.csv file

import pandas as pd

df = pd.read_csv("clean.csv")
df.info()

One of the problems with .csv files is that they do not maintain type information. In an ideal scenario, we would write this to a database table, or store the data in a file format that supported types. But I don’t want to worry about that at this stage, it’s more the principle that we will clean our data, store it and then use it in analysis afterwards.

It does mean we need to assign the correct types again.

df = df.astype({
    'date': 'datetime64[ns]',
    'estKey': 'int64',
    'capacity': 'int64',
    'occupancy': 'float64',
    'roomsSold': 'int64',
    'avgRate': 'float64',
    'salesValue': 'float64',
        'revPAR': 'float64',
    'dayOfWeek': 'object',
    'month': 'object',
    'weekNumber': 'int64',
    'year': 'int64',
    'monthCode': 'int64',
    'weekCode': 'int64',
})

Granularity

Granularity in data analysis is like looking at something really closely, like looking at a bunch of really small pieces, or looking at something from far away, like just seeing a big picture.

For example, let’s say you are trying to figure out how many cookies you sell at your cookie stand. You could look at the number of cookies you sell each day, or you could look at the number of cookies you sell each hour. The first way is like looking at the big picture, and the second way is like looking at lots of little pieces.

Looking at the little pieces can give you more information and help you understand things better, but it can also be harder and take more time. Looking at the big picture can be easier and faster, but you might not see as much detail. It’s up to you to decide which one is the best way to look at the data, depending on what you want to know and how much time and resources you have.

Possible Aggregations with Pandas

Before we look at the code, let’s take a look at the most commonly used aggregations with Python Pandas

FunctionDescription
meanCalculates the mean, or average, of the data. The mean is calculated by summing all of the values in the data and dividing by the number of values.
medianCalculates the median of the data. The median is the value that falls in the middle of the data when it is sorted. If the data has an odd number of values, the median is the middle value. If the data has an even number of values, the median is the average of the two middle values.
sumCalculates the sum of the data. The sum is calculated by adding up all of the values in the data.
countCounts the number of non-NA/null values in the data. This function will ignore any NA/null values and only count the valid values.
minCalculates the minimum value in the data. The minimum value is the smallest value in the data.
maxCalculates the maximum value in the data. The maximum value is the largest value in the data.
varCalculates the variance of the data. The variance is a measure of how spread out the data is, and is calculated by taking the average of the squared differences between the values and the mean of the data.
stdCalculates the standard deviation of the data. The standard deviation is a measure of how spread out the data is, and is calculated by taking the square root of the variance.
skewCalculates the skew of the data. The skew is a measure of the asymmetry of the data, and is calculated by taking the third moment of the data about the mean.
kurtCalculates the kurtosis of the data. The kurtosis is a measure of the peakedness of the data, and is calculated by taking the fourth moment of the data about the mean.
firstSelects the first value in the data. This function will return the first value in the data, regardless of its position in the original dataframe.
lastSelects the last value in the data. This function will return the last value in the data, regardless of its position in the original dataframe.

Our first aggregation

We’re going to look at a high level of granularity first, total amount of sales per Establishment. So we only care about the estKey, and we are going to SUM our salesValue column.

Group by and summing

estRev_agg = df.groupby('estKey')['salesValue'].agg(['sum'])
estRev_agg.head(5)

Output

This is a large dataset, so even a high level we have ~2000 rows of data, because we have 2000 establishments.

estKeysum
018437972.19
112820572.90
213579104.62
31769453.29
41294225.51

More detailed granularity

Let’s say we want to look at revenue by month, by establishment. This will make our data significantly larger. As we will have a row of data for not only every establishment. We will have a row for every month, for every establishment. If you have 10 establishments and 10 months. That would make the data 100 rows, instead of the 10 previous rows.

This is the code we could write that will show us the sum of salesValue by month and year in each establishment. The only change that we made was passing a list of columns to the groupby statement.

estRev_agg = df.groupby(['estKey', 'month', 'year'])['salesValue'].agg(['sum'])

Conclusion

That’s all we’re going to talk about in this post for aggregating data. We’re going to be talking lots more about it though as it’s a fundamental concept of the Pandas library.

Share:
Subscribe to our newsletter

Stay up to date with our latest content - No spam!

Related Posts