Published on

Grouping in Pandas

  • avatar
    Ross Kippenbrock
This post was published on the yhat blog

An Introduction to Grouping in Pandas

Grouping data is an integral part of many data analysis projects. The functionality for grouping in pandas is vast, but can be tough to grasp initially. Have no fear...we will get through a short introduction together using some data from NYC's beloved bike share program, Citi Bike.

Pandas hugging

Why would I want to group data?

Most of the time, you have a set of data that lends itself to being categorized or grouped. As a general example, let's say we have data on a wide variety of people. We may perform an analysis where we compare groups in the data based on age, gender, birth month, shoe size, or birth city; the options are as numerous as the data points!

The pandas groupby functionality draws from the Split-Apply-Combine method as described by Hadley Wickham from the land of R. It's a great approach to solving data analysis problems, and his paper on the subject is worth a read (it's linked in the resources section).

To summarize, he states that a common methodology for analyzing data comes from splitting the data into categories or groups based on some criteria, applying some aggregation function to each group (sum, mean, count), then combining the results for analysis, visualization or other means of better understanding.

Here's a graphic I came across illustrating the process:

Split, Apply, Combine

From Data Analysis in Python by Wes McKinney

Sounds handy, but how do I do it in pandas?

Codes and Stuff

Just so we're on the same page, Iā€™m running pandas 0.18.1.

I was looking around for an intriguing dataset and came across this data from Citi Bike, which is the NYC bike share program. It's pretty medium data at ~250MB CSV for one month's worth of data, and there was potential for some compelling findings with data grouping. Let's start down the rabbit hole...

df = pd.read_csv('data.csv')
tripdurationstarttimestoptimestart station idstart station namestart station latitudestart station longitudeend station idend station nameend station latitudeend station longitudebikeidusertypebirth yeargender
5385/1/2016 00:00:035/1/2016 00:09:025361 Ave & E 30 St40.741444-73.975361497E 17 St & Broadway40.737050-73.99009323097Subscriber1986.02
2245/1/2016 00:00:045/1/2016 00:03:49361Allen St & Hester St40.716059-73.991908340Madison St & Clinton St40.712690-73.98776323631Subscriber1977.01
3285/1/2016 00:00:145/1/2016 00:05:43301E 2 St & Avenue B40.722174-73.983688311Norfolk St & Broome St40.717227-73.98802123049Subscriber1980.01
11965/1/2016 00:00:205/1/2016 00:20:1731411 Ave & E 68 St40.765005-73.958185237E 11 St & 2 Ave40.730473-73.98672419019CustomerNaN0
7535/1/2016 00:00:265/1/2016 00:13:00492W 33 St & 7 Ave40.750200-73.990931228E 48 St & 3 Ave40.754601-73.97187916437Subscriber1981.01

It looks like there is a good opportunity to break the data down into groups to look for some interesting trends. Some ideas are:

  • Group on the gender column and see if there are more male or female riders.
  • Do specific stations get used more than others? We can group on the station start or finish id.
  • Group the data on the day of the week, to see if there is more utilization for a particular day, on average.

How about a few examples?

If we want to group by just the gender, then we pass this key (column name) to the groupby function as the sole argument. This example is the simplest form of grouping, so please check out the docs to get all the options!

groupedGender = df.groupby('gender')
<pandas.core.groupby.DataFrameGroupBy object at 0x1133854d0>

The output shows that groupby returns a pandas DataFrameGroupBy object. Pandas has just made some internal calculations about the new gender groups and is ready to apply some operation on each of these groups.

We can take a look at the available methods with the docstring/tab complete functionality of Rodeo!

rodeo pandas docstring

Counts of groups

Getting back to the data, if we use the count method, we can see the total number of entries for each gender group. For reference, here's what the website says for the gender codes - "Gender (Zero=unknown; 1=male; 2=female)"

0    178710
1    783723
2    249847

# look at the size as a percentage of the whole (using the trip)
total = df.gender.count()
groupedGender.size() / total * 100
0    14.741644
1    64.648679
2    20.609678

It looks like males make up the majority of Citi Bike riders (~65%). I was pretty surprised to see that male riders outnumbered female riders 3 to 1. I wonder if that's true of commuters in general, or if there's some other factor, like females tending to own their own bikes. A question for another post...

Mean and Std Dev of groups

We can use a single column from the DataFrameGroupBy object and apply some aggregation function on it - how about the mean and standard deviation of the trip durations for all three groups?

groupedGender['tripduration'].mean() / 60.
0    35.923658
1    13.778720
2    16.198230

# Don't have to use the bracket notation
groupedGender.tripduration.std() / 60.
0    193.417686
1     94.884313
2     91.675397

Although males make up the majority of Citi Bike riders, there's not much of a difference in their trip durations. Interestingly, gender unknown riders take 2x as long of rides on average. These riders are likely single-use customers (when you purchase a one time pass at a Citi Bike kiosk you are not asked for your gender).

More summary statistics

So there are some summary statistics for these groups (as an aside, you can use the describe function to get these statistics and more in one call). That's a whole lot of spread around the mean, which probably means there are some outliers in the data (maybe people that kept the bike for days). Just a brief look at this even though it's outside the scope, because I'm sure you were all interested šŸ˜Š

df[df.tripduration > 10000].tripduration.count()

Our suspicions are confirmed - there are many bike rentals outside 2:45 even though the "max" is supposed to be 30 minutes (or 45 if you're a Citi Bike member).

Brief aside / public service announcement, it costs up to $1200 to replace a Citi Bike. Don't be an outlier!

A quick (gg)plot

Okay, back on track...the plot, just because we can (using ggpy of course):

df_short = df[df.tripduration < 10000]
df_short.tripduration = df_short.tripduration / 60.
ggplot(df_short, aes(x='tripduration')) + geom_histogram(bins=30) + xlab("Trip Duration (mins)") + ylab("Count")

Grouping by station name

One last example is looking at which are the five favorite start and end stations. We'll group the data based on the start and end station names, apply the count function, and sort the values is descending order. Here's the code for that:

groupedStart = df.groupby('start station name')
groupedStart['start station name'].count().sort_values(ascending=False)[:5]

start station name
Pershing Square North    12775
West St & Chambers St    10128
Lafayette St & E 8 St     9246
W 21 St & 6 Ave           9220
E 17 St & Broadway        9036

groupedEnd = df.groupby('end station name')
groupedEnd['end station name'].count().sort_values(ascending=False)[:5]

end station name
Pershing Square North    12511
West St & Chambers St    10189
Lafayette St & E 8 St     9459
E 17 St & Broadway        9273
W 21 St & 6 Ave           9268

Wrapping Up

Hopefully, the above examples helped introduce some basic uses for the grouping process in pandas to help enhance your analysis and whet your appetite for more! What ideas do you have for further analysis on this dataset? Can you conquer the last idea of looking at the days of the week? Please, take a look at the resources linked below for further investigation!

More Resources

This post was originally published on the yhat blog.