Combining Pandas Data

Some of the most interesting analysis is done by combining different sets of data. In Pandas like a lot of what happens in Python, there are multiple ways to do certain specific things.

This post looks at combining data in Pandas. It first looks at the different data combination options available in Pandas and then at different use cases where these come in handy.

Join operations in Pandas can happen at the Pandas level or at the Object (Data Frame or Series) level.

Pandas LevelObject Level
pd.concat([s1, s2])s1.append(s2)
pd.merge(df1, df2)df1.join(df2, …)

First we’ll start by setting up the objects to be merged.

df1 = pd.DataFrame({'employee':['bob', 'lisa', 'jake', 'sue'], 
                    'group':['Accounting', 'Engineering', 'engineering', 'hr']})
df2 = pd.DataFrame({'employee':['bob', 'lisa', 'jake', 'sue'], 
                    'hire_date':[2004, 2008, 2012, 2014]})

s1 = pd.Series(data=['a', 'b', 'c'], index=[1,2,3])
s2 = pd.Series(index=['a', 'b', 'c'], data=[1,2,3])

Use Cases

Append data to an existing object

The simplest use case is when you want to append data to an existing object. For series, it is like appending data to a dictionary.

Data can be appended as additional rows or columns. For appending data the default is rows.

s1.append(s2)
df1.append(df2)

pd.concat([s1, s2])
pd.concat([df1, df2])

# Concatenate Columns
pd.concat([s1, s2], axis=1)
pd.concat([df1, df2], axis=1)

There is some overlap between merging columns in the merge function in Pandas.

Merge objects together

This is similar to SQL join operations, where you merge two objects based on common items in either.

The simplest activity is to merge two Data Frames together.

pd.merge(df1, df2)

This merges the Data Frames with a common column in both. This may not always be the case so there are additional parameters available in the function.

left_on, right_on : Column names in both objects to merge on.
left_index, right_index : set to True if you are merging on the index.
how : if you want to set SQL join operations like inner, left, right.

You have a good set of commands to do basic SQL operations and this makes dealing with multiple data objects easy.

Pandas also includes convenience functions directly on the objects for this type of operation.

df1.join(df2, lsuffix='_l', rsuffix='_r')

This operation is different to the merge function because you have to include the left and right suffixes for columns with the same name that exist in both data sets.

employee_l group employee_r hire_date
0 bob Accounting bob 2004
1 lisa Engineering lisa 2008
2 jake engineering jake 2012
3 sue hr sue 2014

Overall, combining datasets is a very useful tool that any data analyst using Pandas should have in their tool belt. I also find it useful to pick one way of doing a task, master that and stick with it. Learning different ways to do the same task sometimes makes the learning harder.

Summary

In this post we learned to append data to objects either as a row or as a column. And learned to do SQLesq join functions on objects to create merged data.

We also discussed the importance of learning one way to do a particular task and sticking with it.

Leave a comment

Your email address will not be published. Required fields are marked *