By cs95


2018-12-06 06:41:15 8 Comments

  • How to perform a (LEFT|RIGHT|FULL) (INNER|OUTER) join with pandas?
  • How do I add NaNs for missing rows after merge?
  • How do I get rid of NaNs after merging?
  • Can I merge on the index?
  • Cross join with pandas?
  • How do I merge multiple DataFrames?
  • merge? join? concat? update? Who? What? Why?!

... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.

This QnA is meant to be the next installment in a series of helpful user-guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).

Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.

2 comments

@cs95 2018-12-06 06:41:15

This post aims to give readers a primer on SQL-flavoured merging with pandas, how to use it, and when not to use it.

In particular, here's what this post will go through:

  • The basics - types of joins (LEFT, RIGHT, OUTER, INNER)

    • merging with different column names
    • avoiding duplicate merge key column in output
  • Merging with index under different conditions
    • effectively using your named index
    • merge key as the index of one and column of another
  • Multiway merges on columns and indexes (unique and non-unique)
  • Notable alternatives to merge and join

What this post will not go through:

  • Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
  • Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!

Note
Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.

Furthermore, all the DataFrames here can be copied and replicated so you can play with them. Also, see this post on how to read DataFrames from your clipboard.

Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.

Enough Talk, just show me how to use merge!

Setup

np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})    
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})

left

  key     value
0   A  1.764052
1   B  0.400157
2   C  0.978738
3   D  2.240893

right

  key     value
0   B  1.867558
1   D -0.977278
2   E  0.950088
3   F -0.151357

For the sake of simplicity, the key column has the same name (for now).

An INNER JOIN is represented by

Note
This, along with the forthcoming figures all follow this convention:

  • blue indicates rows that are present in the merge result
  • red indicates rows that are excluded from the result (i.e., removed)
  • green indicates missing values that are replaced with NaNs in the result

To perform an INNER JOIN, call pd.merge specifying the left DataFrame, the right DataFrame, and the join key.

pd.merge(left, right, on='key')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278

This returns only rows from left and right which share a common key (in this example, "B" and "D).

In more recent versions of pandas (v0.21 or so), merge is now a first order function, so you can call DataFrame.merge.

left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278

A LEFT OUTER JOIN, or LEFT JOIN is represented by

This can be performed by specifying how='left'.

left.merge(right, on='key', how='left')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278

Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.

And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...

...specify how='right':

left.merge(right, on='key', how='right')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
2   E       NaN  0.950088
3   F       NaN -0.151357

Here, keys from right are used, and missing data from left is replaced by NaN.

Finally, for the FULL OUTER JOIN, given by

specify how='outer'.

left.merge(right, on='key', how='outer')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357

This uses the keys from both frames, and NaNs are inserted for missing rows in both.

The documentation summarises these various merges nicely:

enter image description here

Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs

If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.

For LEFT-Excluding JOIN, represented as

Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from left only,

(left.merge(right, on='key', how='left', indicator=True)
     .query('_merge == "left_only"')
     .drop('_merge', 1))

  key   value_x  value_y
0   A  1.764052      NaN
2   C  0.978738      NaN

Where,

left.merge(right, on='key', how='left', indicator=True)

  key   value_x   value_y     _merge
0   A  1.764052       NaN  left_only
1   B  0.400157  1.867558       both
2   C  0.978738       NaN  left_only
3   D  2.240893 -0.977278       both

And similarly, for a RIGHT-Excluding JOIN,

(left.merge(right, on='key', how='right', indicator=True)
     .query('_merge == "right_only"')
     .drop('_merge', 1))

  key  value_x   value_y
2   E      NaN  0.950088
3   F      NaN -0.151357

Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),

You can do this in similar fashion—

(left.merge(right, on='key', how='outer', indicator=True)
     .query('_merge != "both"')
     .drop('_merge', 1))

  key   value_x   value_y
0   A  1.764052       NaN
2   C  0.978738       NaN
4   E       NaN  0.950088
5   F       NaN -0.151357

Different names for key columns

If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:

left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)

left2

  keyLeft     value
0       A  1.764052
1       B  0.400157
2       C  0.978738
3       D  2.240893

right2

  keyRight     value
0        B  1.867558
1        D -0.977278
2        E  0.950088
3        F -0.151357

left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')

  keyLeft   value_x keyRight   value_y
0       B  0.400157        B  1.867558
1       D  2.240893        D -0.977278

Avoiding duplicate key column in output

When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.

left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')

    value_x keyRight   value_y
0  0.400157        B  1.867558
1  2.240893        D -0.977278

Contrast this with the output of the command just before (thst is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you'll notice keyLeft is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.

Merging only a single column from one of the DataFrames

For example, consider

right3 = right.assign(newcol=np.arange(len(right)))
right3
  key     value  newcol
0   B  1.867558       0
1   D -0.977278       1
2   E  0.950088       2
3   F -0.151357       3

If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:

left.merge(right3[['key', 'newcol']], on='key')

  key     value  newcol
0   B  0.400157       0
1   D  2.240893       1

If you're doing a LEFT OUTER JOIN, a more performant solution would involve map:

# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

As mentioned, this is similar to, but faster than

left.merge(right3[['key', 'newcol']], on='key', how='left')

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

Merging on multiple columns

To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).

left.merge(right, on=['key1', 'key2'] ...)

Or, in the event the names are different,

left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])

Other useful merge* operations and functions

This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specs.


Index-based *-JOIN (+ index-column merges)

Setup

np.random.seed([3, 14])
left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])    
right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'

left
           value
idxkey          
A      -0.602923
B      -0.402655
C       0.302329
D      -0.524349

right

           value
idxkey          
B       0.543843
D       0.013135
E      -0.326498
F       1.385076

Typically, a merge on index would look like this:

left.merge(right, left_index=True, right_index=True)


         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

Support for index names

If your index is named, then v0.23 users can also specify the level name to on (or left_on and right_on as necessary).

left.merge(right, on='idxkey')

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

Merging on index of one, column(s) of another

It is possible (and quite simple) to use the index of one, and the column of another, to perform a merge. For example,

left.merge(right, left_on='key1', right_index=True)

Or vice versa (right_on=... and left_index=True).

right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2

  colkey     value
0      B  0.543843
1      D  0.013135
2      E -0.326498
3      F  1.385076

left.merge(right2, left_index=True, right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135

In this special case, the index for left is named, so you can also use the index name with left_on, like this:

left.merge(right2, left_on='idxkey', right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135

DataFrame.join
Besides these, there is another succinct option. You can use DataFrame.join which defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.

left.join(right, how='inner', lsuffix='_x', rsuffix='_y')

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:

left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')

Since the column names are the same. This would not be a problem if they were differently named.

left.rename(columns={'value':'leftvalue'}).join(right, how='inner')

        leftvalue     value
idxkey                     
B       -0.402655  0.543843
D       -0.524349  0.013135

pd.concat
Lastly, as an alternative for index-based joins, you can use pd.concat:

pd.concat([left, right], axis=1, sort=False, join='inner')

           value     value
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

Omit join='inner' if you need a FULL OUTER JOIN (the default):

pd.concat([left, right], axis=1, sort=False)

      value     value
A -0.602923       NaN
B -0.402655  0.543843
C  0.302329       NaN
D -0.524349  0.013135
E       NaN -0.326498
F       NaN  1.385076

For more information, see this canonical post on pd.concat by @piRSquared.


Generalizing: mergeing multiple DataFrames

Setup

np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})    
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C]   

Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge calls:

A.merge(B, on='key').merge(C, on='key')

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames. To do this, one often used simple trick is with functools.reduce, and you can use it to achieve a INNER JOIN like so:

from functools import reduce
reduce(pd.merge, dfs)

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

Note that every column besides the "key" column should be differently named for this to work out-of-box. Otherwise, you may need to use a lambda.

For a FULL OUTER JOIN, you can curry pd.merge using functools.partial:

from functools import partial
outer_merge = partial(pd.merge, how='outer')
reduce(outer_merge, dfs)

  key    valueA    valueB  valueC
0   A  1.764052       NaN     NaN
1   B  0.400157  1.867558     NaN
2   C  0.978738       NaN     1.0
3   D  2.240893 -0.977278     1.0
4   E       NaN  0.950088     1.0
5   F       NaN -0.151357     NaN
6   J       NaN       NaN     1.0

As you may have noticed, this is quite powerful—you can also use it to control column names during merge. Simply add more keyword arguments as needed:

partial(pd.merge, how='outer', left_index=True, right_on=...)

The alternative: pd.concat
If your column values are unique, then it makes sense to use pd.concat, this is faster than a two-at-a-time multi-way merge.

pd.concat([
    df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

Multiway merge on unique indexes

If you are merging multiple DataFrames on unique indexes, you should once again prefer pd.concat for better performance.

# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')
dfs2 = [A2, B2, C2]

pd.concat(dfs2, axis=1, sort=False, join='inner')

       valueA    valueB  valueC
key                            
D    2.240893 -0.977278     1.0

As always, omit join='inner' for a FULL OUTER JOIN.

Multiway merge on indexes with duplicates

concat is fast, but has its shortcomings. It cannot handle duplicates.

A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})

pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)

In this situation, join is the best option, since it can handle non-unique indexes (join calls merge under the hood).

# For inner join. For left join, omit the "how" argument.
A3.set_index('key').join([B2, C2], how='inner')

       valueA    valueB  valueC
key                            
D    1.454274 -0.977278     1.0
D    0.761038 -0.977278     1.0

@S.V 2018-12-10 19:37:54

Could you please include some explanation on when pandas.DataFrame.align should be used? This seems to be a yet another way to merge.

@cs95 2018-12-11 01:54:31

@S.V I don't think align is appropriate for this post, but I can explain how it works and hopefully you'll understand why. align is used to make the indexes of two DataFrames compatible with each other by inserting NaNs for missing keys. It's almost like both frames are reindexed on the union of the indices. The return value of left.align(right) is (left_aligned, right_aligned). Hope that makes sense.

@philipxy 2018-12-14 02:36:06

There's no full inner join. Left join is left outer join, right join is right outer join & full join is full outer join. So: "The basics - types of joins - INNER & OUTER (LEFT, RIGHT, FULL)" You don't clearly explain what INNER JOIN returns. Or LEFT/RIGHT JOIN: INNER JOIN rows plus unmatched left/right table rows extended by NaNs. Or FULL JOIN: INNER JOIN rows plus unmatched left & right table rows extended by NaNs. (We must always know what INNER JOIN we want as part of an OUTER JOIN.)

@cs95 2018-12-14 03:17:16

@philipxy Oof, that's shameful. I've edited to fix the terminology and add venn diagrams (with source credited accordingly). Let me know if that's better. Thanks again for your comment.

@philipxy 2018-12-14 04:30:49

OH NO VENN DIAGRAMS I can't bear to look yet but: "Re Venn diagrams" Also this comment by me on this question & my others on this page. And my comments on this page. OK I peeked... limited to inner vs outer joins, good... but you do not give an initial legend for the diagrams explaining A, B & their elements. PS Still a "FULL INNER JOIN" & still INNER JOIN undefined.

@cs95 2018-12-14 04:32:47

@philipxy Okay, I've added a legend. But please keep in mind the goal of this post is not to explain what joins are, just how they are implemented with pandas. Readers are assumed to have a basic understanding of joins and how they work. Venn diagrams here are used to supplement understanding by mapping the code to SQL operations, and not so much as a pedagogic tool.

@philipxy 2018-12-15 01:00:49

Did you read my Venn diagram PS & comments? Or the link you gave? It does not explain Venn diagrams & abuses them as my links explain. From it: "This query will return all of the records in the left table (table A)" NO, the circles are not the input tables & the output rows in general are not input rows, the on condition & keys are irrelevant, etc etc etc etc. Your legend is similarly wrong. You are not reading what is actually written. You are just passing on confusions. PS A is left LEFT JOIN right. B is left RIGHT JOIN right. The intersection is left INNER JOIN right. You show it yourself.

@philipxy 2018-12-15 01:02:49

x SEMIJOIN y is rows in x with a match in y. x ANTIJOIN y is rows in x with no match in y. Venn diagrams are inappropriate for both. "All you have to do to see this is to identify what exactly are the elements of the sets represented by the circles. (Which muddled presentations never make clear.)"

@cs95 2018-12-15 01:41:13

@philipxy okay, I am not an SQL expert, and I fear adding this level of explanation only to make the Venn diagram be technically accurate will result in more confusion. Can we fix this problem by using A and B to represent the sets of join keys and go from there? Or is your suggestion to just remove them altogether?

@Datanovice 2019-01-11 20:58:40

You are an absolute legend, so you happen to have a post on the best way to concat multiple excel/csv files? I know how to do this and there are threads but I wonder what the best method or practice is.

@cs95 2019-01-11 21:13:23

@Datanovice not my post, but there is a link. Try stackoverflow.com/questions/49620538/…

@S.V 2019-03-01 21:05:38

Could you please add to your answer (if it is possible) an efficient (complexity O(N)) way to merge a list of pandas DataFrames, each having a sorted index, into a singe DataFrame with a sorted index?

@cs95 2019-03-01 21:39:51

@S.V if the indexes are unique, you can use pd.concat (already in this answer) with join='outer' argument.

@S.V 2019-03-01 22:58:54

@coldspeed The indices are unique. However, the proposed solution does not seem to work as expected: the resulting DataFrame does not have a sorted index (results are identical with and without join='outer'): import pandas as pd; t1 = pd.DataFrame(data={'i1':[0,0,1,1,2,2], 'i2':[0,1,0,1,0,1], 'x':[1.,2.,3.,4.,5.,6.]}).set_index(['i1','i2']).sort_index(‌​); t2 = pd.DataFrame(data={'i1':[0,0,1,1,2,2], 'i2':[2,3,2,3,2,3], 'x':[7.,8.,9.,10.,11.,12.]}).set_index(['i1','i2']).sort_ind‌​ex(); t = pd.concat([t1,t2], join='outer'); print(t.index.is_lexsorted())

@cs95 2019-03-01 23:31:30

@S.V that's odd. Perhaps open a new question and I (or one of my colleagues) could take a closer look?

@S.V 2019-03-04 14:08:37

@coldspeed Of course, Thank you! Such question is already posted here: stackoverflow.com/q/50380093/6461882

@Andy 2019-04-23 21:44:42

Fantastic article so far, however I feel it is lacking anything discussing returning a DataFrame object from the various merges.

@cs95 2019-04-23 22:37:24

@Andy Not sure I follow, all merge operations shown here return a DataFrame!

@Andy 2019-04-23 23:05:18

@cs95 Maybe just a note that merge operations do not change anything inplace, and that the usual inplace arg is not even present in the merge method. There is a strange boolean arg Copy, which I am not sure what it does, as the supporting documentation is vague on this. So the only way to get the merged DataFrame back as an object is to assign to a new variable, or re-assign one of the merging DataFrames

@Sandalaphon 2019-05-14 10:25:08

The values in setup dont match the resulting values of merge operations, perhaps run the cells again?...(Great great post by the way!)

@cs95 2019-05-14 17:06:54

@Sandalaphon As it turns out, the initial printed out values of left and right did not match the output of the setup code. Thanks for pointing that out! It has been fixed now. Everything else seems to be in order but let me know if I've missed anything.

@cs95 2019-06-11 04:17:08

@philipxy All venn diagrams have been replaced with more useful visual cues.

@philipxy 2019-06-11 13:41:28

Explain how to interpret those figures or leave them out. "Visual cue" here is a euphemism for "doodle" & "more useful" "visual cue" is an oxymoron.

@eliu 2019-04-25 23:43:50

A supplemental visual view of pd.concat([df0, df1], kwargs). I made it due to the axis=0 and axis=1 in this not being as intuitive as df.mean() or df.apply(foo)

on pd.concat([df0, df1])

@cs95 2019-05-20 17:27:56

This is a nice diagram. May I ask how you produced it?

@eliu 2019-05-21 18:19:31

google doc's built-in "insert ==> drawing... ==> new" (as of 2019-May). But, to be clear: the only reason I used google doc for this picture is because my notes is stored in google doc, and I would like a picture that can be modified quickly within google doc itself. Actually now you mentioned it, the google doc's drawing tool is pretty neat.

Related Questions

Sponsored Content

17 Answered Questions

[SOLVED] How to iterate over rows in a DataFrame in Pandas?

40 Answered Questions

[SOLVED] How to merge two dictionaries in a single expression?

22 Answered Questions

[SOLVED] Renaming columns in pandas

14 Answered Questions

[SOLVED] Delete column from pandas DataFrame

13 Answered Questions

[SOLVED] How to join (merge) data frames (inner, outer, left, right)

72 Answered Questions

[SOLVED] How to merge two arrays in JavaScript and de-duplicate items

9 Answered Questions

[SOLVED] Select rows from a DataFrame based on values in a column in pandas

22 Answered Questions

[SOLVED] How do you merge two Git repositories?

1 Answered Questions

[SOLVED] Aggregation in pandas

3 Answered Questions

[SOLVED] How to merge a specific commit in Git

  • 2009-05-19 05:27:34
  • netawater
  • 648921 View
  • 911 Score
  • 3 Answer
  • Tags:   git merge

Sponsored Content