. Advertisement .
..3..
. Advertisement .
..4..
In addition to joining and concatenating, you can make Pandas merge multiple DataFrames. Read on to find out how.
Pandas Merge Multiple DataFrames
You will need to use the method merge() to merge DataFrame objects together through a database-style join.
Syntax:
DataFrame.merge(right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
Parameters:
- right: the DataFrame you want to merge with.
- how: this parameter indicates the type of merge you want to carry out. Pandas uses the ‘inner’ option by default, which is similar to an inner join in SQL. It retains the order of keys and uses intersections from the two DataFrame against those keys.
The method also supports ‘outer’, ‘left’, ‘right’, and ‘cross’ types of merging.
For instance, ‘left’ only takes into account the keys of the left DataFrama and preserves their order. It works in a similar manner to SQL left outer joins. Meanwhile, ‘right’ only uses keys from the right DataFrame, just like right outer joins in SQL.
An ‘outer’ merge in Pandas uses keys from both DataFrames and sorts them lexicographically. It mimics full outer joins in SQL.
Finally, the ‘cross’ option creates the Cartesian product of two DataFrames, keeping the order of the keys in the left DataFrame.
- on: this is the indexes or labels of the data subset you want to join from the two DataFrames. They must be valid on both objects. Its default is None, meaning the method uses the intersection of its columns.
- left_on and right_on: columns or indexes that should be joined on the left or right DataFrame. You can provide this parameter in the form of a list or array.
- left_index and right_index: boolean parameters with the default value False. They control whether the method should use the indexes in the left or right DataFrame for the join keys.
We will create two sample DataFrames in Pandas, which you can learn more about here. They share the same indexes and the first columns.
import pandas as pd
headers1 = ['Site', 'Ranking']
data1 = [
['ITTutoria', 1],
['Stack Overflow', 2],
['Quora', 3],
['Reddit', 4]
]
headers2 = ['Site', 'Type', 'Top Language']
data2 = [
['ITTutoria', 'Tutorials', 'Python'],
['Stack Overflow', 'Q&A', 'JavaScript'],
['Quora', 'Q&A', None],
['Reddit', 'Forum', None]
]
df1 = pd.DataFrame(data1, columns = headers1)
df2 = pd.DataFrame(data2, columns = headers2)
......... ADVERTISEMENT .........
..8..
......... ADVERTISEMENT .........
..8..
If you want to merge these two DataFrames, you can just invoke the method merge()
without any extra options:
df1.merge(df2)
......... ADVERTISEMENT .........
..8..
The output is a DataFrame that uses the order of the first DataFrame’s keys. It has all the data from both DataFrames while not repeating the first column.
Here is an example of the ‘outer’ option.
data1 = {
"Name": ["Sally", "Mary", "John"],
"Age": [50, 40, 30]
}
data2 = {
"Name": ["John", "Peter", "Micky"],
"Age": [77, 44, 22]
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
......... ADVERTISEMENT .........
..8..
......... ADVERTISEMENT .........
..8..
We use the method merge() to create a union of data from both DataFrames.
df1.merge(df2, how = 'outer')
......... ADVERTISEMENT .........
..8..
Meanwhile, this is the result when you use the ‘left’ or ‘right’ merging option:
df1.merge(df2, how = 'left')
df1.merge(df2, how = 'right')
......... ADVERTISEMENT .........
..8..
......... ADVERTISEMENT .........
..8..
You can see how the method returns an object similar to the first DataFrame when given the ‘left’ option (and the second DataFrame with the ‘right’ option). It occurs because these DataFrames don’t share keys, and the method only uses keys from one of them.
Conclusion
Pandas merge multiple DataFrames in the same way as SQL joins. With this knowledge, you can customize the operation to fit your desired merging purpose.
Leave a comment