. Advertisement .
..3..
. Advertisement .
..4..
What is the difference between join() and merge() in Pandas? Keep reading to find out the answers. In this article, you will find the distinction between the two and many use cases and examples for each. Let’s dive right in!
Key points of the two methods:
- The
join()
method only supports joining on rows. It does not permit joining on columns unless a column is provided as an index. join()
conducts a left join by default.- The
merge()
method is for performing a join on columns, indices, or a combination of the two. merge()
conducts an inner join by default.- These approaches are compatible with inner, outer, left, and right join kinds.
merge()
can also support cross join.
Difference Between Join() Vs. Merge() In Pandas
Both merge()
and join()
are utilized to conduct joining DataFrame on the meaning of columns. It creates a single DataFrame from the entirety of two or more DataFrames’ columns. The biggest difference between the two is that join()
is utilized to combine two Pandas DataFrames but only on the index, not the columns.
Meanwhile, The main purpose of merge()
is to define the columns you intend to combine. Additionally, it offers joining on index and combining columns and indexes.
Both of them support right on index and left on column. Still, the merge()
also supports right on column and left on index.
Another distinction is that merge()
defaults to inner join, whereas join()
defaults to left join.
Join() Vs. Merge() Examples
Let’s now examine the distinctions between join and merge using examples. Start by making two DataFrames that share a single column.
import pandas as pd
technologies = {
'Courses':["Spark","PySpark","Python","pandas"],
'Fee' :[20000,25000,22000,30000],
'Duration':['30days','40days','35days','50days'],
}
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)
technologies2 = {
'Courses':["Spark","Java","Python","Go"],
'Discount':[2000,2300,1200,2000]
}
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)
print(df1)
print(df2)
Output of left DataFrame:
Courses Fee Duration
r1 Spark 20000 30days
r2 PySpark 25000 40days
r3 Python 22000 35days
r4 Pandas 30000 50days
Output of right DataFrame:
Courses Discount
r1 Spark 2000
r6 Java 2300
r3 Python 1200
r5 Go 2000
Join On Index Vs. Merge On Index
Join is the most effective method for using indices when merging DataFrames in Pandas on row indexes.
# pandas left join two DataFrames by Index
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='left')
Output:
Courses_left Fee Duration Course_right Discount
r1 Spark 20000 30days Spark 2000.0
r2 PySpark 25000 40days NaN NaN
r3 Python 22000 35days Python 1200.0
r4 Pandas 30000 50days NaN NaN
Keep in mind that it executed a left join on both DataFrame since that is the default action of the join().
You must define the left join using the how parameter to perform the same action with the merge() method.
# Merge two DataFrames by index using pandas.merge()
df3 = pd.merge(df1, df2, left_index=True, right_index=True, how='left')
print(df3)
Output:
Courses_x Fee Duration Course_y Discount
r1 Spark 20000 30days Spark 2000.0
r2 PySpark 25000 40days NaN NaN
r3 Python 22000 35days Python 1200.0
r4 Pandas 30000 50days NaN NaN
Join On Column Vs. Merge On Column
By default, merge()
employs an inner join when combining DataFrames. The Courses column in the example below is automatically joined because it is the only column shared by both DataFrames.
# pandas merge - inner join by Column
df3=pd.merge(df1,df2)
Output:
Courses Fee Duration Discount
0 Spark 20000 30days 2000
1 Python 22000 35days 1200
Additionally, you may specifically state which column you intend to join.
# Merge DataFrames by Column
df3=pd.merge(df1,df2, on='Courses')
You can define the column names if they differ between the left and right DataFrames.
# When column names are different
df3=pd.merge(df1,df2, left_on='Courses', right_on='Courses')
The column must be set to index usingset_index() in order to do a similar action using the join() function.
# Use join on column
df3=df1.set_index('Courses').join(df2.set_index('Courses'), how='inner')
print(df3)
Complete Example
Here is the complete example for you to use as a reference:
import pandas as pd
technologies = {
'Courses':["Spark","PySpark","Python","pandas"],
'Fee' :[20000,25000,22000,30000],
'Duration':['30days','40days','35days','50days'],
}
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)
technologies2 = {
'Courses':["Spark","Java","Python","Go"],
'Discount':[2000,2300,1200,2000]
}
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)
print(df1)
print(df2)
# pandas left join two DataFrames by Index
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='left')
# Merge two DataFrames by index using pandas.merge()
df3 = pd.merge(df1, df2, left_index=True, right_index=True, how='left')
print(df3)
# pandas merge - inner join by Column
df3=pd.merge(df1,df2)
print(df3)
# Merge DataFrames by Column
df3=pd.merge(df1,df2, on='Courses')
print(df3)
# When column names are different
df3=pd.merge(df1,df2, left_on='Courses', right_on='Courses')
print(df3)
# Use join on column
df3=df1.set_index('Courses').join(df2.set_index('Courses'), how='inner')
print(df3)
Conclusion
Hopefully, this article has successfully shown you every difference between join() and merge() in Pandas. Keep in mind these methods so that you can apply them correctly in your project.Now you know what merge()
and join()
are used for. Continuing to learn other functions is the next step in your coding journey. How about sum()
and groupby()
? Here is the tutorial on how to operate these two functions.
Leave a comment