How to Merge Two Dataframes in Pandas?

Challenge Inside! : Find out where you stand! Try quiz, solve problems & win rewards!

Learn via video courses

Overview

Pandas is an open-source library that works with data and related operations like data analysis, manipulation, etc. If you are asked to collect data on students of your college who have a keen interest in Machine learning irrespective of their age, would it be structured until you organize it? In real-life scenarios, data is scattered, so to work on this data, we need to combine it into one DataFrame. We might need to merge two or more dataframes and perform various operations to make working easier.

Introduction to Merge Two Dataframes Pandas

A pandas dataframe is a two-dimensional data structure represented with the help of rows and columns. Data can be stored in a dataframe in various ways, be it by importing from an Excel file, a CSV file, or a database. While importing data from such sources, we can have multiple dataframes, but wouldn't it be tedious to perform similar operations on different dataframes? Now, what do we do in such a case? Merge all of them into one, and instead of performing the same operation multiple times, we could just do it at once and get over with our task. Hence the concept of merging dataframes in Pandas is important. Let's look at all those powerful functions that would make your task easier.

Method 1: Using merge()

The merge() is very similar to SQL join function. There are various parameters to be considered while using this function. While using the merge() function we write it as:-

This function is used when we have to join columns or indexes in every possible combination, be it column to column or index to column, etc. Let us discuss these parameters in more detail.

  • right:- This is the object we will merge with.
  • how:- This contains five different parameters to choose from the right, left, inner, and outer cross. It has the default value “inner.” These parameters denote the type of merge to be performed, for example, left join, right join, inner join, etc. It is similar to the joins we have studied in SQL.
  • on:- This specifies the column or index on which the merge is supposed to happen. If the value for on is None, the dataframe will be merged based on columns in both available dataframes.
  • left_on:- When this parameter is selected columns or indexes are merged in the left dataframe.
  • right_on:-When this parameter is selected columns or indexes are merged in the right dataframe.
  • left_index:- It takes in bool values i.e True/False. We use the index from the left dataframe as the merge keys.
  • right_index:- It is similar to left_index. The only difference is that we use the index from the right dataframe as the merge key.
  • sort:- It also takes in bool values and the main function is to sort the merge keys lexicographically in the output dataframe. If the parameter value is False the order of the merge keys depends on what type of merge is being performed.
  • Suffixes:- It is the sequence of length two. The values are of string datatype and indicate the suffix to be added to the overlapping column names on the left and right respectively after the dataframes are merged. Its default value is (“_x”, “_y”).
  • copy:- It takes in bool values. The default value is True. Then, it creates a copy of the resulting dataframe.
  • indicator:- It takes either bool values or string values. If the parameter value is True, it adds a column to the resultant dataframe named “_merge” that contains information on the source of each row. We can change the name of the column by providing a different name as a string value to the parameter.
  • validate:- If the value is None, it checks what type of merge operation has been performed. For example, one-to-one, one-to-many, many-to-one, or many-to-many. In the one-to-one merge, the merge keys are unique in both datasets, and so on.

The return type for this function is a dataframe created after two objects/ dataframes are merged.

Code Example 1:

Output:-

Code Example 2:

Output:-

Code Example 3:

Output:-

Code Example 4:

Output:-

Code Example 5:

Output:-

Code Example 6:

Output:-

Code Example 7:

Output:-

Code Example 8:

Output:-

Code Example 9:

Output:-

Code Example 10:

Output:-

Code Example 11:

Output:-

Method 2: Using join()

This method is used to join columns with other dataframe either on a column or index. While using the join() function we write it as:-

The merge() and join() functions perform similar operations, but the parameters taken into consideration are very few as compared to the merge() function. Let us look into each parameter in detail, similar to how we did for the merge() function.

  • other:- The index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DataFrame.
  • on:- It takes in a string or a list of string values containing the column or index name on which the join is supposed to happen with respect to the other dataframe. If no value is provided, index-to-index join takes place.
  • how:- This contains four different parameters to choose from right, left, inner, and outer. It has the default value “left”. For example, while using the left value we use the index of the calling dataframe whereas while using the inner, we form the intersection of indexes of both dataframes without altering the order of the calling dataframe.
  • lsuffix:- It takes in string value by default. It provides the suffix value to use with the left dataframe’s overlapping column.
  • rsuffix:- It provides the suffix value for using the right dataframe’s overlapping column.
  • sort:- It takes in a bool value, the default value being False. It helps sort the resultant dataframe in a lexicographical manner using the join key. If the parameter value is False, the order of the join keys depends on what type of join is being performed.

This function's return type is a dataframe with columns from both data frames used.

Code Example 1:

Output:-

Code Example 2:

Output:-

Code Example 3:

Output:-

Code Example 4:

Output:-

Method 3: Using append()

This function is used to append rows of other data frames to the given dataframe. In contrast, only those columns not already present in the calling dataframe are added in appending columns. It isn’t used after version 1.4.0 instead, we use the concat() function in the advanced versions of pandas.

Code Example 1:

Output:-

Code Example 2:

Output:-

Code Example 3:

Output:-

Code Example 4:

Output:-

Code Example 5:

Output:-

Method 4: Using concat()

This function is used to merge two dataframes across rows or columns. Among the various available parameters, we have an axis whose value can be 0 or 1. When the axis = 0 the data frames are concatenated on the basis of rows, whereas when the axis = 1, they are concatenated on the basis of columns. While using the concat() function we write it as:-

Let us see how all these parameters contribute to the Concat function.

  • objs:- It takes in a sequence or mapping of Series or DataFrame objects as the parameter values.
  • axis:- The default value for this is 0, where axis = 0 means index based and axis = 1 means column based.
  • join:- This defines how the concatenation will occur, for example, outer or inner join. The default value is “outer”.
  • ignore_index:- It takes in bool values. If the value is True we tend to ignore the index values along the axis it is concatenated. The default value is False.
  • keys:- This helps us construct hierarchical indexing at the outermost level using the sequence of keys passed as the parameter values.
  • names:- It contains names of the levels in the hierarchical index obtained.
  • verify_integrity:- It takes in bool values and checks whether the new concatenated axis contains redundant values or not. The default value for this parameter is False.
  • sort:- It takes in bool values, the default value being False. During the outer join, it is responsible for sorting the non-concatenated axis if they are not aligned, whereas it has no role during the inner join as the axes are already sorted.
  • copy:- It creates the copy of the resultant object. If the value is set to False, no data is copied.

The return type for this function is a dataframe when the concatenation is done along the columns. Also, if the concatenation happens between a series and a dataframe, a dataframe is returned.

Code Example 1:

Output:-

   0
0  1
1  2
2  3

Code Example 2:

Code Example 3:

Code Example 4:

Code Example 5:

Code Example 6:

Code Example 7:

Output:-

Method 5: Using Combine_first() and Update():-

This function can help update a dataframe with null values and non-null values taken from other dataframe. It is nothing but the union of two data frames provided. The return type for this function is a dataframe.

The update() function is used when data modification in a dataframe is required from another dataframe.

We will now see the parameters that are used with the update function.

  • other:- The update() functions only when at least one common row or column label is present concerning the original DataFrame.
  • join:- Only the left join can be implemented in this function.
  • overwrite:- It takes boolean values. When the value is set to True, it will overwrite not null values with the help of the other dataframe, whereas when the value is set to False, only NA values are updated. The default value for this is True.
  • filter_func:- This function can choose if it wants to replace any value other than NA. All those values that are to be updated will return True for this function.
  • errors:- It can have two values raised or ignored. The default is “ignore”. If the value is set to “raise” it will raise a “Value Error” if the NA value location overlaps in both dataframes.

The return type is None, as it directly updates the dataframe provided rather than creating and returning a new one.

Code Example 1:

Output:-

Code Example 2:

Output:-

We can see that the update function modifies the first dataframe df with the values in the other dataframe n_df and returns the df dataframe instead of creating another one, so the return type is None for the update() function.

Conclusion

This article dealt with various powerful functions on how to merge two dataframes in pandas. We saw various functions like merge(), join(), concat(), append(), combine_first(), and update(). Now, how do we decide when to use which function()?

  • merge()- This function can be used in almost all cases, but if a certain condition is given, we might have to bring in some other function in use.
  • use the concat() function to merge two data frames row-wise.
  • Found a missing value in your dataframe? Use combine_first() or update() function.
  • Planning to add a new row in your dataframe while merging? We have the append() function.

In the end, it is all about experimenting with your data. Try hands-on with all these functions and see if you get an error or the desired result. The better you know your data, the better results you will achieve. Keep Experimenting, Keep Learning.