• No products in the cart.

104.2.8 Joining and Merging datasets in Python

Using merge function of pandas library

Link to the previous post : https://statinfer.com/104-2-7-identifying-and-removing-duplicate-values-from-dataset-in-python/

In this post we will learn how to merge two datasets in python using pandas library.

Data sets merging and Joining

  • Datasets:
    • TV Commercial Slots Analysis/orders.csv
    • TV Commercial Slots Analysis/slots.csv
In [103]:
orders=pd.read_csv("datasets\\TV Commercial Slots Analysis\\orders.csv")
orders.shape
Out[103]:
(1369, 9)
In [105]:
slots=pd.read_csv("datasets\\TV Commercial Slots Analysis\\slots.csv")
slots.shape
Out[105]:
(1764, 17)
In [108]:
# duplicates based on Unique_id
sum(orders.Unique_id.duplicated())
Out[108]:
3
In [109]:
# duplicates based on Unique_id
sum(slots.Unique_id.duplicated())
Out[109]:
13
In [110]:
#Removing Duplicates
orders1=orders.drop_duplicates(['Unique_id'])
slots1=slots.drop_duplicates(['Unique_id'])
In [111]:
sum(orders1.Unique_id.duplicated())
Out[111]:
0
In [112]:
sum(slots1.Unique_id.duplicated())
Out[112]:
0

Inner Join

In [113]:
inner_data=pd.merge(orders1, slots1, on='Unique_id', how='inner')
inner_data.shape
Out[113]:
(8, 25)

Outer Join

In [114]:
outer_data=pd.merge(orders1, slots1, on='Unique_id', how='outer')
outer_data.shape
Out[114]:
(3109, 25)

Left outer Join

In [115]:
L_outer_data=pd.merge(orders1, slots1, on='Unique_id', how='left')
L_outer_data.shape
Out[115]:
(1366, 25)

Righ outer Join

In [116]:
R_outer_data=pd.merge(orders1, slots1, on='Unique_id', how='right')
R_outer_data.shape
Out[116]:
(1751, 25)

LAB: Data Joins

  • Datasets
    • “./Telecom Data Analysis/Bill.csv”
    • “./Telecom Data Analysis/Complaints.csv”
  • Import the data and remove duplicates based on cust_id
  • Create a dataset for each of these requirements
  • All the customers who appear either in bill data or complaints data
  • All the customers who appear both in bill data and complaints data
  • All the customers from bill data: Customers who have bill data along with their complaints
  • All the customers from complaints data: Customers who have Complaints data along with their bill info
In [117]:
comp_data=pd.read_csv("datasets\\Telecom Data Analysis\\Complaints.csv")
comp_data.shape
Out[117]:
(6587, 8)
In [118]:
bill_data=pd.read_csv("datasets\\Telecom Data Analysis\\Bill.csv")
bill_data.shape
Out[118]:
(9462, 7)
In [119]:
#Import the data and remove duplicates based on cust_id
comp_data1=comp_data.drop_duplicates(['cust_id'])
comp_data1.shape
Out[119]:
(4856, 8)
In [120]:
bill_data1=bill_data.drop_duplicates(['cust_id'])
bill_data1.shape
Out[120]:
(9389, 7)
In [121]:
#All the customers who appear either in bill data or complaints data 
combined1=pd.merge(comp_data1, bill_data1, on='cust_id', how='outer')
combined1.shape
Out[121]:
(13952, 14)
In [122]:
#All the customers who appear both in bill data and complaints data
combined2=pd.merge(comp_data1, bill_data1, on='cust_id', how='inner')
combined2.shape
Out[122]:
(293, 14)
In [123]:
#All the customers from bill data: Customers who have bill data along with their complaints
combined3=pd.merge(comp_data1, bill_data1, on='cust_id', how='right')
combined3.shape
Out[123]:
(9389, 14)
In [124]:
#All the customers from complaints data: Customers who have Complaints data along with their bill info
combined4=pd.merge(comp_data1, bill_data1, on='cust_id', how='left')
combined4.shape
Out[124]:
(4856, 14)

Statinfer

Statinfer derived from Statistical inference. We provide training in various Data Analytics and Data Science courses and assist candidates in securing placements.

Contact Us

info@statinfer.com

+91- 9676098897

+91- 9494762485

 

Our Social Links

top
© 2020. All Rights Reserved.