Link to the previous post : https://statinfer.com/104-2-7-identifying-and-removing-duplicate-values-from-dataset-in-python/
In this post we will learn how to merge two datasets in python using pandas library.
Data sets merging and Joining
- Datasets:
- TV Commercial Slots Analysis/orders.csv
- TV Commercial Slots Analysis/slots.csv
In [103]:
orders=pd.read_csv("datasets\\TV Commercial Slots Analysis\\orders.csv")
orders.shape
Out[103]:
In [105]:
slots=pd.read_csv("datasets\\TV Commercial Slots Analysis\\slots.csv")
slots.shape
Out[105]:
In [108]:
# duplicates based on Unique_id
sum(orders.Unique_id.duplicated())
Out[108]:
In [109]:
# duplicates based on Unique_id
sum(slots.Unique_id.duplicated())
Out[109]:
In [110]:
#Removing Duplicates
orders1=orders.drop_duplicates(['Unique_id'])
slots1=slots.drop_duplicates(['Unique_id'])
In [111]:
sum(orders1.Unique_id.duplicated())
Out[111]:
In [112]:
sum(slots1.Unique_id.duplicated())
Out[112]:
Inner Join
In [113]:
inner_data=pd.merge(orders1, slots1, on='Unique_id', how='inner')
inner_data.shape
Out[113]:
Outer Join
In [114]:
outer_data=pd.merge(orders1, slots1, on='Unique_id', how='outer')
outer_data.shape
Out[114]:
Left outer Join
In [115]:
L_outer_data=pd.merge(orders1, slots1, on='Unique_id', how='left')
L_outer_data.shape
Out[115]:
Righ outer Join
In [116]:
R_outer_data=pd.merge(orders1, slots1, on='Unique_id', how='right')
R_outer_data.shape
Out[116]:
LAB: Data Joins
- Datasets
- “./Telecom Data Analysis/Bill.csv”
- “./Telecom Data Analysis/Complaints.csv”
- Import the data and remove duplicates based on cust_id
- Create a dataset for each of these requirements
- All the customers who appear either in bill data or complaints data
- All the customers who appear both in bill data and complaints data
- All the customers from bill data: Customers who have bill data along with their complaints
- All the customers from complaints data: Customers who have Complaints data along with their bill info
In [117]:
comp_data=pd.read_csv("datasets\\Telecom Data Analysis\\Complaints.csv")
comp_data.shape
Out[117]:
In [118]:
bill_data=pd.read_csv("datasets\\Telecom Data Analysis\\Bill.csv")
bill_data.shape
Out[118]:
In [119]:
#Import the data and remove duplicates based on cust_id
comp_data1=comp_data.drop_duplicates(['cust_id'])
comp_data1.shape
Out[119]:
In [120]:
bill_data1=bill_data.drop_duplicates(['cust_id'])
bill_data1.shape
Out[120]:
In [121]:
#All the customers who appear either in bill data or complaints data
combined1=pd.merge(comp_data1, bill_data1, on='cust_id', how='outer')
combined1.shape
Out[121]:
In [122]:
#All the customers who appear both in bill data and complaints data
combined2=pd.merge(comp_data1, bill_data1, on='cust_id', how='inner')
combined2.shape
Out[122]:
In [123]:
#All the customers from bill data: Customers who have bill data along with their complaints
combined3=pd.merge(comp_data1, bill_data1, on='cust_id', how='right')
combined3.shape
Out[123]:
In [124]:
#All the customers from complaints data: Customers who have Complaints data along with their bill info
combined4=pd.merge(comp_data1, bill_data1, on='cust_id', how='left')
combined4.shape
Out[124]: