Link to the previous post : https://statinfer.com/104-2-6-sorting-the-data-in-python/
In this post we will understand how to identify and remove the duplicate values form dataset. We will use bill dataset from Telecom Data Analysis folder.
Identifying & Removing Duplicates
In [90]:
bill_data=pd.read_csv("datasets\\Telecom Data Analysis\\Bill.csv")
bill_data.shape
Out[90]:
In [87]:
#Identify duplicates records in the data
dupes=bill_data.duplicated()
sum(dupes)
Out[87]:
In [88]:
#Removing Duplicates
bill_data_uniq=bill_data.drop_duplicates()
In [89]:
bill_data_uniq.shape
Out[89]:
Identifying & Duplicates based on Key
- What if we are not interested in overall level records
- Sometimes we may name the records as duplicates even if a key variable is repeated.
- Instead of using duplicated function on full data, we use it on one variable
In [93]:
#Identify duplicates in bill data based on cust_id
dupe_id=bill_data.cust_id.duplicated()
In [95]:
#Removing duplicates based on a variable
bill_data_cust_uniq=bill_data.drop_duplicates(['cust_id'])
bill_data_cust_uniq.shape
Out[95]:
Practice : Handling Duplicates in R
- DataSet: “./Telecom Data Analysis/Complaints.csv”
- Identify overall duplicates in complaints data
- Create a new dataset by removing overall duplicates in Complaints data
- Identify duplicates in complaints data based on cust_id
- Create a new dataset by removing duplicates based on cust_id in Complaints data
In [96]:
comp_data=pd.read_csv("datasets\\Telecom Data Analysis\\Complaints.csv")
comp_data.shape
Out[96]:
In [97]:
comp_data.columns.values
Out[97]:
In [98]:
#Identify overall duplicates in complaints data
dupe=comp_data.duplicated()
sum(dupe) # gives total number of duplicates in data
Out[98]:
In [100]:
#Create a new dataset by removing overall duplicates in Complaints data
comp_data1=comp_data.drop_duplicates()
In [101]:
#Identify duplicates in complaints data based on cust_id
dupe_id=comp_data.cust_id.duplicated()
In [102]:
#Create a new dataset by removing duplicates based on cust_id in Complaints data
comp_data2=comp_data.drop_duplicates(['cust_id'])
comp_data2.shape
Out[102]: