Link to the previous post : https://statinfer.com/104-2-6-sorting-the-data-in-python/

In this post we will understand how to identify and remove the duplicate values form dataset. We will use bill dataset from Telecom Data Analysis folder.

Identifying & Removing Duplicates

In [90]:

bill_data=pd.read_csv("datasets\\Telecom Data Analysis\\Bill.csv")
bill_data.shape

Out[90]:

(9462, 7)

In [87]:

#Identify duplicates records in the data
dupes=bill_data.duplicated()
sum(dupes)

Out[87]:

In [88]:

#Removing Duplicates
bill_data_uniq=bill_data.drop_duplicates()

In [89]:

bill_data_uniq.shape

Out[89]:

(9452, 7)

Identifying & Duplicates based on Key

What if we are not interested in overall level records
Sometimes we may name the records as duplicates even if a key variable is repeated.
Instead of using duplicated function on full data, we use it on one variable

In [93]:

#Identify duplicates in bill data based on cust_id
dupe_id=bill_data.cust_id.duplicated()

In [95]:

#Removing duplicates based on a variable
bill_data_cust_uniq=bill_data.drop_duplicates(['cust_id'])

bill_data_cust_uniq.shape

Out[95]:

(9389, 7)

Practice : Handling Duplicates in R

DataSet: “./Telecom Data Analysis/Complaints.csv”
Identify overall duplicates in complaints data
Create a new dataset by removing overall duplicates in Complaints data
Identify duplicates in complaints data based on cust_id
Create a new dataset by removing duplicates based on cust_id in Complaints data

In [96]:

comp_data=pd.read_csv("datasets\\Telecom Data Analysis\\Complaints.csv")
comp_data.shape

Out[96]:

(6587, 8)

In [97]:

comp_data.columns.values

Out[97]:

array(['comp_id', 'month', 'incident', 'cust_id', 'sla status new',
       'incident type', 'type', 'severity'], dtype=object)

In [98]:

#Identify overall duplicates in complaints data

dupe=comp_data.duplicated()
sum(dupe) # gives total number of duplicates in data

Out[98]:

In [100]:

#Create a new dataset by removing overall duplicates in Complaints data
comp_data1=comp_data.drop_duplicates()

In [101]:

#Identify duplicates in complaints data based on cust_id
dupe_id=comp_data.cust_id.duplicated()

In [102]:

#Create a new dataset by removing duplicates based on cust_id in Complaints data
comp_data2=comp_data.drop_duplicates(['cust_id'])

comp_data2.shape

Out[102]:

(4856, 8)


The next post is on joining and merging datasets in python.
Link to the next post : https://statinfer.com/104-2-8-joining-and-merging-datasets-in-python/

11th October 2018

104.2.7 Identifying and Removing Duplicate values from dataset in Python

Duplicate removal

Identifying & Removing Duplicates

Identifying & Duplicates based on Key

Practice : Handling Duplicates in R

Statinfer

Statinfer

Statinfer

104.2.7 Identifying and Removing Duplicate values from dataset in Python

Duplicate removal

Identifying & Removing Duplicates

Identifying & Duplicates based on Key

Practice : Handling Duplicates in R

Related Courses

Python(Batch6)

Statinfer

Tableau (Batch6)

Statinfer

PowerBI (Batch6)

Statinfer