Statinfer

104.2.7 Identifying and Removing Duplicate values from dataset in Python

Duplicate removal

Link to the previous post : https://statinfer.com/104-2-6-sorting-the-data-in-python/

In this post we will understand how to identify and remove the duplicate values form dataset. We will use bill dataset from Telecom Data Analysis folder.

Identifying & Removing Duplicates

In [90]:
bill_data=pd.read_csv("datasets\\Telecom Data Analysis\\Bill.csv")
bill_data.shape
Out[90]:
(9462, 7)
In [87]:
#Identify duplicates records in the data
dupes=bill_data.duplicated()
sum(dupes)
Out[87]:
10
In [88]:
#Removing Duplicates
bill_data_uniq=bill_data.drop_duplicates()
In [89]:
bill_data_uniq.shape
Out[89]:
(9452, 7)

Identifying & Duplicates based on Key

  • What if we are not interested in overall level records
  • Sometimes we may name the records as duplicates even if a key variable is repeated.
  • Instead of using duplicated function on full data, we use it on one variable
In [93]:
#Identify duplicates in bill data based on cust_id
dupe_id=bill_data.cust_id.duplicated()
In [95]:
#Removing duplicates based on a variable
bill_data_cust_uniq=bill_data.drop_duplicates(['cust_id'])

bill_data_cust_uniq.shape
Out[95]:
(9389, 7)

Practice : Handling Duplicates in R

  • DataSet: “./Telecom Data Analysis/Complaints.csv”
  • Identify overall duplicates in complaints data
  • Create a new dataset by removing overall duplicates in Complaints data
  • Identify duplicates in complaints data based on cust_id
  • Create a new dataset by removing duplicates based on cust_id in Complaints data
In [96]:
comp_data=pd.read_csv("datasets\\Telecom Data Analysis\\Complaints.csv")
comp_data.shape
Out[96]:
(6587, 8)
In [97]:
comp_data.columns.values
Out[97]:
array(['comp_id', 'month', 'incident', 'cust_id', 'sla status new',
       'incident type', 'type', 'severity'], dtype=object)
In [98]:
#Identify overall duplicates in complaints data

dupe=comp_data.duplicated()
sum(dupe) # gives total number of duplicates in data
Out[98]:
0
In [100]:
#Create a new dataset by removing overall duplicates in Complaints data
comp_data1=comp_data.drop_duplicates()
In [101]:
#Identify duplicates in complaints data based on cust_id
dupe_id=comp_data.cust_id.duplicated()
In [102]:
#Create a new dataset by removing duplicates based on cust_id in Complaints data
comp_data2=comp_data.drop_duplicates(['cust_id'])

comp_data2.shape
Out[102]:
(4856, 8)


The next post is on joining and merging datasets in python.
Link to the next post : https://statinfer.com/104-2-8-joining-and-merging-datasets-in-python/

0 responses on "104.2.7 Identifying and Removing Duplicate values from dataset in Python"

Leave a Message

Blog Posts

Hurry up!!!

"use coupon code for FLAT 30% discount"  datascientistoffer        ___________________________________      Subscribe to our youtube channel. Get access to video tutorials.                

Contact Us

Statinfer Software Solutions#647 2nd floor 1st Main, Indira Nagar 1st Stage, 100 feet road,Indranagar Bangalore,Karnataka, Pin code:-560038 Landmarks: Opp. Namma Metro Pillar 48.

Connect with us

linkin fn twitter g

How to become a Data Scientist.?

top