• No products in the cart.

103.2.6.a Handling Duplicates- An Example

Learn with Examples

In previous post we saw  Handling Duplicates

The issue of duplicate values is very frequent in many transaction datasets. Here we have an example of telecom bill and complaints data. A single user may register a complaint twice a day. This repeated data need to be found and removed from the space.

The following data on Complaints would help in understanding how this is done.

  1. DataSet: “./Telecom Data Analysis/Complaints.csv”
  2. Identify overall duplicates in complaints data
  3. Create a new dataset by removing overall duplicates in Complaints data
  4. Identify duplicates in complaints data based on cust_id
  5. Create a new dataset by removing duplicates based on cust_id in Complaints data

Solutions

  1. DataSet: “./Telecom Data Analysis/Complaints.csv”
>Complaint <- read.csv("C:\\Users\\venk\\Google Drive\\Training\\Datasets\\Telecom Data Analysis\\Complaints.csv")
>head(Complaint)
##     comp_id  month  incident    cust_id sla.status.new
 ## 1 318691791 12-Jul 7/22/2012 9243584301   ClosedIN SLA
 ## 2 318691795 12-Jul 7/22/2012 9243584297   ClosedIN SLA
 ## 3 318691807 12-Jul 7/22/2012 9243584279   ClosedIN SLA
 ## 4 318691809 12-Jul 7/22/2012 9243584277   ClosedIN SLA
 ## 5 317997895 12-Jul 7/17/2012 9282441943   ClosedIN SLA
 ## 6 318030735 12-Jul 7/17/2012 7204112312   ClosedIN SLA
 ##               incident.type       type severity
 ## 1    Customer Communication VAS-OTHERS     High
 ## 2    Customer Communication VAS-OTHERS     High
 ## 3    Customer Communication VAS-OTHERS     High
 ## 4    Customer Communication VAS-OTHERS     High
 ## 5         Barring/Unbarring VAS-OTHERS     High
 ## 6 Name/Address Modification       1515      Low
  1. Identify overall duplicates in complaints data
>Dupes_in_Complaint <- duplicated(Complaint)
>summary(Dupes_in_Complaint)
>Complaint_dupes <- Complaint[Dupes_in_Complaint,]
>Complaint_dupes

 

##    Mode   FALSE    NA's 
## logical    6587       0
## [1] comp_id        month          incident       cust_id       
## [5] sla.status.new incident.type  type           severity      
## <0 rows> (or 0-length row.names)

There are no record level duplicates

  1. Create a new dataset by removing overall duplicates in Complaints data

There are no record level duplicates, if there were any we can remove them using below code

>dim(Complaint)
>Complaint_unique <- Complaint[!Dupes_in_Complaint,]
>dim(Complaint_unique)

 

## [1] 6587    8
## [1] 6587    8

OR

>dim(Complaint)
>Complaint_unique <- unique(Complaint)
>dim(Complaint_unique)
## [1] 6587    8
## [1] 6587    8
  1. Identify duplicates in complaints data based on cust_id
>Dupes_in_Complaint_cust_id<-duplicated(Complaint$cust_id) 
>length(Dupes_in_Complaint_cust_id[Dupes_in_Complaint_cust_id==TRUE])
## [1] 1731

As we can see, there are 1731 duplicates if searched by using a key “cust_id”

  1. Create a new dataset by removing duplicates based on cust_id in Complaints data
>Complaint_unique<-Complaint[!Dupes_in_Complaint_cust_id,]
>nrow(Complaint)
>nrow(Complaint_unique)

OUTPUT:

## [1] 6587
## [1] 4856

From the above, we can see that the duplicated rows are eliminated, as the number of rows in the complete data set is different from the number of rows in unique data set.

In the next post we will see Merge and Join.

 

20th June 2017

Statinfer

Statinfer derived from Statistical inference. We provide training in various Data Analytics and Data Science courses and assist candidates in securing placements.

Contact Us

info@statinfer.com

+91- 9676098897

+91- 9494762485

 

Our Social Links

top
© 2020. All Rights Reserved.