In previous post we saw Handling Duplicates
The issue of duplicate values is very frequent in many transaction datasets. Here we have an example of telecom bill and complaints data. A single user may register a complaint twice a day. This repeated data need to be found and removed from the space.
The following data on Complaints would help in understanding how this is done.
- DataSet: “./Telecom Data Analysis/Complaints.csv”
- Identify overall duplicates in complaints data
- Create a new dataset by removing overall duplicates in Complaints data
- Identify duplicates in complaints data based on cust_id
- Create a new dataset by removing duplicates based on cust_id in Complaints data
Solutions
- DataSet: “./Telecom Data Analysis/Complaints.csv”
>Complaint <- read.csv("C:\\Users\\venk\\Google Drive\\Training\\Datasets\\Telecom Data Analysis\\Complaints.csv") >head(Complaint)
## comp_id month incident cust_id sla.status.new ## 1 318691791 12-Jul 7/22/2012 9243584301 ClosedIN SLA ## 2 318691795 12-Jul 7/22/2012 9243584297 ClosedIN SLA ## 3 318691807 12-Jul 7/22/2012 9243584279 ClosedIN SLA ## 4 318691809 12-Jul 7/22/2012 9243584277 ClosedIN SLA ## 5 317997895 12-Jul 7/17/2012 9282441943 ClosedIN SLA ## 6 318030735 12-Jul 7/17/2012 7204112312 ClosedIN SLA ## incident.type type severity ## 1 Customer Communication VAS-OTHERS High ## 2 Customer Communication VAS-OTHERS High ## 3 Customer Communication VAS-OTHERS High ## 4 Customer Communication VAS-OTHERS High ## 5 Barring/Unbarring VAS-OTHERS High ## 6 Name/Address Modification 1515 Low
- Identify overall duplicates in complaints data
>Dupes_in_Complaint <- duplicated(Complaint) >summary(Dupes_in_Complaint) >Complaint_dupes <- Complaint[Dupes_in_Complaint,] >Complaint_dupes
## Mode FALSE NA's ## logical 6587 0 ## [1] comp_id month incident cust_id ## [5] sla.status.new incident.type type severity ## <0 rows> (or 0-length row.names)
There are no record level duplicates
- Create a new dataset by removing overall duplicates in Complaints data
There are no record level duplicates, if there were any we can remove them using below code
>dim(Complaint) >Complaint_unique <- Complaint[!Dupes_in_Complaint,] >dim(Complaint_unique)
## [1] 6587 8 ## [1] 6587 8
OR
>dim(Complaint) >Complaint_unique <- unique(Complaint) >dim(Complaint_unique)
## [1] 6587 8 ## [1] 6587 8
- Identify duplicates in complaints data based on cust_id
>Dupes_in_Complaint_cust_id<-duplicated(Complaint$cust_id) >length(Dupes_in_Complaint_cust_id[Dupes_in_Complaint_cust_id==TRUE])
## [1] 1731
As we can see, there are 1731 duplicates if searched by using a key “cust_id”
- Create a new dataset by removing duplicates based on cust_id in Complaints data
>Complaint_unique<-Complaint[!Dupes_in_Complaint_cust_id,] >nrow(Complaint) >nrow(Complaint_unique)
OUTPUT:
## [1] 6587 ## [1] 4856
From the above, we can see that the duplicated rows are eliminated, as the number of rows in the complete data set is different from the number of rows in unique data set.
In the next post we will see Merge and Join.