Statinfer

103.2.6 Handling Duplicates

(Identifying and Removing)

In previous post we saw  An Example of Sorting the Data

Duplicates can be same or similar based on the key (more than one) entries. Like, one customer purchasing multiple times in a single day from the same store. In R, we have a function called duplicated() which can be used to find repeated entries in the dataset.We get the index of duplicated records using duplicated(). Then we can use those indexes to select the duplicated records. There are two types of duplicate records: Duplicates in the whole record and duplicates based on the key. Overall record level duplicates

>Bill <- read.csv("C:\\Users\\venk\\Google Drive\\Training\\Datasets\\Telecom Data Analysis\\Bill.csv")
>Dupes_in_bill<-duplicated(Bill) 
>length(Dupes_in_bill)
>head(Dupes_in_bill)
>summary(Dupes_in_bill)
## [1] 9462
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
## Mode   FALSE    TRUE    NA's 
## logical    9452      10       0

In the above code, duplicated(Bill) would identify all the duplicate entries in the data set and store a logical entry for the same, i.e. If the data has no duplicate values, FALSE entry is stored in Dupes_in_Bill for that data entry, else TRUE. Using summary(), we can see that there are 10 duplicate entries in the data. Access all duplicates

>Bill_dupes<-Bill[Dupes_in_bill,]
>Bill_dupes

OUTPUT:

## cust_id act_active_dt package_id bill_zip customer_segment
## 134  8067122301     7/27/2008     608645   560001           G - TC
## 135  8067122304     7/27/2008     608645   560001           G - TC
## 7162 9243344257     9/24/2009     620693   560035           G - TC
## 7166 9243350958     9/24/2009     620693   560035           G - TC
## 7170 9243356981     9/24/2009     620693   560035           G - TC
## 7172 9243381865     9/24/2009     620693   560035           G - TC
## 7194 9243599440     9/24/2009     620693   560035           G - TC
## 8417 9243323468     9/24/2009     620693   560035           G - TC
## 8419 9243325951     9/24/2009     620693   560035           G - TC
## 8758 9243396747     9/24/2009     620693   560035           G - TC
##            status bill_amt
## 134        ACTIVE   619.41
## 135        ACTIVE   782.21
## 7162       ACTIVE     0.00
## 7166       ACTIVE     0.00
## 7170       ACTIVE     0.00
## 7172       ACTIVE     0.00
## 7194       ACTIVE     0.00
## 8417 DISCONNECTED     0.00
## 8419 DISCONNECTED     0.00
## 8758       ACTIVE     0.00

Bill_dupes contains all the duplicated entries in the data set. The user can easily identify these entries and take appropriate measures for the same.

Duplicates based on Key

The code above finds the duplicates in the overall data set. What if the user is not interested in overall level records?

There are situations when only a key variable is duplicated in the entire data set. So instead of using duplicated() on complete data set, duplicated() function can be used on a specific variable.

Let’s have a look at this example

>id<-c('C001', 'C002', 'C003', 'C001','C002')
>amount<-c(750, 600, 790, 750, 900)
>cust_sales_data<-data.frame(id, amount)
>cust_sales_data
##     id amount
## 1 C001    750
## 2 C002    600
## 3 C003    790
## 4 C001    750
## 5 C002    900

To the Above data, the records 1 and 4 are identical. They are recorded level duplicates and records 2 and 5 have the same id. So they are duplicates based on key id.

Duplicate based on Key

>Dupes_in_bill_cust_id<-duplicated(Bill$cust_id) 
>summary(Dupes_in_bill_cust_id)

 

##    Mode   FALSE    TRUE    NA's 
## logical    9389      73       0

Here the key variable is cust_id. Duplicates in cust_id are identified and the corresponding logical entry for the status of the duplication (TRUE or FALSE) is stored in Dupes_in_bill_cust_id. There are 73 duplicate entries in the data set, i.e., Cust_id is repeated 73 times in the data set. Access all duplicates

>Bill_dupes<-Bill[Dupes_in_bill_cust_id,]
>head(Bill_dupes)

 

## cust_id act_active_dt package_id bill_zip customer_segment status
## 134 8067122301     7/27/2008     608645   560001           G - TC ACTIVE
## 135 8067122304     7/27/2008     608645   560001           G - TC ACTIVE
## 561 9243538711     9/24/2009     625001   560035           G - TC ACTIVE
## 563 9243323496     9/24/2009     620693   560035           G - TC ACTIVE
## 565 9243323504     9/24/2009     620693   560035           G - TC ACTIVE
## 567 9243323511     9/24/2009     620693   560035           G - TC ACTIVE
##     bill_amt
## 134   619.41
## 135   782.21
## 561   561.80
## 563   545.98
## 565   254.76
## 567   545.98

Removing Duplicates

Duplicate entries can easily be identified and removed.For this, unique values need to be identified which is done by adding a “!” sign in front of the variable as shown below.

>Bill_unique<-Bill[!Dupes_in_bill,]
>summary(Bill_unique)
##     cust_id             act_active_dt    package_id        bill_zip     
##  Min.   :1.008e+09   7/17/2010 : 233   Min.   :   109   Min.   :   560  
##  1st Qu.:8.066e+09   10/27/2007: 224   1st Qu.:610487   1st Qu.:560005  
##  Median :9.243e+09   11/11/2009: 167   Median :614984   Median :560037  
##  Mean   :7.216e+09   11/27/2008: 152   Mean   :582327   Mean   :560575  
##  3rd Qu.:9.243e+09   9/24/2009 : 147   3rd Qu.:618402   3rd Qu.:560075  
##  Max.   :1.860e+10   5/7/2010  : 118   Max.   :630846   Max.   :600001  
##                      (Other)   :8411   NA's   :6                        
##  customer_segment          status        bill_amt         
##  G - TC :3594     ACTIVE      :9052   Min.   :-2058340.5  
##  B      :2251     DISCONNECTED:  54   1st Qu.:     244.7  
##  G      :1927     OGB         : 215   Median :     705.9  
##  S      :1135     TSP         :  21   Mean   :    3790.6  
##  N      : 410     TSP_90      :  11   3rd Qu.:    1472.3  
##         : 104     TSP_IC      :  99   Max.   : 2498793.8  
##  (Other):  31

Dupes_in_bill contains all the logical entries (TRUE or FALSE) corresponding to the duplicate data, i.e., if the data is duplicate then FALSE is stored at that index and if the data is not duplicated then TRUE is stored in the particular index. “!” Sign would convert all the FALSE values to TRUE, hence the data corresponding to TRUE index will only be stored in BILL_uniquee. Only unique values are stored and duplicate values are filtered out.

Unique()

Duplicate entries can also be filtered out using Unique() This function would return only the unique values. Consider the example below.

>Bill_unique1<-unique(Bill)
>nrow(Bill)
>nrow(Bill_unique1)

## [1] 9462
## [1] 9452

The number of rows in Bill is different from the number of rows in Bill_unique1, which shows that all the duplicate values are filtered out.

In the next post we will see An Example of Handling Duplicates.

20th January 2017

0 responses on "103.2.6 Handling Duplicates"

Leave a Message

Blog Posts

Hurry up!!!

"use coupon code for FLAT 30% discount"  datascientistoffer        ___________________________________      Subscribe to our youtube channel. Get access to video tutorials.                

Contact Us

Statinfer Software Solutions#647 2nd floor 1st Main, Indira Nagar 1st Stage, 100 feet road,Indranagar Bangalore,Karnataka, Pin code:-560038 Landmarks: Opp. Namma Metro Pillar 48.

Connect with us

linkin fn twitter g

How to become a Data Scientist.?

top