In previous post we saw An Example of Sorting the Data
Duplicates can be same or similar based on the key (more than one) entries. Like, one customer purchasing multiple times in a single day from the same store. In R, we have a function called duplicated() which can be used to find repeated entries in the dataset.We get the index of duplicated records using duplicated(). Then we can use those indexes to select the duplicated records. There are two types of duplicate records: Duplicates in the whole record and duplicates based on the key. Overall record level duplicates
>Bill <- read.csv("C:\\Users\\venk\\Google Drive\\Training\\Datasets\\Telecom Data Analysis\\Bill.csv") >Dupes_in_bill<-duplicated(Bill) >length(Dupes_in_bill) >head(Dupes_in_bill) >summary(Dupes_in_bill)
## [1] 9462 ## [1] FALSE FALSE FALSE FALSE FALSE FALSE ## Mode FALSE TRUE NA's ## logical 9452 10 0
In the above code, duplicated(Bill) would identify all the duplicate entries in the data set and store a logical entry for the same, i.e. If the data has no duplicate values, FALSE entry is stored in Dupes_in_Bill for that data entry, else TRUE. Using summary(), we can see that there are 10 duplicate entries in the data. Access all duplicates
>Bill_dupes<-Bill[Dupes_in_bill,] >Bill_dupes
OUTPUT:
## cust_id act_active_dt package_id bill_zip customer_segment ## 134 8067122301 7/27/2008 608645 560001 G - TC ## 135 8067122304 7/27/2008 608645 560001 G - TC ## 7162 9243344257 9/24/2009 620693 560035 G - TC ## 7166 9243350958 9/24/2009 620693 560035 G - TC ## 7170 9243356981 9/24/2009 620693 560035 G - TC ## 7172 9243381865 9/24/2009 620693 560035 G - TC ## 7194 9243599440 9/24/2009 620693 560035 G - TC ## 8417 9243323468 9/24/2009 620693 560035 G - TC ## 8419 9243325951 9/24/2009 620693 560035 G - TC ## 8758 9243396747 9/24/2009 620693 560035 G - TC ## status bill_amt ## 134 ACTIVE 619.41 ## 135 ACTIVE 782.21 ## 7162 ACTIVE 0.00 ## 7166 ACTIVE 0.00 ## 7170 ACTIVE 0.00 ## 7172 ACTIVE 0.00 ## 7194 ACTIVE 0.00 ## 8417 DISCONNECTED 0.00 ## 8419 DISCONNECTED 0.00 ## 8758 ACTIVE 0.00
Bill_dupes contains all the duplicated entries in the data set. The user can easily identify these entries and take appropriate measures for the same.
Duplicates based on Key
The code above finds the duplicates in the overall data set. What if the user is not interested in overall level records?
There are situations when only a key variable is duplicated in the entire data set. So instead of using duplicated() on complete data set, duplicated() function can be used on a specific variable.
Let’s have a look at this example
>id<-c('C001', 'C002', 'C003', 'C001','C002') >amount<-c(750, 600, 790, 750, 900) >cust_sales_data<-data.frame(id, amount) >cust_sales_data
## id amount ## 1 C001 750 ## 2 C002 600 ## 3 C003 790 ## 4 C001 750 ## 5 C002 900
To the Above data, the records 1 and 4 are identical. They are recorded level duplicates and records 2 and 5 have the same id. So they are duplicates based on key id.
Duplicate based on Key
>Dupes_in_bill_cust_id<-duplicated(Bill$cust_id) >summary(Dupes_in_bill_cust_id)
## Mode FALSE TRUE NA's ## logical 9389 73 0
Here the key variable is cust_id. Duplicates in cust_id are identified and the corresponding logical entry for the status of the duplication (TRUE or FALSE) is stored in Dupes_in_bill_cust_id. There are 73 duplicate entries in the data set, i.e., Cust_id is repeated 73 times in the data set. Access all duplicates
>Bill_dupes<-Bill[Dupes_in_bill_cust_id,] >head(Bill_dupes)
## cust_id act_active_dt package_id bill_zip customer_segment status ## 134 8067122301 7/27/2008 608645 560001 G - TC ACTIVE ## 135 8067122304 7/27/2008 608645 560001 G - TC ACTIVE ## 561 9243538711 9/24/2009 625001 560035 G - TC ACTIVE ## 563 9243323496 9/24/2009 620693 560035 G - TC ACTIVE ## 565 9243323504 9/24/2009 620693 560035 G - TC ACTIVE ## 567 9243323511 9/24/2009 620693 560035 G - TC ACTIVE ## bill_amt ## 134 619.41 ## 135 782.21 ## 561 561.80 ## 563 545.98 ## 565 254.76 ## 567 545.98
Removing Duplicates
Duplicate entries can easily be identified and removed.For this, unique values need to be identified which is done by adding a “!” sign in front of the variable as shown below.
>Bill_unique<-Bill[!Dupes_in_bill,] >summary(Bill_unique)
## cust_id act_active_dt package_id bill_zip ## Min. :1.008e+09 7/17/2010 : 233 Min. : 109 Min. : 560 ## 1st Qu.:8.066e+09 10/27/2007: 224 1st Qu.:610487 1st Qu.:560005 ## Median :9.243e+09 11/11/2009: 167 Median :614984 Median :560037 ## Mean :7.216e+09 11/27/2008: 152 Mean :582327 Mean :560575 ## 3rd Qu.:9.243e+09 9/24/2009 : 147 3rd Qu.:618402 3rd Qu.:560075 ## Max. :1.860e+10 5/7/2010 : 118 Max. :630846 Max. :600001 ## (Other) :8411 NA's :6 ## customer_segment status bill_amt ## G - TC :3594 ACTIVE :9052 Min. :-2058340.5 ## B :2251 DISCONNECTED: 54 1st Qu.: 244.7 ## G :1927 OGB : 215 Median : 705.9 ## S :1135 TSP : 21 Mean : 3790.6 ## N : 410 TSP_90 : 11 3rd Qu.: 1472.3 ## : 104 TSP_IC : 99 Max. : 2498793.8 ## (Other): 31
Dupes_in_bill contains all the logical entries (TRUE or FALSE) corresponding to the duplicate data, i.e., if the data is duplicate then FALSE is stored at that index and if the data is not duplicated then TRUE is stored in the particular index. “!” Sign would convert all the FALSE values to TRUE, hence the data corresponding to TRUE index will only be stored in BILL_uniquee. Only unique values are stored and duplicate values are filtered out.
Unique()
Duplicate entries can also be filtered out using Unique() This function would return only the unique values. Consider the example below.
>Bill_unique1<-unique(Bill) >nrow(Bill) >nrow(Bill_unique1)
## [1] 9462
## [1] 9452
The number of rows in Bill is different from the number of rows in Bill_unique1, which shows that all the duplicate values are filtered out.
In the next post we will see An Example of Handling Duplicates.