• No products in the cart.

Handout – Data Handling in R

Introduction

Data handling is the most important skill for any data analyst.Before jumping on to analysis of data we may have to import, curate and validate the data. Some times we may have to create a new variable in the data,sometimes we may have to merge the data from multiple sources and create the consolidated data and so on. In this session we will discuss most useful data handling techniques. So let us have a look at the contents,what we are going to discuss.

Contents

  • Data imploring from files
  • Database server connections
  • Working with datasets
  • Manipulating the datasets in R
  • Creating new variables in R
  • Sorting in R & Removing Duplicates
  • Exporting the R datasets into external files
  • Data Merging
  • Conclusion

Importing The Data

The data is being stored into different formats, like CSV (Comma Separated Variable), Excel, SQL (database), and many other ways. Various statistical tools are available which are used for the critical analysis of the data. R is a highly flexible and open source software which is integrated with all the formats of data storage making it easy to operate and handle data in R. Data Handling is a very important skill for a Data Analyst. It involves all kind of processes like importing, curation, validation and exploration of data. Hence before moving to the Analysis of data, it is very important for one to become well versed with Handling of Data. During the data preparation, one may have to create a new variable, or even merge the data from multiple sources and sort it further. In this session we will discuss most useful data handling techniques. R is compatible with various data storage and statistical tools like CSV, Excel and SQL database. There are different in-built and external libraries, which make it possible to integrate R with these softwares. Because of the flexibility, it is very easy to import the data set stored in any of the above mentioned formats into R for further analysis. Not only one can import the data set, but also can export the data set from R to these formats. The coming section would cover the various methods available to import the data into R.

Data import from CSV files

CSV (Comma Separated Variable) extension is one of the most commonly used data type. R is compatible with CSV format. By calling a simple function of read.csv(), one can easily import the data into R. The Note of Caution: While defining the path of the CSV file in the read.csv() function, user must use either “/” or “\\” in the path. The windows style of path “\” doesn’t work in R. To understand how read.csv() functions, a sample program to read the Sales related data from Sales_data.csv is shown. The function read.csv() imports the data from the CSV file (the path for which is mentioned in the syntax) to R.

** Note: ** In the syntax below, we use \\ instead of \.

Note: In the syntax below, you need to replace dataset path with your local path.

Sales<-read.csv("~\\R dataset\\Superstore Sales Data\\Sales_sample.csv")

Sales

Sales <- read.csv("~\\R dataset\\Superstore Sales Data\\Sales_sample.csv")

Sales1 <- read.csv("~\\R dataset\\Superstore Sales Data\\Sales_sample.csv")

Sales1

Data import from SAS files

SAS, also known as the Statistical Analysis Software, is used to perform the functions of Data Analytics. SAS is highly specialized famous tool in analytics. R is compatible with the SAS and the JMV format. There is an external library available which needs to be installed before importing the data from the SAS format. To install the package, use the command install.packages(sas7bdat) and to use the package, there is library(sas7bdat). After installing, it will automatically load the library into the R directory. Once the library is installed, user can import the data by calling the function read.sas7bdat() As the path of the file was mentioned in the CSV function, in the same way the path of the sas7bdat file is passed as an argument into the function.

library(sas7bdat)
gnpdata <-read.sas7bdat("~//R dataset//SAS datasets\\gnp.sas7bdat")
View(gnpdata)

Data import from Excel files

Excel is one of the most widely used as reporting and ad-hoc data analysis tool. The excel files contain mostly aggregated and even raw data sometimes. There is function read.xlsx() which can be used to import the data from desired file into R. The spreadsheet is read through this function and is stored into a data frame. Here’s an example on how to read the data from excel into R.

library(xlsx)
wb_data <- read.xlsx("~\\Datasets\\World Bank Data\\World Bank Indicators.xlsx" , sheetName="Data by country")

There’s another way through which data can be imported into R. There is a function called loadWorkbook() from the XLConnect package, which can be used to read the complete workbook, and hence import particular worksheets with the function called readWorksheet(). Java package needs to be installed before using XLConnect.

library(XLConnect)
wb_data <- readWorksheet(loadWorkbook("~\\R dataset\\World Bank Data\\World Bank Indicators.xlsx" ),sheet=1)

You may run into Java related issues while importing Excel files. Excel is not just a flat file like CSV. The excel file contains lots of other information such as indexes, apart from just data. If you run into errors even after installing the necessary packages, then you must first store excel data in CSV format and then import it.

Connecting to Database server

Importing Data from SQL

Data stored in the servers can be accessed through SQL (Structured Query Language). R is compatible with these databases. For SQL server, this is done through ODBC. However, few steps need to be followed to link R with the database. All you need to do is to create ODBC from the SQL Server.Once ODBC is created, you need to connect ODBC with R. This is done through a package RODBC in R. To connect, one needs to install the RODBC package.

  install.packages("RODBC")
  

Once the package is installed, it is loaded into the current workspace, by calling the function library(RODBC). This will load the functions available in the RODBC package into the current workspace for further use. The following syntax shows, how to load the package, and further connect R with Database.

library(RODBC)
conn <-odbcConnect("dblink", uid="uid", pwd="passwd")
salesdata <- sqlFetch(conn, "sales")
ordersdat<- sqlQuery(conn, "select * from orders")
close(conn)
> dbhandle <- odbcDriverConnect('driver={SQL Server}; server=MyServerName;   database = MyDbName; trusted_connection=true')

The above mentioned commands, replace “MyServerName” & “MyDbName” with your server and database names. Make sure that you have access to that server and database before connecting. Once the data is imported, the connection is closed by using the function close(). Below sample code shows how to work with database connections and execute SQL queries from R.

spend_row_count<-sqlQuery(dbhandle, "select count(*) from MyTable") 

Working With Datasets

Once the dataset is imported/connected, there are some basic validation checks that we may have to conduct, before going for further analysis.

  1. Is the data imported correctly?
  2. Are the variables imported in right format?
  3. Did we import all the rows and columns?

It is always a good idea to do some basic checks to get an idea on the dataset. Just printing the data is not a good option, always. Sometimes the datasets are huge; printing the data might not give us any additional benefit. It is a good practice to check the number of rows, columns, quick look at the variable structures, data snapshot and summary. Let’s see some commands on how to get idea on dataset without actually opening it. There are very important steps in the field of data preparation for analysis. Importing, exploring, validating and sanitizing the data are as important as data analysis. It is important to check if there is any data missing, or if there is any outlier lying in the data set. For example: If the age is recorded as 1200, then there has been a problem in recording or importing the data. Hence fixing the imported data is equally important.One can validate the dataset by verifying its structure. By structure, we mean the number of rows and columns, the format of data, the limits of the values taken by the data columns or ever just a summary relating to the same could also be used. Here are some basic commands that you may have to run, right after importing the data into R.

Basic Commands on Datasets

Is the data imported correctly? Are the variables imported in right format? Did we import all the rows? Once the dataset is inside R, we would like to do some basic checks to get an idea on the dataset. Just printing the data is not a good option, always. Is a good practice to check the number of rows, columns, quick look at the variable structures, a summary and data snapshot. dim() This function returns the number of rows and columns. On the overall level we can get a quick idea on the data import. If we don’t have the number of rows and columns as expected, we can address the issue at that phase itself.

dim(Sales)

It can be seen that the data set Sales has 11 rows and 7 columns. names() This function can be used to get or set the names of the variables in the dataset. Sometimes import doesn’t consider column names while importing from text files, we need to be careful while importing.

names(Sales)

In this code output we can see the names of the 7 columns in the data set Sales. head() head() returns the first 6 observations of the data set. This function is used to have a quick look at the first few records of the data. head() can be used to limit the number of observations in the output.

head(Sales)

tail() tail() returns the last 6 observations of the data set. This function is used as a substitute to head() wherever it is applicable

tail(Sales)

The output above shows the last 6 observations of Sales str() This is a powerful and important function str() which returns the complete information relating to the structure of the data set. This would include the number of rows and columns, the names of the columns, the type or class of data set contained in these columns, and even brief summary information contained in these columns.

str(Sales)

We can use this function to see the detailed summary of a single variable. $ can be used to mention the column for which the data is to be verified.

str(Sales$unitsSold)

print()

Print() can be used to display the data set on the console itself. It will works well if the data set is small.
print(Sales)

View() The View() function gives a quick snapshot view of the dataset. Print function prints the dataset in the console. If there are many columns in the dataset, then we may not be able to get a good view of the dataset. The View() function gives a better formatted snapshot of the dataset

View(Sales)

Check list after Import

Data: Superstore Sales DataSales_sample.csv

Code Description
dim(Sales) To check the number of rows and columns
names(Sales) What are the column names? Sometimes import doesn’t consider column names while importing
head(Sales) First few observations of data
tail(Sales) Last few observations of the data
str(Sales) Structure of the whole data, are all the variables in the expected format
str(Sales$unitsSold) Structure of few important variables
View(Sales) A snapshot of small datasets
Print(Sales) Print the data if it is small

Other useful commands

Few more examples are stated in the table below. Give them a try and find if the output matches with the description part.

Code Description
Str(Sales) Get an idea on numerical and non numerical variables
summary(Sales) Overall data summary
summary (Sales$unitsSold) Get summary on numerical variables
table(Sales$salesChannel) Get frequency tables for categorical variables
table(Sales$custCountry) Get frequency tables for categorical variables
sum(is.na(Sales)) Missing value count in full data
sum(is.na(Sales$unitsSold)) Missing value count in a variable

Lab: Printing the data and meta info

Consider a Sales data for Superstore which has its departments established in different countries. This data set can be easily bifurcated on the basis of various countries. This data contains all the information relating the consumers purchase attributes. You are to analyze the data by checking what attributes are being covered, how big could be the sample data, and various other functions. Using this data try to Solve the below mentioned questions.

1.Import “Superstore Sales DataSales_by_country_v1.csv” data.

2.Perform the basic checks on the data.

3.How many rows and columns are there in this dataset?

4.Print only column names in the dataset.

5.Print first 10 observations.

6.Print the last 5 observations.

7.Get the summary of the dataset.

8.Print the structure of the data.

9.Describe the field unitsSold, custCountry.

10.Create a new dataset by taking first 30 observations from this data.

11.Print the resultant data.

12.Remove(delete) the new dataset.

Solutions

1.Import “Superstore Sales DataSales_by_country_v1.csv” data

Sales_data <- read.csv("~\\R dataset\\Superstore Sales Data\\Sales_by_country_v1.csv")
head(Sales_data)
##   custId     custName                  custCountry productSold
## 1  23262 Candice Levy                        Congo     SUPA101
## 2  23263 Xerxes Smith                       Panama     DETA200
## 3  23264 Levi Douglas Tanzania, United Republic of     DETA800
## 4  23265 Uriel Benton                 South Africa     SUPA104
## 5  23266 Celeste Pugh                        Gabon     PURA200
## 6  23267 Vance Campos         Syrian Arab Republic     PURA100
##   salesChannel unitsSold   dateSold
## 1       Retail       117 2012-08-09
## 2       Online        73 2012-07-06
## 3       Online       205 2012-08-18
## 4       Online        14 2012-08-05
## 5       Retail       170 2012-08-11
## 6       Retail       129 2012-07-11

2.Perform the basic checks on the data

dim(Sales_data)
## [1] 998   7

2.Perform the basic checks on the data

str(Sales_data)
## 'data.frame':    998 obs. of  7 variables:
##  $ custId      : int  23262 23263 23264 23265 23266 23267 23268 23269 23270 23271 ...
##  $ custName    : Factor w/ 998 levels "Aaron Edwards",..: 183 969 612 929 195 937 593 482 956 77 ...
##  $ custCountry : Factor w/ 233 levels "Afghanistan",..: 49 160 204 191 74 201 83 122 112 169 ...
##  $ productSold : Factor w/ 12 levels "DETA100","DETA200",..: 8 2 3 11 5 4 1 4 10 10 ...
##  $ salesChannel: Factor w/ 3 levels "Direct","Online",..: 3 2 2 2 3 3 3 3 2 3 ...
##  $ unitsSold   : int  117 73 205 14 170 129 82 116 67 125 ...
##  $ dateSold    : Factor w/ 464 levels "2011-01-02","2011-01-03",..: 446 416 454 442 448 421 422 386 388 434 ...
  1. Perform the basic checks on the data
head(Sales_data)
##   custId     custName                  custCountry productSold
## 1  23262 Candice Levy                        Congo     SUPA101
## 2  23263 Xerxes Smith                       Panama     DETA200
## 3  23264 Levi Douglas Tanzania, United Republic of     DETA800
## 4  23265 Uriel Benton                 South Africa     SUPA104
## 5  23266 Celeste Pugh                        Gabon     PURA200
## 6  23267 Vance Campos         Syrian Arab Republic     PURA100
##   salesChannel unitsSold   dateSold
## 1       Retail       117 2012-08-09
## 2       Online        73 2012-07-06
## 3       Online       205 2012-08-18
## 4       Online        14 2012-08-05
## 5       Retail       170 2012-08-11
## 6       Retail       129 2012-07-11
  1. Perform the basic checks on the data
tail(Sales_data)
##     custId       custName             custCountry productSold salesChannel
## 993  24254   Anika Alford                  Belize     DETA800       Online
## 994  24255      Ethan Day              Tajikistan     DETA100       Online
## 995  24256     Quail Knox                   Tonga     PURA500       Retail
## 996  24257 Noelle Sargent                 Ireland     DETA800       Direct
## 997  24258  Kuame Wallace              Montserrat     SUPA103       Online
## 998  24259  Lester Fisher Cocos (Keeling) Islands     PURA500       Direct
##     unitsSold   dateSold
## 993         6 2011-07-08
## 994       189 2011-01-09
## 995        43 2011-05-08
## 996        17 2011-02-04
## 997        80 2011-01-13
## 998       138 2011-08-10
  1. How many rows and columns are there in this dataset?
dim(Sales_data)
## [1] 998   7
  1. Print only column names in the dataset
names(Sales_data)
## [1] "custId"       "custName"     "custCountry"  "productSold" 
## [5] "salesChannel" "unitsSold"    "dateSold"
  1. Print first 10 observations
head(Sales_data,n=10)
##    custId           custName                  custCountry productSold
## 1   23262       Candice Levy                        Congo     SUPA101
## 2   23263       Xerxes Smith                       Panama     DETA200
## 3   23264       Levi Douglas Tanzania, United Republic of     DETA800
## 4   23265       Uriel Benton                 South Africa     SUPA104
## 5   23266       Celeste Pugh                        Gabon     PURA200
## 6   23267       Vance Campos         Syrian Arab Republic     PURA100
## 7   23268       Latifah Wall                   Guadeloupe     DETA100
## 8   23269     Jane Hernandez                    Macedonia     PURA100
## 9   23270        Wanda Garza                   Kyrgyzstan     SUPA103
## 10  23271 Athena Fitzpatrick                      Reunion     SUPA103
##    salesChannel unitsSold   dateSold
## 1        Retail       117 2012-08-09
## 2        Online        73 2012-07-06
## 3        Online       205 2012-08-18
## 4        Online        14 2012-08-05
## 5        Retail       170 2012-08-11
## 6        Retail       129 2012-07-11
## 7        Retail        82 2012-07-12
## 8        Retail       116 2012-06-03
## 9        Online        67 2012-06-07
## 10       Retail       125 2012-07-27
  1. Print the last 5 observations
tail(Sales_data,n=5)
##     custId       custName             custCountry productSold salesChannel
## 994  24255      Ethan Day              Tajikistan     DETA100       Online
## 995  24256     Quail Knox                   Tonga     PURA500       Retail
## 996  24257 Noelle Sargent                 Ireland     DETA800       Direct
## 997  24258  Kuame Wallace              Montserrat     SUPA103       Online
## 998  24259  Lester Fisher Cocos (Keeling) Islands     PURA500       Direct
##     unitsSold   dateSold
## 994       189 2011-01-09
## 995        43 2011-05-08
## 996        17 2011-02-04
## 997        80 2011-01-13
## 998       138 2011-08-10
  1. Get the summary of the dataset
summary(Sales_data)
##      custId                    custName          custCountry 
##  Min.   :23262   Aaron Edwards     :  1   Denmark      : 10  
##  1st Qu.:23511   Abigail Cunningham:  1   Swaziland    : 10  
##  Median :23761   Abraham Mcguire   :  1   Turkey       : 10  
##  Mean   :23761   Acton Mendoza     :  1   Azerbaijan   :  9  
##  3rd Qu.:24010   Acton Ratliff     :  1   Bouvet Island:  9  
##  Max.   :24259   Adam Blackburn    :  1   Nauru        :  9  
##                  (Other)           :992   (Other)      :941  
##   productSold  salesChannel   unitsSold            dateSold  
##  PURA100:112   Direct: 91   Min.   :  1.00   2011-11-11:  7  
##  SUPA103: 90   Online:511   1st Qu.: 52.25   2012-05-15:  7  
##  DETA800: 89   Retail:396   Median :111.00   2012-01-08:  6  
##  DETA100: 87                Mean   :108.26   2012-02-20:  6  
##  SUPA102: 86                3rd Qu.:163.00   2012-04-06:  6  
##  PURA200: 84                Max.   :212.00   2012-04-21:  6  
##  (Other):450                                 (Other)   :960
  1. Print the structure of the data
str(Sales_data)
## 'data.frame':    998 obs. of  7 variables:
##  $ custId      : int  23262 23263 23264 23265 23266 23267 23268 23269 23270 23271 ...
##  $ custName    : Factor w/ 998 levels "Aaron Edwards",..: 183 969 612 929 195 937 593 482 956 77 ...
##  $ custCountry : Factor w/ 233 levels "Afghanistan",..: 49 160 204 191 74 201 83 122 112 169 ...
##  $ productSold : Factor w/ 12 levels "DETA100","DETA200",..: 8 2 3 11 5 4 1 4 10 10 ...
##  $ salesChannel: Factor w/ 3 levels "Direct","Online",..: 3 2 2 2 3 3 3 3 2 3 ...
##  $ unitsSold   : int  117 73 205 14 170 129 82 116 67 125 ...
##  $ dateSold    : Factor w/ 464 levels "2011-01-02","2011-01-03",..: 446 416 454 442 448 421 422 386 388 434 ...
  1. Describe the field unitsSold, custCountry
str(Sales_data$unitsSold,Sales_data$custCountry)
##  int [1:998] 117 73 205 14 170 129 82 116 67 125 ...
  1. Create a new dataset by taking first 30 observations from this data
Sales_data_new <- Sales_data[c(1:30),]
dim(Sales_data_new)
## [1] 30  7
  1. Print the resultant data
print(Sales_data_new)
##    custId           custName                  custCountry productSold
## 1   23262       Candice Levy                        Congo     SUPA101
## 2   23263       Xerxes Smith                       Panama     DETA200
## 3   23264       Levi Douglas Tanzania, United Republic of     DETA800
## 4   23265       Uriel Benton                 South Africa     SUPA104
## 5   23266       Celeste Pugh                        Gabon     PURA200
## 6   23267       Vance Campos         Syrian Arab Republic     PURA100
## 7   23268       Latifah Wall                   Guadeloupe     DETA100
## 8   23269     Jane Hernandez                    Macedonia     PURA100
## 9   23270        Wanda Garza                   Kyrgyzstan     SUPA103
## 10  23271 Athena Fitzpatrick                      Reunion     SUPA103
## 11  23272      Anjolie Hicks     Turks and Caicos Islands     DETA200
## 12  23273       Isaac Cooper         Netherlands Antilles     SUPA104
## 13  23274        Asher Weber                    Macedonia     PURA100
## 14  23275      Ethan Gregory                       Tuvalu     DETA800
## 15  23276      Hayes Rollins                        Nepal     PURA500
## 16  23277     MacKenzie Moss                         Oman     SUPA101
## 17  23278  Aphrodite Brennan                       Malawi     SUPA105
## 18  23279        Angela Wise                      Moldova     PURA100
## 19  23280      James Spencer                 Burkina Faso     SUPA103
## 20  23281      Adria Kaufman                Bouvet Island     SUPA102
## 21  23282     Amir Alexander                      Liberia     DETA100
## 22  23283         Lani Sweet                      Vanuatu     SUPA105
## 23  23284       Clark Weaver                        Palau     PURA250
## 24  23285   Leonard Cardenas                   Madagascar     SUPA102
## 25  23286      Renee Padilla                        Yemen     DETA800
## 26  23287        Joy Vazquez                        Korea     PURA250
## 27  23288        Ingrid Bush                   Montserrat     SUPA104
## 28  23289       Deacon Craig                     Mongolia     SUPA105
## 29  23290       Rama Goodwin                      Tunisia     DETA800
## 30  23291    Jelani Odonnell                      Albania     DETA800
##    salesChannel unitsSold   dateSold
## 1        Retail       117 2012-08-09
## 2        Online        73 2012-07-06
## 3        Online       205 2012-08-18
## 4        Online        14 2012-08-05
## 5        Retail       170 2012-08-11
## 6        Retail       129 2012-07-11
## 7        Retail        82 2012-07-12
## 8        Retail       116 2012-06-03
## 9        Online        67 2012-06-07
## 10       Retail       125 2012-07-27
## 11       Retail        71 2012-07-31
## 12       Retail        22 2012-08-13
## 13       Direct       153 2012-08-22
## 14       Retail       141 2012-07-04
## 15       Direct        65 2012-08-01
## 16       Online       157 2012-07-12
## 17       Online       197 2012-08-24
## 18       Direct        10 2012-06-21
## 19       Direct        30 2012-06-03
## 20       Online       134 2012-07-13
## 21       Online       100 2012-08-21
## 22       Online       142 2012-06-24
## 23       Online       135 2012-06-17
## 24       Online         9 2012-07-24
## 25       Retail        69 2012-08-08
## 26       Retail       189 2012-06-17
## 27       Direct       141 2012-06-14
## 28       Retail       166 2012-08-02
## 29       Online       170 2012-08-11
## 30       Retail       199 2012-08-18
  1. Remove(delete) the new dataset
rm(Sales_data_new)

Sub setting the data

Sub Setting of Data based on rows

Sub-setting allows to access particular data from a complete data set. If a user wants to explore the data of a particular column or row, then sub setting can be very helpful for them. In R, users can access to particular part of the data set by selecting particular number of columns and rows as shown below. Import Dataset: GDP.csv

gdp<- read.csv("~\\R dataset\\World Bank Data\\GDP.csv")

New dataset with selected rows.To select particular part of the dataset, you need to vary the number of rows and columns in the syntax by changing the syntax [Number of rows, Number of columns].

gdp1<-gdp[1:10, ]
gdp1
##    Country_code Rank            Country      GDP
## 1           USA    1      United States 17419000
## 2           CHN    2              China 10354832
## 3           JPN    3              Japan  4601461
## 4           DEU    4            Germany  3868291
## 5           GBR    5     United Kingdom  2988893
## 6           FRA    6             France  2829192
## 7           BRA    7             Brazil  2346076
## 8           ITA    8              Italy  2141161
## 9           IND    9              India  2048517
## 10          RUS   10 Russian Federation  1860598

In the above code, the first 10 rows of the data set are selected by using 1:10 (which means 1 to 10), and all the columns are selected. Hence on the printing gdp1, the first 10 rows of all the columns are displayed in the output. The other way to select specific rows is

gdp2<- gdp[c(1,4,10,12), ]
gdp2
##    Country_code Rank            Country      GDP
## 1           USA    1      United States 17419000
## 4           DEU    4            Germany  3868291
## 10          RUS   10 Russian Federation  1860598
## 12          AUS   12          Australia  1454675

To select a varied number of rows, which are non-continuous or are not in a sequence, then c(row number 1, row number 2, ….) can be used to select the specific rows. In the code above gdp[c(1,4,10,12),] would select row number 1,4,10,12 and store these rows data into gdp2.

Sub-setting with selected number of columns

The way we can select particular rows from the dataset, the same way columns can be selected. To do so, you need to change second index in the syntax [Number of rows, Number of Columns]. A sample code is shown below:

gdp3<- gdp[, 2:4 ]
head(gdp3)
##   Rank        Country      GDP
## 1    1  United States 17419000
## 2    2          China 10354832
## 3    3          Japan  4601461
## 4    4        Germany  3868291
## 5    5 United Kingdom  2988893
## 6    6         France  2829192

The code above shows the data for columns 2 to 4 in the output. However if all the rows needs to be printed, all you need to do is to remove the function head(). Another way of selecting columns is shown below:

gdp4<- gdp[, c(1,2,4)]
head(gdp4)
##   Country_code Rank      GDP
## 1          USA    1 17419000
## 2          CHN    2 10354832
## 3          JPN    3  4601461
## 4          DEU    4  3868291
## 5          GBR    5  2988893
## 6          FRA    6  2829192

The way non-sequence based rows are selected, non-sequence based columns can be selected by using the syntax c(column number 1, column number 2, column number 3, ….). The code above shows the data for 1st, 2nd and 4th column. One can even access particular columns by using the names of the columns as shown below:

gdp5<- gdp[, c("Country", "Rank")]
head(gdp5)
##          Country Rank
## 1  United States    1
## 2          China    2
## 3          Japan    3
## 4        Germany    4
## 5 United Kingdom    5
## 6         France    6

The way non sequence based rows are selected, non-sequence based columns can be selected by using the syntax c(column number 1, column number 2, column number 3, ….). The code above shows the data for 1st, 2nd and 4th column. One can even access particular columns by using the names of the columns as shown below:

Sub-setting with selected number of rows and columns

The syntax used above can also be used to select collectively the varying number of rows and columns. Sample code is shown below:

gdp6<-gdp[5:20,  c(1,2,4)]
gdp6
##    Country_code Rank     GDP
## 5           GBR    5 2988893
## 6           FRA    6 2829192
## 7           BRA    7 2346076
## 8           ITA    8 2141161
## 9           IND    9 2048517
## 10          RUS   10 1860598
## 11          CAN   11 1785387
## 12          AUS   12 1454675
## 13          KOR   13 1410383
## 14          ESP   14 1381342
## 15          MEX   15 1294690
## 16          IDN   16  888538
## 17          NLD   17  879319
## 18          TUR   18  798429
## 19          SAU   19  746249
## 20          CHE   20  701037

The code above would access rows 5 to 20 and columns 1,2, and 4. Other way to select varying number of rows and columns is shown below. Note: In the code below, the negative sign c(-3, -4) is used. Using negative sign would only the selected rows or columns which are mentioned in the code with the negative sign.

gdp7<-gdp[1:5,  c(-3,-4)]
gdp7
##   Country_code Rank
## 1          USA    1
## 2          CHN    2
## 3          JPN    3
## 4          DEU    4
## 5          GBR    5

From the output, we can see that 3rd and 4th row are omitted in the output.

Lab: Sub setting the data

As a bank manager you might would like to know the quality of the customers which the bank is catering to. This would help you in profiling out the target customers. We may want to rollout some marketing offers to the selected customers from the bank.

1. Import : “./Bank Marketing/bank_market.csv”

2. Create separate datasets for each of the below tasks + a) Select first 1000 rows only + b) Select only four columns “Cust_num” “age” “default” and “balance” + c) Select 20,000 to 40,000 observations along with four variables “Cust_num” “job” “marital” and “education” + d) Select 5000 to 6000 observations drop “poutcome” and “y”

Solutions

  • 1.Import : “./Bank Marketing/bank_market.csv”
bank <- read.csv("C:UsersajithDesktopstatinferR ProgrammingR datasetBank Tele Marketingbank_market.csv")
  • 2.a) Select first 1000 rows only
bank_modified1 <- bank[c(1:1000),]
dim(bank_modified1)
## [1] 1000   18
  • 2.b) Select only four columns “Cust_num” “age” “default” and “balance”
bank_modified2 <- bank[,c("Cust_num","age","default","balance")]
dim(bank_modified2)
## [1] 45211     4
  • 2.c) Select 20,000 to 40,000 observations along with four variables “Cust_num” “job” “marital” and “education”
bank_modified3 <- bank[c(20000:40000), c("Cust_num","job","marital","education")]
dim(bank_modified3)
## [1] 20001     4
  • 2.d) Select 5000 to 6000 observations drop “poutcome” and “y”
bank_modified4 <- bank[c(5000:6000), !(colnames(bank) %in% c("poutcome","y"))]
dim(bank_modified4)
## [1] 1001   16

Subset with Variable Filter Conditions

  • Selection with a condition on variables
  • For example, selection of complains where budget is greater than $5000.
  • Need to use subset function in R
 newdata <- subset(old_data,condition1 | condition2)
  • And condition & filters
 bank_subset1<-subset(bank, age>40 &  loan=="no")
  • OR condition & filters
 bank_subset2<-subset(bank, age>40 | loan=="no")

Subset with Variable Filter Conditions

Sub setting can also be done by using various filters. Say we have a data set for 5000 employees, and we want to find out all the employees who have salary more than 100,000 per month. By using filters, the user can find the records with salary more than 100,000. Few of the conditional operators which can be used to filter out the datasets include AND (&) operator, OR (|) operator, NOT (!) operator, Greater than or less than (> or <), equal to (==) operator and many other. Syntax new_data <- subset(old_data, condition1 | conditon2) Examples stating the use of these conditional operators is shown below And and OR condition In AND condition, filtering is done on the basis of satisfaction of both the criteria. The sample code is as shown:

bank_subset3<-subset(bank, (age>40 &  loan=="no") | marital=="single" )

AND, OR condition , Numeric and Character filters & selected fields

market4<-subset(bank, (age>40 &  loan=="no") | marital=="single" , select=c (Cust_num , age ,   marital,    loan ))

AND, OR condition , Numeric and Character filters & dropping few fields

market5<-subset(bank, (age>40 &  loan=="no") | marital=="single" , select=-c (poutcome,y, balance, contact))

Lab: Subsetting

AutoDataset consists of all the information relating to the specification of the cars, which includes the engine type, the mileage value, the the fuel type, and other specifications. So if user would like to know how many cars with rpm more than 5500 and having a convertible body style with price range more than 20000, sub-setting can help in categorizing such data.The following data might would help in understanding the subset concepts.

  1. Import : “./Automobile Data Set/AutoDataset.csv”
  2. Create a new dataset for exclusively Toyota cars 3.Create a new dataset for all cars with city.mpg greater than 30 and engine size is less than 120.
  3. Create a new dataset by taking only sedan cars. Keep only four variables(Make, body style, fuel type, price) in the final dataset.
  4. Create a new dataset by taking Audi, BMW or Porsche company makes. Drop two variables from the resultant dataset(price and normalized losses)

Solutions

  1. Import : “./Automobile Data Set/AutoDataset.csv”
auto_data <- read.csv("~\\R dataset\\Automobile Data Set\\AutoDataset.csv")
  1. Create a new dataset for exclusively Toyota cars
toyota_data <- subset(auto_data, make == "toyota")
head(toyota_data)
##     symboling normalized.losses   make fuel.type aspiration num.of.doors
## 151         1                87 toyota       gas        std          two
## 152         1                87 toyota       gas        std          two
## 153         1                74 toyota       gas        std         four
## 154         0                77 toyota       gas        std         four
## 155         0                81 toyota       gas        std         four
## 156         0                91 toyota       gas        std         four
##     body.style drive.wheels engine.location wheel.base length width height
## 151  hatchback          fwd           front       95.7  158.7  63.6   54.5
## 152  hatchback          fwd           front       95.7  158.7  63.6   54.5
## 153  hatchback          fwd           front       95.7  158.7  63.6   54.5
## 154      wagon          fwd           front       95.7  169.7  63.6   59.1
## 155      wagon          4wd           front       95.7  169.7  63.6   59.1
## 156      wagon          4wd           front       95.7  169.7  63.6   59.1
##     curb.weight engine.type num.of.cylinders engine.size fuel.system bore
## 151        1985         ohc             four          92        2bbl 3.05
## 152        2040         ohc             four          92        2bbl 3.05
## 153        2015         ohc             four          92        2bbl 3.05
## 154        2280         ohc             four          92        2bbl 3.05
## 155        2290         ohc             four          92        2bbl 3.05
## 156        3110         ohc             four          92        2bbl 3.05
##     stroke compression.ratio horsepower peak.rpm city.mpg highway.mpg
## 151   3.03                 9         62     4800       35          39
## 152   3.03                 9         62     4800       31          38
## 153   3.03                 9         62     4800       31          38
## 154   3.03                 9         62     4800       31          37
## 155   3.03                 9         62     4800       27          32
## 156   3.03                 9         62     4800       27          32
##     price
## 151  5348
## 152  6338
## 153  6488
## 154  6918
## 155  7898
## 156  8778

3.Create a new dataset for all cars with city.mpg greater than 30 and engine size is less than 120.

auto_data1 <- subset(auto_data, (city.mpg > 30) & (engine.size < 120))
head(auto_data1)
##    symboling normalized.losses      make fuel.type aspiration num.of.doors
## 19         2               121 chevrolet       gas        std          two
## 20         1                98 chevrolet       gas        std          two
## 21         0                81 chevrolet       gas        std         four
## 22         1               118     dodge       gas        std          two
## 23         1               118     dodge       gas        std          two
## 25         1               148     dodge       gas        std         four
##    body.style drive.wheels engine.location wheel.base length width height
## 19  hatchback          fwd           front       88.4  141.1  60.3   53.2
## 20  hatchback          fwd           front       94.5  155.9  63.6   52.0
## 21      sedan          fwd           front       94.5  158.8  63.6   52.0
## 22  hatchback          fwd           front       93.7  157.3  63.8   50.8
## 23  hatchback          fwd           front       93.7  157.3  63.8   50.8
## 25  hatchback          fwd           front       93.7  157.3  63.8   50.6
##    curb.weight engine.type num.of.cylinders engine.size fuel.system bore
## 19        1488           l            three          61        2bbl 2.91
## 20        1874         ohc             four          90        2bbl 3.03
## 21        1909         ohc             four          90        2bbl 3.03
## 22        1876         ohc             four          90        2bbl 2.97
## 23        1876         ohc             four          90        2bbl 2.97
## 25        1967         ohc             four          90        2bbl 2.97
##    stroke compression.ratio horsepower peak.rpm city.mpg highway.mpg price
## 19   3.03              9.50         48     5100       47          53  5151
## 20   3.11              9.60         70     5400       38          43  6295
## 21   3.11              9.60         70     5400       38          43  6575
## 22   3.23              9.41         68     5500       37          41  5572
## 23   3.23              9.40         68     5500       31          38  6377
## 25   3.23              9.40         68     5500       31          38  6229
  1. Create a new dataset by taking only sedan cars. Keep only four variables(Make, body style, fuel type, price) in the final dataset.
auto_data2 <- subset(auto_data, body.style == "sedan" , select = c(make, body.style,fuel.type,price))
head(auto_data2)
##    make body.style fuel.type price
## 4  audi      sedan       gas 13950
## 5  audi      sedan       gas 17450
## 6  audi      sedan       gas 15250
## 7  audi      sedan       gas 17710
## 9  audi      sedan       gas 23875
## 11  bmw      sedan       gas 16430
  1. Create a new dataset by taking Audi, BMW or Porsche company makes. Drop two variables from the resultant dataset(price and normalized losses)
auto_data3 <- subset(auto_data, (make == "audi") | (make == "bmw") | (make == "porsche"), select = c(-price, -normalized.losses)) 
head(auto_data3)
##   symboling make fuel.type aspiration num.of.doors body.style drive.wheels
## 4         2 audi       gas        std         four      sedan          fwd
## 5         2 audi       gas        std         four      sedan          4wd
## 6         2 audi       gas        std          two      sedan          fwd
## 7         1 audi       gas        std         four      sedan          fwd
## 8         1 audi       gas        std         four      wagon          fwd
## 9         1 audi       gas      turbo         four      sedan          fwd
##   engine.location wheel.base length width height curb.weight engine.type
## 4           front       99.8  176.6  66.2   54.3        2337         ohc
## 5           front       99.4  176.6  66.4   54.3        2824         ohc
## 6           front       99.8  177.3  66.3   53.1        2507         ohc
## 7           front      105.8  192.7  71.4   55.7        2844         ohc
## 8           front      105.8  192.7  71.4   55.7        2954         ohc
## 9           front      105.8  192.7  71.4   55.9        3086         ohc
##   num.of.cylinders engine.size fuel.system bore stroke compression.ratio
## 4             four         109        mpfi 3.19    3.4              10.0
## 5             five         136        mpfi 3.19    3.4               8.0
## 6             five         136        mpfi 3.19    3.4               8.5
## 7             five         136        mpfi 3.19    3.4               8.5
## 8             five         136        mpfi 3.19    3.4               8.5
## 9             five         131        mpfi 3.13    3.4               8.3
##   horsepower peak.rpm city.mpg highway.mpg
## 4        102     5500       24          30
## 5        115     5500       18          22
## 6        110     5500       19          25
## 7        110     5500       19          25
## 8        110     5500       19          25
## 9        140     5500       17          20

Calculated Fields in R

A new field can be added to the existing data set by using $ operator. ##Creating Calculated Fields in R Consider the following syntax, where a new field sum is added(attached) to the dataset, where the contents of the variable is sum of the other two variables x1 and x2 in the dataset.

```r
dataset$sum <- dataset$x1 + dataset$x2
```

Average of two existing variables can be stored in a new variable in dataset, as shown below

```r
dataset$mean <- (dataset$x1 + dataset$x2)/2
```

However, assigning this way is very time consuming as the user has to type the name of the dataset again and again. This process can be bypassed by attaching the dataset. Using this, the user need not refer to the dataset again and again. Refer to the code below:

```r
  attach(dataset)
dataset$sum <- x1 + x2
dataset$mean <- (x1 + x2)/2
detach(dataset)
```

Demo-Calculated Fields in R

Getting idea on size of the car in Auto Data

dim(auto_data)
## [1] 205  26
auto_data$area<-(auto_data$length)*(auto_data$width)*(auto_data$height)
dim(auto_data)
## [1] 205  27

As we can see before creating a new variable area in AutoDataset, the number of rows and columns were 205, 26, and after creating the new field, the number of rows and columns were 205, and 27. Hence a new variable area was created in AutoDataset Creating a new variable by reducing the balance by 20%

bank$balance_new<-bank$balance*0.8
summary(bank)
##     Cust_num          age                 job           marital     
##  Min.   :    1   Min.   :18.00   blue-collar:9732   divorced: 5207  
##  1st Qu.:11304   1st Qu.:33.00   management :9458   married :27214  
##  Median :22606   Median :39.00   technician :7597   single  :12790  
##  Mean   :22606   Mean   :40.94   admin.     :5171                   
##  3rd Qu.:33909   3rd Qu.:48.00   services   :4154                   
##  Max.   :45211   Max.   :95.00   retired    :2264                   
##                                  (Other)    :6835                   
##      education     default        balance       housing      loan      
##  primary  : 6851   no :44396   Min.   : -8019   no :20081   no :37967  
##  secondary:23202   yes:  815   1st Qu.:    72   yes:25130   yes: 7244  
##  tertiary :13301               Median :   448                          
##  unknown  : 1857               Mean   :  1362                          
##                                3rd Qu.:  1428                          
##                                Max.   :102127                          
##                                                                        
##       contact           day            month          duration     
##  cellular :29285   Min.   : 1.00   may    :13766   Min.   :   0.0  
##  telephone: 2906   1st Qu.: 8.00   jul    : 6895   1st Qu.: 103.0  
##  unknown  :13020   Median :16.00   aug    : 6247   Median : 180.0  
##                    Mean   :15.81   jun    : 5341   Mean   : 258.2  
##                    3rd Qu.:21.00   nov    : 3970   3rd Qu.: 319.0  
##                    Max.   :31.00   apr    : 2932   Max.   :4918.0  
##                                    (Other): 6060                   
##     campaign          pdays          previous           poutcome    
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   failure: 4901  
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   other  : 1840  
##  Median : 2.000   Median : -1.0   Median :  0.0000   success: 1511  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803   unknown:36959  
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                  
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                  
##                                                                     
##    y          balance_new     
##  no :39922   Min.   :-6415.2  
##  yes: 5289   1st Qu.:   57.6  
##              Median :  358.4  
##              Mean   : 1089.8  
##              3rd Qu.: 1142.4  
##              Max.   :81701.6  
## 

If then Else

If-Else statement is very intuitive in here if(condition == true) { Syntax1 }else { Syntax2 } If then Else functions the same way as If Else. However the difference lies in the syntax. The syntax for If then Else is as shown below.

Newvar<-ifelse( Condition, True Value, False Value) 

Replace missing horse power values by -1

auto_data$horsepower_new<-ifelse(auto_data$horsepower=="?",-1,  auto_data$horsepower)

Replace missing in peak.rpm values by -1

auto_data$peak_rpm_new<-ifelse(auto_data$peak.rpm=="?",-1,  auto_data$peak.rpm)

LAB: If then Else

Conditional operators are very helpful in filtering out the required data from the complete data set. Say as a manager of a multi retail chain, you want to find out all the customers who are paying more than certain amount, or you want to find out the customers who purchase in bulk quantity, using certain conditional operators on Online Retail.csv data set would help in clearing such concepts.

  1. Import “./Online Retail Sales Data/Online Retail.csv”
  2. Create a new variable Quantity indicator. If Quantity is more than 200 the Quantity indicator =“high” else “low”
  • 100 – “Low”
  • 450 – “High”
  1. Create a new variable Price Class. If price is less than 1 then “Low”, between 1 and 80 then medium, if the price is more than 80 then high

Solutions

  1. Import “./Online Retail Sales Data/Online Retail.csv”
Online_Retail  <- read.csv("~\\R dataset\\Online Retail Sales Data\\Online Retail.csv")
head(Online_Retail)
##   InvoiceNo StockCode                         Description Quantity
## 1    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER        6
## 2    536365     71053                 WHITE METAL LANTERN        6
## 3    536365    84406B      CREAM CUPID HEARTS COAT HANGER        8
## 4    536365    84029G KNITTED UNION FLAG HOT WATER BOTTLE        6
## 5    536365    84029E      RED WOOLLY HOTTIE WHITE HEART.        6
## 6    536365     22752        SET 7 BABUSHKA NESTING BOXES        2
##      InvoiceDate UnitPrice CustomerID        Country
## 1 12/1/2010 8:26      2.55      17850 United Kingdom
## 2 12/1/2010 8:26      3.39      17850 United Kingdom
## 3 12/1/2010 8:26      2.75      17850 United Kingdom
## 4 12/1/2010 8:26      3.39      17850 United Kingdom
## 5 12/1/2010 8:26      3.39      17850 United Kingdom
## 6 12/1/2010 8:26      7.65      17850 United Kingdom

2.Create a new variable Quantity indicator. If Quantity is more than 200 the Quantity indicator =“high” else “low” + 100 – “Low” + 450 – “High”

Online_Retail$Quantity_indicator <- ifelse(Online_Retail$Quantity < 101,"Low", "High")
head(Online_Retail)
##   InvoiceNo StockCode                         Description Quantity
## 1    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER        6
## 2    536365     71053                 WHITE METAL LANTERN        6
## 3    536365    84406B      CREAM CUPID HEARTS COAT HANGER        8
## 4    536365    84029G KNITTED UNION FLAG HOT WATER BOTTLE        6
## 5    536365    84029E      RED WOOLLY HOTTIE WHITE HEART.        6
## 6    536365     22752        SET 7 BABUSHKA NESTING BOXES        2
##      InvoiceDate UnitPrice CustomerID        Country Quantity_indicator
## 1 12/1/2010 8:26      2.55      17850 United Kingdom                Low
## 2 12/1/2010 8:26      3.39      17850 United Kingdom                Low
## 3 12/1/2010 8:26      2.75      17850 United Kingdom                Low
## 4 12/1/2010 8:26      3.39      17850 United Kingdom                Low
## 5 12/1/2010 8:26      3.39      17850 United Kingdom                Low
## 6 12/1/2010 8:26      7.65      17850 United Kingdom                Low

3.Create a new variable Price Class. If price is less than 1 then “Low”, between 1 and 80 then medium, if the price is more than 80 then high

Online_Retail$Price_Class  <- ifelse(Online_Retail$UnitPrice < 1,"Low", ifelse(Online_Retail$UnitPrice > 80, "High", "Medium"))

head(Online_Retail)
##   InvoiceNo StockCode                         Description Quantity
## 1    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER        6
## 2    536365     71053                 WHITE METAL LANTERN        6
## 3    536365    84406B      CREAM CUPID HEARTS COAT HANGER        8
## 4    536365    84029G KNITTED UNION FLAG HOT WATER BOTTLE        6
## 5    536365    84029E      RED WOOLLY HOTTIE WHITE HEART.        6
## 6    536365     22752        SET 7 BABUSHKA NESTING BOXES        2
##      InvoiceDate UnitPrice CustomerID        Country Quantity_indicator
## 1 12/1/2010 8:26      2.55      17850 United Kingdom                Low
## 2 12/1/2010 8:26      3.39      17850 United Kingdom                Low
## 3 12/1/2010 8:26      2.75      17850 United Kingdom                Low
## 4 12/1/2010 8:26      3.39      17850 United Kingdom                Low
## 5 12/1/2010 8:26      3.39      17850 United Kingdom                Low
## 6 12/1/2010 8:26      7.65      17850 United Kingdom                Low
##   Price_Class
## 1      Medium
## 2      Medium
## 3      Medium
## 4      Medium
## 5      Medium
## 6      Medium

Sorting the Data

Sorting of the data can be considered as the fundamental part of the Data Analysis. User might want to sort the Names in the Alphabetical order, or wants to sort the Income data in the ascending order to find the highest tax payers, Sorting is helpful in managing the data in ascending or descending order. Consider the example: A bank wants to find the highest tax payers in a city. For this they need to sort the data on the basis of the Income in the Descending order and hence select the top tax payers from the list. In R, there is an in-built function order() to sort the data in ascending or descending order. Syntax for sorting the data is

Newdata<-olddata[order(variables), ]   

It is ascending by default.

Online_Retail_sort<-Online_Retail[order(Online_Retail$UnitPrice),]
head(Online_Retail_sort)
##        InvoiceNo StockCode     Description Quantity     InvoiceDate
## 299984   A563186         B Adjust bad debt        1 8/12/2011 14:51
## 299985   A563187         B Adjust bad debt        1 8/12/2011 14:52
## 623       536414     22139                       56 12/1/2010 11:52
## 1971      536545     21134                        1 12/1/2010 14:32
## 1972      536546     22145                        1 12/1/2010 14:33
## 1973      536547     37509                        1 12/1/2010 14:33
##        UnitPrice CustomerID        Country Quantity_indicator Price_Class
## 299984 -11062.06         NA United Kingdom                Low         Low
## 299985 -11062.06         NA United Kingdom                Low         Low
## 623         0.00         NA United Kingdom                Low         Low
## 1971        0.00         NA United Kingdom                Low         Low
## 1972        0.00         NA United Kingdom                Low         Low
## 1973        0.00         NA United Kingdom                Low         Low

Use -ve sign before the variable for descending

Online_Retail_sort1<-Online_Retail[order(- Online_Retail$UnitPrice),]
head(Online_Retail_sort1)
##        InvoiceNo StockCode Description Quantity     InvoiceDate UnitPrice
## 222682   C556445         M      Manual       -1 6/10/2011 15:31  38970.00
## 524603   C580605 AMAZONFEE  AMAZON FEE       -1 12/5/2011 11:36  17836.46
## 43703    C540117 AMAZONFEE  AMAZON FEE       -1   1/5/2011 9:55  16888.02
## 43704    C540118 AMAZONFEE  AMAZON FEE       -1   1/5/2011 9:57  16453.71
## 15017    C537630 AMAZONFEE  AMAZON FEE       -1 12/7/2010 15:04  13541.33
## 15018     537632 AMAZONFEE  AMAZON FEE        1 12/7/2010 15:08  13541.33
##        CustomerID        Country Quantity_indicator Price_Class
## 222682      15098 United Kingdom                Low        High
## 524603         NA United Kingdom                Low        High
## 43703          NA United Kingdom                Low        High
## 43704          NA United Kingdom                Low        High
## 15017          NA United Kingdom                Low        High
## 15018          NA United Kingdom                Low        High

Sorting based on multiple variables Sorting can also be done on two variables in a data set simultaneously. As in the code below, the Country Name is being sorted in ascending(alphabetical) order and whenever there are ties in the country, Quantity is being sorted in descending(numeric) order and is being stored in Online.Retail_Sort2, output of which can be seen below.

Online_Retail_sort2<-Online_Retail[order(Online_Retail$Country,Online_Retail$Quantity),]
head(Online_Retail_sort2)
##        InvoiceNo StockCode                      Description Quantity
## 433988   C574019     22738        RIBBON REEL SNOWY VILLAGE     -120
## 170900   C551348     22990       COTTON APRON PANTRY DESIGN     -100
## 108304   C545525     22969     HOMEMADE JAM SCENTED CANDLES      -86
## 270094   C560540     22492          MINI PAINT SET VINTAGE       -36
## 279047   C561227     84978 HANGING HEART JAR T-LIGHT HOLDER      -36
## 108305   C545525     22697  GREEN REGENCY TEACUP AND SAUCER      -24
##            InvoiceDate UnitPrice CustomerID   Country Quantity_indicator
## 433988 11/2/2011 12:05      1.45      12415 Australia                Low
## 170900  4/28/2011 9:49      4.25      12415 Australia                Low
## 108304  3/3/2011 13:11      1.25      12415 Australia                Low
## 270094 7/19/2011 12:26      0.65      12415 Australia                Low
## 279047 7/26/2011 10:15      1.06      12431 Australia                Low
## 108305  3/3/2011 13:11      2.55      12415 Australia                Low
##        Price_Class
## 433988      Medium
## 170900      Medium
## 108304      Medium
## 270094         Low
## 279047      Medium
## 108305      Medium

LAB: Sorting the data

  1. Import AutoDataset
  2. Sort the dataset based on length
  3. Sort the dataset based on length descending

Solutions

  1. Import AutoDataset
auto_data <- read.csv("~\\R dataset\\Automobile Data Set\\AutoDataset.csv")
  1. Sort the dataset based on length
auto_data1 <- auto_data[order(auto_data$length),]
head(auto_data1)
##    symboling normalized.losses      make fuel.type aspiration num.of.doors
## 19         2               121 chevrolet       gas        std          two
## 31         2               137     honda       gas        std          two
## 32         2               137     honda       gas        std          two
## 33         1               101     honda       gas        std          two
## 34         1               101     honda       gas        std          two
## 35         1               101     honda       gas        std          two
##    body.style drive.wheels engine.location wheel.base length width height
## 19  hatchback          fwd           front       88.4  141.1  60.3   53.2
## 31  hatchback          fwd           front       86.6  144.6  63.9   50.8
## 32  hatchback          fwd           front       86.6  144.6  63.9   50.8
## 33  hatchback          fwd           front       93.7  150.0  64.0   52.6
## 34  hatchback          fwd           front       93.7  150.0  64.0   52.6
## 35  hatchback          fwd           front       93.7  150.0  64.0   52.6
##    curb.weight engine.type num.of.cylinders engine.size fuel.system bore
## 19        1488           l            three          61        2bbl 2.91
## 31        1713         ohc             four          92        1bbl 2.91
## 32        1819         ohc             four          92        1bbl 2.91
## 33        1837         ohc             four          79        1bbl 2.91
## 34        1940         ohc             four          92        1bbl 2.91
## 35        1956         ohc             four          92        1bbl 2.91
##    stroke compression.ratio horsepower peak.rpm city.mpg highway.mpg price
## 19   3.03               9.5         48     5100       47          53  5151
## 31   3.41               9.6         58     4800       49          54  6479
## 32   3.41               9.2         76     6000       31          38  6855
## 33   3.07              10.1         60     5500       38          42  5399
## 34   3.41               9.2         76     6000       30          34  6529
## 35   3.41               9.2         76     6000       30          34  7129
  1. Sort the dataset based on length descending
auto_data2 <- auto_data[order(- auto_data$length),]
head(auto_data2)
##    symboling normalized.losses          make fuel.type aspiration
## 74         0                NA mercedes-benz       gas        std
## 71        -1                93 mercedes-benz    diesel      turbo
## 72        -1                NA mercedes-benz       gas        std
## 48         0               145        jaguar       gas        std
## 49         0                NA        jaguar       gas        std
## 75         1                NA mercedes-benz       gas        std
##    num.of.doors body.style drive.wheels engine.location wheel.base length
## 74         four      sedan          rwd           front      120.9  208.1
## 71         four      sedan          rwd           front      115.6  202.6
## 72         four      sedan          rwd           front      115.6  202.6
## 48         four      sedan          rwd           front      113.0  199.6
## 49         four      sedan          rwd           front      113.0  199.6
## 75          two    hardtop          rwd           front      112.0  199.2
##    width height curb.weight engine.type num.of.cylinders engine.size
## 74  71.7   56.7        3900        ohcv            eight         308
## 71  71.7   56.3        3770         ohc             five         183
## 72  71.7   56.5        3740        ohcv            eight         234
## 48  69.6   52.8        4066        dohc              six         258
## 49  69.6   52.8        4066        dohc              six         258
## 75  72.0   55.4        3715        ohcv            eight         304
##    fuel.system bore stroke compression.ratio horsepower peak.rpm city.mpg
## 74        mpfi 3.80   3.35               8.0        184     4500       14
## 71         idi 3.58   3.64              21.5        123     4350       22
## 72        mpfi 3.46   3.10               8.3        155     4750       16
## 48        mpfi 3.63   4.17               8.1        176     4750       15
## 49        mpfi 3.63   4.17               8.1        176     4750       15
## 75        mpfi 3.80   3.35               8.0        184     4500       14
##    highway.mpg price
## 74          16 40960
## 71          25 31600
## 72          18 34184
## 48          19 32250
## 49          19 35550
## 75          16 45400

Identifying Duplicates

Duplicates can be same or similar based on key (more than one) entries. Like one customer purchasing multiple times in one day from the same store. In R, we have a function called duplicate() which can be used to find repeated entries in the dataset. We get the index of duplicated records using duplicated(). Then we can use those indexes to select the duplicated records. There are two types of duplicate records. The whole record level duplicates and duplicates based on key. + Overall record level duplicates First fetch the duplicate record indexes

Bill <- read.csv("~\\R dataset\\Telecom Data Analysis\\Bill.csv")

Dupes_in_bill<-duplicated(Bill) 

length(Dupes_in_bill)
## [1] 9462
head(Dupes_in_bill)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
summary(Dupes_in_bill)
##    Mode   FALSE    TRUE    NA's 
## logical    9452      10       0

In the above code duplicated(Bill) would identify all the duplicate entries in the data set and store a logical entry for the same i.e. If the data is not a duplicate, FALSE entry is stored in Dupes_in_Bill for that data entry, else TRUE. There are 10 duplicate entries which can be seen through summary() output. Access all duplicates

Bill_dupes<-Bill[Dupes_in_bill,]
head(Bill_dupes)
##         cust_id act_active_dt package_id bill_zip customer_segment status
## 134  8067122301     7/27/2008     608645   560001           G - TC ACTIVE
## 135  8067122304     7/27/2008     608645   560001           G - TC ACTIVE
## 7162 9243344257     9/24/2009     620693   560035           G - TC ACTIVE
## 7166 9243350958     9/24/2009     620693   560035           G - TC ACTIVE
## 7170 9243356981     9/24/2009     620693   560035           G - TC ACTIVE
## 7172 9243381865     9/24/2009     620693   560035           G - TC ACTIVE
##      bill_amt
## 134    619.41
## 135    782.21
## 7162     0.00
## 7166     0.00
## 7170     0.00
## 7172     0.00

Bill_dupes contains all the duplicated entries in the data set. User can easily identify these entries and take appropriate measure on the same.

Duplicates based on Key

The code above finds the duplicates in the overall data set. What if user is not interested in overall level records? There are situations when only a key variable is being duplicated in the entire data set. So instead of using duplicated() on complete data set, duplicated() function can be used on a specific variable. Let’s have a look at this example

Dupes_in_bill_cust_id<-duplicated(Bill$cust_id) 
summary(Dupes_in_bill_cust_id)
##    Mode   FALSE    TRUE    NA's 
## logical    9389      73       0

Access all duplicates

Bill_dupes<-Bill[Dupes_in_bill_cust_id,]
head(Bill_dupes)
##        cust_id act_active_dt package_id bill_zip customer_segment status
## 134 8067122301     7/27/2008     608645   560001           G - TC ACTIVE
## 135 8067122304     7/27/2008     608645   560001           G - TC ACTIVE
## 561 9243538711     9/24/2009     625001   560035           G - TC ACTIVE
## 563 9243323496     9/24/2009     620693   560035           G - TC ACTIVE
## 565 9243323504     9/24/2009     620693   560035           G - TC ACTIVE
## 567 9243323511     9/24/2009     620693   560035           G - TC ACTIVE
##     bill_amt
## 134   619.41
## 135   782.21
## 561   561.80
## 563   545.98
## 565   254.76
## 567   545.98

Removing Duplicates

Add “!” sign to access the unique values. Instead of selecting the duplicated record indexes, we can select the unique records using “!” symbol

Bill_unique<-Bill[!Dupes_in_bill,]
Bill_unique_cust_id<-Bill[!Dupes_in_bill_cust_id,]

Dupes_in_bill contains all the logical entries (TRUE or FALSE) corresponding to the duplicate data i.e., if the data is duplicate then FALSE is store at that index and if the data is not duplicate then TRUE is stored at the particular index. “!” sign would convert all the FALSE values to TRUE, hence the data corresponding to TRUE index will only be stored in BILL_unique i.e. only unique values are stored and duplicate values are filtered out. Unique() function can remove the overall duplicate records. It will simply consider the unique records

Bill_unique1<-unique(Bill)

LAB: Handling Duplicates in R

This duplicity issue is very frequent in many transactional datasets. Here we have examples of telecom bill and complaints data. A single user may register a complaint twice a day. This repeated data needs to be found and removed from the space. The following data on Complaints would help in understanding how this is done.

  1. Import DataSet: “./Telecom Data Analysis/Complaints.csv”
  2. Identify overall duplicates in complaints data
  3. Create a new dataset by removing overall duplicates in Complaints data
  4. Identify duplicates in complaints data based on cust_id
  5. Create a new dataset by removing duplicates based on cust_id in Complaints data

Solutions

  1. Import DataSet: “./Telecom Data Analysis/Complaints.csv”
Complaints <- read.csv("~\\R dataset\\Telecom Data Analysis\\Complaints.csv")
  1. Identify overall duplicates in complaints data
Dupes_in_Complaints <- duplicated(Complaints)
summary(Dupes_in_Complaints)
##    Mode   FALSE    NA's 
## logical    6587       0
Complaint_dupes <- Complaints[Dupes_in_Complaints,]
Complaint_dupes
## [1] comp_id        month          incident       cust_id       
## [5] sla.status.new incident.type  type           severity      
## <0 rows> (or 0-length row.names)
  1. Create a new dataset by removing overall duplicates in Complaints data
Complaints_unique1<-unique(Complaints)
  1. Identify duplicates in complaints data based on cust_id
Dupes_in_complaints_cust_id<-duplicated(Complaints$cust_id)
summary(Dupes_in_complaints_cust_id)
##    Mode   FALSE    TRUE    NA's 
## logical    4856    1731       0
complaints_dupes<-Complaints[Dupes_in_complaints_cust_id,]
dim(complaints_dupes)
## [1] 1731    8
  1. Create a new dataset by removing duplicates based on cust_id in Complaints data
complaints_unique_custID<-Complaints[!Dupes_in_complaints_cust_id,]
nrow(Complaints)
## [1] 6587
nrow(complaints_unique_custID)
## [1] 4856

Data sets merging and Joining

In R, we have merge() function to join the data sets into one. Syntax for merge function is x <- merge(datasets1, dataset2, by=“primary_key”) The function above merges the data in dataset1, dataset2, and saves the merged data into the variable x. Note: You need to mention type of join otherwise the function does inner join by default – With a single primary key The syntax is

Newdata <- merge(dataone, datatwo ,by="primary_key")
  • With composite keys More than one key can also be used to join two datasets. The syntax is
Newdata <- merge(dataone, datatwo ,by=c("primary_key1", "primary_key2")

Joins

Joining the data can happen in one of the four ways: – Inner Join – Outer Join – Left Outer Join – Right Outer Join The syntax of types are as follows: Value given for the parameter “all” will decide the type of join Inner Join

Newdata <- merge(dataone, datatwo ,by="primary_key", all=FALSE)

Outer Join

Newdata <- merge(dataone, datatwo ,by="primary_key", all=TRUE)

Left Outer Join

Newdata <- merge(dataone, datatwo ,by="primary_key", all.x=TRUE)

Right Outer Join

Newdata <- merge(dataone, datatwo ,by="primary_key", all.y=TRUE)

Code: Joins

We have a TV commercial slots Datasets: + TV Commercial Slots Analysis/orders.csv + TV Commercial Slots Analysis/slots.csv

orders<-read.csv("~\\R dataset\\TV Commercial Slots Analysis\\orders.csv")
slots<-read.csv("~\\R dataset\\TV Commercial Slots Analysis\\slots.csv")

-Inner Join Merge by using single key “Unique_id”

Order_Slot_Common <- merge(orders, slots, by="Unique_id", all=FALSE)
head(Order_Slot_Common)
##                            Unique_id      AD_ID.x      Date     Time
## 1  SPYMYA2CBSH416510.833333333333333  SPYMYA2CBSH 1/12/2014  8:00 PM
## 2             SPYMYA60C008416520.375 SPYMYA60C008 1/13/2014  9:00 AM
## 3             SPYMYA60C008416540.375 SPYMYA60C008 1/15/2014  9:00 AM
## 4 SPYMYA60C008416550.458333333333333 SPYMYA60C008 1/16/2014 11:00 AM
## 5             SPYMYA60C008416560.375 SPYMYA60C008 1/17/2014  9:00 AM
## 6 SPYMYA60C008416570.666666666666667 SPYMYA60C008 1/18/2014  4:00 PM
##   Client.Product.Code.x Product.ID.x Network.ID.x Length.x Orders
## 1                MSDR02       SPYMYA          CBS      120      3
## 2                MSDR02       SPYMYA       TBKIDS       60      1
## 3                MSDR02       SPYMYA       TBKIDS       60      1
## 4                MSDR02       SPYMYA       TBKIDS       60      1
## 5                MSDR02       SPYMYA       TBKIDS       60      1
## 6                MSDR02       SPYMYA       TBKIDS       60      1
##        AD_ID.y  Air.Date Air.Time Time.Zone Client.Product.Code.y
## 1  SPYMYA2CBSH 1/12/2014  8:00 PM       EST                MSDR02
## 2 SPYMYA60C008 1/13/2014  9:00 AM       EST                MSDR02
## 3 SPYMYA60C008 1/15/2014  9:00 AM       EST                MSDR02
## 4 SPYMYA60C008 1/16/2014 11:00 AM       EST                MSDR02
## 5 SPYMYA60C008 1/17/2014  9:00 AM       EST                MSDR02
## 6 SPYMYA60C008 1/18/2014  4:00 PM       EST                MSDR02
##   Product.ID.y Length.y Network.ID.y  Network.Name      Network.Type
## 1       SPYMYA      120          CBS           CBS Broadcast Station
## 2       SPYMYA       60       TBKIDS The Band Kids             Cable
## 3       SPYMYA       60       TBKIDS The Band Kids             Cable
## 4       SPYMYA       60       TBKIDS The Band Kids             Cable
## 5       SPYMYA       60       TBKIDS The Band Kids             Cable
## 6       SPYMYA       60       TBKIDS The Band Kids             Cable
##   Network.DMA Network.Feed HH.Impressions HH.Ratings Spot.Cost
## 1    National                       18418      15.91      4000
## 2    Regional                          NA         NA      1000
## 3    Regional                          NA         NA      1000
## 4    Regional                          NA         NA      1000
## 5    Regional                          NA         NA      1050
## 6    Regional                          NA         NA      1000
##   Spot.Cost.Discounted
## 1               3800.0
## 2                950.0
## 3                950.0
## 4                950.0
## 5                997.5
## 6                950.0

-Left Outer Join Merge by using single key “Unique_id”

Order_all_Slot_match<-merge(orders, slots, by="Unique_id", all.x=TRUE)
head(Order_all_Slot_match)
##                            Unique_id      AD_ID.x      Date     Time
## 1             MYSPYR60A002416540.375 MYSPYR60A002 1/15/2014  9:00 AM
## 2 MYSPYR60A002416620.708333333333333 MYSPYR60A002 1/23/2014  5:00 PM
## 3 MYSPYR60A002416630.416666666666667 MYSPYR60A002 1/24/2014 10:00 AM
## 4             MYSPYR60A002416640.625 MYSPYR60A002 1/25/2014  3:00 PM
## 5 MYSPYR60A004416500.666666666666667 MYSPYR60A004 1/11/2014  4:00 PM
## 6 MYSPYR60A004416520.916666666666667 MYSPYR60A004 1/13/2014 10:00 PM
##   Client.Product.Code.x Product.ID.x Network.ID.x Length.x Orders AD_ID.y
## 1               MSDR02R       MYSPYR          ESQ       60      1    <NA>
## 2               MSDR02R       MYSPYR          ESQ       60      1    <NA>
## 3               MSDR02R       MYSPYR          ESQ       60      1    <NA>
## 4               MSDR02R       MYSPYR          ESQ       60      1    <NA>
## 5               MSDR02R       MYSPYR         NHLN       60      1    <NA>
## 6               MSDR02R       MYSPYR         NHLN       60      1    <NA>
##   Air.Date Air.Time Time.Zone Client.Product.Code.y Product.ID.y Length.y
## 1     <NA>     <NA>      <NA>                  <NA>         <NA>       NA
## 2     <NA>     <NA>      <NA>                  <NA>         <NA>       NA
## 3     <NA>     <NA>      <NA>                  <NA>         <NA>       NA
## 4     <NA>     <NA>      <NA>                  <NA>         <NA>       NA
## 5     <NA>     <NA>      <NA>                  <NA>         <NA>       NA
## 6     <NA>     <NA>      <NA>                  <NA>         <NA>       NA
##   Network.ID.y Network.Name Network.Type Network.DMA Network.Feed
## 1         <NA>         <NA>         <NA>        <NA>         <NA>
## 2         <NA>         <NA>         <NA>        <NA>         <NA>
## 3         <NA>         <NA>         <NA>        <NA>         <NA>
## 4         <NA>         <NA>         <NA>        <NA>         <NA>
## 5         <NA>         <NA>         <NA>        <NA>         <NA>
## 6         <NA>         <NA>         <NA>        <NA>         <NA>
##   HH.Impressions HH.Ratings Spot.Cost Spot.Cost.Discounted
## 1             NA         NA        NA                   NA
## 2             NA         NA        NA                   NA
## 3             NA         NA        NA                   NA
## 4             NA         NA        NA                   NA
## 5             NA         NA        NA                   NA
## 6             NA         NA        NA                   NA

-Right Outer Join Merge by using single key “Unique_id”

Slot_all_order_match<-merge(orders, slots, by="Unique_id", all.y=TRUE)
head(Slot_all_order_match)
##                            Unique_id      AD_ID.x      Date     Time
## 1  SPYMYA2CBSH416510.833333333333333  SPYMYA2CBSH 1/12/2014  8:00 PM
## 2             SPYMYA60C008416520.375 SPYMYA60C008 1/13/2014  9:00 AM
## 3             SPYMYA60C008416540.375 SPYMYA60C008 1/15/2014  9:00 AM
## 4 SPYMYA60C008416550.458333333333333 SPYMYA60C008 1/16/2014 11:00 AM
## 5             SPYMYA60C008416560.375 SPYMYA60C008 1/17/2014  9:00 AM
## 6 SPYMYA60C008416570.666666666666667 SPYMYA60C008 1/18/2014  4:00 PM
##   Client.Product.Code.x Product.ID.x Network.ID.x Length.x Orders
## 1                MSDR02       SPYMYA          CBS      120      3
## 2                MSDR02       SPYMYA       TBKIDS       60      1
## 3                MSDR02       SPYMYA       TBKIDS       60      1
## 4                MSDR02       SPYMYA       TBKIDS       60      1
## 5                MSDR02       SPYMYA       TBKIDS       60      1
## 6                MSDR02       SPYMYA       TBKIDS       60      1
##        AD_ID.y  Air.Date Air.Time Time.Zone Client.Product.Code.y
## 1  SPYMYA2CBSH 1/12/2014  8:00 PM       EST                MSDR02
## 2 SPYMYA60C008 1/13/2014  9:00 AM       EST                MSDR02
## 3 SPYMYA60C008 1/15/2014  9:00 AM       EST                MSDR02
## 4 SPYMYA60C008 1/16/2014 11:00 AM       EST                MSDR02
## 5 SPYMYA60C008 1/17/2014  9:00 AM       EST                MSDR02
## 6 SPYMYA60C008 1/18/2014  4:00 PM       EST                MSDR02
##   Product.ID.y Length.y Network.ID.y  Network.Name      Network.Type
## 1       SPYMYA      120          CBS           CBS Broadcast Station
## 2       SPYMYA       60       TBKIDS The Band Kids             Cable
## 3       SPYMYA       60       TBKIDS The Band Kids             Cable
## 4       SPYMYA       60       TBKIDS The Band Kids             Cable
## 5       SPYMYA       60       TBKIDS The Band Kids             Cable
## 6       SPYMYA       60       TBKIDS The Band Kids             Cable
##   Network.DMA Network.Feed HH.Impressions HH.Ratings Spot.Cost
## 1    National                       18418      15.91      4000
## 2    Regional                          NA         NA      1000
## 3    Regional                          NA         NA      1000
## 4    Regional                          NA         NA      1000
## 5    Regional                          NA         NA      1050
## 6    Regional                          NA         NA      1000
##   Spot.Cost.Discounted
## 1               3800.0
## 2                950.0
## 3                950.0
## 4                950.0
## 5                997.5
## 6                950.0

-Outer Join Merge by using single key “Unique_id”

Order_Slots_union<- merge(orders, slots, by="Unique_id", all=TRUE)
head(Order_Slots_union)
##                            Unique_id      AD_ID.x      Date     Time
## 1             MYSPYR60A002416540.375 MYSPYR60A002 1/15/2014  9:00 AM
## 2 MYSPYR60A002416620.708333333333333 MYSPYR60A002 1/23/2014  5:00 PM
## 3 MYSPYR60A002416630.416666666666667 MYSPYR60A002 1/24/2014 10:00 AM
## 4             MYSPYR60A002416640.625 MYSPYR60A002 1/25/2014  3:00 PM
## 5 MYSPYR60A004416500.666666666666667 MYSPYR60A004 1/11/2014  4:00 PM
## 6 MYSPYR60A004416520.916666666666667 MYSPYR60A004 1/13/2014 10:00 PM
##   Client.Product.Code.x Product.ID.x Network.ID.x Length.x Orders AD_ID.y
## 1               MSDR02R       MYSPYR          ESQ       60      1    <NA>
## 2               MSDR02R       MYSPYR          ESQ       60      1    <NA>
## 3               MSDR02R       MYSPYR          ESQ       60      1    <NA>
## 4               MSDR02R       MYSPYR          ESQ       60      1    <NA>
## 5               MSDR02R       MYSPYR         NHLN       60      1    <NA>
## 6               MSDR02R       MYSPYR         NHLN       60      1    <NA>
##   Air.Date Air.Time Time.Zone Client.Product.Code.y Product.ID.y Length.y
## 1     <NA>     <NA>      <NA>                  <NA>         <NA>       NA
## 2     <NA>     <NA>      <NA>                  <NA>         <NA>       NA
## 3     <NA>     <NA>      <NA>                  <NA>         <NA>       NA
## 4     <NA>     <NA>      <NA>                  <NA>         <NA>       NA
## 5     <NA>     <NA>      <NA>                  <NA>         <NA>       NA
## 6     <NA>     <NA>      <NA>                  <NA>         <NA>       NA
##   Network.ID.y Network.Name Network.Type Network.DMA Network.Feed
## 1         <NA>         <NA>         <NA>        <NA>         <NA>
## 2         <NA>         <NA>         <NA>        <NA>         <NA>
## 3         <NA>         <NA>         <NA>        <NA>         <NA>
## 4         <NA>         <NA>         <NA>        <NA>         <NA>
## 5         <NA>         <NA>         <NA>        <NA>         <NA>
## 6         <NA>         <NA>         <NA>        <NA>         <NA>
##   HH.Impressions HH.Ratings Spot.Cost Spot.Cost.Discounted
## 1             NA         NA        NA                   NA
## 2             NA         NA        NA                   NA
## 3             NA         NA        NA                   NA
## 4             NA         NA        NA                   NA
## 5             NA         NA        NA                   NA
## 6             NA         NA        NA                   NA

LAB: Data Joins

  1. Import Datasets
  • “./Telecom Data Analysis/Bill.csv”
  • “./Telecom Data Analysis/Complaints.csv”
  1. Import the data and remove duplicates based on cust_id
  2. Create a dataset for each of these requirements
    1. All the customers who appear either in bill data or complaints data
    1. All the customers who appear both in bill data and complaints data
    1. All the customers from bill data: Customers who have bill data along with their complaints
    1. All the customers from complaints data: Customers who have Complaints data along with their bill info

Solutions

  1. Import Datasets
  • “./Telecom Data Analysis/Bill.csv”
  • “./Telecom Data Analysis/Complaints.csv”
bill<-read.csv("~\\R dataset\\Telecom Data Analysis\\Bill.csv")
Complaints<-read.csv("~\\R dataset\\Telecom Data Analysis\\Complaints.csv")
  1. Import the data and remove duplicates based on cust_id
Dupes_in_bills_cust_id <- duplicated(bill$cust_id)
Bills_unique<-bill[!Dupes_in_bills_cust_id,]
nrow(bill)
## [1] 9462
nrow(Bills_unique)
## [1] 9389
Dupes_in_Complaints_cust_id <- duplicated(Complaints$cust_id)
Complaints_unique<-Complaints[!Dupes_in_Complaints_cust_id,]
nrow(Complaints)
## [1] 6587
nrow(Complaints_unique)
## [1] 4856
  1. Create a dataset for each of these requirements
    1. All the customers who appear either in bill data or complaints data
all_customer<-merge(Bills_unique,Complaints_unique,by="cust_id",all=TRUE)
nrow(Bills_unique)
## [1] 9389
nrow(Complaints_unique)
## [1] 4856
nrow(all_customer)
## [1] 13952
  1. Create a dataset for each of these requirements
    1. All the customers who appear both in bill data and complaints data
customer_in_both<-merge(Bills_unique,Complaints_unique,by="cust_id",all=FALSE)
nrow(Bills_unique)
## [1] 9389
nrow(Complaints_unique)
## [1] 4856
nrow(customer_in_both)
## [1] 293
  1. Create a dataset for each of these requirements
    1. All the customers from bill data: Customers who have bill data along with their complaints
bill_customer<-merge(Bills_unique,Complaints_unique,by="cust_id",all.x=TRUE)
nrow(Bills_unique)
## [1] 9389
nrow(bill_customer)
## [1] 9389
  1. Create a dataset for each of these requirements
    1. All the customers from complaints data: Customers who have Complaints data along with their bill info
complaints_customer<-merge(Bills_unique,Complaints_unique,by="cust_id",all.y=TRUE)
nrow(Complaints_unique)
## [1] 4856
nrow(complaints_customer)
## [1] 4856

Exporting the data Out of R

Another important feature available in R is the export function. With this function, the data created in R can be saved into any other format like the csv, xlsx, or any other format. Just as we used the function read.csv() to read the contents of the CSV file into R, there is a function called write.csv(), using which the data compiled can be exported to other formats.

  • Make sure that the export path is correct
write.csv(data, "./path/file_name.csv") 
write.csv(customer_in_both, "./Telecom Data Analysis/Bill_and_comp_common.csv") 
write.table(all_customer, "./Telecom Data Analysis/Bill_and_comp_overall.txt", sep="t") 

Conclusion

In this session we started with Data imploring from various sources. We saw some basic commands to work with data. We also learnt manipulating the datasets and creating new variables. Sorting the datasets and handling duplicates. Joining the datasets is also an important concept.There are many more topics to discuss in data handling, these topics in the session are essential for any data scientist.

Attachments1

Statinfer

Statinfer derived from Statistical inference. We provide training in various Data Analytics and Data Science courses and assist candidates in securing placements.

Contact Us

info@statinfer.com

+91- 9676098897

+91- 9494762485

 

Our Social Links

top
© 2020. All Rights Reserved.