You can download the datasets and R code file for this session here.
Introduction
Data handling is the most important skill for any data analyst.Before jumping on to analysis of data we may have to import, curate and validate the data. Some times we may have to create a new variable in the data,sometimes we may have to merge the data from multiple sources and create the consolidated data and so on. In this session we will discuss most useful data handling techniques. So let us have a look at the contents,what we are going to discuss.
Contents
- Data imploring from files
- Database server connections
- Working with datasets
- Manipulating the datasets in R
- Creating new variables in R
- Sorting in R & Removing Duplicates
- Exporting the R datasets into external files
- Data Merging
- Conclusion
Importing The Data
The data is being stored into different formats, like CSV (Comma Separated Variable), Excel, SQL (database), and many other ways. Various statistical tools are available which are used for the critical analysis of the data. R is a highly flexible and open source software which is integrated with all the formats of data storage making it easy to operate and handle data in R. Data Handling is a very important skill for a Data Analyst. It involves all kind of processes like importing, curation, validation and exploration of data. Hence before moving to the Analysis of data, it is very important for one to become well versed with Handling of Data. During the data preparation, one may have to create a new variable, or even merge the data from multiple sources and sort it further. In this session we will discuss most useful data handling techniques. R is compatible with various data storage and statistical tools like CSV, Excel and SQL database. There are different in-built and external libraries, which make it possible to integrate R with these softwares. Because of the flexibility, it is very easy to import the data set stored in any of the above mentioned formats into R for further analysis. Not only one can import the data set, but also can export the data set from R to these formats. The coming section would cover the various methods available to import the data into R.
Data import from CSV files
CSV (Comma Separated Variable) extension is one of the most commonly used data type. R is compatible with CSV format. By calling a simple function of read.csv(), one can easily import the data into R. The Note of Caution: While defining the path of the CSV file in the read.csv() function, user must use either “/” or “\\” in the path. The windows style of path “\” doesn’t work in R. To understand how read.csv() functions, a sample program to read the Sales related data from Sales_data.csv is shown. The function read.csv() imports the data from the CSV file (the path for which is mentioned in the syntax) to R.
** Note: ** In the syntax below, we use \\ instead of \.
Note: In the syntax below, you need to replace dataset path with your local path.
Sales<-read.csv("~\\R dataset\\Superstore Sales Data\\Sales_sample.csv")
Sales
Sales <- read.csv("~\\R dataset\\Superstore Sales Data\\Sales_sample.csv")
Sales1 <- read.csv("~\\R dataset\\Superstore Sales Data\\Sales_sample.csv")
Sales1
Data import from SAS files
SAS, also known as the Statistical Analysis Software, is used to perform the functions of Data Analytics. SAS is highly specialized famous tool in analytics. R is compatible with the SAS and the JMV format. There is an external library available which needs to be installed before importing the data from the SAS format. To install the package, use the command install.packages(sas7bdat) and to use the package, there is library(sas7bdat). After installing, it will automatically load the library into the R directory. Once the library is installed, user can import the data by calling the function read.sas7bdat() As the path of the file was mentioned in the CSV function, in the same way the path of the sas7bdat file is passed as an argument into the function.
library(sas7bdat)
gnpdata <-read.sas7bdat("~//R dataset//SAS datasets\\gnp.sas7bdat")
View(gnpdata)
Data import from Excel files
Excel is one of the most widely used as reporting and ad-hoc data analysis tool. The excel files contain mostly aggregated and even raw data sometimes. There is function read.xlsx() which can be used to import the data from desired file into R. The spreadsheet is read through this function and is stored into a data frame. Here’s an example on how to read the data from excel into R.
library(xlsx)
wb_data <- read.xlsx("~\\Datasets\\World Bank Data\\World Bank Indicators.xlsx" , sheetName="Data by country")
There’s another way through which data can be imported into R. There is a function called loadWorkbook() from the XLConnect package, which can be used to read the complete workbook, and hence import particular worksheets with the function called readWorksheet(). Java package needs to be installed before using XLConnect.
library(XLConnect)
wb_data <- readWorksheet(loadWorkbook("~\\R dataset\\World Bank Data\\World Bank Indicators.xlsx" ),sheet=1)
You may run into Java related issues while importing Excel files. Excel is not just a flat file like CSV. The excel file contains lots of other information such as indexes, apart from just data. If you run into errors even after installing the necessary packages, then you must first store excel data in CSV format and then import it.
Connecting to Database server
Importing Data from SQL
Data stored in the servers can be accessed through SQL (Structured Query Language). R is compatible with these databases. For SQL server, this is done through ODBC. However, few steps need to be followed to link R with the database. All you need to do is to create ODBC from the SQL Server.Once ODBC is created, you need to connect ODBC with R. This is done through a package RODBC in R. To connect, one needs to install the RODBC package.
install.packages("RODBC")
Once the package is installed, it is loaded into the current workspace, by calling the function library(RODBC). This will load the functions available in the RODBC package into the current workspace for further use. The following syntax shows, how to load the package, and further connect R with Database.
library(RODBC)
conn <-odbcConnect("dblink", uid="uid", pwd="passwd")
salesdata <- sqlFetch(conn, "sales")
ordersdat<- sqlQuery(conn, "select * from orders")
close(conn)
> dbhandle <- odbcDriverConnect('driver={SQL Server}; server=MyServerName; database = MyDbName; trusted_connection=true')
The above mentioned commands, replace “MyServerName” & “MyDbName” with your server and database names. Make sure that you have access to that server and database before connecting. Once the data is imported, the connection is closed by using the function close(). Below sample code shows how to work with database connections and execute SQL queries from R.
spend_row_count<-sqlQuery(dbhandle, "select count(*) from MyTable")
Working With Datasets
Once the dataset is imported/connected, there are some basic validation checks that we may have to conduct, before going for further analysis.
- Is the data imported correctly?
- Are the variables imported in right format?
- Did we import all the rows and columns?
It is always a good idea to do some basic checks to get an idea on the dataset. Just printing the data is not a good option, always. Sometimes the datasets are huge; printing the data might not give us any additional benefit. It is a good practice to check the number of rows, columns, quick look at the variable structures, data snapshot and summary. Let’s see some commands on how to get idea on dataset without actually opening it. There are very important steps in the field of data preparation for analysis. Importing, exploring, validating and sanitizing the data are as important as data analysis. It is important to check if there is any data missing, or if there is any outlier lying in the data set. For example: If the age is recorded as 1200, then there has been a problem in recording or importing the data. Hence fixing the imported data is equally important.One can validate the dataset by verifying its structure. By structure, we mean the number of rows and columns, the format of data, the limits of the values taken by the data columns or ever just a summary relating to the same could also be used. Here are some basic commands that you may have to run, right after importing the data into R.
Basic Commands on Datasets
Is the data imported correctly? Are the variables imported in right format? Did we import all the rows? Once the dataset is inside R, we would like to do some basic checks to get an idea on the dataset. Just printing the data is not a good option, always. Is a good practice to check the number of rows, columns, quick look at the variable structures, a summary and data snapshot. dim() This function returns the number of rows and columns. On the overall level we can get a quick idea on the data import. If we don’t have the number of rows and columns as expected, we can address the issue at that phase itself.
dim(Sales)
It can be seen that the data set Sales has 11 rows and 7 columns. names() This function can be used to get or set the names of the variables in the dataset. Sometimes import doesn’t consider column names while importing from text files, we need to be careful while importing.
names(Sales)
In this code output we can see the names of the 7 columns in the data set Sales. head() head() returns the first 6 observations of the data set. This function is used to have a quick look at the first few records of the data. head() can be used to limit the number of observations in the output.
head(Sales)
tail() tail() returns the last 6 observations of the data set. This function is used as a substitute to head() wherever it is applicable
tail(Sales)
The output above shows the last 6 observations of Sales str() This is a powerful and important function str() which returns the complete information relating to the structure of the data set. This would include the number of rows and columns, the names of the columns, the type or class of data set contained in these columns, and even brief summary information contained in these columns.
str(Sales)
We can use this function to see the detailed summary of a single variable. $ can be used to mention the column for which the data is to be verified.
str(Sales$unitsSold)
print()
Print() can be used to display the data set on the console itself. It will works well if the data set is small.
print(Sales)
View() The View() function gives a quick snapshot view of the dataset. Print function prints the dataset in the console. If there are many columns in the dataset, then we may not be able to get a good view of the dataset. The View() function gives a better formatted snapshot of the dataset
View(Sales)
Check list after Import
Data: Superstore Sales DataSales_sample.csv
Code | Description |
---|---|
dim(Sales) | To check the number of rows and columns |
names(Sales) | What are the column names? Sometimes import doesn’t consider column names while importing |
head(Sales) | First few observations of data |
tail(Sales) | Last few observations of the data |
str(Sales) | Structure of the whole data, are all the variables in the expected format |
str(Sales$unitsSold) | Structure of few important variables |
View(Sales) | A snapshot of small datasets |
Print(Sales) | Print the data if it is small |
Other useful commands
Few more examples are stated in the table below. Give them a try and find if the output matches with the description part.
Code | Description |
---|---|
Str(Sales) | Get an idea on numerical and non numerical variables |
summary(Sales) | Overall data summary |
summary (Sales$unitsSold) | Get summary on numerical variables |
table(Sales$salesChannel) | Get frequency tables for categorical variables |
table(Sales$custCountry) | Get frequency tables for categorical variables |
sum(is.na(Sales)) | Missing value count in full data |
sum(is.na(Sales$unitsSold)) | Missing value count in a variable |
Lab: Printing the data and meta info
Consider a Sales data for Superstore which has its departments established in different countries. This data set can be easily bifurcated on the basis of various countries. This data contains all the information relating the consumers purchase attributes. You are to analyze the data by checking what attributes are being covered, how big could be the sample data, and various other functions. Using this data try to Solve the below mentioned questions.
1.Import “Superstore Sales DataSales_by_country_v1.csv” data.
2.Perform the basic checks on the data.
3.How many rows and columns are there in this dataset?
4.Print only column names in the dataset.
5.Print first 10 observations.
6.Print the last 5 observations.
7.Get the summary of the dataset.
8.Print the structure of the data.
9.Describe the field unitsSold, custCountry.
10.Create a new dataset by taking first 30 observations from this data.
11.Print the resultant data.
12.Remove(delete) the new dataset.
Solutions
1.Import “Superstore Sales DataSales_by_country_v1.csv” data
Sales_data <- read.csv("~\\R dataset\\Superstore Sales Data\\Sales_by_country_v1.csv")
head(Sales_data)
## custId custName custCountry productSold
## 1 23262 Candice Levy Congo SUPA101
## 2 23263 Xerxes Smith Panama DETA200
## 3 23264 Levi Douglas Tanzania, United Republic of DETA800
## 4 23265 Uriel Benton South Africa SUPA104
## 5 23266 Celeste Pugh Gabon PURA200
## 6 23267 Vance Campos Syrian Arab Republic PURA100
## salesChannel unitsSold dateSold
## 1 Retail 117 2012-08-09
## 2 Online 73 2012-07-06
## 3 Online 205 2012-08-18
## 4 Online 14 2012-08-05
## 5 Retail 170 2012-08-11
## 6 Retail 129 2012-07-11
2.Perform the basic checks on the data
dim(Sales_data)
## [1] 998 7
2.Perform the basic checks on the data
str(Sales_data)
## 'data.frame': 998 obs. of 7 variables:
## $ custId : int 23262 23263 23264 23265 23266 23267 23268 23269 23270 23271 ...
## $ custName : Factor w/ 998 levels "Aaron Edwards",..: 183 969 612 929 195 937 593 482 956 77 ...
## $ custCountry : Factor w/ 233 levels "Afghanistan",..: 49 160 204 191 74 201 83 122 112 169 ...
## $ productSold : Factor w/ 12 levels "DETA100","DETA200",..: 8 2 3 11 5 4 1 4 10 10 ...
## $ salesChannel: Factor w/ 3 levels "Direct","Online",..: 3 2 2 2 3 3 3 3 2 3 ...
## $ unitsSold : int 117 73 205 14 170 129 82 116 67 125 ...
## $ dateSold : Factor w/ 464 levels "2011-01-02","2011-01-03",..: 446 416 454 442 448 421 422 386 388 434 ...
- Perform the basic checks on the data
head(Sales_data)
## custId custName custCountry productSold
## 1 23262 Candice Levy Congo SUPA101
## 2 23263 Xerxes Smith Panama DETA200
## 3 23264 Levi Douglas Tanzania, United Republic of DETA800
## 4 23265 Uriel Benton South Africa SUPA104
## 5 23266 Celeste Pugh Gabon PURA200
## 6 23267 Vance Campos Syrian Arab Republic PURA100
## salesChannel unitsSold dateSold
## 1 Retail 117 2012-08-09
## 2 Online 73 2012-07-06
## 3 Online 205 2012-08-18
## 4 Online 14 2012-08-05
## 5 Retail 170 2012-08-11
## 6 Retail 129 2012-07-11
- Perform the basic checks on the data
tail(Sales_data)
## custId custName custCountry productSold salesChannel
## 993 24254 Anika Alford Belize DETA800 Online
## 994 24255 Ethan Day Tajikistan DETA100 Online
## 995 24256 Quail Knox Tonga PURA500 Retail
## 996 24257 Noelle Sargent Ireland DETA800 Direct
## 997 24258 Kuame Wallace Montserrat SUPA103 Online
## 998 24259 Lester Fisher Cocos (Keeling) Islands PURA500 Direct
## unitsSold dateSold
## 993 6 2011-07-08
## 994 189 2011-01-09
## 995 43 2011-05-08
## 996 17 2011-02-04
## 997 80 2011-01-13
## 998 138 2011-08-10
- How many rows and columns are there in this dataset?
dim(Sales_data)
## [1] 998 7
- Print only column names in the dataset
names(Sales_data)
## [1] "custId" "custName" "custCountry" "productSold"
## [5] "salesChannel" "unitsSold" "dateSold"
- Print first 10 observations
head(Sales_data,n=10)
## custId custName custCountry productSold
## 1 23262 Candice Levy Congo SUPA101
## 2 23263 Xerxes Smith Panama DETA200
## 3 23264 Levi Douglas Tanzania, United Republic of DETA800
## 4 23265 Uriel Benton South Africa SUPA104
## 5 23266 Celeste Pugh Gabon PURA200
## 6 23267 Vance Campos Syrian Arab Republic PURA100
## 7 23268 Latifah Wall Guadeloupe DETA100
## 8 23269 Jane Hernandez Macedonia PURA100
## 9 23270 Wanda Garza Kyrgyzstan SUPA103
## 10 23271 Athena Fitzpatrick Reunion SUPA103
## salesChannel unitsSold dateSold
## 1 Retail 117 2012-08-09
## 2 Online 73 2012-07-06
## 3 Online 205 2012-08-18
## 4 Online 14 2012-08-05
## 5 Retail 170 2012-08-11
## 6 Retail 129 2012-07-11
## 7 Retail 82 2012-07-12
## 8 Retail 116 2012-06-03
## 9 Online 67 2012-06-07
## 10 Retail 125 2012-07-27
- Print the last 5 observations
tail(Sales_data,n=5)
## custId custName custCountry productSold salesChannel
## 994 24255 Ethan Day Tajikistan DETA100 Online
## 995 24256 Quail Knox Tonga PURA500 Retail
## 996 24257 Noelle Sargent Ireland DETA800 Direct
## 997 24258 Kuame Wallace Montserrat SUPA103 Online
## 998 24259 Lester Fisher Cocos (Keeling) Islands PURA500 Direct
## unitsSold dateSold
## 994 189 2011-01-09
## 995 43 2011-05-08
## 996 17 2011-02-04
## 997 80 2011-01-13
## 998 138 2011-08-10
- Get the summary of the dataset
summary(Sales_data)
## custId custName custCountry
## Min. :23262 Aaron Edwards : 1 Denmark : 10
## 1st Qu.:23511 Abigail Cunningham: 1 Swaziland : 10
## Median :23761 Abraham Mcguire : 1 Turkey : 10
## Mean :23761 Acton Mendoza : 1 Azerbaijan : 9
## 3rd Qu.:24010 Acton Ratliff : 1 Bouvet Island: 9
## Max. :24259 Adam Blackburn : 1 Nauru : 9
## (Other) :992 (Other) :941
## productSold salesChannel unitsSold dateSold
## PURA100:112 Direct: 91 Min. : 1.00 2011-11-11: 7
## SUPA103: 90 Online:511 1st Qu.: 52.25 2012-05-15: 7
## DETA800: 89 Retail:396 Median :111.00 2012-01-08: 6
## DETA100: 87 Mean :108.26 2012-02-20: 6
## SUPA102: 86 3rd Qu.:163.00 2012-04-06: 6
## PURA200: 84 Max. :212.00 2012-04-21: 6
## (Other):450 (Other) :960
- Print the structure of the data
str(Sales_data)
## 'data.frame': 998 obs. of 7 variables:
## $ custId : int 23262 23263 23264 23265 23266 23267 23268 23269 23270 23271 ...
## $ custName : Factor w/ 998 levels "Aaron Edwards",..: 183 969 612 929 195 937 593 482 956 77 ...
## $ custCountry : Factor w/ 233 levels "Afghanistan",..: 49 160 204 191 74 201 83 122 112 169 ...
## $ productSold : Factor w/ 12 levels "DETA100","DETA200",..: 8 2 3 11 5 4 1 4 10 10 ...
## $ salesChannel: Factor w/ 3 levels "Direct","Online",..: 3 2 2 2 3 3 3 3 2 3 ...
## $ unitsSold : int 117 73 205 14 170 129 82 116 67 125 ...
## $ dateSold : Factor w/ 464 levels "2011-01-02","2011-01-03",..: 446 416 454 442 448 421 422 386 388 434 ...
- Describe the field unitsSold, custCountry
str(Sales_data$unitsSold,Sales_data$custCountry)
## int [1:998] 117 73 205 14 170 129 82 116 67 125 ...
- Create a new dataset by taking first 30 observations from this data
Sales_data_new <- Sales_data[c(1:30),]
dim(Sales_data_new)
## [1] 30 7
- Print the resultant data
print(Sales_data_new)
## custId custName custCountry productSold
## 1 23262 Candice Levy Congo SUPA101
## 2 23263 Xerxes Smith Panama DETA200
## 3 23264 Levi Douglas Tanzania, United Republic of DETA800
## 4 23265 Uriel Benton South Africa SUPA104
## 5 23266 Celeste Pugh Gabon PURA200
## 6 23267 Vance Campos Syrian Arab Republic PURA100
## 7 23268 Latifah Wall Guadeloupe DETA100
## 8 23269 Jane Hernandez Macedonia PURA100
## 9 23270 Wanda Garza Kyrgyzstan SUPA103
## 10 23271 Athena Fitzpatrick Reunion SUPA103
## 11 23272 Anjolie Hicks Turks and Caicos Islands DETA200
## 12 23273 Isaac Cooper Netherlands Antilles SUPA104
## 13 23274 Asher Weber Macedonia PURA100
## 14 23275 Ethan Gregory Tuvalu DETA800
## 15 23276 Hayes Rollins Nepal PURA500
## 16 23277 MacKenzie Moss Oman SUPA101
## 17 23278 Aphrodite Brennan Malawi SUPA105
## 18 23279 Angela Wise Moldova PURA100
## 19 23280 James Spencer Burkina Faso SUPA103
## 20 23281 Adria Kaufman Bouvet Island SUPA102
## 21 23282 Amir Alexander Liberia DETA100
## 22 23283 Lani Sweet Vanuatu SUPA105
## 23 23284 Clark Weaver Palau PURA250
## 24 23285 Leonard Cardenas Madagascar SUPA102
## 25 23286 Renee Padilla Yemen DETA800
## 26 23287 Joy Vazquez Korea PURA250
## 27 23288 Ingrid Bush Montserrat SUPA104
## 28 23289 Deacon Craig Mongolia SUPA105
## 29 23290 Rama Goodwin Tunisia DETA800
## 30 23291 Jelani Odonnell Albania DETA800
## salesChannel unitsSold dateSold
## 1 Retail 117 2012-08-09
## 2 Online 73 2012-07-06
## 3 Online 205 2012-08-18
## 4 Online 14 2012-08-05
## 5 Retail 170 2012-08-11
## 6 Retail 129 2012-07-11
## 7 Retail 82 2012-07-12
## 8 Retail 116 2012-06-03
## 9 Online 67 2012-06-07
## 10 Retail 125 2012-07-27
## 11 Retail 71 2012-07-31
## 12 Retail 22 2012-08-13
## 13 Direct 153 2012-08-22
## 14 Retail 141 2012-07-04
## 15 Direct 65 2012-08-01
## 16 Online 157 2012-07-12
## 17 Online 197 2012-08-24
## 18 Direct 10 2012-06-21
## 19 Direct 30 2012-06-03
## 20 Online 134 2012-07-13
## 21 Online 100 2012-08-21
## 22 Online 142 2012-06-24
## 23 Online 135 2012-06-17
## 24 Online 9 2012-07-24
## 25 Retail 69 2012-08-08
## 26 Retail 189 2012-06-17
## 27 Direct 141 2012-06-14
## 28 Retail 166 2012-08-02
## 29 Online 170 2012-08-11
## 30 Retail 199 2012-08-18
- Remove(delete) the new dataset
rm(Sales_data_new)
Sub setting the data
Sub Setting of Data based on rows
Sub-setting allows to access particular data from a complete data set. If a user wants to explore the data of a particular column or row, then sub setting can be very helpful for them. In R, users can access to particular part of the data set by selecting particular number of columns and rows as shown below. Import Dataset: GDP.csv
gdp<- read.csv("~\\R dataset\\World Bank Data\\GDP.csv")
New dataset with selected rows.To select particular part of the dataset, you need to vary the number of rows and columns in the syntax by changing the syntax [Number of rows, Number of columns].
gdp1<-gdp[1:10, ]
gdp1
## Country_code Rank Country GDP
## 1 USA 1 United States 17419000
## 2 CHN 2 China 10354832
## 3 JPN 3 Japan 4601461
## 4 DEU 4 Germany 3868291
## 5 GBR 5 United Kingdom 2988893
## 6 FRA 6 France 2829192
## 7 BRA 7 Brazil 2346076
## 8 ITA 8 Italy 2141161
## 9 IND 9 India 2048517
## 10 RUS 10 Russian Federation 1860598
In the above code, the first 10 rows of the data set are selected by using 1:10 (which means 1 to 10), and all the columns are selected. Hence on the printing gdp1, the first 10 rows of all the columns are displayed in the output. The other way to select specific rows is
gdp2<- gdp[c(1,4,10,12), ]
gdp2
## Country_code Rank Country GDP
## 1 USA 1 United States 17419000
## 4 DEU 4 Germany 3868291
## 10 RUS 10 Russian Federation 1860598
## 12 AUS 12 Australia 1454675
To select a varied number of rows, which are non-continuous or are not in a sequence, then c(row number 1, row number 2, ….) can be used to select the specific rows. In the code above gdp[c(1,4,10,12),] would select row number 1,4,10,12 and store these rows data into gdp2.
Sub-setting with selected number of columns
The way we can select particular rows from the dataset, the same way columns can be selected. To do so, you need to change second index in the syntax [Number of rows, Number of Columns]. A sample code is shown below:
gdp3<- gdp[, 2:4 ]
head(gdp3)
## Rank Country GDP
## 1 1 United States 17419000
## 2 2 China 10354832
## 3 3 Japan 4601461
## 4 4 Germany 3868291
## 5 5 United Kingdom 2988893
## 6 6 France 2829192
The code above shows the data for columns 2 to 4 in the output. However if all the rows needs to be printed, all you need to do is to remove the function head(). Another way of selecting columns is shown below:
gdp4<- gdp[, c(1,2,4)]
head(gdp4)
## Country_code Rank GDP
## 1 USA 1 17419000
## 2 CHN 2 10354832
## 3 JPN 3 4601461
## 4 DEU 4 3868291
## 5 GBR 5 2988893
## 6 FRA 6 2829192
The way non-sequence based rows are selected, non-sequence based columns can be selected by using the syntax c(column number 1, column number 2, column number 3, ….). The code above shows the data for 1st, 2nd and 4th column. One can even access particular columns by using the names of the columns as shown below:
gdp5<- gdp[, c("Country", "Rank")]
head(gdp5)
## Country Rank
## 1 United States 1
## 2 China 2
## 3 Japan 3
## 4 Germany 4
## 5 United Kingdom 5
## 6 France 6
The way non sequence based rows are selected, non-sequence based columns can be selected by using the syntax c(column number 1, column number 2, column number 3, ….). The code above shows the data for 1st, 2nd and 4th column. One can even access particular columns by using the names of the columns as shown below:
Sub-setting with selected number of rows and columns
The syntax used above can also be used to select collectively the varying number of rows and columns. Sample code is shown below:
gdp6<-gdp[5:20, c(1,2,4)]
gdp6
## Country_code Rank GDP
## 5 GBR 5 2988893
## 6 FRA 6 2829192
## 7 BRA 7 2346076
## 8 ITA 8 2141161
## 9 IND 9 2048517
## 10 RUS 10 1860598
## 11 CAN 11 1785387
## 12 AUS 12 1454675
## 13 KOR 13 1410383
## 14 ESP 14 1381342
## 15 MEX 15 1294690
## 16 IDN 16 888538
## 17 NLD 17 879319
## 18 TUR 18 798429
## 19 SAU 19 746249
## 20 CHE 20 701037
The code above would access rows 5 to 20 and columns 1,2, and 4. Other way to select varying number of rows and columns is shown below. Note: In the code below, the negative sign c(-3, -4) is used. Using negative sign would only the selected rows or columns which are mentioned in the code with the negative sign.
gdp7<-gdp[1:5, c(-3,-4)]
gdp7
## Country_code Rank
## 1 USA 1
## 2 CHN 2
## 3 JPN 3
## 4 DEU 4
## 5 GBR 5
From the output, we can see that 3rd and 4th row are omitted in the output.
Lab: Sub setting the data
As a bank manager you might would like to know the quality of the customers which the bank is catering to. This would help you in profiling out the target customers. We may want to rollout some marketing offers to the selected customers from the bank.
1. Import : “./Bank Marketing/bank_market.csv”
2. Create separate datasets for each of the below tasks + a) Select first 1000 rows only + b) Select only four columns “Cust_num” “age” “default” and “balance” + c) Select 20,000 to 40,000 observations along with four variables “Cust_num” “job” “marital” and “education” + d) Select 5000 to 6000 observations drop “poutcome” and “y”
Solutions
- 1.Import : “./Bank Marketing/bank_market.csv”
bank <- read.csv("C:UsersajithDesktopstatinferR ProgrammingR datasetBank Tele Marketingbank_market.csv")
- 2.a) Select first 1000 rows only
bank_modified1 <- bank[c(1:1000),]
dim(bank_modified1)
## [1] 1000 18
- 2.b) Select only four columns “Cust_num” “age” “default” and “balance”
bank_modified2 <- bank[,c("Cust_num","age","default","balance")]
dim(bank_modified2)
## [1] 45211 4
- 2.c) Select 20,000 to 40,000 observations along with four variables “Cust_num” “job” “marital” and “education”
bank_modified3 <- bank[c(20000:40000), c("Cust_num","job","marital","education")]
dim(bank_modified3)
## [1] 20001 4
- 2.d) Select 5000 to 6000 observations drop “poutcome” and “y”
bank_modified4 <- bank[c(5000:6000), !(colnames(bank) %in% c("poutcome","y"))]
dim(bank_modified4)
## [1] 1001 16
Subset with Variable Filter Conditions
- Selection with a condition on variables
- For example, selection of complains where budget is greater than $5000.
- Need to use subset function in R
newdata <- subset(old_data,condition1 | condition2)
- And condition & filters
bank_subset1<-subset(bank, age>40 & loan=="no")
- OR condition & filters
bank_subset2<-subset(bank, age>40 | loan=="no")
Subset with Variable Filter Conditions
Sub setting can also be done by using various filters. Say we have a data set for 5000 employees, and we want to find out all the employees who have salary more than 100,000 per month. By using filters, the user can find the records with salary more than 100,000. Few of the conditional operators which can be used to filter out the datasets include AND (&) operator, OR (|) operator, NOT (!) operator, Greater than or less than (> or <), equal to (==) operator and many other. Syntax new_data <- subset(old_data, condition1 | conditon2) Examples stating the use of these conditional operators is shown below And and OR condition In AND condition, filtering is done on the basis of satisfaction of both the criteria. The sample code is as shown:
bank_subset3<-subset(bank, (age>40 & loan=="no") | marital=="single" )
AND, OR condition , Numeric and Character filters & selected fields
market4<-subset(bank, (age>40 & loan=="no") | marital=="single" , select=c (Cust_num , age , marital, loan ))
AND, OR condition , Numeric and Character filters & dropping few fields
market5<-subset(bank, (age>40 & loan=="no") | marital=="single" , select=-c (poutcome,y, balance, contact))
Lab: Subsetting
AutoDataset consists of all the information relating to the specification of the cars, which includes the engine type, the mileage value, the the fuel type, and other specifications. So if user would like to know how many cars with rpm more than 5500 and having a convertible body style with price range more than 20000, sub-setting can help in categorizing such data.The following data might would help in understanding the subset concepts.
- Import : “./Automobile Data Set/AutoDataset.csv”
- Create a new dataset for exclusively Toyota cars 3.Create a new dataset for all cars with city.mpg greater than 30 and engine size is less than 120.
- Create a new dataset by taking only sedan cars. Keep only four variables(Make, body style, fuel type, price) in the final dataset.
- Create a new dataset by taking Audi, BMW or Porsche company makes. Drop two variables from the resultant dataset(price and normalized losses)
Solutions
- Import : “./Automobile Data Set/AutoDataset.csv”
auto_data <- read.csv("~\\R dataset\\Automobile Data Set\\AutoDataset.csv")
- Create a new dataset for exclusively Toyota cars
toyota_data <- subset(auto_data, make == "toyota")
head(toyota_data)
## symboling normalized.losses make fuel.type aspiration num.of.doors
## 151 1 87 toyota gas std two
## 152 1 87 toyota gas std two
## 153 1 74 toyota gas std four
## 154 0 77 toyota gas std four
## 155 0 81 toyota gas std four
## 156 0 91 toyota gas std four
## body.style drive.wheels engine.location wheel.base length width height
## 151 hatchback fwd front 95.7 158.7 63.6 54.5
## 152 hatchback fwd front 95.7 158.7 63.6 54.5
## 153 hatchback fwd front 95.7 158.7 63.6 54.5
## 154 wagon fwd front 95.7 169.7 63.6 59.1
## 155 wagon 4wd front 95.7 169.7 63.6 59.1
## 156 wagon 4wd front 95.7 169.7 63.6 59.1
## curb.weight engine.type num.of.cylinders engine.size fuel.system bore
## 151 1985 ohc four 92 2bbl 3.05
## 152 2040 ohc four 92 2bbl 3.05
## 153 2015 ohc four 92 2bbl 3.05
## 154 2280 ohc four 92 2bbl 3.05
## 155 2290 ohc four 92 2bbl 3.05
## 156 3110 ohc four 92 2bbl 3.05
## stroke compression.ratio horsepower peak.rpm city.mpg highway.mpg
## 151 3.03 9 62 4800 35 39
## 152 3.03 9 62 4800 31 38
## 153 3.03 9 62 4800 31 38
## 154 3.03 9 62 4800 31 37
## 155 3.03 9 62 4800 27 32
## 156 3.03 9 62 4800 27 32
## price
## 151 5348
## 152 6338
## 153 6488
## 154 6918
## 155 7898
## 156 8778
3.Create a new dataset for all cars with city.mpg greater than 30 and engine size is less than 120.
auto_data1 <- subset(auto_data, (city.mpg > 30) & (engine.size < 120))
head(auto_data1)
## symboling normalized.losses make fuel.type aspiration num.of.doors
## 19 2 121 chevrolet gas std two
## 20 1 98 chevrolet gas std two
## 21 0 81 chevrolet gas std four
## 22 1 118 dodge gas std two
## 23 1 118 dodge gas std two
## 25 1 148 dodge gas std four
## body.style drive.wheels engine.location wheel.base length width height
## 19 hatchback fwd front 88.4 141.1 60.3 53.2
## 20 hatchback fwd front 94.5 155.9 63.6 52.0
## 21 sedan fwd front 94.5 158.8 63.6 52.0
## 22 hatchback fwd front 93.7 157.3 63.8 50.8
## 23 hatchback fwd front 93.7 157.3 63.8 50.8
## 25 hatchback fwd front 93.7 157.3 63.8 50.6
## curb.weight engine.type num.of.cylinders engine.size fuel.system bore
## 19 1488 l three 61 2bbl 2.91
## 20 1874 ohc four 90 2bbl 3.03
## 21 1909 ohc four 90 2bbl 3.03
## 22 1876 ohc four 90 2bbl 2.97
## 23 1876 ohc four 90 2bbl 2.97
## 25 1967 ohc four 90 2bbl 2.97
## stroke compression.ratio horsepower peak.rpm city.mpg highway.mpg price
## 19 3.03 9.50 48 5100 47 53 5151
## 20 3.11 9.60 70 5400 38 43 6295
## 21 3.11 9.60 70 5400 38 43 6575
## 22 3.23 9.41 68 5500 37 41 5572
## 23 3.23 9.40 68 5500 31 38 6377
## 25 3.23 9.40 68 5500 31 38 6229
- Create a new dataset by taking only sedan cars. Keep only four variables(Make, body style, fuel type, price) in the final dataset.
auto_data2 <- subset(auto_data, body.style == "sedan" , select = c(make, body.style,fuel.type,price))
head(auto_data2)
## make body.style fuel.type price
## 4 audi sedan gas 13950
## 5 audi sedan gas 17450
## 6 audi sedan gas 15250
## 7 audi sedan gas 17710
## 9 audi sedan gas 23875
## 11 bmw sedan gas 16430
- Create a new dataset by taking Audi, BMW or Porsche company makes. Drop two variables from the resultant dataset(price and normalized losses)
auto_data3 <- subset(auto_data, (make == "audi") | (make == "bmw") | (make == "porsche"), select = c(-price, -normalized.losses))
head(auto_data3)
## symboling make fuel.type aspiration num.of.doors body.style drive.wheels
## 4 2 audi gas std four sedan fwd
## 5 2 audi gas std four sedan 4wd
## 6 2 audi gas std two sedan fwd
## 7 1 audi gas std four sedan fwd
## 8 1 audi gas std four wagon fwd
## 9 1 audi gas turbo four sedan fwd
## engine.location wheel.base length width height curb.weight engine.type
## 4 front 99.8 176.6 66.2 54.3 2337 ohc
## 5 front 99.4 176.6 66.4 54.3 2824 ohc
## 6 front 99.8 177.3 66.3 53.1 2507 ohc
## 7 front 105.8 192.7 71.4 55.7 2844 ohc
## 8 front 105.8 192.7 71.4 55.7 2954 ohc
## 9 front 105.8 192.7 71.4 55.9 3086 ohc
## num.of.cylinders engine.size fuel.system bore stroke compression.ratio
## 4 four 109 mpfi 3.19 3.4 10.0
## 5 five 136 mpfi 3.19 3.4 8.0
## 6 five 136 mpfi 3.19 3.4 8.5
## 7 five 136 mpfi 3.19 3.4 8.5
## 8 five 136 mpfi 3.19 3.4 8.5
## 9 five 131 mpfi 3.13 3.4 8.3
## horsepower peak.rpm city.mpg highway.mpg
## 4 102 5500 24 30
## 5 115 5500 18 22
## 6 110 5500 19 25
## 7 110 5500 19 25
## 8 110 5500 19 25
## 9 140 5500 17 20
Calculated Fields in R
A new field can be added to the existing data set by using $ operator. ##Creating Calculated Fields in R Consider the following syntax, where a new field sum is added(attached) to the dataset, where the contents of the variable is sum of the other two variables x1 and x2 in the dataset.
```r
dataset$sum <- dataset$x1 + dataset$x2
```
Average of two existing variables can be stored in a new variable in dataset, as shown below
```r
dataset$mean <- (dataset$x1 + dataset$x2)/2
```
However, assigning this way is very time consuming as the user has to type the name of the dataset again and again. This process can be bypassed by attaching the dataset. Using this, the user need not refer to the dataset again and again. Refer to the code below:
```r
attach(dataset)
dataset$sum <- x1 + x2
dataset$mean <- (x1 + x2)/2
detach(dataset)
```
Demo-Calculated Fields in R
Getting idea on size of the car in Auto Data
dim(auto_data)
## [1] 205 26
auto_data$area<-(auto_data$length)*(auto_data$width)*(auto_data$height)
dim(auto_data)
## [1] 205 27
As we can see before creating a new variable area in AutoDataset, the number of rows and columns were 205, 26, and after creating the new field, the number of rows and columns were 205, and 27. Hence a new variable area was created in AutoDataset Creating a new variable by reducing the balance by 20%
bank$balance_new<-bank$balance*0.8
summary(bank)
## Cust_num age job marital
## Min. : 1 Min. :18.00 blue-collar:9732 divorced: 5207
## 1st Qu.:11304 1st Qu.:33.00 management :9458 married :27214
## Median :22606 Median :39.00 technician :7597 single :12790
## Mean :22606 Mean :40.94 admin. :5171
## 3rd Qu.:33909 3rd Qu.:48.00 services :4154
## Max. :45211 Max. :95.00 retired :2264
## (Other) :6835
## education default balance housing loan
## primary : 6851 no :44396 Min. : -8019 no :20081 no :37967
## secondary:23202 yes: 815 1st Qu.: 72 yes:25130 yes: 7244
## tertiary :13301 Median : 448
## unknown : 1857 Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
##
## contact day month duration
## cellular :29285 Min. : 1.00 may :13766 Min. : 0.0
## telephone: 2906 1st Qu.: 8.00 jul : 6895 1st Qu.: 103.0
## unknown :13020 Median :16.00 aug : 6247 Median : 180.0
## Mean :15.81 jun : 5341 Mean : 258.2
## 3rd Qu.:21.00 nov : 3970 3rd Qu.: 319.0
## Max. :31.00 apr : 2932 Max. :4918.0
## (Other): 6060
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.0 Min. : 0.0000 failure: 4901
## 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000 other : 1840
## Median : 2.000 Median : -1.0 Median : 0.0000 success: 1511
## Mean : 2.764 Mean : 40.2 Mean : 0.5803 unknown:36959
## 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :63.000 Max. :871.0 Max. :275.0000
##
## y balance_new
## no :39922 Min. :-6415.2
## yes: 5289 1st Qu.: 57.6
## Median : 358.4
## Mean : 1089.8
## 3rd Qu.: 1142.4
## Max. :81701.6
##
If then Else
If-Else statement is very intuitive in here if(condition == true) { Syntax1 }else { Syntax2 } If then Else functions the same way as If Else. However the difference lies in the syntax. The syntax for If then Else is as shown below.
Newvar<-ifelse( Condition, True Value, False Value)
Replace missing horse power values by -1
auto_data$horsepower_new<-ifelse(auto_data$horsepower=="?",-1, auto_data$horsepower)
Replace missing in peak.rpm values by -1
auto_data$peak_rpm_new<-ifelse(auto_data$peak.rpm=="?",-1, auto_data$peak.rpm)
LAB: If then Else
Conditional operators are very helpful in filtering out the required data from the complete data set. Say as a manager of a multi retail chain, you want to find out all the customers who are paying more than certain amount, or you want to find out the customers who purchase in bulk quantity, using certain conditional operators on Online Retail.csv data set would help in clearing such concepts.
- Import “./Online Retail Sales Data/Online Retail.csv”
- Create a new variable Quantity indicator. If Quantity is more than 200 the Quantity indicator =“high” else “low”
- 100 – “Low”
- 450 – “High”
- Create a new variable Price Class. If price is less than 1 then “Low”, between 1 and 80 then medium, if the price is more than 80 then high
Solutions
- Import “./Online Retail Sales Data/Online Retail.csv”
Online_Retail <- read.csv("~\\R dataset\\Online Retail Sales Data\\Online Retail.csv")
head(Online_Retail)
## InvoiceNo StockCode Description Quantity
## 1 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
## 2 536365 71053 WHITE METAL LANTERN 6
## 3 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
## 4 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
## 5 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
## 6 536365 22752 SET 7 BABUSHKA NESTING BOXES 2
## InvoiceDate UnitPrice CustomerID Country
## 1 12/1/2010 8:26 2.55 17850 United Kingdom
## 2 12/1/2010 8:26 3.39 17850 United Kingdom
## 3 12/1/2010 8:26 2.75 17850 United Kingdom
## 4 12/1/2010 8:26 3.39 17850 United Kingdom
## 5 12/1/2010 8:26 3.39 17850 United Kingdom
## 6 12/1/2010 8:26 7.65 17850 United Kingdom
2.Create a new variable Quantity indicator. If Quantity is more than 200 the Quantity indicator =“high” else “low” + 100 – “Low” + 450 – “High”
Online_Retail$Quantity_indicator <- ifelse(Online_Retail$Quantity < 101,"Low", "High")
head(Online_Retail)
## InvoiceNo StockCode Description Quantity
## 1 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
## 2 536365 71053 WHITE METAL LANTERN 6
## 3 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
## 4 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
## 5 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
## 6 536365 22752 SET 7 BABUSHKA NESTING BOXES 2
## InvoiceDate UnitPrice CustomerID Country Quantity_indicator
## 1 12/1/2010 8:26 2.55 17850 United Kingdom Low
## 2 12/1/2010 8:26 3.39 17850 United Kingdom Low
## 3 12/1/2010 8:26 2.75 17850 United Kingdom Low
## 4 12/1/2010 8:26 3.39 17850 United Kingdom Low
## 5 12/1/2010 8:26 3.39 17850 United Kingdom Low
## 6 12/1/2010 8:26 7.65 17850 United Kingdom Low
3.Create a new variable Price Class. If price is less than 1 then “Low”, between 1 and 80 then medium, if the price is more than 80 then high
Online_Retail$Price_Class <- ifelse(Online_Retail$UnitPrice < 1,"Low", ifelse(Online_Retail$UnitPrice > 80, "High", "Medium"))
head(Online_Retail)
## InvoiceNo StockCode Description Quantity
## 1 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
## 2 536365 71053 WHITE METAL LANTERN 6
## 3 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
## 4 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
## 5 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
## 6 536365 22752 SET 7 BABUSHKA NESTING BOXES 2
## InvoiceDate UnitPrice CustomerID Country Quantity_indicator
## 1 12/1/2010 8:26 2.55 17850 United Kingdom Low
## 2 12/1/2010 8:26 3.39 17850 United Kingdom Low
## 3 12/1/2010 8:26 2.75 17850 United Kingdom Low
## 4 12/1/2010 8:26 3.39 17850 United Kingdom Low
## 5 12/1/2010 8:26 3.39 17850 United Kingdom Low
## 6 12/1/2010 8:26 7.65 17850 United Kingdom Low
## Price_Class
## 1 Medium
## 2 Medium
## 3 Medium
## 4 Medium
## 5 Medium
## 6 Medium
Sorting the Data
Sorting of the data can be considered as the fundamental part of the Data Analysis. User might want to sort the Names in the Alphabetical order, or wants to sort the Income data in the ascending order to find the highest tax payers, Sorting is helpful in managing the data in ascending or descending order. Consider the example: A bank wants to find the highest tax payers in a city. For this they need to sort the data on the basis of the Income in the Descending order and hence select the top tax payers from the list. In R, there is an in-built function order() to sort the data in ascending or descending order. Syntax for sorting the data is
Newdata<-olddata[order(variables), ]
It is ascending by default.
Online_Retail_sort<-Online_Retail[order(Online_Retail$UnitPrice),]
head(Online_Retail_sort)
## InvoiceNo StockCode Description Quantity InvoiceDate
## 299984 A563186 B Adjust bad debt 1 8/12/2011 14:51
## 299985 A563187 B Adjust bad debt 1 8/12/2011 14:52
## 623 536414 22139 56 12/1/2010 11:52
## 1971 536545 21134 1 12/1/2010 14:32
## 1972 536546 22145 1 12/1/2010 14:33
## 1973 536547 37509 1 12/1/2010 14:33
## UnitPrice CustomerID Country Quantity_indicator Price_Class
## 299984 -11062.06 NA United Kingdom Low Low
## 299985 -11062.06 NA United Kingdom Low Low
## 623 0.00 NA United Kingdom Low Low
## 1971 0.00 NA United Kingdom Low Low
## 1972 0.00 NA United Kingdom Low Low
## 1973 0.00 NA United Kingdom Low Low
Use -ve sign before the variable for descending
Online_Retail_sort1<-Online_Retail[order(- Online_Retail$UnitPrice),]
head(Online_Retail_sort1)
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## 222682 C556445 M Manual -1 6/10/2011 15:31 38970.00
## 524603 C580605 AMAZONFEE AMAZON FEE -1 12/5/2011 11:36 17836.46
## 43703 C540117 AMAZONFEE AMAZON FEE -1 1/5/2011 9:55 16888.02
## 43704 C540118 AMAZONFEE AMAZON FEE -1 1/5/2011 9:57 16453.71
## 15017 C537630 AMAZONFEE AMAZON FEE -1 12/7/2010 15:04 13541.33
## 15018 537632 AMAZONFEE AMAZON FEE 1 12/7/2010 15:08 13541.33
## CustomerID Country Quantity_indicator Price_Class
## 222682 15098 United Kingdom Low High
## 524603 NA United Kingdom Low High
## 43703 NA United Kingdom Low High
## 43704 NA United Kingdom Low High
## 15017 NA United Kingdom Low High
## 15018 NA United Kingdom Low High
Sorting based on multiple variables Sorting can also be done on two variables in a data set simultaneously. As in the code below, the Country Name is being sorted in ascending(alphabetical) order and whenever there are ties in the country, Quantity is being sorted in descending(numeric) order and is being stored in Online.Retail_Sort2, output of which can be seen below.
Online_Retail_sort2<-Online_Retail[order(Online_Retail$Country,Online_Retail$Quantity),]
head(Online_Retail_sort2)
## InvoiceNo StockCode Description Quantity
## 433988 C574019 22738 RIBBON REEL SNOWY VILLAGE -120
## 170900 C551348 22990 COTTON APRON PANTRY DESIGN -100
## 108304 C545525 22969 HOMEMADE JAM SCENTED CANDLES -86
## 270094 C560540 22492 MINI PAINT SET VINTAGE -36
## 279047 C561227 84978 HANGING HEART JAR T-LIGHT HOLDER -36
## 108305 C545525 22697 GREEN REGENCY TEACUP AND SAUCER -24
## InvoiceDate UnitPrice CustomerID Country Quantity_indicator
## 433988 11/2/2011 12:05 1.45 12415 Australia Low
## 170900 4/28/2011 9:49 4.25 12415 Australia Low
## 108304 3/3/2011 13:11 1.25 12415 Australia Low
## 270094 7/19/2011 12:26 0.65 12415 Australia Low
## 279047 7/26/2011 10:15 1.06 12431 Australia Low
## 108305 3/3/2011 13:11 2.55 12415 Australia Low
## Price_Class
## 433988 Medium
## 170900 Medium
## 108304 Medium
## 270094 Low
## 279047 Medium
## 108305 Medium
LAB: Sorting the data
- Import AutoDataset
- Sort the dataset based on length
- Sort the dataset based on length descending
Solutions
- Import AutoDataset
auto_data <- read.csv("~\\R dataset\\Automobile Data Set\\AutoDataset.csv")
- Sort the dataset based on length
auto_data1 <- auto_data[order(auto_data$length),]
head(auto_data1)
## symboling normalized.losses make fuel.type aspiration num.of.doors
## 19 2 121 chevrolet gas std two
## 31 2 137 honda gas std two
## 32 2 137 honda gas std two
## 33 1 101 honda gas std two
## 34 1 101 honda gas std two
## 35 1 101 honda gas std two
## body.style drive.wheels engine.location wheel.base length width height
## 19 hatchback fwd front 88.4 141.1 60.3 53.2
## 31 hatchback fwd front 86.6 144.6 63.9 50.8
## 32 hatchback fwd front 86.6 144.6 63.9 50.8
## 33 hatchback fwd front 93.7 150.0 64.0 52.6
## 34 hatchback fwd front 93.7 150.0 64.0 52.6
## 35 hatchback fwd front 93.7 150.0 64.0 52.6
## curb.weight engine.type num.of.cylinders engine.size fuel.system bore
## 19 1488 l three 61 2bbl 2.91
## 31 1713 ohc four 92 1bbl 2.91
## 32 1819 ohc four 92 1bbl 2.91
## 33 1837 ohc four 79 1bbl 2.91
## 34 1940 ohc four 92 1bbl 2.91
## 35 1956 ohc four 92 1bbl 2.91
## stroke compression.ratio horsepower peak.rpm city.mpg highway.mpg price
## 19 3.03 9.5 48 5100 47 53 5151
## 31 3.41 9.6 58 4800 49 54 6479
## 32 3.41 9.2 76 6000 31 38 6855
## 33 3.07 10.1 60 5500 38 42 5399
## 34 3.41 9.2 76 6000 30 34 6529
## 35 3.41 9.2 76 6000 30 34 7129
- Sort the dataset based on length descending
auto_data2 <- auto_data[order(- auto_data$length),]
head(auto_data2)
## symboling normalized.losses make fuel.type aspiration
## 74 0 NA mercedes-benz gas std
## 71 -1 93 mercedes-benz diesel turbo
## 72 -1 NA mercedes-benz gas std
## 48 0 145 jaguar gas std
## 49 0 NA jaguar gas std
## 75 1 NA mercedes-benz gas std
## num.of.doors body.style drive.wheels engine.location wheel.base length
## 74 four sedan rwd front 120.9 208.1
## 71 four sedan rwd front 115.6 202.6
## 72 four sedan rwd front 115.6 202.6
## 48 four sedan rwd front 113.0 199.6
## 49 four sedan rwd front 113.0 199.6
## 75 two hardtop rwd front 112.0 199.2
## width height curb.weight engine.type num.of.cylinders engine.size
## 74 71.7 56.7 3900 ohcv eight 308
## 71 71.7 56.3 3770 ohc five 183
## 72 71.7 56.5 3740 ohcv eight 234
## 48 69.6 52.8 4066 dohc six 258
## 49 69.6 52.8 4066 dohc six 258
## 75 72.0 55.4 3715 ohcv eight 304
## fuel.system bore stroke compression.ratio horsepower peak.rpm city.mpg
## 74 mpfi 3.80 3.35 8.0 184 4500 14
## 71 idi 3.58 3.64 21.5 123 4350 22
## 72 mpfi 3.46 3.10 8.3 155 4750 16
## 48 mpfi 3.63 4.17 8.1 176 4750 15
## 49 mpfi 3.63 4.17 8.1 176 4750 15
## 75 mpfi 3.80 3.35 8.0 184 4500 14
## highway.mpg price
## 74 16 40960
## 71 25 31600
## 72 18 34184
## 48 19 32250
## 49 19 35550
## 75 16 45400
Identifying Duplicates
Duplicates can be same or similar based on key (more than one) entries. Like one customer purchasing multiple times in one day from the same store. In R, we have a function called duplicate() which can be used to find repeated entries in the dataset. We get the index of duplicated records using duplicated(). Then we can use those indexes to select the duplicated records. There are two types of duplicate records. The whole record level duplicates and duplicates based on key. + Overall record level duplicates First fetch the duplicate record indexes
Bill <- read.csv("~\\R dataset\\Telecom Data Analysis\\Bill.csv")
Dupes_in_bill<-duplicated(Bill)
length(Dupes_in_bill)
## [1] 9462
head(Dupes_in_bill)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
summary(Dupes_in_bill)
## Mode FALSE TRUE NA's
## logical 9452 10 0
In the above code duplicated(Bill) would identify all the duplicate entries in the data set and store a logical entry for the same i.e. If the data is not a duplicate, FALSE entry is stored in Dupes_in_Bill for that data entry, else TRUE. There are 10 duplicate entries which can be seen through summary() output. Access all duplicates
Bill_dupes<-Bill[Dupes_in_bill,]
head(Bill_dupes)
## cust_id act_active_dt package_id bill_zip customer_segment status
## 134 8067122301 7/27/2008 608645 560001 G - TC ACTIVE
## 135 8067122304 7/27/2008 608645 560001 G - TC ACTIVE
## 7162 9243344257 9/24/2009 620693 560035 G - TC ACTIVE
## 7166 9243350958 9/24/2009 620693 560035 G - TC ACTIVE
## 7170 9243356981 9/24/2009 620693 560035 G - TC ACTIVE
## 7172 9243381865 9/24/2009 620693 560035 G - TC ACTIVE
## bill_amt
## 134 619.41
## 135 782.21
## 7162 0.00
## 7166 0.00
## 7170 0.00
## 7172 0.00
Bill_dupes contains all the duplicated entries in the data set. User can easily identify these entries and take appropriate measure on the same.
Duplicates based on Key
The code above finds the duplicates in the overall data set. What if user is not interested in overall level records? There are situations when only a key variable is being duplicated in the entire data set. So instead of using duplicated() on complete data set, duplicated() function can be used on a specific variable. Let’s have a look at this example
Dupes_in_bill_cust_id<-duplicated(Bill$cust_id)
summary(Dupes_in_bill_cust_id)
## Mode FALSE TRUE NA's
## logical 9389 73 0
Access all duplicates
Bill_dupes<-Bill[Dupes_in_bill_cust_id,]
head(Bill_dupes)
## cust_id act_active_dt package_id bill_zip customer_segment status
## 134 8067122301 7/27/2008 608645 560001 G - TC ACTIVE
## 135 8067122304 7/27/2008 608645 560001 G - TC ACTIVE
## 561 9243538711 9/24/2009 625001 560035 G - TC ACTIVE
## 563 9243323496 9/24/2009 620693 560035 G - TC ACTIVE
## 565 9243323504 9/24/2009 620693 560035 G - TC ACTIVE
## 567 9243323511 9/24/2009 620693 560035 G - TC ACTIVE
## bill_amt
## 134 619.41
## 135 782.21
## 561 561.80
## 563 545.98
## 565 254.76
## 567 545.98
Removing Duplicates
Add “!” sign to access the unique values. Instead of selecting the duplicated record indexes, we can select the unique records using “!” symbol
Bill_unique<-Bill[!Dupes_in_bill,]
Bill_unique_cust_id<-Bill[!Dupes_in_bill_cust_id,]
Dupes_in_bill contains all the logical entries (TRUE or FALSE) corresponding to the duplicate data i.e., if the data is duplicate then FALSE is store at that index and if the data is not duplicate then TRUE is stored at the particular index. “!” sign would convert all the FALSE values to TRUE, hence the data corresponding to TRUE index will only be stored in BILL_unique i.e. only unique values are stored and duplicate values are filtered out. Unique() function can remove the overall duplicate records. It will simply consider the unique records
Bill_unique1<-unique(Bill)
LAB: Handling Duplicates in R
This duplicity issue is very frequent in many transactional datasets. Here we have examples of telecom bill and complaints data. A single user may register a complaint twice a day. This repeated data needs to be found and removed from the space. The following data on Complaints would help in understanding how this is done.
- Import DataSet: “./Telecom Data Analysis/Complaints.csv”
- Identify overall duplicates in complaints data
- Create a new dataset by removing overall duplicates in Complaints data
- Identify duplicates in complaints data based on cust_id
- Create a new dataset by removing duplicates based on cust_id in Complaints data
Solutions
- Import DataSet: “./Telecom Data Analysis/Complaints.csv”
Complaints <- read.csv("~\\R dataset\\Telecom Data Analysis\\Complaints.csv")
- Identify overall duplicates in complaints data
Dupes_in_Complaints <- duplicated(Complaints)
summary(Dupes_in_Complaints)
## Mode FALSE NA's
## logical 6587 0
Complaint_dupes <- Complaints[Dupes_in_Complaints,]
Complaint_dupes
## [1] comp_id month incident cust_id
## [5] sla.status.new incident.type type severity
## <0 rows> (or 0-length row.names)
- Create a new dataset by removing overall duplicates in Complaints data
Complaints_unique1<-unique(Complaints)
- Identify duplicates in complaints data based on cust_id
Dupes_in_complaints_cust_id<-duplicated(Complaints$cust_id)
summary(Dupes_in_complaints_cust_id)
## Mode FALSE TRUE NA's
## logical 4856 1731 0
complaints_dupes<-Complaints[Dupes_in_complaints_cust_id,]
dim(complaints_dupes)
## [1] 1731 8
- Create a new dataset by removing duplicates based on cust_id in Complaints data
complaints_unique_custID<-Complaints[!Dupes_in_complaints_cust_id,]
nrow(Complaints)
## [1] 6587
nrow(complaints_unique_custID)
## [1] 4856
Data sets merging and Joining
In R, we have merge() function to join the data sets into one. Syntax for merge function is x <- merge(datasets1, dataset2, by=“primary_key”) The function above merges the data in dataset1, dataset2, and saves the merged data into the variable x. Note: You need to mention type of join otherwise the function does inner join by default – With a single primary key The syntax is
Newdata <- merge(dataone, datatwo ,by="primary_key")
- With composite keys More than one key can also be used to join two datasets. The syntax is
Newdata <- merge(dataone, datatwo ,by=c("primary_key1", "primary_key2")
Joins
Joining the data can happen in one of the four ways: – Inner Join – Outer Join – Left Outer Join – Right Outer Join The syntax of types are as follows: Value given for the parameter “all” will decide the type of join Inner Join
Newdata <- merge(dataone, datatwo ,by="primary_key", all=FALSE)
Outer Join
Newdata <- merge(dataone, datatwo ,by="primary_key", all=TRUE)
Left Outer Join
Newdata <- merge(dataone, datatwo ,by="primary_key", all.x=TRUE)
Right Outer Join
Newdata <- merge(dataone, datatwo ,by="primary_key", all.y=TRUE)
Code: Joins
We have a TV commercial slots Datasets: + TV Commercial Slots Analysis/orders.csv + TV Commercial Slots Analysis/slots.csv
orders<-read.csv("~\\R dataset\\TV Commercial Slots Analysis\\orders.csv")
slots<-read.csv("~\\R dataset\\TV Commercial Slots Analysis\\slots.csv")
-Inner Join Merge by using single key “Unique_id”
Order_Slot_Common <- merge(orders, slots, by="Unique_id", all=FALSE)
head(Order_Slot_Common)
## Unique_id AD_ID.x Date Time
## 1 SPYMYA2CBSH416510.833333333333333 SPYMYA2CBSH 1/12/2014 8:00 PM
## 2 SPYMYA60C008416520.375 SPYMYA60C008 1/13/2014 9:00 AM
## 3 SPYMYA60C008416540.375 SPYMYA60C008 1/15/2014 9:00 AM
## 4 SPYMYA60C008416550.458333333333333 SPYMYA60C008 1/16/2014 11:00 AM
## 5 SPYMYA60C008416560.375 SPYMYA60C008 1/17/2014 9:00 AM
## 6 SPYMYA60C008416570.666666666666667 SPYMYA60C008 1/18/2014 4:00 PM
## Client.Product.Code.x Product.ID.x Network.ID.x Length.x Orders
## 1 MSDR02 SPYMYA CBS 120 3
## 2 MSDR02 SPYMYA TBKIDS 60 1
## 3 MSDR02 SPYMYA TBKIDS 60 1
## 4 MSDR02 SPYMYA TBKIDS 60 1
## 5 MSDR02 SPYMYA TBKIDS 60 1
## 6 MSDR02 SPYMYA TBKIDS 60 1
## AD_ID.y Air.Date Air.Time Time.Zone Client.Product.Code.y
## 1 SPYMYA2CBSH 1/12/2014 8:00 PM EST MSDR02
## 2 SPYMYA60C008 1/13/2014 9:00 AM EST MSDR02
## 3 SPYMYA60C008 1/15/2014 9:00 AM EST MSDR02
## 4 SPYMYA60C008 1/16/2014 11:00 AM EST MSDR02
## 5 SPYMYA60C008 1/17/2014 9:00 AM EST MSDR02
## 6 SPYMYA60C008 1/18/2014 4:00 PM EST MSDR02
## Product.ID.y Length.y Network.ID.y Network.Name Network.Type
## 1 SPYMYA 120 CBS CBS Broadcast Station
## 2 SPYMYA 60 TBKIDS The Band Kids Cable
## 3 SPYMYA 60 TBKIDS The Band Kids Cable
## 4 SPYMYA 60 TBKIDS The Band Kids Cable
## 5 SPYMYA 60 TBKIDS The Band Kids Cable
## 6 SPYMYA 60 TBKIDS The Band Kids Cable
## Network.DMA Network.Feed HH.Impressions HH.Ratings Spot.Cost
## 1 National 18418 15.91 4000
## 2 Regional NA NA 1000
## 3 Regional NA NA 1000
## 4 Regional NA NA 1000
## 5 Regional NA NA 1050
## 6 Regional NA NA 1000
## Spot.Cost.Discounted
## 1 3800.0
## 2 950.0
## 3 950.0
## 4 950.0
## 5 997.5
## 6 950.0
-Left Outer Join Merge by using single key “Unique_id”
Order_all_Slot_match<-merge(orders, slots, by="Unique_id", all.x=TRUE)
head(Order_all_Slot_match)
## Unique_id AD_ID.x Date Time
## 1 MYSPYR60A002416540.375 MYSPYR60A002 1/15/2014 9:00 AM
## 2 MYSPYR60A002416620.708333333333333 MYSPYR60A002 1/23/2014 5:00 PM
## 3 MYSPYR60A002416630.416666666666667 MYSPYR60A002 1/24/2014 10:00 AM
## 4 MYSPYR60A002416640.625 MYSPYR60A002 1/25/2014 3:00 PM
## 5 MYSPYR60A004416500.666666666666667 MYSPYR60A004 1/11/2014 4:00 PM
## 6 MYSPYR60A004416520.916666666666667 MYSPYR60A004 1/13/2014 10:00 PM
## Client.Product.Code.x Product.ID.x Network.ID.x Length.x Orders AD_ID.y
## 1 MSDR02R MYSPYR ESQ 60 1 <NA>
## 2 MSDR02R MYSPYR ESQ 60 1 <NA>
## 3 MSDR02R MYSPYR ESQ 60 1 <NA>
## 4 MSDR02R MYSPYR ESQ 60 1 <NA>
## 5 MSDR02R MYSPYR NHLN 60 1 <NA>
## 6 MSDR02R MYSPYR NHLN 60 1 <NA>
## Air.Date Air.Time Time.Zone Client.Product.Code.y Product.ID.y Length.y
## 1 <NA> <NA> <NA> <NA> <NA> NA
## 2 <NA> <NA> <NA> <NA> <NA> NA
## 3 <NA> <NA> <NA> <NA> <NA> NA
## 4 <NA> <NA> <NA> <NA> <NA> NA
## 5 <NA> <NA> <NA> <NA> <NA> NA
## 6 <NA> <NA> <NA> <NA> <NA> NA
## Network.ID.y Network.Name Network.Type Network.DMA Network.Feed
## 1 <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA> <NA>
## HH.Impressions HH.Ratings Spot.Cost Spot.Cost.Discounted
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
-Right Outer Join Merge by using single key “Unique_id”
Slot_all_order_match<-merge(orders, slots, by="Unique_id", all.y=TRUE)
head(Slot_all_order_match)
## Unique_id AD_ID.x Date Time
## 1 SPYMYA2CBSH416510.833333333333333 SPYMYA2CBSH 1/12/2014 8:00 PM
## 2 SPYMYA60C008416520.375 SPYMYA60C008 1/13/2014 9:00 AM
## 3 SPYMYA60C008416540.375 SPYMYA60C008 1/15/2014 9:00 AM
## 4 SPYMYA60C008416550.458333333333333 SPYMYA60C008 1/16/2014 11:00 AM
## 5 SPYMYA60C008416560.375 SPYMYA60C008 1/17/2014 9:00 AM
## 6 SPYMYA60C008416570.666666666666667 SPYMYA60C008 1/18/2014 4:00 PM
## Client.Product.Code.x Product.ID.x Network.ID.x Length.x Orders
## 1 MSDR02 SPYMYA CBS 120 3
## 2 MSDR02 SPYMYA TBKIDS 60 1
## 3 MSDR02 SPYMYA TBKIDS 60 1
## 4 MSDR02 SPYMYA TBKIDS 60 1
## 5 MSDR02 SPYMYA TBKIDS 60 1
## 6 MSDR02 SPYMYA TBKIDS 60 1
## AD_ID.y Air.Date Air.Time Time.Zone Client.Product.Code.y
## 1 SPYMYA2CBSH 1/12/2014 8:00 PM EST MSDR02
## 2 SPYMYA60C008 1/13/2014 9:00 AM EST MSDR02
## 3 SPYMYA60C008 1/15/2014 9:00 AM EST MSDR02
## 4 SPYMYA60C008 1/16/2014 11:00 AM EST MSDR02
## 5 SPYMYA60C008 1/17/2014 9:00 AM EST MSDR02
## 6 SPYMYA60C008 1/18/2014 4:00 PM EST MSDR02
## Product.ID.y Length.y Network.ID.y Network.Name Network.Type
## 1 SPYMYA 120 CBS CBS Broadcast Station
## 2 SPYMYA 60 TBKIDS The Band Kids Cable
## 3 SPYMYA 60 TBKIDS The Band Kids Cable
## 4 SPYMYA 60 TBKIDS The Band Kids Cable
## 5 SPYMYA 60 TBKIDS The Band Kids Cable
## 6 SPYMYA 60 TBKIDS The Band Kids Cable
## Network.DMA Network.Feed HH.Impressions HH.Ratings Spot.Cost
## 1 National 18418 15.91 4000
## 2 Regional NA NA 1000
## 3 Regional NA NA 1000
## 4 Regional NA NA 1000
## 5 Regional NA NA 1050
## 6 Regional NA NA 1000
## Spot.Cost.Discounted
## 1 3800.0
## 2 950.0
## 3 950.0
## 4 950.0
## 5 997.5
## 6 950.0
-Outer Join Merge by using single key “Unique_id”
Order_Slots_union<- merge(orders, slots, by="Unique_id", all=TRUE)
head(Order_Slots_union)
## Unique_id AD_ID.x Date Time
## 1 MYSPYR60A002416540.375 MYSPYR60A002 1/15/2014 9:00 AM
## 2 MYSPYR60A002416620.708333333333333 MYSPYR60A002 1/23/2014 5:00 PM
## 3 MYSPYR60A002416630.416666666666667 MYSPYR60A002 1/24/2014 10:00 AM
## 4 MYSPYR60A002416640.625 MYSPYR60A002 1/25/2014 3:00 PM
## 5 MYSPYR60A004416500.666666666666667 MYSPYR60A004 1/11/2014 4:00 PM
## 6 MYSPYR60A004416520.916666666666667 MYSPYR60A004 1/13/2014 10:00 PM
## Client.Product.Code.x Product.ID.x Network.ID.x Length.x Orders AD_ID.y
## 1 MSDR02R MYSPYR ESQ 60 1 <NA>
## 2 MSDR02R MYSPYR ESQ 60 1 <NA>
## 3 MSDR02R MYSPYR ESQ 60 1 <NA>
## 4 MSDR02R MYSPYR ESQ 60 1 <NA>
## 5 MSDR02R MYSPYR NHLN 60 1 <NA>
## 6 MSDR02R MYSPYR NHLN 60 1 <NA>
## Air.Date Air.Time Time.Zone Client.Product.Code.y Product.ID.y Length.y
## 1 <NA> <NA> <NA> <NA> <NA> NA
## 2 <NA> <NA> <NA> <NA> <NA> NA
## 3 <NA> <NA> <NA> <NA> <NA> NA
## 4 <NA> <NA> <NA> <NA> <NA> NA
## 5 <NA> <NA> <NA> <NA> <NA> NA
## 6 <NA> <NA> <NA> <NA> <NA> NA
## Network.ID.y Network.Name Network.Type Network.DMA Network.Feed
## 1 <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA> <NA>
## HH.Impressions HH.Ratings Spot.Cost Spot.Cost.Discounted
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
LAB: Data Joins
- Import Datasets
- “./Telecom Data Analysis/Bill.csv”
- “./Telecom Data Analysis/Complaints.csv”
- Import the data and remove duplicates based on cust_id
- Create a dataset for each of these requirements
-
- All the customers who appear either in bill data or complaints data
-
- All the customers who appear both in bill data and complaints data
-
- All the customers from bill data: Customers who have bill data along with their complaints
-
- All the customers from complaints data: Customers who have Complaints data along with their bill info
Solutions
- Import Datasets
- “./Telecom Data Analysis/Bill.csv”
- “./Telecom Data Analysis/Complaints.csv”
bill<-read.csv("~\\R dataset\\Telecom Data Analysis\\Bill.csv")
Complaints<-read.csv("~\\R dataset\\Telecom Data Analysis\\Complaints.csv")
- Import the data and remove duplicates based on cust_id
Dupes_in_bills_cust_id <- duplicated(bill$cust_id)
Bills_unique<-bill[!Dupes_in_bills_cust_id,]
nrow(bill)
## [1] 9462
nrow(Bills_unique)
## [1] 9389
Dupes_in_Complaints_cust_id <- duplicated(Complaints$cust_id)
Complaints_unique<-Complaints[!Dupes_in_Complaints_cust_id,]
nrow(Complaints)
## [1] 6587
nrow(Complaints_unique)
## [1] 4856
- Create a dataset for each of these requirements
-
- All the customers who appear either in bill data or complaints data
all_customer<-merge(Bills_unique,Complaints_unique,by="cust_id",all=TRUE)
nrow(Bills_unique)
## [1] 9389
nrow(Complaints_unique)
## [1] 4856
nrow(all_customer)
## [1] 13952
- Create a dataset for each of these requirements
-
- All the customers who appear both in bill data and complaints data
customer_in_both<-merge(Bills_unique,Complaints_unique,by="cust_id",all=FALSE)
nrow(Bills_unique)
## [1] 9389
nrow(Complaints_unique)
## [1] 4856
nrow(customer_in_both)
## [1] 293
- Create a dataset for each of these requirements
-
- All the customers from bill data: Customers who have bill data along with their complaints
bill_customer<-merge(Bills_unique,Complaints_unique,by="cust_id",all.x=TRUE)
nrow(Bills_unique)
## [1] 9389
nrow(bill_customer)
## [1] 9389
- Create a dataset for each of these requirements
-
- All the customers from complaints data: Customers who have Complaints data along with their bill info
complaints_customer<-merge(Bills_unique,Complaints_unique,by="cust_id",all.y=TRUE)
nrow(Complaints_unique)
## [1] 4856
nrow(complaints_customer)
## [1] 4856
Exporting the data Out of R
Another important feature available in R is the export function. With this function, the data created in R can be saved into any other format like the csv, xlsx, or any other format. Just as we used the function read.csv() to read the contents of the CSV file into R, there is a function called write.csv(), using which the data compiled can be exported to other formats.
- Make sure that the export path is correct
write.csv(data, "./path/file_name.csv")
write.csv(customer_in_both, "./Telecom Data Analysis/Bill_and_comp_common.csv")
write.table(all_customer, "./Telecom Data Analysis/Bill_and_comp_overall.txt", sep="t")
Conclusion
In this session we started with Data imploring from various sources. We saw some basic commands to work with data. We also learnt manipulating the datasets and creating new variables. Sorting the datasets and handling duplicates. Joining the datasets is also an important concept.There are many more topics to discuss in data handling, these topics in the session are essential for any data scientist.