Data Analysing is a skill
It is not like, all the data were available already in the form we require them. It is the duty of the data analyst to take the data, and sometimes he needs to prepare the data by himself, identifying the missing values and replacing them, transpose the data, transform the variables and many other operations on the data. So data handling is a very important part in R. Data Handling involves all kinds of processes like import, curate, validation and exploration of data. Hence, before moving on to Analysis of data, it is very important for one to be good with data handling.
The data can be stored in different formats, like CSV (Comma Separated Variable), Excel, SQL (database), and many other ways. Various statistical tools are available which are used for the critical analysis of the data. R is highly flexible and open source software which is integrated with all the formats of data storage due to which data handling becomes easy in R.
Here we are going to discuss various data handling which includes
- Data importing from files
- Database server connections
- Working with datasets
- Manipulating the datasets in R
- Creating new variables in R
- Sorting in R & Removing Duplicates
- Exporting the R datasets into external files
- Data Merging, etc.
Importing Data from CSV file
CSV (Comma Separated Variable) extension is one of the most commonly used data type. R is compatible with CSV format. By calling a simple function, csv (), we can easily import the data into R.
While defining the path of the CSV file in the csv() function, we must either use “/” or “\\” in the path. The windows style of giving path in which we use a single “\”, doesn’t work in R.
To understand how read.csv() function works, let’s consider a sample program to read the Sales related data from Sales_data.csv as shown below. The function read.csv() imports the data from the CSV file (the path for which is mentioned in the syntax) to R.
In the syntax below, we need to give the local path of the data set as an input to read.csv() function and replace all ‘\’ with ‘\\’.
> Sales <- read.csv("C:\\Users\\venk\\Google Drive\\Training\\Datasets\\Superstore Sales Data\\Sales_sample.csv")
> Sales
> Sales <- read.csv("C:\Users\venk\Google Drive\Training\Datasets\Superstore Sales Data\Sales_sample.csv")
> Sales1 <- read.csv("C:/Users/venk/Google Drive/Training/Datasets/Superstore Sales Data/Sales_sample.csv")
> Sales1
## custId custName custCountry productSold
## 1 23262 Candice Levy Congo SUPA101
## 2 23263 Xerxes Smith Panama DETA200
## 3 23264 Levi Douglas Tanzania, United Republic of DETA800
## 4 23265 Uriel Benton South Africa SUPA104
## 5 23266 Celeste Pugh Gabon PURA200
## 6 23267 Vance Campos Syrian Arab Republic PURA100
## 7 23268 Latifah Wall Guadeloupe DETA100
## 8 23269 Jane Hernandez Macedonia PURA100
## 9 23270 Wanda Garza Kyrgyzstan SUPA103
## 10 23271 Athena Fitzpatrick Reunion SUPA103
## 11 23272 Anjolie Hicks Turks and Caicos Islands DETA200
## salesChannel unitsSold dateSold
## 1 Retail 117 8/9/2012
## 2 Online 73 7/6/2012
## 3 Online 205 8/18/2012
## 4 Online 14 8/5/2012
## 5 Retail 170 8/11/2012
## 6 Retail 129 7/11/2012
## 7 Retail 82 7/12/2012
## 8 Retail 116 6/3/2012
## 9 Online 67 6/7/2012
## 10 Retail 125 7/27/2012
## 11 Retail 71 7/31/2012
Importing from SAS files
We have earlier discussed about installing packages. In order to import data from SAS files, we need to install the package “sas7bdat”.
SAS, also known as the Statistical Analysis Software, is used to perform the functions of Data Analytics. SAS is a highly specialized tool in analytics. R is compatible with the SAS and the JMV format. There is an external library available which needs to be installed before importing the data of SAS format.
To install the package, use the command packages(sas7bdat) and to use that package, we need to call the library(sas7bdat). After installing, it will automatically load the library into the R directory. Once the library is installed, users can import the data by calling the function sas7bdat(). Giving path for this function is same as we did for the csv file format.
A sample program to import the SAS data into R is shown below:
> library(sas7bdat)
> gnpdata <-read.sas7bdat("C:\\Users\\venk\\Google Drive\\Training\\Datasets\\SAS datasets\\gnp.sas7bdat")
> View(gnpdata)
## Warning: package 'sas7bdat' was built under R version 3.2.3 ## DATE GNP CONSUMP INVEST EXPORTS GOVT ## 1 0 516.1 325.5 88.7 4.3 97.6 ## 2 91 514.5 331.6 78.1 5.1 99.6 ## 3 182 517.7 331.7 77.4 6.5 102.1 ## 4 274 513.0 333.8 68.5 7.7 103.0 ## 5 366 517.4 334.4 69.5 8.3 105.3 ## 6 456 527.9 339.1 74.7 7.0 107.1
Data import from Excel files
Excel is one of the most widely used tools for reporting and ad-hoc data analysis. The excel files contain mostly aggregated and even raw data sometimes. The function, xlsx() is used to import the desired excel file into R. The spreadsheet is read through this function and is stored into a data frame. Here’s an example of how to read the data from excel into R.
> library(xlsx) > wb_data <- read.xlsx("C:\\Users\\venk\\Google Drive\\Training\\Datasets\\World Bank Data\\World Bank Indicators.xlsx" , sheetName="Data by country")
There’s another way through which data can be imported into R. There is a function called loadWorkbook() from the XLConnect package, which can be used to read the complete workbook. To import any such worksheet, the function readWorksheet() is called. Java package needs to be installed before using XLConnect.
> library(XLConnect)
> wb_data <- readWorksheet(loadWorkbook("C:\\Users\\venk\\Google Drive\\Training\\Datasets\\World Bank Data\\World Bank Indicators.xlsx" ),sheet=1)
We may have some Java related issues while importing Excel files. Excel is not just a flat file like CSV. The excel file contains lots of other information such as indexes, apart from just data. If we find errors even after installing the necessary packages, then we must first store the excel data in CSV format and then import it.
In the next post you will be learning about Database server connections.