103.2.1 Data Handling In R

Data Analysing is a skill !!

Data Analysing is a skill

It is not like, all the data were available already in the form we require them. It is the duty of the data analyst to take the data, and sometimes he needs to prepare the data by himself, identifying the missing values and replacing them, transpose the data, transform the variables and many other operations on the data. So data handling is a very important part in R. Data Handling  involves all kinds of processes like import, curate, validation and exploration of data. Hence, before moving on to Analysis of data, it is very important for one to be good with data handling.

The data can be stored in different formats, like CSV (Comma Separated Variable), Excel, SQL (database), and many other ways. Various statistical tools are available which are used for the critical analysis of the data. R is highly flexible and open source software which is integrated with all the formats of data storage due to which data handling becomes easy in R.

Here we are going to discuss various data handling which includes

  • Data importing from files
  • Database server connections
  • Working with datasets
  • Manipulating the datasets in R
  • Creating new variables in R
  • Sorting in R & Removing Duplicates
  • Exporting the R datasets into external files
  • Data Merging, etc.

Importing Data from CSV file

CSV (Comma Separated Variable) extension is one of the most commonly used data type. R is compatible with CSV format. By calling a simple function, csv (), we can easily import the data into R.

While defining the path of the CSV file in the csv() function, we must either use “/” or “\\” in the path. The windows style of giving path in which we use a single “\”, doesn’t work in R.

To understand how read.csv() function works, let’s consider a sample program to read the Sales related data from Sales_data.csv as shown below. The function read.csv() imports the data from the CSV file (the path for which is mentioned in the syntax) to R.

In the syntax below, we need to give the local path of the data set as an input to read.csv() function and replace all ‘\’ with ‘\\’.

> Sales <- read.csv("C:\\Users\\venk\\Google Drive\\Training\\Datasets\\Superstore Sales  Data\\Sales_sample.csv")
> Sales
> Sales <- read.csv("C:\Users\venk\Google Drive\Training\Datasets\Superstore Sales  Data\Sales_sample.csv")
> Sales1 <- read.csv("C:/Users/venk/Google Drive/Training/Datasets/Superstore Sales  Data/Sales_sample.csv")
> Sales1

 ##    custId           custName                  custCountry productSold
 ## 1   23262       Candice Levy                        Congo     SUPA101
 ## 2   23263       Xerxes Smith                       Panama     DETA200
 ## 3   23264       Levi Douglas Tanzania, United Republic of     DETA800
 ## 4   23265       Uriel Benton                 South Africa     SUPA104
 ## 5   23266       Celeste Pugh                        Gabon     PURA200
 ## 6   23267       Vance Campos         Syrian Arab Republic     PURA100
 ## 7   23268       Latifah Wall                   Guadeloupe     DETA100
 ## 8   23269     Jane Hernandez                    Macedonia     PURA100
 ## 9   23270        Wanda Garza                   Kyrgyzstan     SUPA103
 ## 10  23271 Athena Fitzpatrick                      Reunion     SUPA103
 ## 11  23272      Anjolie Hicks     Turks and Caicos Islands     DETA200
 ##    salesChannel unitsSold  dateSold
 ## 1        Retail       117  8/9/2012
 ## 2        Online        73  7/6/2012
 ## 3        Online       205 8/18/2012
 ## 4        Online        14  8/5/2012
 ## 5        Retail       170 8/11/2012
 ## 6        Retail       129 7/11/2012
 ## 7        Retail        82 7/12/2012
 ## 8        Retail       116  6/3/2012
 ## 9        Online        67  6/7/2012
 ## 10       Retail       125 7/27/2012
 ## 11       Retail        71 7/31/2012

Importing from SAS files

We have earlier discussed about installing packages. In order to import data from SAS files, we need to install the package “sas7bdat”.

SAS, also known as the Statistical Analysis Software, is used to perform the functions of Data Analytics. SAS is a highly specialized tool in analytics. R is compatible with the SAS and the JMV format. There is an external library available which needs to be installed before importing the data of SAS format.

To install the package, use the command packages(sas7bdat) and to use that package, we need to call the library(sas7bdat). After installing, it will automatically load the library into the R directory. Once the library is installed, users can import the data by calling the function sas7bdat(). Giving path for this function is same as we did for the csv file format.

A sample program to import the SAS data into R is shown below:

> library(sas7bdat)
> gnpdata <-read.sas7bdat("C:\\Users\\venk\\Google Drive\\Training\\Datasets\\SAS datasets\\gnp.sas7bdat")
> View(gnpdata)
 ## Warning: package 'sas7bdat' was built under R version 3.2.3
 ## 1    0 516.1   325.5   88.7     4.3  97.6
 ## 2   91 514.5   331.6   78.1     5.1  99.6
 ## 3  182 517.7   331.7   77.4     6.5 102.1
 ## 4  274 513.0   333.8   68.5     7.7 103.0
 ## 5  366 517.4   334.4   69.5     8.3 105.3
 ## 6  456 527.9   339.1   74.7     7.0 107.1

Data import from Excel files

Excel is one of the most widely used tools for reporting and ad-hoc data analysis. The excel files contain mostly aggregated and even raw data sometimes. The function, xlsx() is used to import the desired excel file into R. The spreadsheet is read through this function and is stored into a data frame. Here’s an example of how to read the data from excel into R.

> library(xlsx)
> wb_data <- read.xlsx("C:\\Users\\venk\\Google Drive\\Training\\Datasets\\World Bank Data\\World Bank Indicators.xlsx" , sheetName="Data by country")

There’s another way through which data can be imported into R. There is a function called loadWorkbook() from the XLConnect package, which can be used to read the complete workbook. To import any such worksheet, the function readWorksheet() is called. Java package needs to be installed before using XLConnect.

> library(XLConnect)
> wb_data <- readWorksheet(loadWorkbook("C:\\Users\\venk\\Google Drive\\Training\\Datasets\\World Bank Data\\World Bank Indicators.xlsx" ),sheet=1)

We may have some Java related issues while importing Excel files. Excel is not just a flat file like CSV. The excel file contains lots of other information such as indexes, apart from just data. If we find errors even after installing the necessary packages, then we must first store the excel data in CSV format and then import it.

In the next post you will be learning about Database server connections.

0 responses on "103.2.1 Data Handling In R"

Leave a Message

Blog Posts

Hurry up!!!

"use coupon code for FLAT 30% discount"  datascientistoffer        ___________________________________      Subscribe to our youtube channel. Get access to video tutorials.                

Contact Us

Statinfer Software Solutions#647 2nd floor 1st Main, Indira Nagar 1st Stage, 100 feet road,Indranagar Bangalore,Karnataka, Pin code:-560038 Landmarks: Opp. Namma Metro Pillar 48.

Connect with us

linkin fn twitter g

How to become a Data Scientist.?