This is the 3rd session of the R programming. The 1st session consisted of R introduction, how to code in R, how to compile the code, how to get the output. etc. The 2nd session was all about data handling, importing the data from various sources into R,merging different data, creating a new variable, filtering the data, how to take different R data set & combine them to create the resultant data, exporting the data, etc. Session 3 is all about sampling, statistics, quartiles, percentiles, box plot, graphs, etc.
Sampling
Many a times we need only some part of the data or sample of the data or a subset of data, instead of the entire data set. For example, lets us consider the sales data or purchase orders data of last 20 years. We might not be interested in the whole 20 years data, we might only need the last 2 years data for the analysis. How do we take the sample. We take the dataset and use the sample function.
Syntax: sampleset<-dataset[sample(1:nrow(mydata),n),]
Let us consider the Online Retail data set.
>Retail_data<-read.csv("C:\\Amrita\\Datavedi\\Online Retail Sales Data\\Online Retail.csv") >dim(Retail_data)
So there are 541909 rows and 8 columns. We don’t want all the 541909 rows, we want to take only the sample of 10000 rows out of this Online Retail dataset
>Sample_set<-Retail_data[sample(1:nrow(Retail_data),10000), ]
This is the syntax for sampling the whole dataset for 1000 observations. Sample_set, is the new object created, into which we assign the new dataset of 10000 elements. Retail_data, is the original data, from which we extract the sample 10000 elements. We need to give 2 parameters, i.e., 1 for the rows and 1 for the column. The row part will include “sample(1:nrow(Retail_data),10000)”, which means for taking random 1000 observations, all the rows, i.e., from 1 to n rows is considered and the column part is left blank, as we need all the columns.
Instead of “sample(1:nrow(Retail_data),1000)”, we can also give as “sample(1:5000,10000)”, which will consider only the first 5000 rows from the original dataset for sampling the 10000 rows. If the syntax seems to be confusing, then we can simply write it as “sample(1:541909,10000)”, as we know there are total 541909 rows. The 1000 rows which we get from sampling will be randomly taken from the dataset.
Sampling in R
Let us consider the census income data that is there in the dataset folder.
>Income_data<-read.csv("C:\\Amrita\\Datavedi\\Census Income Data\\Income_data.csv")
The exercise is to take a sample dataset of 5000 records, from the dataset, Income-data which is a very large data set.
>dim(Income_data) #32561 15
The Income_data consists of 32561 rows and 15 columns.
>sample<-Income_data[sample(1:nrow(Income_data),5000),]
The above command will store the 5000 records in the object sample, and if we check the dimension we get it as 50000 rows and 15 columns. This is how the sampling is done.
In next section, we will be studying about Descriptive Statistics.