103.2.3 Sub Setting

In previous section we discussed about Database Server Connections

Subsetting of data based on rows

Sub-setting allows accessing particular data from a complete data set. If a user wants to explore the data of a particular column or row, then subsetting can be very helpful for them. In R, users can access to particular part of the data set by selecting a particular number of columns and rows as shown below.

Sub-setting with selected number of rows.

 >gdp<-read.csv("C:\\Users\\user\\Desktop\\World Bank Data\\GDP.csv")
 >gdp1<-gdp[1:10, ]
 >gdp1

gdp data is a world bank dataset. This data set has many countries and their gdps arranged based on their Rank. The first 10 rows are stored in gdp1 with the above code.

##    Country_code Rank            Country      GDP
## 1           USA    1      United States 17419000
## 2           CHN    2              China 10354832
## 3           JPN    3              Japan  4601461
## 4           DEU    4            Germany  3868291
## 5           GBR    5     United Kingdom  2988893
## 6           FRA    6             France  2829192
## 7           BRA    7             Brazil  2346076
## 8           ITA    8              Italy  2141161
## 9           IND    9              India  2048517
## 10          RUS   10 Russian Federation  1860598

To select a particular part of the dataset, you need to vary the number of rows and columns in the syntax by changing the syntax [Number of rows, Number of columns]. In the above code, the first 10 rows of the data set are selected by using 1:10 (which means 1 to 10), and all the columns are selected. Hence on the printing gdp1, the first 10 rows of all the columns are displayed in the output.

The another way to select particular rows is

 >gdp2<-gdp[c(1,4,10,12), ]
 >gdp2

##    Country_code Rank            Country      GDP
## 1           USA    1      United States 17419000
## 4           DEU    4            Germany  3868291
## 10          RUS   10 Russian Federation  1860598
## 12          AUS   12          Australia  1454675

To select a varied number of rows, which are non-continuous or are not in a sequence, then c(row number 1, row number 2, ….) can be used to select the specific rows. In the code above gdp[c(1,4,10,12),] would select row number 1,4,10,12 and store these rows data into gdp2.

Subsetting with selected number of columns

The way a Data Analyst can select particular rows from a whole lot of data, the same way columns can be selected. To do so, you need to change the second index in the syntax [Number of rows, Number of Columns]. A sample code is shown below:

 >gdp3<- gdp[, 2:4 ]
 >head(gdp3)

##   Rank        Country      GDP
## 1    1  United States 17419000
## 2    2          China 10354832
## 3    3          Japan  4601461
## 4    4        Germany  3868291
## 5    5 United Kingdom  2988893
## 6    6         France  2829192

The code above shows the data from first 6 rows and columns 2 to 4 in the output. However, if all the rows need to be printed, all you need to do is to remove the function head(), and just print the variable. Another way of generating the same output is shown below

 >gdp4<- gdp[, c(1,2,4)]
 >head(gdp4)

##   Country_code Rank      GDP
## 1          USA    1 17419000
## 2          CHN    2 10354832
## 3          JPN    3  4601461
## 4          DEU    4  3868291
## 5          GBR    5  2988893
## 6          FRA    6  2829192

The way non-sequence based rows are selected, non-sequence based columns can be selected by using the syntax c(column number 1, column number 2, column number 3, ….). The code above shows the data for 1st, 2nd, and 4th column.

One can even access particular columns by using the names of the columns as shown below:

 >gdp5<- gdp[, c("Country", "Rank")]
 >head(gdp5)

##          Country Rank
## 1  United States    1
## 2          China    2
## 3          Japan    3
## 4        Germany    4
## 5 United Kingdom    5
## 6         France    6

Subsetting with selected number of rows and columns

The syntax used above can also be used to select collectively the varying number of rows and columns. Sample code is shown below:

 >gdp6<-gdp[5:20,  c(1,2,4)]
 >gdp6

##    Country_code Rank     GDP
## 5           GBR    5 2988893
## 6           FRA    6 2829192
## 7           BRA    7 2346076
## 8           ITA    8 2141161
## 9           IND    9 2048517
## 10          RUS   10 1860598
## 11          CAN   11 1785387
## 12          AUS   12 1454675
## 13          KOR   13 1410383
## 14          ESP   14 1381342
## 15          MEX   15 1294690
## 16          IDN   16  888538
## 17          NLD   17  879319
## 18          TUR   18  798429
## 19          SAU   19  746249
## 20          CHE   20  701037

The code above would access rows 5 to 20 and columns 1,2, and 4.

Another way to select varying number of rows and columns is shown below.

In the code below, the negative sign c(-3, -4) is used. Using negative sign would only the selected rows or columns which are mentioned in the code with the negative sign.

 >gdp7<-gdp[1:5,  c(-3,-4)]
 >gdp7

##   Country_code Rank
## 1          USA    1
## 2          CHN    2
## 3          JPN    3
## 4          DEU    4
## 5          GBR    5

From the output, we can see that 3rd and 4th row are omitted in the output.

In the next post we will see Sub Setting Example 1.

20th June 2017

subsetting in r

103.2.3 Sub Setting

Statinfer

Statinfer

Statinfer

103.2.3 Sub Setting

Related Courses

Python(Batch6)

Statinfer

Tableau (Batch6)

Statinfer

PowerBI (Batch6)

Statinfer