In previous section we discussed about Database Server Connections
Subsetting of data based on rows
Sub-setting allows accessing particular data from a complete data set. If a user wants to explore the data of a particular column or row, then subsetting can be very helpful for them. In R, users can access to particular part of the data set by selecting a particular number of columns and rows as shown below.
Sub-setting with selected number of rows.
>gdp<-read.csv("C:\\Users\\user\\Desktop\\World Bank Data\\GDP.csv")
>gdp1<-gdp[1:10, ]
>gdp1
gdp data is a world bank dataset. This data set has many countries and their gdps arranged based on their Rank. The first 10 rows are stored in gdp1 with the above code.
## Country_code Rank Country GDP
## 1 USA 1 United States 17419000
## 2 CHN 2 China 10354832
## 3 JPN 3 Japan 4601461
## 4 DEU 4 Germany 3868291
## 5 GBR 5 United Kingdom 2988893
## 6 FRA 6 France 2829192
## 7 BRA 7 Brazil 2346076
## 8 ITA 8 Italy 2141161
## 9 IND 9 India 2048517
## 10 RUS 10 Russian Federation 1860598
To select a particular part of the dataset, you need to vary the number of rows and columns in the syntax by changing the syntax [Number of rows, Number of columns]. In the above code, the first 10 rows of the data set are selected by using 1:10 (which means 1 to 10), and all the columns are selected. Hence on the printing gdp1, the first 10 rows of all the columns are displayed in the output.
The another way to select particular rows is
>gdp2<-gdp[c(1,4,10,12), ]
>gdp2
## Country_code Rank Country GDP
## 1 USA 1 United States 17419000
## 4 DEU 4 Germany 3868291
## 10 RUS 10 Russian Federation 1860598
## 12 AUS 12 Australia 1454675
To select a varied number of rows, which are non-continuous or are not in a sequence, then c(row number 1, row number 2, ….) can be used to select the specific rows. In the code above gdp[c(1,4,10,12),] would select row number 1,4,10,12 and store these rows data into gdp2.
Subsetting with selected number of columns
The way a Data Analyst can select particular rows from a whole lot of data, the same way columns can be selected. To do so, you need to change the second index in the syntax [Number of rows, Number of Columns]. A sample code is shown below:
>gdp3<- gdp[, 2:4 ]
>head(gdp3)
## Rank Country GDP
## 1 1 United States 17419000
## 2 2 China 10354832
## 3 3 Japan 4601461
## 4 4 Germany 3868291
## 5 5 United Kingdom 2988893
## 6 6 France 2829192
The code above shows the data from first 6 rows and columns 2 to 4 in the output. However, if all the rows need to be printed, all you need to do is to remove the function head(), and just print the variable. Another way of generating the same output is shown below
>gdp4<- gdp[, c(1,2,4)]
>head(gdp4)
## Country_code Rank GDP
## 1 USA 1 17419000
## 2 CHN 2 10354832
## 3 JPN 3 4601461
## 4 DEU 4 3868291
## 5 GBR 5 2988893
## 6 FRA 6 2829192
The way non-sequence based rows are selected, non-sequence based columns can be selected by using the syntax c(column number 1, column number 2, column number 3, ….). The code above shows the data for 1st, 2nd, and 4th column.
One can even access particular columns by using the names of the columns as shown below:
>gdp5<- gdp[, c("Country", "Rank")]
>head(gdp5)
## Country Rank
## 1 United States 1
## 2 China 2
## 3 Japan 3
## 4 Germany 4
## 5 United Kingdom 5
## 6 France 6
Subsetting with selected number of rows and columns
The syntax used above can also be used to select collectively the varying number of rows and columns. Sample code is shown below:
>gdp6<-gdp[5:20, c(1,2,4)]
>gdp6
## Country_code Rank GDP
## 5 GBR 5 2988893
## 6 FRA 6 2829192
## 7 BRA 7 2346076
## 8 ITA 8 2141161
## 9 IND 9 2048517
## 10 RUS 10 1860598
## 11 CAN 11 1785387
## 12 AUS 12 1454675
## 13 KOR 13 1410383
## 14 ESP 14 1381342
## 15 MEX 15 1294690
## 16 IDN 16 888538
## 17 NLD 17 879319
## 18 TUR 18 798429
## 19 SAU 19 746249
## 20 CHE 20 701037
The code above would access rows 5 to 20 and columns 1,2, and 4.
Another way to select varying number of rows and columns is shown below.
In the code below, the negative sign c(-3, -4) is used. Using negative sign would only the selected rows or columns which are mentioned in the code with the negative sign.
>gdp7<-gdp[1:5, c(-3,-4)]
>gdp7
## Country_code Rank
## 1 USA 1
## 2 CHN 2
## 3 JPN 3
## 4 DEU 4
## 5 GBR 5
From the output, we can see that 3rd and 4th row are omitted in the output.
In the next post we will see Sub Setting Example 1.