• No products in the cart.

104.2.2 Practice: Working with datasets in Python

Getting metadata of a dataset in python.

Link to the previous post : https://statinfer.com/104-2-1-importing-data-in-python/

In this post we will cover basic tasks we can perform on a dataset after importing it into python.

We will complete following tasks:

Printing the data and meta info

  • Import “Superstore Sales Data\Sales_by_country_v1.csv” data.
  • Perform the basic checks on the data.
  • How many rows and columns are there in this dataset?
  • Print only column names in the dataset.
  • Print first 10 observations.
  • Print the last 5 observations.
  • Get the summary of the dataset.
  • Print the structure of the data.
  • Describe the field unitsSold, custCountry.
  • Create a new dataset by taking first 30 observations from this data.
  • Print the resultant data.
  • Remove(delete) the new dataset.
In [4]:
import pandas as pd     # importing library pandas

Sales_country =pd.read_csv("datasets\\Superstore Sales Data\\Sales_by_country_v1.csv")

print(Sales)
    custId            custName                   custCountry productSold  \
0    23262        Candice Levy                         Congo     SUPA101   
1    23263        Xerxes Smith                        Panama     DETA200   
2    23264        Levi Douglas  Tanzania, United Republic of     DETA800   
3    23265        Uriel Benton                  South Africa     SUPA104   
4    23266        Celeste Pugh                         Gabon     PURA200   
5    23267        Vance Campos          Syrian Arab Republic     PURA100   
6    23268        Latifah Wall                    Guadeloupe     DETA100   
7    23269      Jane Hernandez                     Macedonia     PURA100   
8    23270         Wanda Garza                    Kyrgyzstan     SUPA103   
9    23271  Athena Fitzpatrick                       Reunion     SUPA103   
10   23272       Anjolie Hicks      Turks and Caicos Islands     DETA200   

   salesChannel  unitsSold   dateSold  
0        Retail        117   8/9/2012  
1        Online         73   7/6/2012  
2        Online        205  8/18/2012  
3        Online         14   8/5/2012  
4        Retail        170  8/11/2012  
5        Retail        129  7/11/2012  
6        Retail         82  7/12/2012  
7        Retail        116   6/3/2012  
8        Online         67   6/7/2012  
9        Retail        125  7/27/2012  
10       Retail         71  7/31/2012  
In [10]:
 #How many rows and columns are there in this dataset?

Sales_country.shape 
Out[10]:
(998, 7)
In [11]:
#Print only column names in the dataset
Sales_country.columns.values
Out[11]:
array(['custId', 'custName', 'custCountry', 'productSold', 'salesChannel',
       'unitsSold', 'dateSold'], dtype=object)
In [12]:
#Print first 10 observations
Sales_country.head(10)
Out[12]:
custId custName custCountry productSold salesChannel unitsSold dateSold
0 23262 Candice Levy Congo SUPA101 Retail 117 2012-08-09
1 23263 Xerxes Smith Panama DETA200 Online 73 2012-07-06
2 23264 Levi Douglas Tanzania, United Republic of DETA800 Online 205 2012-08-18
3 23265 Uriel Benton South Africa SUPA104 Online 14 2012-08-05
4 23266 Celeste Pugh Gabon PURA200 Retail 170 2012-08-11
5 23267 Vance Campos Syrian Arab Republic PURA100 Retail 129 2012-07-11
6 23268 Latifah Wall Guadeloupe DETA100 Retail 82 2012-07-12
7 23269 Jane Hernandez Macedonia PURA100 Retail 116 2012-06-03
8 23270 Wanda Garza Kyrgyzstan SUPA103 Online 67 2012-06-07
9 23271 Athena Fitzpatrick Reunion SUPA103 Retail 125 2012-07-27
In [14]:
#Print the last 5 observations
Sales_country.tail(5)
Out[14]:
custId custName custCountry productSold salesChannel unitsSold dateSold
993 24255 Ethan Day Tajikistan DETA100 Online 189 2011-01-09
994 24256 Quail Knox Tonga PURA500 Retail 43 2011-05-08
995 24257 Noelle Sargent Ireland DETA800 Direct 17 2011-02-04
996 24258 Kuame Wallace Montserrat SUPA103 Online 80 2011-01-13
997 24259 Lester Fisher Cocos (Keeling) Islands PURA500 Direct 138 2011-08-10
In [15]:
#Get the summary of the dataset

Sales_country.describe()
Out[15]:
custId unitsSold
count 998.000000 998.000000
mean 23760.500000 108.256513
std 288.242086 62.167957
min 23262.000000 1.000000
25% 23511.250000 52.250000
50% 23760.500000 111.000000
75% 24009.750000 163.000000
max 24259.000000 212.000000
In [17]:
#Print the structure of the data
Sales_country.apply(lambda x: [x.unique()])  # this is close str() in R.
Out[17]:
custId          [[23262, 23263, 23264, 23265, 23266, 23267, 23...
custName        [[Candice Levy, Xerxes Smith, Levi Douglas, Ur...
custCountry     [[Congo, Panama, Tanzania, United Republic of,...
productSold     [[SUPA101, DETA200, DETA800, SUPA104, PURA200,...
salesChannel                           [[Retail, Online, Direct]]
unitsSold       [[117, 73, 205, 14, 170, 129, 82, 116, 67, 125...
dateSold        [[2012-08-09, 2012-07-06, 2012-08-18, 2012-08-...
dtype: object
In [19]:
#Describe the field unitsSold

Sales_country.unitsSold.describe()
Out[19]:
count    998.000000
mean     108.256513
std       62.167957
min        1.000000
25%       52.250000
50%      111.000000
75%      163.000000
max      212.000000
Name: unitsSold, dtype: float64
In [20]:
#Describe the field custCountry
Sales_country.custCountry.describe()       #describe wont give much info about string variable, so we will create frequency table
Out[20]:
count        998
unique       233
top       Turkey
freq          10
Name: custCountry, dtype: object
In [21]:
Sales_country.custCountry.value_counts()   #frequency table
Out[21]:
Turkey                               10
Swaziland                            10
Denmark                              10
Azerbaijan                            9
Bouvet Island                         9
Puerto Rico                           9
Nauru                                 9
Panama                                9
Sri Lanka                             8
Anguilla                              8
Netherlands Antilles                  8
Guinea                                8
United Kingdom                        8
Paraguay                              8
Indonesia                             8
Botswana                              8
Chad                                  8
Bosnia and Herzegovina                8
Faroe Islands                         8
Pakistan                              8
Macedonia                             7
Andorra                               7
Mauritius                             7
Mauritania                            7
Czech Republic                        7
Syrian Arab Republic                  7
Kuwait                                7
Malaysia                              7
Hungary                               7
Qatar                                 7
                                     ..
Seychelles                            2
Latvia                                2
French Southern Territories           2
Senegal                               2
Libyan Arab Jamahiriya                2
Mayotte                               2
Dominican Republic                    2
Sudan                                 2
Romania                               2
Slovakia                              2
Slovenia                              2
Russian Federation                    1
Ghana                                 1
France                                1
Lithuania                             1
Mexico                                1
Jamaica                               1
Cuba                                  1
Thailand                              1
Nicaragua                             1
Greenland                             1
Heard Island and Mcdonald Islands     1
Bulgaria                              1
Wallis and Futuna                     1
Turkmenistan                          1
Liechtenstein                         1
Kenya                                 1
Monaco                                1
Cayman Islands                        1
India                                 1
Name: custCountry, dtype: int64
In [23]:
#Create a new dataset by taking first 30 observations from this data

sales_new=Sales_country.head(30)
In [24]:
#Print the resultant data
print(sales_new)
    custId            custName                   custCountry productSold  \
0    23262        Candice Levy                         Congo     SUPA101   
1    23263        Xerxes Smith                        Panama     DETA200   
2    23264        Levi Douglas  Tanzania, United Republic of     DETA800   
3    23265        Uriel Benton                  South Africa     SUPA104   
4    23266        Celeste Pugh                         Gabon     PURA200   
5    23267        Vance Campos          Syrian Arab Republic     PURA100   
6    23268        Latifah Wall                    Guadeloupe     DETA100   
7    23269      Jane Hernandez                     Macedonia     PURA100   
8    23270         Wanda Garza                    Kyrgyzstan     SUPA103   
9    23271  Athena Fitzpatrick                       Reunion     SUPA103   
10   23272       Anjolie Hicks      Turks and Caicos Islands     DETA200   
11   23273        Isaac Cooper          Netherlands Antilles     SUPA104   
12   23274         Asher Weber                     Macedonia     PURA100   
13   23275       Ethan Gregory                        Tuvalu     DETA800   
14   23276       Hayes Rollins                         Nepal     PURA500   
15   23277      MacKenzie Moss                          Oman     SUPA101   
16   23278   Aphrodite Brennan                        Malawi     SUPA105   
17   23279         Angela Wise                       Moldova     PURA100   
18   23280       James Spencer                  Burkina Faso     SUPA103   
19   23281       Adria Kaufman                 Bouvet Island     SUPA102   
20   23282      Amir Alexander                       Liberia     DETA100   
21   23283          Lani Sweet                       Vanuatu     SUPA105   
22   23284        Clark Weaver                         Palau     PURA250   
23   23285    Leonard Cardenas                    Madagascar     SUPA102   
24   23286       Renee Padilla                         Yemen     DETA800   
25   23287         Joy Vazquez                         Korea     PURA250   
26   23288         Ingrid Bush                    Montserrat     SUPA104   
27   23289        Deacon Craig                      Mongolia     SUPA105   
28   23290        Rama Goodwin                       Tunisia     DETA800   
29   23291     Jelani Odonnell                       Albania     DETA800   

   salesChannel  unitsSold    dateSold  
0        Retail        117  2012-08-09  
1        Online         73  2012-07-06  
2        Online        205  2012-08-18  
3        Online         14  2012-08-05  
4        Retail        170  2012-08-11  
5        Retail        129  2012-07-11  
6        Retail         82  2012-07-12  
7        Retail        116  2012-06-03  
8        Online         67  2012-06-07  
9        Retail        125  2012-07-27  
10       Retail         71  2012-07-31  
11       Retail         22  2012-08-13  
12       Direct        153  2012-08-22  
13       Retail        141  2012-07-04  
14       Direct         65  2012-08-01  
15       Online        157  2012-07-12  
16       Online        197  2012-08-24  
17       Direct         10  2012-06-21  
18       Direct         30  2012-06-03  
19       Online        134  2012-07-13  
20       Online        100  2012-08-21  
21       Online        142  2012-06-24  
22       Online        135  2012-06-17  
23       Online          9  2012-07-24  
24       Retail         69  2012-08-08  
25       Retail        189  2012-06-17  
26       Direct        141  2012-06-14  
27       Retail        166  2012-08-02  
28       Online        170  2012-08-11  
29       Retail        199  2012-08-18  
In [25]:
#Remove(delete) the new dataset

del(sales_new)

The next post is about manipulating datasets in python.
Link to the next post : https://statinfer.com/104-2-3-manipulting-datasets-in-python/

Statinfer

Statinfer derived from Statistical inference. We provide training in various Data Analytics and Data Science courses and assist candidates in securing placements.

Contact Us

info@statinfer.com

+91- 9676098897

+91- 9494762485

 

Our Social Links

top
© 2020. All Rights Reserved.