Link to the previous post : https://statinfer.com/104-2-1-importing-data-in-python/

In this post we will cover basic tasks we can perform on a dataset after importing it into python.

We will complete following tasks:

Printing the data and meta info

Import “Superstore Sales Data\Sales_by_country_v1.csv” data.
Perform the basic checks on the data.
How many rows and columns are there in this dataset?
Print only column names in the dataset.
Print first 10 observations.
Print the last 5 observations.
Get the summary of the dataset.
Print the structure of the data.
Describe the field unitsSold, custCountry.
Create a new dataset by taking first 30 observations from this data.
Print the resultant data.
Remove(delete) the new dataset.

In [4]:

import pandas as pd     # importing library pandas

Sales_country =pd.read_csv("datasets\\Superstore Sales Data\\Sales_by_country_v1.csv")

print(Sales)

    custId            custName                   custCountry productSold  \
0    23262        Candice Levy                         Congo     SUPA101   
1    23263        Xerxes Smith                        Panama     DETA200   
2    23264        Levi Douglas  Tanzania, United Republic of     DETA800   
3    23265        Uriel Benton                  South Africa     SUPA104   
4    23266        Celeste Pugh                         Gabon     PURA200   
5    23267        Vance Campos          Syrian Arab Republic     PURA100   
6    23268        Latifah Wall                    Guadeloupe     DETA100   
7    23269      Jane Hernandez                     Macedonia     PURA100   
8    23270         Wanda Garza                    Kyrgyzstan     SUPA103   
9    23271  Athena Fitzpatrick                       Reunion     SUPA103   
10   23272       Anjolie Hicks      Turks and Caicos Islands     DETA200   

   salesChannel  unitsSold   dateSold  
0        Retail        117   8/9/2012  
1        Online         73   7/6/2012  
2        Online        205  8/18/2012  
3        Online         14   8/5/2012  
4        Retail        170  8/11/2012  
5        Retail        129  7/11/2012  
6        Retail         82  7/12/2012  
7        Retail        116   6/3/2012  
8        Online         67   6/7/2012  
9        Retail        125  7/27/2012  
10       Retail         71  7/31/2012

In [10]:

 #How many rows and columns are there in this dataset?

Sales_country.shape

Out[10]:

(998, 7)

In [11]:

#Print only column names in the dataset
Sales_country.columns.values

Out[11]:

array(['custId', 'custName', 'custCountry', 'productSold', 'salesChannel',
       'unitsSold', 'dateSold'], dtype=object)

In [12]:

#Print first 10 observations
Sales_country.head(10)

Out[12]:

	custId	custName	custCountry	productSold	salesChannel	unitsSold	dateSold
0	23262	Candice Levy	Congo	SUPA101	Retail	117	2012-08-09
1	23263	Xerxes Smith	Panama	DETA200	Online	73	2012-07-06
2	23264	Levi Douglas	Tanzania, United Republic of	DETA800	Online	205	2012-08-18
3	23265	Uriel Benton	South Africa	SUPA104	Online	14	2012-08-05
4	23266	Celeste Pugh	Gabon	PURA200	Retail	170	2012-08-11
5	23267	Vance Campos	Syrian Arab Republic	PURA100	Retail	129	2012-07-11
6	23268	Latifah Wall	Guadeloupe	DETA100	Retail	82	2012-07-12
7	23269	Jane Hernandez	Macedonia	PURA100	Retail	116	2012-06-03
8	23270	Wanda Garza	Kyrgyzstan	SUPA103	Online	67	2012-06-07
9	23271	Athena Fitzpatrick	Reunion	SUPA103	Retail	125	2012-07-27

In [14]:

#Print the last 5 observations
Sales_country.tail(5)

Out[14]:

	custId	custName	custCountry	productSold	salesChannel	unitsSold	dateSold
993	24255	Ethan Day	Tajikistan	DETA100	Online	189	2011-01-09
994	24256	Quail Knox	Tonga	PURA500	Retail	43	2011-05-08
995	24257	Noelle Sargent	Ireland	DETA800	Direct	17	2011-02-04
996	24258	Kuame Wallace	Montserrat	SUPA103	Online	80	2011-01-13
997	24259	Lester Fisher	Cocos (Keeling) Islands	PURA500	Direct	138	2011-08-10

In [15]:

#Get the summary of the dataset

Sales_country.describe()

Out[15]:

	custId	unitsSold
count	998.000000	998.000000
mean	23760.500000	108.256513
std	288.242086	62.167957
min	23262.000000	1.000000
25%	23511.250000	52.250000
50%	23760.500000	111.000000
75%	24009.750000	163.000000
max	24259.000000	212.000000

In [17]:

#Print the structure of the data
Sales_country.apply(lambda x: [x.unique()])  # this is close str() in R.

Out[17]:

custId          [[23262, 23263, 23264, 23265, 23266, 23267, 23...
custName        [[Candice Levy, Xerxes Smith, Levi Douglas, Ur...
custCountry     [[Congo, Panama, Tanzania, United Republic of,...
productSold     [[SUPA101, DETA200, DETA800, SUPA104, PURA200,...
salesChannel                           [[Retail, Online, Direct]]
unitsSold       [[117, 73, 205, 14, 170, 129, 82, 116, 67, 125...
dateSold        [[2012-08-09, 2012-07-06, 2012-08-18, 2012-08-...
dtype: object

In [19]:

#Describe the field unitsSold

Sales_country.unitsSold.describe()

Out[19]:

count    998.000000
mean     108.256513
std       62.167957
min        1.000000
25%       52.250000
50%      111.000000
75%      163.000000
max      212.000000
Name: unitsSold, dtype: float64

In [20]:

#Describe the field custCountry
Sales_country.custCountry.describe()       #describe wont give much info about string variable, so we will create frequency table

Out[20]:

count        998
unique       233
top       Turkey
freq          10
Name: custCountry, dtype: object

In [21]:

Sales_country.custCountry.value_counts()   #frequency table

Out[21]:

Turkey                               10
Swaziland                            10
Denmark                              10
Azerbaijan                            9
Bouvet Island                         9
Puerto Rico                           9
Nauru                                 9
Panama                                9
Sri Lanka                             8
Anguilla                              8
Netherlands Antilles                  8
Guinea                                8
United Kingdom                        8
Paraguay                              8
Indonesia                             8
Botswana                              8
Chad                                  8
Bosnia and Herzegovina                8
Faroe Islands                         8
Pakistan                              8
Macedonia                             7
Andorra                               7
Mauritius                             7
Mauritania                            7
Czech Republic                        7
Syrian Arab Republic                  7
Kuwait                                7
Malaysia                              7
Hungary                               7
Qatar                                 7
                                     ..
Seychelles                            2
Latvia                                2
French Southern Territories           2
Senegal                               2
Libyan Arab Jamahiriya                2
Mayotte                               2
Dominican Republic                    2
Sudan                                 2
Romania                               2
Slovakia                              2
Slovenia                              2
Russian Federation                    1
Ghana                                 1
France                                1
Lithuania                             1
Mexico                                1
Jamaica                               1
Cuba                                  1
Thailand                              1
Nicaragua                             1
Greenland                             1
Heard Island and Mcdonald Islands     1
Bulgaria                              1
Wallis and Futuna                     1
Turkmenistan                          1
Liechtenstein                         1
Kenya                                 1
Monaco                                1
Cayman Islands                        1
India                                 1
Name: custCountry, dtype: int64

In [23]:

#Create a new dataset by taking first 30 observations from this data

sales_new=Sales_country.head(30)

In [24]:

#Print the resultant data
print(sales_new)

    custId            custName                   custCountry productSold  \
0    23262        Candice Levy                         Congo     SUPA101   
1    23263        Xerxes Smith                        Panama     DETA200   
2    23264        Levi Douglas  Tanzania, United Republic of     DETA800   
3    23265        Uriel Benton                  South Africa     SUPA104   
4    23266        Celeste Pugh                         Gabon     PURA200   
5    23267        Vance Campos          Syrian Arab Republic     PURA100   
6    23268        Latifah Wall                    Guadeloupe     DETA100   
7    23269      Jane Hernandez                     Macedonia     PURA100   
8    23270         Wanda Garza                    Kyrgyzstan     SUPA103   
9    23271  Athena Fitzpatrick                       Reunion     SUPA103   
10   23272       Anjolie Hicks      Turks and Caicos Islands     DETA200   
11   23273        Isaac Cooper          Netherlands Antilles     SUPA104   
12   23274         Asher Weber                     Macedonia     PURA100   
13   23275       Ethan Gregory                        Tuvalu     DETA800   
14   23276       Hayes Rollins                         Nepal     PURA500   
15   23277      MacKenzie Moss                          Oman     SUPA101   
16   23278   Aphrodite Brennan                        Malawi     SUPA105   
17   23279         Angela Wise                       Moldova     PURA100   
18   23280       James Spencer                  Burkina Faso     SUPA103   
19   23281       Adria Kaufman                 Bouvet Island     SUPA102   
20   23282      Amir Alexander                       Liberia     DETA100   
21   23283          Lani Sweet                       Vanuatu     SUPA105   
22   23284        Clark Weaver                         Palau     PURA250   
23   23285    Leonard Cardenas                    Madagascar     SUPA102   
24   23286       Renee Padilla                         Yemen     DETA800   
25   23287         Joy Vazquez                         Korea     PURA250   
26   23288         Ingrid Bush                    Montserrat     SUPA104   
27   23289        Deacon Craig                      Mongolia     SUPA105   
28   23290        Rama Goodwin                       Tunisia     DETA800   
29   23291     Jelani Odonnell                       Albania     DETA800   

   salesChannel  unitsSold    dateSold  
0        Retail        117  2012-08-09  
1        Online         73  2012-07-06  
2        Online        205  2012-08-18  
3        Online         14  2012-08-05  
4        Retail        170  2012-08-11  
5        Retail        129  2012-07-11  
6        Retail         82  2012-07-12  
7        Retail        116  2012-06-03  
8        Online         67  2012-06-07  
9        Retail        125  2012-07-27  
10       Retail         71  2012-07-31  
11       Retail         22  2012-08-13  
12       Direct        153  2012-08-22  
13       Retail        141  2012-07-04  
14       Direct         65  2012-08-01  
15       Online        157  2012-07-12  
16       Online        197  2012-08-24  
17       Direct         10  2012-06-21  
18       Direct         30  2012-06-03  
19       Online        134  2012-07-13  
20       Online        100  2012-08-21  
21       Online        142  2012-06-24  
22       Online        135  2012-06-17  
23       Online          9  2012-07-24  
24       Retail         69  2012-08-08  
25       Retail        189  2012-06-17  
26       Direct        141  2012-06-14  
27       Retail        166  2012-08-02  
28       Online        170  2012-08-11  
29       Retail        199  2012-08-18

In [25]:

#Remove(delete) the new dataset

del(sales_new)

The next post is about manipulating datasets in python.
Link to the next post : https://statinfer.com/104-2-3-manipulting-datasets-in-python/

20th June 2017

104.2.2 Practice: Working with datasets in Python

Getting metadata of a dataset in python.

Printing the data and meta info

Statinfer

Statinfer

Statinfer

104.2.2 Practice: Working with datasets in Python

Getting metadata of a dataset in python.

Printing the data and meta info

Related Courses

Python(Batch6)

Statinfer

Tableau (Batch6)

Statinfer

PowerBI (Batch6)

Statinfer