In this post we will try to understand how to import the datasets into python.

Data import from CSV files

Need to use the function read.csv
Need to use “/” or “\” in the path. The windows style of path “\” doesn’t work

Importing from CSV files

In [2]:

import pandas as pd     # importing library pandas

Sales = pd.read_csv("datasets\\Superstore Sales Data\\Sales_sample.csv")

print(Sales)

    custId            custName                   custCountry productSold  \
0    23262        Candice Levy                         Congo     SUPA101   
1    23263        Xerxes Smith                        Panama     DETA200   
2    23264        Levi Douglas  Tanzania, United Republic of     DETA800   
3    23265        Uriel Benton                  South Africa     SUPA104   
4    23266        Celeste Pugh                         Gabon     PURA200   
5    23267        Vance Campos          Syrian Arab Republic     PURA100   
6    23268        Latifah Wall                    Guadeloupe     DETA100   
7    23269      Jane Hernandez                     Macedonia     PURA100   
8    23270         Wanda Garza                    Kyrgyzstan     SUPA103   
9    23271  Athena Fitzpatrick                       Reunion     SUPA103   
10   23272       Anjolie Hicks      Turks and Caicos Islands     DETA200   

   salesChannel  unitsSold   dateSold  
0        Retail        117   8/9/2012  
1        Online         73   7/6/2012  
2        Online        205  8/18/2012  
3        Online         14   8/5/2012  
4        Retail        170  8/11/2012  
5        Retail        129  7/11/2012  
6        Retail         82  7/12/2012  
7        Retail        116   6/3/2012  
8        Online         67   6/7/2012  
9        Retail        125  7/27/2012  
10       Retail         71  7/31/2012

Data import from Excel files

Need to use pandas again

In [2]:

import pandas as pd

wb_data = pd.read_excel("datasets\\World Bank Data\\World Bank Indicators.xlsx" , "Data by country",index_col=None, na_values=['NA'])

wb_data.head(5)

Out[2]:

	Country Name	Date	Transit: Passenger cars (per 1,000 people)	Business: Mobile phone subscribers	Business: Internet users (per 100 people)	Health: Mortality, under-5 (per 1,000 live births)	Health: Health expenditure per capita (current US$)	Health: Health expenditure, total (% GDP)	Population: Total (count)	Population: Urban (count)	Population:: Birth rate, crude (per 1,000)	Health: Life expectancy at birth, female (years)	Health: Life expectancy at birth, male (years)	Health: Life expectancy at birth, total (years)	Population: Ages 0-14 (% of total)	Population: Ages 15-64 (% of total)	Population: Ages 65+ (% of total)	Finance: GDP (current US$)	Finance: GDP per capita (current US$)
0	Afghanistan	2000-07-01	NaN	0.0	NaN	151.0	11.0	8.0	25950816	5527524.0	51.0	45.0	45.0	45.0	48.0	50.0	2.0	NaN	NaN
1	Afghanistan	2001-07-01	NaN	0.0	0.0	150.0	11.0	9.0	26697430	5771984.0	50.0	46.0	45.0	46.0	48.0	50.0	2.0	2.461666e+09	92.0
2	Afghanistan	2002-07-01	NaN	25000.0	0.0	150.0	22.0	7.0	27465525	6025936.0	49.0	46.0	46.0	46.0	48.0	50.0	2.0	4.338908e+09	158.0
3	Afghanistan	2003-07-01	NaN	200000.0	0.0	151.0	25.0	8.0	28255719	6289723.0	48.0	46.0	46.0	46.0	48.0	50.0	2.0	4.766127e+09	169.0
4	Afghanistan	2004-07-01	NaN	600000.0	0.0	150.0	30.0	9.0	29068646	6563700.0	47.0	46.0	46.0	46.0	48.0	50.0	2.0	5.704203e+09	196.0

Basic Commands on Datasets

Is the data imported correctly? Are the variables imported in right format? Did we import all the rows?
Once the dataset is inside Python, we would like to do some basic checks to get an idea on the dataset.
Just printing the data is not a good option, always.
Is a good practice to check the number of rows, columns, quick look at the variable structures, a summary and data snapshot

Check list after Import

Data: Superstore Sales Data\Sales_sample.csv

Code	Description
Sales.shape	To check the number of rows and columns
Sales.columns.values	What are the column names?, Sometimes import doesn’t consider column names while importing
Sales.head(10)	First few observations of data
Sales.tail(10)	Last few observations of the data
Sales.dtypes	Data types of all variables

Quick Summary

Code	Description
Sales.describe()	Summary of all variables
Sales[‘custId’].describe()	Summary of a variable
Sales.salesChannel.value_counts()	Get frequency table for a given variable
table(Sales$custCountry)	Get frequency tables for categorical variables
sum(Sales.custId.isnull())	Missing value count in a variable
Sales.sample(n=10)	Take a random sample of size 10

The next post is a practice session with datasets on Python.
Link to the next post :https://statinfer.com/104-2-1-importing-data-in-python/

20th June 2017

104.2.1 Importing data in Python

How to Import the data in python.

Data import from CSV files

Importing from CSV files

Data import from Excel files

Basic Commands on Datasets

Check list after Import

Quick Summary

Statinfer

Statinfer

Statinfer

104.2.1 Importing data in Python

How to Import the data in python.

Data import from CSV files

Importing from CSV files

Data import from Excel files

Basic Commands on Datasets

Check list after Import

Quick Summary

Related Courses

Python(Batch6)

Statinfer

Tableau (Batch6)

Statinfer

PowerBI (Batch6)

Statinfer