In this post we will try to understand how to import the datasets into python.
Data import from CSV files
- Need to use the function read.csv
- Need to use “/” or “\” in the path. The windows style of path “\” doesn’t work
Importing from CSV files
In [2]:
import pandas as pd # importing library pandas
Sales = pd.read_csv("datasets\\Superstore Sales Data\\Sales_sample.csv")
print(Sales)
Data import from Excel files
- Need to use pandas again
In [2]:
import pandas as pd
wb_data = pd.read_excel("datasets\\World Bank Data\\World Bank Indicators.xlsx" , "Data by country",index_col=None, na_values=['NA'])
wb_data.head(5)
Out[2]:
Basic Commands on Datasets
- Is the data imported correctly? Are the variables imported in right format? Did we import all the rows?
- Once the dataset is inside Python, we would like to do some basic checks to get an idea on the dataset.
- Just printing the data is not a good option, always.
- Is a good practice to check the number of rows, columns, quick look at the variable structures, a summary and data snapshot
Check list after Import
Data: Superstore Sales Data\Sales_sample.csv
Code | Description |
---|---|
Sales.shape | To check the number of rows and columns |
Sales.columns.values | What are the column names?, Sometimes import doesn’t consider column names while importing |
Sales.head(10) | First few observations of data |
Sales.tail(10) | Last few observations of the data |
Sales.dtypes | Data types of all variables |
Quick Summary
Code | Description |
---|---|
Sales.describe() | Summary of all variables |
Sales[‘custId’].describe() | Summary of a variable |
Sales.salesChannel.value_counts() | Get frequency table for a given variable |
table(Sales$custCountry) | Get frequency tables for categorical variables |
sum(Sales.custId.isnull()) | Missing value count in a variable |
Sales.sample(n=10) | Take a random sample of size 10 |
The next post is a practice session with datasets on Python.
Link to the next post :https://statinfer.com/104-2-1-importing-data-in-python/