Introduction

Data handling is most important skill for any data analyst. As a data analyst we spend fair amount of time in handling or manipulating the data. Before jumping to analysis we have to import the data, accurate and validate the data. Sometimes we may have to merge the data from multiple sources and create consolidated data. Sometime just create new variable in the data. These are the basic skills a data scientist should have to handle a dataset before getting into analysis.

Data Frame

Data Frames are table like structures. It means each column contains measurements on one variable and each row contains one case or one observation for each column. Data Frames are actually python dictionaries in their core. When we are working with datasets or dataframes the first thing to be done is importing ” R-Pandas library “. Pandas library is the easiest one to handle datasets in python.

In [1]:

#Example: How to print a dataframe
import pandas as pd
data = {'name': ['Stan', 'Kyle', 'Eric', 'Kenny'], 'age':[9, 9, 11, 12]}
df = pd.DataFrame(data)
df

Out[1]:

	age	name
0	9	Stan
1	9	Kyle
2	11	Eric
3	12	Kenny

Observations

Above results indicates that a dictionary called data has been created which contains one key which is name and the value here is given as list of values which would be ‘stan’,’kyle’,’Eric’ and ‘Kenny’. In previous session any list value as a key-value pair was not present but here we can see that we can enter any list as a value also. Here the 4 values has been created in the “name” and the second key which is “age” has 4 values which corresponds to name. By running this code we would be able to run dictionary and if we want to convert this dictionary into dataframe by using Pandas dataframe function. Always remember in the Pandas dataframe function D and F are capital letters.

Data Importing

Most of the datasets will be imported from external source, dataframes won’t be created by ourselves. Any datasets can be imported and Pythons Panda converts this dataset into dataframe.

Data Importing from CSV Files

Suppose we have external CSV file and which is needed to be imported then we need to use pd.read_csv () function. While using this function the address of the file that needs to be imported should be given. While giving the path use Forward Slash (“/”) or two double backward slashes (“”). The windows style of single back slash (“”) will not work.

In [2]:

#Import Superstore Sales Dataset

import pandas as pd     
Sales =pd.read_csv("DataSuperstore Sales DataSales_sample.csv")
print(Sales)

    custId            custName                   custCountry productSold  
0    23262        Candice Levy                         Congo     SUPA101   
1    23263        Xerxes Smith                        Panama     DETA200   
2    23264        Levi Douglas  Tanzania, United Republic of     DETA800   
3    23265        Uriel Benton                  South Africa     SUPA104   
4    23266        Celeste Pugh                         Gabon     PURA200   
5    23267        Vance Campos          Syrian Arab Republic     PURA100   
6    23268        Latifah Wall                    Guadeloupe     DETA100   
7    23269      Jane Hernandez                     Macedonia     PURA100   
8    23270         Wanda Garza                    Kyrgyzstan     SUPA103   
9    23271  Athena Fitzpatrick                       Reunion     SUPA103   
10   23272       Anjolie Hicks      Turks and Caicos Islands     DETA200   

   salesChannel  unitsSold   dateSold  
0        Retail        117   8/9/2012  
1        Online         73   7/6/2012  
2        Online        205  8/18/2012  
3        Online         14   8/5/2012  
4        Retail        170  8/11/2012  
5        Retail        129  7/11/2012  
6        Retail         82  7/12/2012  
7        Retail        116   6/3/2012  
8        Online         67   6/7/2012  
9        Retail        125  7/27/2012  
10       Retail         71  7/31/2012

Data Importing from Excel file

For importing data from the excel command is pd.read_excel () function. There is a bit difference in using this function that is we just have to give one extra value that is the name of the sheet because excel has many sheets, so it’s important to specify the name of the sheet which we want to import.

In [3]:

# Import World Bank Indicators dataset

import pandas as pd
wb_data = pd.read_excel("DataWorld Bank DataWorld Bank Indicators.xlsx" , "Data by country",index_col=None, na_values=['NA'])
wb_data.head(10)

Out[3]:

	Country Name	Date	Transit: Passenger cars (per 1,000 people)	Business: Mobile phone subscribers	Business: Internet users (per 100 people)	Health: Mortality, under-5 (per 1,000 live births)	Health: Health expenditure per capita (current US$)	Health: Health expenditure, total (% GDP)	Population: Total (count)	Population: Urban (count)	Population:: Birth rate, crude (per 1,000)	Health: Life expectancy at birth, female (years)	Health: Life expectancy at birth, male (years)	Health: Life expectancy at birth, total (years)	Population: Ages 0-14 (% of total)	Population: Ages 15-64 (% of total)	Population: Ages 65+ (% of total)	Finance: GDP (current US$)	Finance: GDP per capita (current US$)
0	Afghanistan	2000-07-01	NaN	0.0	NaN	151.0	11.0	8.0	25950816	5527524.0	51.0	45.0	45.0	45.0	48.0	50.0	2.0	NaN	NaN
1	Afghanistan	2001-07-01	NaN	0.0	0.0	150.0	11.0	9.0	26697430	5771984.0	50.0	46.0	45.0	46.0	48.0	50.0	2.0	2.461666e+09	92.0
2	Afghanistan	2002-07-01	NaN	25000.0	0.0	150.0	22.0	7.0	27465525	6025936.0	49.0	46.0	46.0	46.0	48.0	50.0	2.0	4.338908e+09	158.0
3	Afghanistan	2003-07-01	NaN	200000.0	0.0	151.0	25.0	8.0	28255719	6289723.0	48.0	46.0	46.0	46.0	48.0	50.0	2.0	4.766127e+09	169.0
4	Afghanistan	2004-07-01	NaN	600000.0	0.0	150.0	30.0	9.0	29068646	6563700.0	47.0	46.0	46.0	46.0	48.0	50.0	2.0	5.704203e+09	196.0
5	Afghanistan	2005-07-01	NaN	1200000.0	1.0	151.0	33.0	9.0	29904962	6848236.0	47.0	47.0	47.0	47.0	48.0	50.0	2.0	6.814754e+09	228.0
6	Afghanistan	2006-07-01	11.0	2520366.0	2.0	151.0	24.0	7.0	30751661	7158987.0	46.0	47.0	47.0	47.0	48.0	50.0	2.0	7.721932e+09	251.0
7	Afghanistan	2007-07-01	18.0	4668096.0	2.0	150.0	29.0	7.0	31622333	7481844.0	45.0	47.0	47.0	47.0	47.0	50.0	2.0	9.707374e+09	307.0
8	Afghanistan	2008-07-01	19.0	7898909.0	2.0	150.0	32.0	7.0	32517656	7817245.0	45.0	48.0	47.0	48.0	47.0	51.0	2.0	1.194030e+10	367.0
9	Afghanistan	2009-07-01	21.0	12000000.0	3.0	149.0	34.0	8.0	33438329	8165640.0	44.0	48.0	48.0	48.0	47.0	51.0	2.0	1.421367e+10	425.0

Observations

From the above result we observed that there is a constraint na_values = [‘NA’]. It means that any missing values in the dataset are considered as “NA”.When we run the above command we will get the results as shown above.

Basic Commands on Datasets

Is the data imported correctly? Are the variables imported in right format? Did we import all the rows?
Once the dataset is inside Python, we would like to do some basic checks to get an idea on the dataset.
Just printing the data is not a good option, always.
It is a good practice to check the number of rows, columns, quick look at the variable structures, a summary and data snapshot.

Check list after Importing

There are few things that to be done after importing the dataset.

Lab:Basic commands on Datasets

Import “Superstore Sales DataSales_by_country_v1.csv” data.
Perform the basic checks on the data.
How many rows and columns are there in this dataset?
Print only column names in the dataset.
Print first 10 observations.
Print the last 5 observations.
Get the summary of the dataset.
Print the structure of the data.
Describe the field unitsSold, custCountry.
Create a new dataset by taking first 30 observations from this data.
Print the resultant dataset.
Remove(delete) the new dataset.

Subsetting the data

Once the dataset is imported and all the basic operation is performed ,then we might want to create subset of the dataframe and perform analysis on the sub-dataset only. So, how to do the subset of the data? There are many ways to do but we might want our subset to be from any particular rows that we want to select in our new subset or we might want to have a few particular columns in our new subset or we might want a few rows or columns from the previous dataset and create a new subset.

In [4]:

### Import GDP Dataset

import pandas as pd
GDP1=pd.read_csv("DataWorld Bank DataGDP.csv",encoding = "ISO-8859-1")

GDP1.columns.values

Out[4]:

array(['Country_code', 'Rank', 'Country', 'GDP'], dtype=object)

Observations

Above results shows the columns that are present in GDP Dataset.

In [5]:

### New dataset with selected rows
gdp=GDP1.head(10)
gdp

Out[5]:

	Country_code	Rank	Country	GDP
0	USA	1	United States	17419000
1	CHN	2	China	10354832
2	JPN	3	Japan	4601461
3	DEU	4	Germany	3868291
4	GBR	5	United Kingdom	2988893
5	FRA	6	France	2829192
6	BRA	7	Brazil	2346076
7	ITA	8	Italy	2141161
8	IND	9	India	2048517
9	RUS	10	Russian Federation	1860598

Observations

Above results shows first observations of GDP dataset

In [6]:

### New dataset with selected rows based on Index location
            
gdp1=GDP1.iloc[[2,9,15,25]]
gdp1

Out[6]:

	Country_code	Rank	Country	GDP
2	JPN	3	Japan	4601461
9	RUS	10	Russian Federation	1860598
15	IDN	16	Indonesia	888538
25	NOR	26	Norway	499817

In [7]:

### New dataset by keeping selected columns and selected rows

gdp2=GDP1[["Country", "Rank"]][0:10]
gdp2

Out[7]:

	Country	Rank
0	United States	1
1	China	2
2	Japan	3
3	Germany	4
4	United Kingdom	5
5	France	6
6	Brazil	7
7	Italy	8
8	India	9
9	Russian Federation	10

In [8]:

### New dataset with selected rows and excluding columns
gdp3=GDP1.drop(["Country_code"],axis=1)[0:12]
gdp3

Out[8]:

	Rank	Country	GDP
0	1	United States	17419000
1	2	China	10354832
2	3	Japan	4601461
3	4	Germany	3868291
4	5	United Kingdom	2988893
5	6	France	2829192
6	7	Brazil	2346076
7	8	Italy	2141161
8	9	India	2048517
9	10	Russian Federation	1860598
10	11	Canada	1785387
11	12	Australia	1454675

LAB: Sub setting the data

Data : “./Bank Marketing/bank_market.csv”.
Create separate datasets for each of the below tasks.
Select first 1000 rows only.
Select only four columns “Cust_num” “age” “default” and “balance”.
Select 20,000 to 40,000 observations along with four variables “Cust_num” “job” “marital” and “education” .
Select 5000 to 6000 observations drop “poutcome“ and “y”.

Subset with variable filter conditions

Sometimes we want to select a particular variable and we just want to apply condition on that particular variable.

In [9]:

### Import bank_market dataset
bank_data = pd.read_csv("DataBank Tele MarketingBank Tele Marketingbank_market.csv")

### And Condition and Filters
bank_subset1=bank_data[(bank_data['age']>40) &  (bank_data['loan']=="no")]
bank_subset1.head(10)

Out[9]:

	Cust_num	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	y
0	1	58	management	married	tertiary	no	2143	yes	no	unknown	5	may	261	1	-1	unknown	no
1	2	44	technician	single	secondary	no	29	yes	no	unknown	5	may	151	1	-1	unknown	no
3	4	47	blue-collar	married	unknown	no	1506	yes	no	unknown	5	may	92	1	-1	unknown	no
7	8	42	entrepreneur	divorced	tertiary	yes	2	yes	no	unknown	5	may	380	1	-1	unknown	no
8	9	58	retired	married	primary	no	121	yes	no	unknown	5	may	50	1	-1	unknown	no
9	10	43	technician	single	secondary	no	593	yes	no	unknown	5	may	55	1	-1	unknown	no
10	11	41	admin.	divorced	secondary	no	270	yes	no	unknown	5	may	222	1	-1	unknown	no
12	13	53	technician	married	secondary	no	6	yes	no	unknown	5	may	517	1	-1	unknown	no
13	14	58	technician	married	unknown	no	71	yes	no	unknown	5	may	71	1	-1	unknown	no
14	15	57	services	married	secondary	no	162	yes	no	unknown	5	may	174	1	-1	unknown	no

Observations

Above results indicates the first 10 observations of bank_data subset whose age is greater than 40 and doesn’t have any loan

In [10]:

## OR Condition and Filters
bank_subset2=bank_data[(bank_data['age']>40) |  (bank_data['loan']=="no")]
bank_subset2.head(10)

Out[10]:

	Cust_num	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	y
0	1	58	management	married	tertiary	no	2143	yes	no	unknown	5	may	261	1	-1	unknown	no
1	2	44	technician	single	secondary	no	29	yes	no	unknown	5	may	151	1	-1	unknown	no
3	4	47	blue-collar	married	unknown	no	1506	yes	no	unknown	5	may	92	1	-1	unknown	no
4	5	33	unknown	single	unknown	no	1	no	no	unknown	5	may	198	1	-1	unknown	no
5	6	35	management	married	tertiary	no	231	yes	no	unknown	5	may	139	1	-1	unknown	no
7	8	42	entrepreneur	divorced	tertiary	yes	2	yes	no	unknown	5	may	380	1	-1	unknown	no
8	9	58	retired	married	primary	no	121	yes	no	unknown	5	may	50	1	-1	unknown	no
9	10	43	technician	single	secondary	no	593	yes	no	unknown	5	may	55	1	-1	unknown	no
10	11	41	admin.	divorced	secondary	no	270	yes	no	unknown	5	may	222	1	-1	unknown	no
11	12	29	admin.	single	secondary	no	390	yes	no	unknown	5	may	137	1	-1	unknown	no

Observations

Above results shows the first 10 observations of bank_data subset whose age is greater than 40 or who doesn’t have any loan.

In [11]:

### AND, OR condition  Numeric and Character filters

bank_subset3= bank_data[(bank_data['age']>40) &  (bank_data['loan']=="no") | (bank_data['marital']=="single" )]
bank_subset3.head(10)

Out[11]:

	Cust_num	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	y
0	1	58	management	married	tertiary	no	2143	yes	no	unknown	5	may	261	1	-1	unknown	no
1	2	44	technician	single	secondary	no	29	yes	no	unknown	5	may	151	1	-1	unknown	no
3	4	47	blue-collar	married	unknown	no	1506	yes	no	unknown	5	may	92	1	-1	unknown	no
4	5	33	unknown	single	unknown	no	1	no	no	unknown	5	may	198	1	-1	unknown	no
6	7	28	management	single	tertiary	no	447	yes	yes	unknown	5	may	217	1	-1	unknown	no
7	8	42	entrepreneur	divorced	tertiary	yes	2	yes	no	unknown	5	may	380	1	-1	unknown	no
8	9	58	retired	married	primary	no	121	yes	no	unknown	5	may	50	1	-1	unknown	no
9	10	43	technician	single	secondary	no	593	yes	no	unknown	5	may	55	1	-1	unknown	no
10	11	41	admin.	divorced	secondary	no	270	yes	no	unknown	5	may	222	1	-1	unknown	no
11	12	29	admin.	single	secondary	no	390	yes	no	unknown	5	may	137	1	-1	unknown	no

Observations

Above results shows first 10 observations of bank_data subset whose age is greater than 40 and who doesn’t have any loan or who are not married or staying single.

LAB: Subset with variable filter conditions

Data: “./Automobile Data Set/AutoDataset.csv”

Create a new dataset for exclusively Toyota cars.
Create a new dataset for all cars with city.mpg greater than 30 and engine size is less than 120.
Create a new dataset by taking only sedan cars, keep only four variables(Make, body style, fuel type, price) in the final dataset.
Create a new dataset by taking Audi, BMW or Porsche company makes. Drop two variables from the resultant dataset(price and normalized losses).

Calculated Fields

Sometimes we want to create a new column based on previous existing columns in our dataset. Let us see how this is done through python code.

In [12]:

### Import AutoMobileDataset

import pandas as pd
auto_data=pd.read_csv("DataAutomobile Data SetAutoDataset.csv")

auto_data['area']=(auto_data[' length'])*(auto_data[' width'])*(auto_data[' height'])

auto_data['area'].head(5)

Out[12]:

0    528019.904
1    528019.904
2    587592.640
3    634816.956
4    636734.832
Name: area, dtype: float64

Observations

Above results shows that a new variable volume has been created with the existing variables “length”, “width” and “height”.

Sorting the data

If you want to sort the dataframe based on any particular column you need to use .sort (). Let us sort Online Retail dataset based on UnitPrice variable.

In [13]:

### Import Online Retail Sales data

Online_Retail= pd.read_csv("DataOnline Retail Sales DataOnline Retail.csv",encoding="ISO-8859-1")

##Sorting Variable UnitPrice in Ascending Order

Online_Retail_sort=Online_Retail.sort_values('UnitPrice')
Online_Retail_sort.head(5)

Out[13]:

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
299984	A563187	B	Adjust bad debt	1	8/12/2011 14:52	-11062.06	NaN	United Kingdom
299983	A563186	B	Adjust bad debt	1	8/12/2011 14:51	-11062.06	NaN	United Kingdom
40984	539750	22652	TRAVEL SEWING KIT	1	12/21/2010 15:40	0.00	NaN	United Kingdom
52217	540696	84562A	NaN	1	1/11/2011 9:14	0.00	NaN	United Kingdom
52262	540699	POST	NaN	1000	1/11/2011 9:32	0.00	NaN	United Kingdom

Observations

Above results shows that a variable UnitPrice have been sorted in ascending order i.e from low values to higher values.

In [14]:

#Sorting variable UnitPrice in Descending Order

Online_Retail_sort=Online_Retail.sort_values('UnitPrice',ascending=False)
Online_Retail_sort.head(5)

Out[14]:

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
222681	C556445	M	Manual	-1	6/10/2011 15:31	38970.00	15098.0	United Kingdom
524602	C580605	AMAZONFEE	AMAZON FEE	-1	12/5/2011 11:36	17836.46	NaN	United Kingdom
43702	C540117	AMAZONFEE	AMAZON FEE	-1	1/5/2011 9:55	16888.02	NaN	United Kingdom
43703	C540118	AMAZONFEE	AMAZON FEE	-1	1/5/2011 9:57	16453.71	NaN	United Kingdom
15017	537632	AMAZONFEE	AMAZON FEE	1	12/7/2010 15:08	13541.33	NaN	United Kingdom

Observations

Above results shows that a variable UnitPrice have been sorted in descending order i.e from higher values to lower values.

LAB: Sorting the data

Import the Auto Dataset.
Sort the dataset based on length.
Sort the dataset based on length descending.

Identifying & Removing Duplicates

Duplicates are very big problem when we are going to do some analytics. So, before doing analysis we need to clear duplicates that is present in dataset. To remove duplicates in our dataset we can use the pre-built function pandas ” .duplicated”. Let us see how this works.

In [15]:

## Import Bill Dataset

import pandas as pd
Bill_data=pd.read_csv("DataTelecom Data AnalysisBill.csv")

#Identifying Duplicates

dupes=Bill_data.duplicated()
sum(dupes)

Out[15]:

In [16]:

#Dimensions of Bill Dataset

Bill_data.shape

Out[16]:

(9462, 7)

In [17]:

# Removing Duplicates

Bill_data_uniq=Bill_data.drop_duplicates()
Bill_data_uniq.shape

Out[17]:

(9452, 7)

In [18]:

## Identifying duplicates in complaints data based on cust_id

sum(Bill_data.cust_id.duplicated())

Out[18]:

In [19]:

#Dimensions of Bill dataset
Bill_data.shape

Out[19]:

(9462, 7)

In [20]:

#Removing Duplicates in cust_id variable

Bill_data_cust_uniq = Bill_data.drop_duplicates(['cust_id'])
Bill_data_cust_uniq.shape

Out[20]:

(9389, 7)

LAB: Handling Duplicates in Python

DataSet: “./Telecom Data Analysis/Complaints.csv”

Identify overall duplicates in complaints data.
Create a new dataset by removing overall duplicates in Complaints data.
Identify duplicates in complaints data based on cust_id.
Create a new dataset by removing duplicates based on cust_id in Complaints data.

Merging Datasets

In most “real world” situations, the data that we want to use come in multiple sets. We often need to combine these files into a single DataFrame to analyze the complete data. Use pandas pd.merge () function.

What Merge Function will do?

It will take two parameters initially table1 or dataframe1 and 2nd parameter will be table2 or dataframe2 and then third parameter would be on parameter which will have a Key column that we want for merging to be performed on. Most of the times datasets will have a unique columns which we can use as on column or key column then we have a parameter how which allows us what kind of join operation we want to perform. There are four join operations. They are :-

Left Join.
Right Join.
Outer Join .
Inner Join.

Working of Merging

Merging will combine two tables or dataframes based on a key and return a dataframe. The new dataframe will contain rows based on key column but the entire column will be same.

Working of Joins

Inner Join

Inner Join combine two dataframes based on a key and will return common rows that have matching values in both the datasets.

Outer Join

Outer join returns all the rows from both the dataframes based on key but it won’t repeat the common rows i.e, it will take all the columns from both tables but based on key if they are any common rows they won’t be repeated only one of them will be taken into consideration.

Left Outer Join

Left Join will return all the rows from left table even for the key which doesn’t have value in right table.

Right Outer Join

Right join will return all the rows from right table and even for the key which doesn’t have value in our left table.

Data sets merging and joining

Here we will import two datasets from Commercial Slot Analysis and we will perform all the four types of joins on the 2 datasets and we will see how it works.

Datasets:

      -TV Commercial Slots Analysis/orders.csv
      -TV Commercial Slots Analysis/slots.csv

In [21]:

# Import Orders dataset
Orders = pd.read_csv("DataTV Commercial Slots Analysisorders.csv")

#Dimensions of the dataset
Orders.shape

Out[21]:

(1369, 9)

In [22]:

# Import slots dataset
slots = pd.read_csv("DataTV Commercial Slots Analysisslots.csv")

#Dimensions of the dataset
slots.shape

Out[22]:

(1764, 17)

In [23]:

#Identifying duplicates for Unique_id variable in Orders dataset

sum(Orders.Unique_id.duplicated())

Out[23]:

In [24]:

#Identifying duplicates for Unique_id variable in Slots dataset

sum(slots.Unique_id.duplicated())

Out[24]:

In [25]:

#Removing duplicates for Unique_id variable in Orders dataset

orders1 = Orders.drop_duplicates(['Unique_id'])
sum(orders1.Unique_id.duplicated())

Out[25]:

In [26]:

#Removing duplicates for Unique_id variable in Orders dataset

slots1 = slots.drop_duplicates(['Unique_id'])
sum(slots1.Unique_id.duplicated())

Out[26]:

In [27]:

#Inner Join

inner_data = pd.merge(orders1, slots1, on = "Unique_id", how = "inner")
inner_data.shape

Out[27]:

(8, 25)

In [28]:

#Outer Join

Outer_data = pd.merge(orders1, slots1, on = "Unique_id", how = "outer")
Outer_data.shape

Out[28]:

(3109, 25)

In [29]:

#Left Outer Join

Left_outer_data = pd.merge(orders1, slots1, on = "Unique_id", how = "left")
Left_outer_data.shape

Out[29]:

(1366, 25)

In [30]:

#Right Outer Join

Right_outer_data = pd.merge(orders1, slots1, on = "Unique_id", how = "right")
Right_outer_data.shape

Out[30]:

(1751, 25)

LAB: Data Joins

Datasets

     “./Telecom Data Analysis/Bill.csv”
       “./Telecom Data Analysis/Complaints.csv”

Import the data and remove duplicates based on cust_id.
Create a dataset for each of these requirements.
All the customers who appear either in bill data or complaints data .
All the customers who appear both in bill data and complaints data.
All the customers from bill data: Customers who have bill data along with their complaints.
All the customers from complaints data: Customers who have Complaints data along with their bill info.

Exporting the Datasets or dataframe into external file

Once we have created a new dataframe or merged two dataframes we want to create a new data file to save in our hard drive. This can be done by using pandas function .to_csv () to export any data frame into an external .csv file.

Syntax:

dataframe.to_csv (‘path+filename.csv’)

Example:

L_outer_data.to_csv (‘D:Statinferouter_join.csv’)

Conclusion

In this session we started with Data imploring from various sources.

We saw some basic commands to work with data .
We also learned manipulating the datasets and creating new variables.
Sorting the datasets and handling duplicates.
Joining the datasets is also an important concept.
There are many more topics to discuss in data handling, these topics in the session are essential for any data scientist .

Handout – Data Handling in Python

Before start our lesson please download the datasets.

Contents

Introduction

Data Frame

Observations

Data Importing

Data Importing from CSV Files

Data Importing from Excel file

Observations

Basic Commands on Datasets

Check list after Importing

Lab:Basic commands on Datasets

Subsetting the data

Observations

Observations

LAB: Sub setting the data

Subset with variable filter conditions

Observations

Observations

Observations

LAB: Subset with variable filter conditions

Calculated Fields

Observations

Sorting the data

Observations

Observations

LAB: Sorting the data

Identifying & Removing Duplicates

LAB: Handling Duplicates in Python

Merging Datasets

What Merge Function will do?

Working of Merging

Working of Joins

Inner Join

Outer Join

Left Outer Join

Right Outer Join

Data sets merging and joining

Datasets:

LAB: Data Joins

Datasets

Exporting the Datasets or dataframe into external file

Syntax:

Example:

Conclusion