• No products in the cart.

Handout – Data Handling in Python

Contents

  1. Introduction
  2. Data Frames
  3. Data Importing
  4. Working with datasets
  5. Manipulating the datasets
    • Creating new variables
    • Sorting
    • Removing Duplicates
  6. Merging
  7. Exporting the datasets into external files
  8. Conclusion

Introduction

Data handling is most important skill for any data analyst. As a data analyst we spend fair amount of time in handling or manipulating the data. Before jumping to analysis we have to import the data, accurate and validate the data. Sometimes we may have to merge the data from multiple sources and create consolidated data. Sometime just create new variable in the data. These are the basic skills a data scientist should have to handle a dataset before getting into analysis.

Data Frame

Data Frames are table like structures. It means each column contains measurements on one variable and each row contains one case or one observation for each column. Data Frames are actually python dictionaries in their core. When we are working with datasets or dataframes the first thing to be done is importing ” R-Pandas library “. Pandas library is the easiest one to handle datasets in python.

In [1]:
#Example: How to print a dataframe
import pandas as pd
data = {'name': ['Stan', 'Kyle', 'Eric', 'Kenny'], 'age':[9, 9, 11, 12]}
df = pd.DataFrame(data)
df
Out[1]:
age name
0 9 Stan
1 9 Kyle
2 11 Eric
3 12 Kenny

Observations

Above results indicates that a dictionary called data has been created which contains one key which is name and the value here is given as list of values which would be ‘stan’,’kyle’,’Eric’ and ‘Kenny’. In previous session any list value as a key-value pair was not present but here we can see that we can enter any list as a value also. Here the 4 values has been created in the “name” and the second key which is “age” has 4 values which corresponds to name. By running this code we would be able to run dictionary and if we want to convert this dictionary into dataframe by using Pandas dataframe function. Always remember in the Pandas dataframe function D and F are capital letters.

Data Importing

Most of the datasets will be imported from external source, dataframes won’t be created by ourselves. Any datasets can be imported and Pythons Panda converts this dataset into dataframe.

Data Importing from CSV Files

Suppose we have external CSV file and which is needed to be imported then we need to use pd.read_csv () function. While using this function the address of the file that needs to be imported should be given. While giving the path use Forward Slash (“/”) or two double backward slashes (“”). The windows style of single back slash (“”) will not work.

In [2]:
#Import Superstore Sales Dataset

import pandas as pd     
Sales =pd.read_csv("DataSuperstore Sales DataSales_sample.csv")
print(Sales)
    custId            custName                   custCountry productSold  
0    23262        Candice Levy                         Congo     SUPA101   
1    23263        Xerxes Smith                        Panama     DETA200   
2    23264        Levi Douglas  Tanzania, United Republic of     DETA800   
3    23265        Uriel Benton                  South Africa     SUPA104   
4    23266        Celeste Pugh                         Gabon     PURA200   
5    23267        Vance Campos          Syrian Arab Republic     PURA100   
6    23268        Latifah Wall                    Guadeloupe     DETA100   
7    23269      Jane Hernandez                     Macedonia     PURA100   
8    23270         Wanda Garza                    Kyrgyzstan     SUPA103   
9    23271  Athena Fitzpatrick                       Reunion     SUPA103   
10   23272       Anjolie Hicks      Turks and Caicos Islands     DETA200   

   salesChannel  unitsSold   dateSold  
0        Retail        117   8/9/2012  
1        Online         73   7/6/2012  
2        Online        205  8/18/2012  
3        Online         14   8/5/2012  
4        Retail        170  8/11/2012  
5        Retail        129  7/11/2012  
6        Retail         82  7/12/2012  
7        Retail        116   6/3/2012  
8        Online         67   6/7/2012  
9        Retail        125  7/27/2012  
10       Retail         71  7/31/2012  

Data Importing from Excel file

For importing data from the excel command is pd.read_excel () function. There is a bit difference in using this function that is we just have to give one extra value that is the name of the sheet because excel has many sheets, so it’s important to specify the name of the sheet which we want to import.

In [3]:
# Import World Bank Indicators dataset

import pandas as pd
wb_data = pd.read_excel("DataWorld Bank DataWorld Bank Indicators.xlsx" , "Data by country",index_col=None, na_values=['NA'])
wb_data.head(10)
Out[3]:
Country Name Date Transit: Railways, (million passenger-km) Transit: Passenger cars (per 1,000 people) Business: Mobile phone subscribers Business: Internet users (per 100 people) Health: Mortality, under-5 (per 1,000 live births) Health: Health expenditure per capita (current US$) Health: Health expenditure, total (% GDP) Population: Total (count) Population: Urban (count) Population:: Birth rate, crude (per 1,000) Health: Life expectancy at birth, female (years) Health: Life expectancy at birth, male (years) Health: Life expectancy at birth, total (years) Population: Ages 0-14 (% of total) Population: Ages 15-64 (% of total) Population: Ages 65+ (% of total) Finance: GDP (current US$) Finance: GDP per capita (current US$)
0 Afghanistan 2000-07-01 0.0 NaN 0.0 NaN 151.0 11.0 8.0 25950816 5527524.0 51.0 45.0 45.0 45.0 48.0 50.0 2.0 NaN NaN
1 Afghanistan 2001-07-01 0.0 NaN 0.0 0.0 150.0 11.0 9.0 26697430 5771984.0 50.0 46.0 45.0 46.0 48.0 50.0 2.0 2.461666e+09 92.0
2 Afghanistan 2002-07-01 0.0 NaN 25000.0 0.0 150.0 22.0 7.0 27465525 6025936.0 49.0 46.0 46.0 46.0 48.0 50.0 2.0 4.338908e+09 158.0
3 Afghanistan 2003-07-01 0.0 NaN 200000.0 0.0 151.0 25.0 8.0 28255719 6289723.0 48.0 46.0 46.0 46.0 48.0 50.0 2.0 4.766127e+09 169.0
4 Afghanistan 2004-07-01 0.0 NaN 600000.0 0.0 150.0 30.0 9.0 29068646 6563700.0 47.0 46.0 46.0 46.0 48.0 50.0 2.0 5.704203e+09 196.0
5 Afghanistan 2005-07-01 0.0 NaN 1200000.0 1.0 151.0 33.0 9.0 29904962 6848236.0 47.0 47.0 47.0 47.0 48.0 50.0 2.0 6.814754e+09 228.0
6 Afghanistan 2006-07-01 0.0 11.0 2520366.0 2.0 151.0 24.0 7.0 30751661 7158987.0 46.0 47.0 47.0 47.0 48.0 50.0 2.0 7.721932e+09 251.0
7 Afghanistan 2007-07-01 0.0 18.0 4668096.0 2.0 150.0 29.0 7.0 31622333 7481844.0 45.0 47.0 47.0 47.0 47.0 50.0 2.0 9.707374e+09 307.0
8 Afghanistan 2008-07-01 0.0 19.0 7898909.0 2.0 150.0 32.0 7.0 32517656 7817245.0 45.0 48.0 47.0 48.0 47.0 51.0 2.0 1.194030e+10 367.0
9 Afghanistan 2009-07-01 0.0 21.0 12000000.0 3.0 149.0 34.0 8.0 33438329 8165640.0 44.0 48.0 48.0 48.0 47.0 51.0 2.0 1.421367e+10 425.0

Observations

From the above result we observed that there is a constraint na_values = [‘NA’]. It means that any missing values in the dataset are considered as “NA”.When we run the above command we will get the results as shown above.

Basic Commands on Datasets

  1. Is the data imported correctly? Are the variables imported in right format? Did we import all the rows?
  2. Once the dataset is inside Python, we would like to do some basic checks to get an idea on the dataset.
  3. Just printing the data is not a good option, always.
  4. It is a good practice to check the number of rows, columns, quick look at the variable structures, a summary and data snapshot.

Check list after Importing

There are few things that to be done after importing the dataset.

 Lab:Basic commands on Datasets

  1. Import “Superstore Sales DataSales_by_country_v1.csv” data.
  2. Perform the basic checks on the data.
  3. How many rows and columns are there in this dataset?
  4. Print only column names in the dataset.
  5. Print first 10 observations.
  6. Print the last 5 observations.
  7. Get the summary of the dataset.
  8. Print the structure of the data.
  9. Describe the field unitsSold, custCountry.
  10. Create a new dataset by taking first 30 observations from this data.
  11. Print the resultant dataset.
  12. Remove(delete) the new dataset.

Subsetting the data

Once the dataset is imported and all the basic operation is performed ,then we might want to create subset of the dataframe and perform analysis on the sub-dataset only. So, how to do the subset of the data? There are many ways to do but we might want our subset to be from any particular rows that we want to select in our new subset or we might want to have a few particular columns in our new subset or we might want a few rows or columns from the previous dataset and create a new subset.

In [4]:
### Import GDP Dataset

import pandas as pd
GDP1=pd.read_csv("DataWorld Bank DataGDP.csv",encoding = "ISO-8859-1")

GDP1.columns.values
Out[4]:
array(['Country_code', 'Rank', 'Country', 'GDP'], dtype=object)

Observations

Above results shows the columns that are present in GDP Dataset.

In [5]:
### New dataset with selected rows
gdp=GDP1.head(10)
gdp
Out[5]:
Country_code Rank Country GDP
0 USA 1 United States 17419000
1 CHN 2 China 10354832
2 JPN 3 Japan 4601461
3 DEU 4 Germany 3868291
4 GBR 5 United Kingdom 2988893
5 FRA 6 France 2829192
6 BRA 7 Brazil 2346076
7 ITA 8 Italy 2141161
8 IND 9 India 2048517
9 RUS 10 Russian Federation 1860598

Observations

Above results shows first observations of GDP dataset

In [6]:
### New dataset with selected rows based on Index location
            
gdp1=GDP1.iloc[[2,9,15,25]]
gdp1
Out[6]:
Country_code Rank Country GDP
2 JPN 3 Japan 4601461
9 RUS 10 Russian Federation 1860598
15 IDN 16 Indonesia 888538
25 NOR 26 Norway 499817
In [7]:
### New dataset by keeping selected columns and selected rows

gdp2=GDP1[["Country", "Rank"]][0:10]
gdp2
Out[7]:
Country Rank
0 United States 1
1 China 2
2 Japan 3
3 Germany 4
4 United Kingdom 5
5 France 6
6 Brazil 7
7 Italy 8
8 India 9
9 Russian Federation 10
In [8]:
### New dataset with selected rows and excluding columns
gdp3=GDP1.drop(["Country_code"],axis=1)[0:12]
gdp3
Out[8]:
Rank Country GDP
0 1 United States 17419000
1 2 China 10354832
2 3 Japan 4601461
3 4 Germany 3868291
4 5 United Kingdom 2988893
5 6 France 2829192
6 7 Brazil 2346076
7 8 Italy 2141161
8 9 India 2048517
9 10 Russian Federation 1860598
10 11 Canada 1785387
11 12 Australia 1454675

LAB: Sub setting the data

  1. Data : “./Bank Marketing/bank_market.csv”.
  2. Create separate datasets for each of the below tasks.
  3. Select first 1000 rows only.
  4. Select only four columns “Cust_num” “age” “default” and “balance”.
  5. Select 20,000 to 40,000 observations along with four variables “Cust_num” “job” “marital” and “education” .
  6. Select 5000 to 6000 observations drop “poutcome“ and “y”.

Subset with variable filter conditions

Sometimes we want to select a particular variable and we just want to apply condition on that particular variable.

In [9]:
### Import bank_market dataset
bank_data = pd.read_csv("DataBank Tele MarketingBank Tele Marketingbank_market.csv")

### And Condition and Filters
bank_subset1=bank_data[(bank_data['age']>40) &  (bank_data['loan']=="no")]
bank_subset1.head(10)
Out[9]:
Cust_num age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 1 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 2 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
3 4 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
7 8 42 entrepreneur divorced tertiary yes 2 yes no unknown 5 may 380 1 -1 0 unknown no
8 9 58 retired married primary no 121 yes no unknown 5 may 50 1 -1 0 unknown no
9 10 43 technician single secondary no 593 yes no unknown 5 may 55 1 -1 0 unknown no
10 11 41 admin. divorced secondary no 270 yes no unknown 5 may 222 1 -1 0 unknown no
12 13 53 technician married secondary no 6 yes no unknown 5 may 517 1 -1 0 unknown no
13 14 58 technician married unknown no 71 yes no unknown 5 may 71 1 -1 0 unknown no
14 15 57 services married secondary no 162 yes no unknown 5 may 174 1 -1 0 unknown no

Observations

Above results indicates the first 10 observations of bank_data subset whose age is greater than 40 and doesn’t have any loan

In [10]:
## OR Condition and Filters
bank_subset2=bank_data[(bank_data['age']>40) |  (bank_data['loan']=="no")]
bank_subset2.head(10)
Out[10]:
Cust_num age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 1 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 2 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
3 4 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 5 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no
5 6 35 management married tertiary no 231 yes no unknown 5 may 139 1 -1 0 unknown no
7 8 42 entrepreneur divorced tertiary yes 2 yes no unknown 5 may 380 1 -1 0 unknown no
8 9 58 retired married primary no 121 yes no unknown 5 may 50 1 -1 0 unknown no
9 10 43 technician single secondary no 593 yes no unknown 5 may 55 1 -1 0 unknown no
10 11 41 admin. divorced secondary no 270 yes no unknown 5 may 222 1 -1 0 unknown no
11 12 29 admin. single secondary no 390 yes no unknown 5 may 137 1 -1 0 unknown no

Observations

Above results shows the first 10 observations of bank_data subset whose age is greater than 40 or who doesn’t have any loan.

In [11]:
### AND, OR condition  Numeric and Character filters

bank_subset3= bank_data[(bank_data['age']>40) &  (bank_data['loan']=="no") | (bank_data['marital']=="single" )]
bank_subset3.head(10)
Out[11]:
Cust_num age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 1 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 2 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
3 4 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 5 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no
6 7 28 management single tertiary no 447 yes yes unknown 5 may 217 1 -1 0 unknown no
7 8 42 entrepreneur divorced tertiary yes 2 yes no unknown 5 may 380 1 -1 0 unknown no
8 9 58 retired married primary no 121 yes no unknown 5 may 50 1 -1 0 unknown no
9 10 43 technician single secondary no 593 yes no unknown 5 may 55 1 -1 0 unknown no
10 11 41 admin. divorced secondary no 270 yes no unknown 5 may 222 1 -1 0 unknown no
11 12 29 admin. single secondary no 390 yes no unknown 5 may 137 1 -1 0 unknown no

Observations

Above results shows first 10 observations of bank_data subset whose age is greater than 40 and who doesn’t have any loan or who are not married or staying single.

LAB: Subset with variable filter conditions

Data: “./Automobile Data Set/AutoDataset.csv”

  1. Create a new dataset for exclusively Toyota cars.
  2. Create a new dataset for all cars with city.mpg greater than 30 and engine size is less than 120.
  3. Create a new dataset by taking only sedan cars, keep only four variables(Make, body style, fuel type, price) in the final dataset.
  4. Create a new dataset by taking Audi, BMW or Porsche company makes. Drop two variables from the resultant dataset(price and normalized losses).

Calculated Fields

Sometimes we want to create a new column based on previous existing columns in our dataset. Let us see how this is done through python code.

In [12]:
### Import AutoMobileDataset

import pandas as pd
auto_data=pd.read_csv("DataAutomobile Data SetAutoDataset.csv")

auto_data['area']=(auto_data[' length'])*(auto_data[' width'])*(auto_data[' height'])

auto_data['area'].head(5)
Out[12]:
0    528019.904
1    528019.904
2    587592.640
3    634816.956
4    636734.832
Name: area, dtype: float64

Observations

Above results shows that a new variable volume has been created with the existing variables “length”, “width” and “height”.

Sorting the data

If you want to sort the dataframe based on any particular column you need to use .sort (). Let us sort Online Retail dataset based on UnitPrice variable.

In [13]:
### Import Online Retail Sales data

Online_Retail= pd.read_csv("DataOnline Retail Sales DataOnline Retail.csv",encoding="ISO-8859-1")

##Sorting Variable UnitPrice in Ascending Order

Online_Retail_sort=Online_Retail.sort_values('UnitPrice')
Online_Retail_sort.head(5)
Out[13]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
299984 A563187 B Adjust bad debt 1 8/12/2011 14:52 -11062.06 NaN United Kingdom
299983 A563186 B Adjust bad debt 1 8/12/2011 14:51 -11062.06 NaN United Kingdom
40984 539750 22652 TRAVEL SEWING KIT 1 12/21/2010 15:40 0.00 NaN United Kingdom
52217 540696 84562A NaN 1 1/11/2011 9:14 0.00 NaN United Kingdom
52262 540699 POST NaN 1000 1/11/2011 9:32 0.00 NaN United Kingdom

Observations

Above results shows that a variable UnitPrice have been sorted in ascending order i.e from low values to higher values.

In [14]:
#Sorting variable UnitPrice in Descending Order

Online_Retail_sort=Online_Retail.sort_values('UnitPrice',ascending=False)
Online_Retail_sort.head(5)
Out[14]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
222681 C556445 M Manual -1 6/10/2011 15:31 38970.00 15098.0 United Kingdom
524602 C580605 AMAZONFEE AMAZON FEE -1 12/5/2011 11:36 17836.46 NaN United Kingdom
43702 C540117 AMAZONFEE AMAZON FEE -1 1/5/2011 9:55 16888.02 NaN United Kingdom
43703 C540118 AMAZONFEE AMAZON FEE -1 1/5/2011 9:57 16453.71 NaN United Kingdom
15017 537632 AMAZONFEE AMAZON FEE 1 12/7/2010 15:08 13541.33 NaN United Kingdom

Observations

Above results shows that a variable UnitPrice have been sorted in descending order i.e from higher values to lower values.

LAB: Sorting the data

  1. Import the Auto Dataset.
  2. Sort the dataset based on length.
  3. Sort the dataset based on length descending.

Identifying & Removing Duplicates

Duplicates are very big problem when we are going to do some analytics. So, before doing analysis we need to clear duplicates that is present in dataset. To remove duplicates in our dataset we can use the pre-built function pandas ” .duplicated”. Let us see how this works.

In [15]:
## Import Bill Dataset

import pandas as pd
Bill_data=pd.read_csv("DataTelecom Data AnalysisBill.csv")

#Identifying Duplicates

dupes=Bill_data.duplicated()
sum(dupes)
Out[15]:
10
In [16]:
#Dimensions of Bill Dataset

Bill_data.shape
Out[16]:
(9462, 7)
In [17]:
# Removing Duplicates

Bill_data_uniq=Bill_data.drop_duplicates()
Bill_data_uniq.shape
Out[17]:
(9452, 7)
In [18]:
## Identifying duplicates in complaints data based on cust_id

sum(Bill_data.cust_id.duplicated())
Out[18]:
73
In [19]:
#Dimensions of Bill dataset
Bill_data.shape
Out[19]:
(9462, 7)
In [20]:
#Removing Duplicates in cust_id variable

Bill_data_cust_uniq = Bill_data.drop_duplicates(['cust_id'])
Bill_data_cust_uniq.shape
Out[20]:
(9389, 7)

LAB: Handling Duplicates in Python

DataSet: “./Telecom Data Analysis/Complaints.csv”

  1. Identify overall duplicates in complaints data.
  2. Create a new dataset by removing overall duplicates in Complaints data.
  3. Identify duplicates in complaints data based on cust_id.
  4. Create a new dataset by removing duplicates based on cust_id in Complaints data.

Merging Datasets

In most “real world” situations, the data that we want to use come in multiple sets. We often need to combine these files into a single DataFrame to analyze the complete data. Use pandas pd.merge () function.

What Merge Function will do?

It will take two parameters initially table1 or dataframe1 and 2nd parameter will be table2 or dataframe2 and then third parameter would be on parameter which will have a Key column that we want for merging to be performed on. Most of the times datasets will have a unique columns which we can use as on column or key column then we have a parameter how which allows us what kind of join operation we want to perform. There are four join operations. They are :-

  1. Left Join.
  2. Right Join.
  3. Outer Join .
  4. Inner Join.

Working of Merging

Merging will combine two tables or dataframes based on a key and return a dataframe. The new dataframe will contain rows based on key column but the entire column will be same.

Working of Joins

Inner Join

Inner Join combine two dataframes based on a key and will return common rows that have matching values in both the datasets.

Outer Join


Outer join returns all the rows from both the dataframes based on key but it won’t repeat the common rows i.e, it will take all the columns from both tables but based on key if they are any common rows they won’t be repeated only one of them will be taken into consideration.

 Left Outer Join

Left Join will return all the rows from left table even for the key which doesn’t have value in right table.

 Right Outer Join

Right join will return all the rows from right table and even for the key which doesn’t have value in our left table.

Data sets merging and joining

Here we will import two datasets from Commercial Slot Analysis and we will perform all the four types of joins on the 2 datasets and we will see how it works.

Datasets:

      -TV Commercial Slots Analysis/orders.csv
      -TV Commercial Slots Analysis/slots.csv
In [21]:
# Import Orders dataset
Orders = pd.read_csv("DataTV Commercial Slots Analysisorders.csv")

#Dimensions of the dataset
Orders.shape
Out[21]:
(1369, 9)
In [22]:
# Import slots dataset
slots = pd.read_csv("DataTV Commercial Slots Analysisslots.csv")

#Dimensions of the dataset
slots.shape
Out[22]:
(1764, 17)
In [23]:
#Identifying duplicates for Unique_id variable in Orders dataset

sum(Orders.Unique_id.duplicated())
Out[23]:
3
In [24]:
#Identifying duplicates for Unique_id variable in Slots dataset

sum(slots.Unique_id.duplicated())
Out[24]:
13
In [25]:
#Removing duplicates for Unique_id variable in Orders dataset

orders1 = Orders.drop_duplicates(['Unique_id'])
sum(orders1.Unique_id.duplicated())
Out[25]:
0
In [26]:
#Removing duplicates for Unique_id variable in Orders dataset

slots1 = slots.drop_duplicates(['Unique_id'])
sum(slots1.Unique_id.duplicated())
Out[26]:
0
In [27]:
#Inner Join

inner_data = pd.merge(orders1, slots1, on = "Unique_id", how = "inner")
inner_data.shape
Out[27]:
(8, 25)
In [28]:
#Outer Join

Outer_data = pd.merge(orders1, slots1, on = "Unique_id", how = "outer")
Outer_data.shape
Out[28]:
(3109, 25)
In [29]:
#Left Outer Join

Left_outer_data = pd.merge(orders1, slots1, on = "Unique_id", how = "left")
Left_outer_data.shape
Out[29]:
(1366, 25)
In [30]:
#Right Outer Join

Right_outer_data = pd.merge(orders1, slots1, on = "Unique_id", how = "right")
Right_outer_data.shape
Out[30]:
(1751, 25)

LAB: Data Joins

Datasets

     “./Telecom Data Analysis/Bill.csv”
       “./Telecom Data Analysis/Complaints.csv”

  1. Import the data and remove duplicates based on cust_id.
  2. Create a dataset for each of these requirements.
  3. All the customers who appear either in bill data or complaints data .
  4. All the customers who appear both in bill data and complaints data.
  5. All the customers from bill data: Customers who have bill data along with their complaints.
  6. All the customers from complaints data: Customers who have Complaints data along with their bill info.

Exporting the Datasets or dataframe into external file

Once we have created a new dataframe or merged two dataframes we want to create a new data file to save in our hard drive. This can be done by using pandas function .to_csv () to export any data frame into an external .csv file.

Syntax:

dataframe.to_csv (‘path+filename.csv’)

Example:

L_outer_data.to_csv (‘D:Statinferouter_join.csv’)

Conclusion

In this session we started with Data imploring from various sources.

  1. We saw some basic commands to work with data .
  2. We also learned manipulating the datasets and creating new variables.
  3. Sorting the datasets and handling duplicates.
  4. Joining the datasets is also an important concept.
  5. There are many more topics to discuss in data handling, these topics in the session are essential for any data scientist .

 

Statinfer

Statinfer derived from Statistical inference. We provide training in various Data Analytics and Data Science courses and assist candidates in securing placements.

Contact Us

info@statinfer.com

+91- 9676098897

+91- 9494762485

 

Our Social Links

top
© 2020. All Rights Reserved.