• LOGIN
  • No products in the cart.

104.2.4 Practice : Manipulating dataset in Python

Subsetting the dataset in python.

Link to the previous post : https://statinfer.com/104-2-3-manipulting-datasets-in-python/

In previous post we saw how we can manipulate a dataset using python. In this post we will put what we learned into practice.

Sub setting the data

  • Data : “./Bank Marketing/bank_market.csv”
  • Create separate datasets for each of the below tasks
    • Select first 1000 rows only
    • Select only four columns “Cust_num” “age” “default” and “balance”
    • Select 20,000 to 40,000 observations along with four variables “Cust_num” “job” “marital” and “education”
    • Select 5000 to 6000 observations drop “poutcome“ and “y”
In [37]:
bank_data=pd.read_csv("datasets\\Bank Marketing\\bank_market.csv")
bank_data.shape
Out[37]:
(45211, 18)
In [38]:
bank_data.columns.values
Out[38]:
array(['Cust_num', 'age', 'job', 'marital', 'education', 'default',
       'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration',
       'campaign', 'pdays', 'previous', 'poutcome', 'y'], dtype=object)
In [48]:
#Select first 1000 rows only

bank_data1 = bank_data.head(1000)
bank_data1.head(5)
Out[48]:
Cust_num age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 1 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 2 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
2 3 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no
3 4 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 5 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no
In [47]:
#Select only four columns "Cust_num"  "age” "default" and  "balance"

bank_data2 = bank_data[["Cust_num", "age","default","balance"]]
bank_data2.head(5)
Out[47]:
Cust_num age default balance
0 1 58 no 2143
1 2 44 no 29
2 3 33 no 2
3 4 47 no 1506
4 5 33 no 1
In [46]:
#Select 20,000 to 40,000 observations along with four variables  "Cust_num"  "job"       "marital" and   "education"
bank_data3 = bank_data[["Cust_num", "job","marital","education"]][20000:40000]
bank_data3.head(5)
Out[46]:
Cust_num job marital education
20000 20001 housemaid married primary
20001 20002 management married tertiary
20002 20003 technician married secondary
20003 20004 technician single tertiary
20004 20005 management married tertiary
In [45]:
#Select 5000 to 6000 observations drop  "poutcome“ and  "y" 
bank_data4=bank_data.drop(['poutcome','y'], axis=1)[5000:6000]
bank_data4.head(5)
Out[45]:
Cust_num age job marital education default balance housing loan contact day month duration campaign pdays previous
5000 5001 32 management single tertiary no 728 yes no unknown 21 may 125 1 -1 0
5001 5002 38 services married tertiary no -121 yes no unknown 21 may 288 1 -1 0
5002 5003 29 management single tertiary no 330 yes no unknown 21 may 315 1 -1 0
5003 5004 31 management single tertiary no 825 yes no unknown 21 may 506 2 -1 0
5004 5005 36 management single tertiary no 247 no no unknown 21 may 354 5 -1 0

</div>

The next post is about subsetting data with variable filter condition in python.

Link to the next post : https://statinfer.com/104-2-5-subsetting-data-with-variable-filter-condition-in-python/

0 responses on "104.2.4 Practice : Manipulating dataset in Python"

Leave a Message