Link to the previous post : https://statinfer.com/104-2-3-manipulting-datasets-in-python/

In previous post we saw how we can manipulate a dataset using python. In this post we will put what we learned into practice.

Sub setting the data

Data : “./Bank Marketing/bank_market.csv”
Create separate datasets for each of the below tasks
- Select first 1000 rows only
- Select only four columns “Cust_num” “age” “default” and “balance”
- Select 20,000 to 40,000 observations along with four variables “Cust_num” “job” “marital” and “education”
- Select 5000 to 6000 observations drop “poutcome“ and “y”

In [37]:

bank_data=pd.read_csv("datasets\\Bank Marketing\\bank_market.csv")
bank_data.shape

Out[37]:

(45211, 18)

In [38]:

bank_data.columns.values

Out[38]:

array(['Cust_num', 'age', 'job', 'marital', 'education', 'default',
       'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration',
       'campaign', 'pdays', 'previous', 'poutcome', 'y'], dtype=object)

In [48]:

#Select first 1000 rows only

bank_data1 = bank_data.head(1000)
bank_data1.head(5)

Out[48]:

	Cust_num	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	y
0	1	58	management	married	tertiary	no	2143	yes	no	unknown	5	may	261	1	-1	unknown	no
1	2	44	technician	single	secondary	no	29	yes	no	unknown	5	may	151	1	-1	unknown	no
2	3	33	entrepreneur	married	secondary	no	2	yes	yes	unknown	5	may	76	1	-1	unknown	no
3	4	47	blue-collar	married	unknown	no	1506	yes	no	unknown	5	may	92	1	-1	unknown	no
4	5	33	unknown	single	unknown	no	1	no	no	unknown	5	may	198	1	-1	unknown	no

In [47]:

#Select only four columns "Cust_num"  "age” "default" and  "balance"

bank_data2 = bank_data[["Cust_num", "age","default","balance"]]
bank_data2.head(5)

Out[47]:

	Cust_num	age	default	balance
0	1	58	no	2143
1	2	44	no	29
2	3	33	no	2
3	4	47	no	1506
4	5	33	no	1

In [46]:

#Select 20,000 to 40,000 observations along with four variables  "Cust_num"  "job"       "marital" and   "education"
bank_data3 = bank_data[["Cust_num", "job","marital","education"]][20000:40000]
bank_data3.head(5)

Out[46]:

	Cust_num	job	marital	education
20000	20001	housemaid	married	primary
20001	20002	management	married	tertiary
20002	20003	technician	married	secondary
20003	20004	technician	single	tertiary
20004	20005	management	married	tertiary

In [45]:

#Select 5000 to 6000 observations drop  "poutcome“ and  "y" 
bank_data4=bank_data.drop(['poutcome','y'], axis=1)[5000:6000]
bank_data4.head(5)

Out[45]:

Cust_num	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous
5000	5001	32	management	single	tertiary	no	728	yes	no	unknown	21	may	125	1	-1
5001	5002	38	services	married	tertiary	no	-121	yes	no	unknown	21	may	288	1	-1
5002	5003	29	management	single	tertiary	no	330	yes	no	unknown	21	may	315	1	-1
5003	5004	31	management	single	tertiary	no	825	yes	no	unknown	21	may	506	2	-1
5004	5005	36	management	single	tertiary	no	247	no	no	unknown	21	may	354	5	-1

</div>

The next post is about subsetting data with variable filter condition in python.

Link to the next post : https://statinfer.com/104-2-5-subsetting-data-with-variable-filter-condition-in-python/

104.2.4 Practice : Manipulating dataset in Python

Subsetting the dataset in python.

Sub setting the data

Statinfer

Statinfer

Statinfer

104.2.4 Practice : Manipulating dataset in Python

Subsetting the dataset in python.

Sub setting the data

Related Courses

Python(Batch6)

Statinfer

Tableau (Batch6)

Statinfer

PowerBI (Batch6)

Statinfer