Subset with Variable Filter Conditions

Selection with a condition on variables
- For example, selection of complains where budget is greater than $5000.
And condition & filters

In [50]:

bank_subset1=bank_data[(bank_data['age']>40) &  (bank_data['loan']=="no")]
bank_subset1.head(5)

Out[50]:

	Cust_num	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	y
0	1	58	management	married	tertiary	no	2143	yes	no	unknown	5	may	261	1	-1	unknown	no
1	2	44	technician	single	secondary	no	29	yes	no	unknown	5	may	151	1	-1	unknown	no
3	4	47	blue-collar	married	unknown	no	1506	yes	no	unknown	5	may	92	1	-1	unknown	no
7	8	42	entrepreneur	divorced	tertiary	yes	2	yes	no	unknown	5	may	380	1	-1	unknown	no
8	9	58	retired	married	primary	no	121	yes	no	unknown	5	may	50	1	-1	unknown	no

OR condition & filters

In [51]:

bank_subset2=bank_data[(bank_data['age']>40) |  (bank_data['loan']=="no")]
bank_subset2.head(5)

Out[51]:

	Cust_num	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	y
0	1	58	management	married	tertiary	no	2143	yes	no	unknown	5	may	261	1	-1	unknown	no
1	2	44	technician	single	secondary	no	29	yes	no	unknown	5	may	151	1	-1	unknown	no
3	4	47	blue-collar	married	unknown	no	1506	yes	no	unknown	5	may	92	1	-1	unknown	no
4	5	33	unknown	single	unknown	no	1	no	no	unknown	5	may	198	1	-1	unknown	no
5	6	35	management	married	tertiary	no	231	yes	no	unknown	5	may	139	1	-1	unknown	no

AND, OR condition Numeric and Character filters

In [53]:

bank_subset3= bank_data[(bank_data['age']>40) &  (bank_data['loan']=="no") | (bank_data['marital']=="single" )]
bank_subset3.head(5)

Out[53]:

	Cust_num	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	y
0	1	58	management	married	tertiary	no	2143	yes	no	unknown	5	may	261	1	-1	unknown	no
1	2	44	technician	single	secondary	no	29	yes	no	unknown	5	may	151	1	-1	unknown	no
3	4	47	blue-collar	married	unknown	no	1506	yes	no	unknown	5	may	92	1	-1	unknown	no
4	5	33	unknown	single	unknown	no	1	no	no	unknown	5	may	198	1	-1	unknown	no
6	7	28	management	single	tertiary	no	447	yes	yes	unknown	5	may	217	1	-1	unknown	no

Practice : Subset with variable filter conditions

Data : “./Automobile Data Set/AutoDataset.csv”
Create a new dataset for exclusively Toyota cars
Create a new dataset for all cars with city.mpg greater than 30 and engine size is less than 120.
Create a new dataset by taking only sedan cars. Keep only four variables(Make, body style, fuel type, price) in the final dataset.
Create a new dataset by taking Audi, BMW or Porsche company makes. Drop two variables from the resultant dataset(price and normalized losses)

In [4]:

auto_data=pd.read_csv("C:\\Amrita\\Datavedi\\Automobile Data Set\\AutoDataset.csv")
auto_data.shape

Out[4]:

(205, 26)

In [68]:

auto_data.columns.values

Out[68]:

array([' symboling', ' normalized-losses', ' make', ' fuel-type',
       ' aspiration', ' num-of-doors', ' body-style', ' drive-wheels',
       ' engine-location', ' wheel-base', ' length', ' width', ' height',
       ' curb-weight', ' engine-type', ' num-of-cylinders', ' engine-size',
       ' fuel-system', ' bore', ' stroke', ' compression-ratio',
       ' horsepower', ' peak-rpm', ' city-mpg', ' highway-mpg', ' price'], dtype=object)

In [69]:

# Create a new dataset for exclusively Toyota cars
auto_data1=auto_data[(auto_data[' make']=="toyota")]
auto_data1.head(5)

Out[69]:

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	…	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
150	1	87	toyota	gas	std	two	hatchback	fwd	front	95.7	…	92	2bbl	3.05	3.03	9.0	62	4800	35	39	5348.0
151	1	87	toyota	gas	std	two	hatchback	fwd	front	95.7	…	92	2bbl	3.05	3.03	9.0	62	4800	31	38	6338.0
152	1	74	toyota	gas	std	four	hatchback	fwd	front	95.7	…	92	2bbl	3.05	3.03	9.0	62	4800	31	38	6488.0
153	0	77	toyota	gas	std	four	wagon	fwd	front	95.7	…	92	2bbl	3.05	3.03	9.0	62	4800	31	37	6918.0
154	0	81	toyota	gas	std	four	wagon	4wd	front	95.7	…	92	2bbl	3.05	3.03	9.0	62	4800	27	32	7898.0

5 rows × 26 columns

In [70]:

#Create a new dataset for all cars with city.mpg greater than 30 and engine size is less than 120.
auto_data2=auto_data[(auto_data[' city-mpg']>30) & (auto_data[ ' engine-size']<120)]
auto_data2.head(5)

Out[70]:

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	…	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
18	2	121	chevrolet	gas	std	two	hatchback	fwd	front	88.4	…	61	2bbl	2.91	3.03	9.50	48	5100	47	53	5151.0
19	1	98	chevrolet	gas	std	two	hatchback	fwd	front	94.5	…	90	2bbl	3.03	3.11	9.60	70	5400	38	43	6295.0
20	0	81	chevrolet	gas	std	four	sedan	fwd	front	94.5	…	90	2bbl	3.03	3.11	9.60	70	5400	38	43	6575.0
21	1	118	dodge	gas	std	two	hatchback	fwd	front	93.7	…	90	2bbl	2.97	3.23	9.41	68	5500	37	41	5572.0
22	1	118	dodge	gas	std	two	hatchback	fwd	front	93.7	…	90	2bbl	2.97	3.23	9.40	68	5500	31	38	6377.0

5 rows × 26 columns

In [5]:

#Create a new dataset by taking only  sedan cars. Keep only four variables(Make, body style, fuel type, price) in the final dataset. 
auto_data3=auto_data[auto_data[' body-style']=='sedan']
auto_data3.head(5)

Out[5]:

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	…	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
3	2	164	audi	gas	std	four	sedan	fwd	front	99.8	…	109	mpfi	3.19	3.4	10.0	102	5500	24	30	13950.0
4	2	164	audi	gas	std	four	sedan	4wd	front	99.4	…	136	mpfi	3.19	3.4	8.0	115	5500	18	22	17450.0
5	2	?	audi	gas	std	two	sedan	fwd	front	99.8	…	136	mpfi	3.19	3.4	8.5	110	5500	19	25	15250.0
6	1	158	audi	gas	std	four	sedan	fwd	front	105.8	…	136	mpfi	3.19	3.4	8.5	110	5500	19	25	17710.0
8	1	158	audi	gas	turbo	four	sedan	fwd	front	105.8	…	131	mpfi	3.13	3.4	8.3	140	5500	17	20	23875.0

5 rows × 26 columns

In [6]:

auto_data4=auto_data3[[' make',' body-style',' fuel-type',' price']]
auto_data4.head(5)

Out[6]:

	make	body-style	fuel-type	price
3	audi	sedan	gas	13950.0
4	audi	sedan	gas	17450.0
5	audi	sedan	gas	15250.0
6	audi	sedan	gas	17710.0
8	audi	sedan	gas	23875.0

In [7]:

#Create a new dataset by taking Audi, BMW or Porsche company makes. Drop two variables from the resultant dataset(price and normalized losses)

auto_data5=auto_data[(auto_data[' make']=='audi') | (auto_data[' make']=='bmw') | (auto_data[' make']=='porsche') ]
auto_data5.head(5)

Out[7]:

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	…	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
3	2	164	audi	gas	std	four	sedan	fwd	front	99.8	…	109	mpfi	3.19	3.4	10.0	102	5500	24	30	13950.0
4	2	164	audi	gas	std	four	sedan	4wd	front	99.4	…	136	mpfi	3.19	3.4	8.0	115	5500	18	22	17450.0
5	2	?	audi	gas	std	two	sedan	fwd	front	99.8	…	136	mpfi	3.19	3.4	8.5	110	5500	19	25	15250.0
6	1	158	audi	gas	std	four	sedan	fwd	front	105.8	…	136	mpfi	3.19	3.4	8.5	110	5500	19	25	17710.0
7	1	?	audi	gas	std	four	wagon	fwd	front	105.8	…	136	mpfi	3.19	3.4	8.5	110	5500	19	25	18920.0

5 rows × 26 columns

In [8]:

auto_data6=auto_data5.drop([' price',' normalized-losses'],axis=1)
auto_data6.head(5)

Out[8]:

symboling	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	length	…	num-of-cylinders	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg
3	2	audi	gas	std	four	sedan	fwd	front	99.8	176.6	…	four	109	mpfi	3.19	3.4	10.0	102	5500	24	30
4	2	audi	gas	std	four	sedan	4wd	front	99.4	176.6	…	five	136	mpfi	3.19	3.4	8.0	115	5500	18	22
5	2	audi	gas	std	two	sedan	fwd	front	99.8	177.3	…	five	136	mpfi	3.19	3.4	8.5	110	5500	19	25
6	1	audi	gas	std	four	sedan	fwd	front	105.8	192.7	…	five	136	mpfi	3.19	3.4	8.5	110	5500	19	25
7	1	audi	gas	std	four	wagon	fwd	front	105.8	192.7	…	five	136	mpfi	3.19	3.4	8.5	110	5500	19	25

5 rows × 24 columns

Calculated Fields

Calculate and Assign it to new variable

In [9]:

auto_data['area']=(auto_data[' length'])*(auto_data[' width'])*(auto_data[' height'])

auto_data['area'].head(5)

Out[9]:

0    528019.904
1    528019.904
2    587592.640
3    634816.956
4    636734.832
Name: area, dtype: float64



The next post is on sorting data in python.
Link to the next post : https://statinfer.com/104-2-6-sorting-the-data-in-python/

104.2.5 Subsetting data with variable filter condition in Python

Conditional dataset subsetting in python

Subset with Variable Filter Conditions

Practice : Subset with variable filter conditions

Calculated Fields

Statinfer

Statinfer

Statinfer

104.2.5 Subsetting data with variable filter condition in Python

Conditional dataset subsetting in python

Subset with Variable Filter Conditions

Practice : Subset with variable filter conditions

Calculated Fields

Related Courses

Python(Batch6)

Statinfer

Tableau (Batch6)

Statinfer

PowerBI (Batch6)

Statinfer