• No products in the cart.

104.2.5 Subsetting data with variable filter condition in Python

Conditional dataset subsetting in python

Link to the previous post : https://statinfer.com/104-2-4-practice-manipulating-dataset-in-python/

 

In previous posts we saw how to create subsets in python using pandas library and practiced the same.

In this post we will try to create subsets with variable filter conditions. We will also practice the same on a different dataset.

Subset with Variable Filter Conditions

  • Selection with a condition on variables
    • For example, selection of complains where budget is greater than $5000.
  • And condition & filters
In [50]:
bank_subset1=bank_data[(bank_data['age']>40) &  (bank_data['loan']=="no")]
bank_subset1.head(5)
Out[50]:
Cust_num age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 1 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 2 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
3 4 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
7 8 42 entrepreneur divorced tertiary yes 2 yes no unknown 5 may 380 1 -1 0 unknown no
8 9 58 retired married primary no 121 yes no unknown 5 may 50 1 -1 0 unknown no
  • OR condition & filters
In [51]:
bank_subset2=bank_data[(bank_data['age']>40) |  (bank_data['loan']=="no")]
bank_subset2.head(5)
Out[51]:
Cust_num age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 1 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 2 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
3 4 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 5 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no
5 6 35 management married tertiary no 231 yes no unknown 5 may 139 1 -1 0 unknown no
  • AND, OR condition Numeric and Character filters
In [53]:
bank_subset3= bank_data[(bank_data['age']>40) &  (bank_data['loan']=="no") | (bank_data['marital']=="single" )]
bank_subset3.head(5)
Out[53]:
Cust_num age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 1 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 2 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
3 4 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 5 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no
6 7 28 management single tertiary no 447 yes yes unknown 5 may 217 1 -1 0 unknown no

Practice : Subset with variable filter conditions

  • Data : “./Automobile Data Set/AutoDataset.csv”
  • Create a new dataset for exclusively Toyota cars
  • Create a new dataset for all cars with city.mpg greater than 30 and engine size is less than 120.
  • Create a new dataset by taking only sedan cars. Keep only four variables(Make, body style, fuel type, price) in the final dataset.
  • Create a new dataset by taking Audi, BMW or Porsche company makes. Drop two variables from the resultant dataset(price and normalized losses)
In [4]:
auto_data=pd.read_csv("C:\\Amrita\\Datavedi\\Automobile Data Set\\AutoDataset.csv")
auto_data.shape
Out[4]:
(205, 26)
In [68]:
auto_data.columns.values
Out[68]:
array([' symboling', ' normalized-losses', ' make', ' fuel-type',
       ' aspiration', ' num-of-doors', ' body-style', ' drive-wheels',
       ' engine-location', ' wheel-base', ' length', ' width', ' height',
       ' curb-weight', ' engine-type', ' num-of-cylinders', ' engine-size',
       ' fuel-system', ' bore', ' stroke', ' compression-ratio',
       ' horsepower', ' peak-rpm', ' city-mpg', ' highway-mpg', ' price'], dtype=object)
In [69]:
# Create a new dataset for exclusively Toyota cars
auto_data1=auto_data[(auto_data[' make']=="toyota")]
auto_data1.head(5)
Out[69]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
150 1 87 toyota gas std two hatchback fwd front 95.7 92 2bbl 3.05 3.03 9.0 62 4800 35 39 5348.0
151 1 87 toyota gas std two hatchback fwd front 95.7 92 2bbl 3.05 3.03 9.0 62 4800 31 38 6338.0
152 1 74 toyota gas std four hatchback fwd front 95.7 92 2bbl 3.05 3.03 9.0 62 4800 31 38 6488.0
153 0 77 toyota gas std four wagon fwd front 95.7 92 2bbl 3.05 3.03 9.0 62 4800 31 37 6918.0
154 0 81 toyota gas std four wagon 4wd front 95.7 92 2bbl 3.05 3.03 9.0 62 4800 27 32 7898.0

5 rows × 26 columns

In [70]:
#Create a new dataset for all cars with city.mpg greater than 30 and engine size is less than 120.
auto_data2=auto_data[(auto_data[' city-mpg']>30) & (auto_data[ ' engine-size']<120)]
auto_data2.head(5)
Out[70]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
18 2 121 chevrolet gas std two hatchback fwd front 88.4 61 2bbl 2.91 3.03 9.50 48 5100 47 53 5151.0
19 1 98 chevrolet gas std two hatchback fwd front 94.5 90 2bbl 3.03 3.11 9.60 70 5400 38 43 6295.0
20 0 81 chevrolet gas std four sedan fwd front 94.5 90 2bbl 3.03 3.11 9.60 70 5400 38 43 6575.0
21 1 118 dodge gas std two hatchback fwd front 93.7 90 2bbl 2.97 3.23 9.41 68 5500 37 41 5572.0
22 1 118 dodge gas std two hatchback fwd front 93.7 90 2bbl 2.97 3.23 9.40 68 5500 31 38 6377.0

5 rows × 26 columns

In [5]:
#Create a new dataset by taking only  sedan cars. Keep only four variables(Make, body style, fuel type, price) in the final dataset. 
auto_data3=auto_data[auto_data[' body-style']=='sedan']
auto_data3.head(5)
Out[5]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
3 2 164 audi gas std four sedan fwd front 99.8 109 mpfi 3.19 3.4 10.0 102 5500 24 30 13950.0
4 2 164 audi gas std four sedan 4wd front 99.4 136 mpfi 3.19 3.4 8.0 115 5500 18 22 17450.0
5 2 ? audi gas std two sedan fwd front 99.8 136 mpfi 3.19 3.4 8.5 110 5500 19 25 15250.0
6 1 158 audi gas std four sedan fwd front 105.8 136 mpfi 3.19 3.4 8.5 110 5500 19 25 17710.0
8 1 158 audi gas turbo four sedan fwd front 105.8 131 mpfi 3.13 3.4 8.3 140 5500 17 20 23875.0

5 rows × 26 columns

In [6]:
auto_data4=auto_data3[[' make',' body-style',' fuel-type',' price']]
auto_data4.head(5)
Out[6]:
make body-style fuel-type price
3 audi sedan gas 13950.0
4 audi sedan gas 17450.0
5 audi sedan gas 15250.0
6 audi sedan gas 17710.0
8 audi sedan gas 23875.0
In [7]:
#Create a new dataset by taking Audi, BMW or Porsche company makes. Drop two variables from the resultant dataset(price and normalized losses)

auto_data5=auto_data[(auto_data[' make']=='audi') | (auto_data[' make']=='bmw') | (auto_data[' make']=='porsche') ]
auto_data5.head(5)
Out[7]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
3 2 164 audi gas std four sedan fwd front 99.8 109 mpfi 3.19 3.4 10.0 102 5500 24 30 13950.0
4 2 164 audi gas std four sedan 4wd front 99.4 136 mpfi 3.19 3.4 8.0 115 5500 18 22 17450.0
5 2 ? audi gas std two sedan fwd front 99.8 136 mpfi 3.19 3.4 8.5 110 5500 19 25 15250.0
6 1 158 audi gas std four sedan fwd front 105.8 136 mpfi 3.19 3.4 8.5 110 5500 19 25 17710.0
7 1 ? audi gas std four wagon fwd front 105.8 136 mpfi 3.19 3.4 8.5 110 5500 19 25 18920.0

5 rows × 26 columns

In [8]:
auto_data6=auto_data5.drop([' price',' normalized-losses'],axis=1)
auto_data6.head(5)
Out[8]:
symboling make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base length num-of-cylinders engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg
3 2 audi gas std four sedan fwd front 99.8 176.6 four 109 mpfi 3.19 3.4 10.0 102 5500 24 30
4 2 audi gas std four sedan 4wd front 99.4 176.6 five 136 mpfi 3.19 3.4 8.0 115 5500 18 22
5 2 audi gas std two sedan fwd front 99.8 177.3 five 136 mpfi 3.19 3.4 8.5 110 5500 19 25
6 1 audi gas std four sedan fwd front 105.8 192.7 five 136 mpfi 3.19 3.4 8.5 110 5500 19 25
7 1 audi gas std four wagon fwd front 105.8 192.7 five 136 mpfi 3.19 3.4 8.5 110 5500 19 25

5 rows × 24 columns

Calculated Fields

  • Calculate and Assign it to new variable
In [9]:
auto_data['area']=(auto_data[' length'])*(auto_data[' width'])*(auto_data[' height'])

auto_data['area'].head(5)
Out[9]:
0    528019.904
1    528019.904
2    587592.640
3    634816.956
4    636734.832
Name: area, dtype: float64



The next post is on sorting data in python.
Link to the next post : https://statinfer.com/104-2-6-sorting-the-data-in-python/

Statinfer

Statinfer derived from Statistical inference. We provide training in various Data Analytics and Data Science courses and assist candidates in securing placements.

Contact Us

info@statinfer.com

+91- 9676098897

+91- 9494762485

 

Our Social Links

top
© 2020. All Rights Reserved.