Introduction to Random Forests
27 Sep 2019Through this workshop you will learn how to quickly model and understand datasets using scikit-learn. Topics will include a basic introduction to using decisions trees and random forests, understanding feature importance, identifying model weaknesses, explaining your model, and more.
If you are not familiar with pandas, check out this post first. This article contains some high level explanations of the code not covered in the live workshop, but skips some examples given in the main presentation. Here is a recording of the main workshop, here are the slides, and here is a colab link to run all the code.
Many of the model interpretation techniques were taken from the fast.ai ML course. Check it out!!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Data Wrangling
Data munging/wrangling is the process of transforming your raw data into a form suitable for analysis or modelling. Typically this involves dealing with null values, finding (and possibly correcting) data issues, transforming non-numerical data into numerical data, and more.
In the “real world”, you will need to do much more than is shown below. There are numerous Kaggle Kernels demonstrating in depth data cleaning. For the purposes of this workshop, we will do (less than) the bare minimum for two reasons:
- We are using a relatively clean dataset (with few data oddities).
- Our model analysis will tell us where to focus our data cleaning and feature engineering efforts.
- This workshop is only an hour long :P.
Here is an example of a great example of some data exploration and correction.
def show_all(df):
"""Shows our dataframe without cutting off any rows or columns."""
with pd.option_context("display.max_rows", 1000,
"display.max_columns", 1000):
display(df)
First, we take a look at our data to get a sense of the types of values in the columns. We do this by looking at some of our data and using numerical summaries.
df = pd.read_csv('https://raw.githubusercontent.com/n2cholas/dsc-workshops/master/Random%20Forest%20Workshop/data/train.csv')
show_all(df.head())
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196.0 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NaN | Attchd | 2003.0 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | None | 0.0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976.0 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162.0 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001.0 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1915 | 1970 | Gable | CompShg | Wd Sdng | Wd Shng | None | 0.0 | TA | TA | BrkTil | TA | Gd | No | ALQ | 216 | Unf | 0 | 540 | 756 | GasA | Gd | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1998.0 | Unf | 3 | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2000 | 2000 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 350.0 | Gd | TA | PConc | Gd | TA | Av | GLQ | 655 | Unf | 0 | 490 | 1145 | GasA | Ex | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 2000.0 | RFn | 3 | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
len(df)
1460
show_all(df.describe())
Id | MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | TotRmsAbvGrd | Fireplaces | GarageYrBlt | GarageCars | GarageArea | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1460.000000 | 1460.000000 | 1201.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1452.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1379.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 |
mean | 730.500000 | 56.897260 | 70.049958 | 10516.828082 | 6.099315 | 5.575342 | 1971.267808 | 1984.865753 | 103.685262 | 443.639726 | 46.549315 | 567.240411 | 1057.429452 | 1162.626712 | 346.992466 | 5.844521 | 1515.463699 | 0.425342 | 0.057534 | 1.565068 | 0.382877 | 2.866438 | 1.046575 | 6.517808 | 0.613014 | 1978.506164 | 1.767123 | 472.980137 | 94.244521 | 46.660274 | 21.954110 | 3.409589 | 15.060959 | 2.758904 | 43.489041 | 6.321918 | 2007.815753 | 180921.195890 |
std | 421.610009 | 42.300571 | 24.284752 | 9981.264932 | 1.382997 | 1.112799 | 30.202904 | 20.645407 | 181.066207 | 456.098091 | 161.319273 | 441.866955 | 438.705324 | 386.587738 | 436.528436 | 48.623081 | 525.480383 | 0.518911 | 0.238753 | 0.550916 | 0.502885 | 0.815778 | 0.220338 | 1.625393 | 0.644666 | 24.689725 | 0.747315 | 213.804841 | 125.338794 | 66.256028 | 61.119149 | 29.317331 | 55.757415 | 40.177307 | 496.123024 | 2.703626 | 1.328095 | 79442.502883 |
min | 1.000000 | 20.000000 | 21.000000 | 1300.000000 | 1.000000 | 1.000000 | 1872.000000 | 1950.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 334.000000 | 0.000000 | 0.000000 | 334.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 1900.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 2006.000000 | 34900.000000 |
25% | 365.750000 | 20.000000 | 59.000000 | 7553.500000 | 5.000000 | 5.000000 | 1954.000000 | 1967.000000 | 0.000000 | 0.000000 | 0.000000 | 223.000000 | 795.750000 | 882.000000 | 0.000000 | 0.000000 | 1129.500000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 2.000000 | 1.000000 | 5.000000 | 0.000000 | 1961.000000 | 1.000000 | 334.500000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.000000 | 2007.000000 | 129975.000000 |
50% | 730.500000 | 50.000000 | 69.000000 | 9478.500000 | 6.000000 | 5.000000 | 1973.000000 | 1994.000000 | 0.000000 | 383.500000 | 0.000000 | 477.500000 | 991.500000 | 1087.000000 | 0.000000 | 0.000000 | 1464.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 3.000000 | 1.000000 | 6.000000 | 1.000000 | 1980.000000 | 2.000000 | 480.000000 | 0.000000 | 25.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 | 2008.000000 | 163000.000000 |
75% | 1095.250000 | 70.000000 | 80.000000 | 11601.500000 | 7.000000 | 6.000000 | 2000.000000 | 2004.000000 | 166.000000 | 712.250000 | 0.000000 | 808.000000 | 1298.250000 | 1391.250000 | 728.000000 | 0.000000 | 1776.750000 | 1.000000 | 0.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 7.000000 | 1.000000 | 2002.000000 | 2.000000 | 576.000000 | 168.000000 | 68.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 8.000000 | 2009.000000 | 214000.000000 |
max | 1460.000000 | 190.000000 | 313.000000 | 215245.000000 | 10.000000 | 9.000000 | 2010.000000 | 2010.000000 | 1600.000000 | 5644.000000 | 1474.000000 | 2336.000000 | 6110.000000 | 4692.000000 | 2065.000000 | 572.000000 | 5642.000000 | 3.000000 | 2.000000 | 3.000000 | 2.000000 | 8.000000 | 3.000000 | 14.000000 | 3.000000 | 2010.000000 | 4.000000 | 1418.000000 | 857.000000 | 547.000000 | 552.000000 | 508.000000 | 480.000000 | 738.000000 | 15500.000000 | 12.000000 | 2010.000000 | 755000.000000 |
We want to ensure all our columns (or features) have numerical or categorical values. Let’s look at the data types of each column.
show_all(df.dtypes)
Id int64
MSSubClass int64
MSZoning object
LotFrontage float64
LotArea int64
Street object
Alley object
LotShape object
LandContour object
Utilities object
LotConfig object
LandSlope object
Neighborhood object
Condition1 object
Condition2 object
BldgType object
HouseStyle object
OverallQual int64
OverallCond int64
YearBuilt int64
YearRemodAdd int64
RoofStyle object
RoofMatl object
Exterior1st object
Exterior2nd object
MasVnrType object
MasVnrArea float64
ExterQual object
ExterCond object
Foundation object
BsmtQual object
BsmtCond object
BsmtExposure object
BsmtFinType1 object
BsmtFinSF1 int64
BsmtFinType2 object
BsmtFinSF2 int64
BsmtUnfSF int64
TotalBsmtSF int64
Heating object
HeatingQC object
CentralAir object
Electrical object
1stFlrSF int64
2ndFlrSF int64
LowQualFinSF int64
GrLivArea int64
BsmtFullBath int64
BsmtHalfBath int64
FullBath int64
HalfBath int64
BedroomAbvGr int64
KitchenAbvGr int64
KitchenQual object
TotRmsAbvGrd int64
Functional object
Fireplaces int64
FireplaceQu object
GarageType object
GarageYrBlt float64
GarageFinish object
GarageCars int64
GarageArea int64
GarageQual object
GarageCond object
PavedDrive object
WoodDeckSF int64
OpenPorchSF int64
EnclosedPorch int64
3SsnPorch int64
ScreenPorch int64
PoolArea int64
PoolQC object
Fence object
MiscFeature object
MiscVal int64
MoSold int64
YrSold int64
SaleType object
SaleCondition object
SalePrice int64
dtype: object
Most of the data is of type object
. In general, this will be a Python string but could also be other Python objects, such as lists or dicts. Just based on the names, it looks like these are all categorical features. Let’s check this by looking at the unique values in each column.
for c in df.columns[df.dtypes=='object']:
print(df[c].value_counts())
print()
RL 1151
RM 218
FV 65
RH 16
C (all) 10
Name: MSZoning, dtype: int64
Pave 1454
Grvl 6
Name: Street, dtype: int64
Grvl 50
Pave 41
Name: Alley, dtype: int64
Reg 925
IR1 484
IR2 41
IR3 10
Name: LotShape, dtype: int64
Lvl 1311
Bnk 63
HLS 50
Low 36
Name: LandContour, dtype: int64
AllPub 1459
NoSeWa 1
Name: Utilities, dtype: int64
Inside 1052
Corner 263
CulDSac 94
FR2 47
FR3 4
Name: LotConfig, dtype: int64
Gtl 1382
Mod 65
Sev 13
Name: LandSlope, dtype: int64
NAmes 225
CollgCr 150
OldTown 113
Edwards 100
Somerst 86
Gilbert 79
NridgHt 77
Sawyer 74
NWAmes 73
SawyerW 59
BrkSide 58
Crawfor 51
Mitchel 49
NoRidge 41
Timber 38
IDOTRR 37
ClearCr 28
StoneBr 25
SWISU 25
MeadowV 17
Blmngtn 17
BrDale 16
Veenker 11
NPkVill 9
Blueste 2
Name: Neighborhood, dtype: int64
Norm 1260
Feedr 81
Artery 48
RRAn 26
PosN 19
RRAe 11
PosA 8
RRNn 5
RRNe 2
Name: Condition1, dtype: int64
Norm 1445
Feedr 6
RRNn 2
PosN 2
Artery 2
RRAe 1
PosA 1
RRAn 1
Name: Condition2, dtype: int64
1Fam 1220
TwnhsE 114
Duplex 52
Twnhs 43
2fmCon 31
Name: BldgType, dtype: int64
1Story 726
2Story 445
1.5Fin 154
SLvl 65
SFoyer 37
1.5Unf 14
2.5Unf 11
2.5Fin 8
Name: HouseStyle, dtype: int64
Gable 1141
Hip 286
Flat 13
Gambrel 11
Mansard 7
Shed 2
Name: RoofStyle, dtype: int64
CompShg 1434
Tar&Grv 11
WdShngl 6
WdShake 5
Membran 1
Roll 1
ClyTile 1
Metal 1
Name: RoofMatl, dtype: int64
VinylSd 515
HdBoard 222
MetalSd 220
Wd Sdng 206
Plywood 108
CemntBd 61
BrkFace 50
WdShing 26
Stucco 25
AsbShng 20
Stone 2
BrkComm 2
AsphShn 1
ImStucc 1
CBlock 1
Name: Exterior1st, dtype: int64
VinylSd 504
MetalSd 214
HdBoard 207
Wd Sdng 197
Plywood 142
CmentBd 60
Wd Shng 38
Stucco 26
BrkFace 25
AsbShng 20
ImStucc 10
Brk Cmn 7
Stone 5
AsphShn 3
CBlock 1
Other 1
Name: Exterior2nd, dtype: int64
None 864
BrkFace 445
Stone 128
BrkCmn 15
Name: MasVnrType, dtype: int64
TA 906
Gd 488
Ex 52
Fa 14
Name: ExterQual, dtype: int64
TA 1282
Gd 146
Fa 28
Ex 3
Po 1
Name: ExterCond, dtype: int64
PConc 647
CBlock 634
BrkTil 146
Slab 24
Stone 6
Wood 3
Name: Foundation, dtype: int64
TA 649
Gd 618
Ex 121
Fa 35
Name: BsmtQual, dtype: int64
TA 1311
Gd 65
Fa 45
Po 2
Name: BsmtCond, dtype: int64
No 953
Av 221
Gd 134
Mn 114
Name: BsmtExposure, dtype: int64
Unf 430
GLQ 418
ALQ 220
BLQ 148
Rec 133
LwQ 74
Name: BsmtFinType1, dtype: int64
Unf 1256
Rec 54
LwQ 46
BLQ 33
ALQ 19
GLQ 14
Name: BsmtFinType2, dtype: int64
GasA 1428
GasW 18
Grav 7
Wall 4
OthW 2
Floor 1
Name: Heating, dtype: int64
Ex 741
TA 428
Gd 241
Fa 49
Po 1
Name: HeatingQC, dtype: int64
Y 1365
N 95
Name: CentralAir, dtype: int64
SBrkr 1334
FuseA 94
FuseF 27
FuseP 3
Mix 1
Name: Electrical, dtype: int64
TA 735
Gd 586
Ex 100
Fa 39
Name: KitchenQual, dtype: int64
Typ 1360
Min2 34
Min1 31
Mod 15
Maj1 14
Maj2 5
Sev 1
Name: Functional, dtype: int64
Gd 380
TA 313
Fa 33
Ex 24
Po 20
Name: FireplaceQu, dtype: int64
Attchd 870
Detchd 387
BuiltIn 88
Basment 19
CarPort 9
2Types 6
Name: GarageType, dtype: int64
Unf 605
RFn 422
Fin 352
Name: GarageFinish, dtype: int64
TA 1311
Fa 48
Gd 14
Ex 3
Po 3
Name: GarageQual, dtype: int64
TA 1326
Fa 35
Gd 9
Po 7
Ex 2
Name: GarageCond, dtype: int64
Y 1340
N 90
P 30
Name: PavedDrive, dtype: int64
Gd 3
Fa 2
Ex 2
Name: PoolQC, dtype: int64
MnPrv 157
GdPrv 59
GdWo 54
MnWw 11
Name: Fence, dtype: int64
Shed 49
Gar2 2
Othr 2
TenC 1
Name: MiscFeature, dtype: int64
WD 1267
New 122
COD 43
ConLD 9
ConLI 5
ConLw 5
CWD 4
Oth 3
Con 2
Name: SaleType, dtype: int64
Normal 1198
Partial 125
Abnorml 101
Family 20
Alloca 12
AdjLand 4
Name: SaleCondition, dtype: int64
Great, now we know all of our object columns contain categories. In reality, this is rarely the case and you would have to do additional cleaning steps.
We should also note that some of these categories are ordered, such as ExterQual and ExterCond.
Also, you may have a dataset with thousands of features, so it is infeasible to look through all the categories like we did. In this case you can look at the number of unique values for each feature. Features with many unique values might be numerical values that weren’t properly encoded or free form text, which would have to be dealt with otherwise.
Next thing we want to do is deal with null values. Let’s see which columns have null values by showing the proportion of values in each column that are null.
x = (df.isnull().sum()/len(df)).to_frame(name='perc_null')
x['types'] = df.dtypes
x[x.perc_null>0].sort_values('perc_null', ascending=False)
perc_null | types | |
---|---|---|
PoolQC | 0.995205 | object |
MiscFeature | 0.963014 | object |
Alley | 0.937671 | object |
Fence | 0.807534 | object |
FireplaceQu | 0.472603 | object |
LotFrontage | 0.177397 | float64 |
GarageType | 0.055479 | object |
GarageYrBlt | 0.055479 | float64 |
GarageFinish | 0.055479 | object |
GarageQual | 0.055479 | object |
GarageCond | 0.055479 | object |
BsmtExposure | 0.026027 | object |
BsmtFinType2 | 0.026027 | object |
BsmtFinType1 | 0.025342 | object |
BsmtCond | 0.025342 | object |
BsmtQual | 0.025342 | object |
MasVnrArea | 0.005479 | float64 |
MasVnrType | 0.005479 | object |
Electrical | 0.000685 | object |
For the categorical variables, we will simply add a null category, when we turn them into categorical variables, pandas will automatically create a nan category if needed.
Dealing with nans (not a numbers) in numerical columns is more challenging. You typically want to replace the nans with a default value. Is there a reasonable default for this column? Does 0 make sense? What about the min/max/median? There is much discussion about this on the web.
Here, PoolQC and MasVnrArea are null when the house does not have a pool, so it makes sense to fill in 0 for these columns. For LotFrontage and GarageYrBuilt, we use the median. For the latter two, the missing information may have some pattern that helps us predict the price, so we will create an indicator column that tells us whether the column was null or not. This could be useful if, for example, all houses without a LotFrontage had a very small Lot, so they have a lower cost, so the indicator would help our model learn that this is the case.
df.PoolQC.fillna(0, inplace=True)
df.MasVnrArea.fillna(0, inplace=True)
def fill_nulls(col, filler=np.nanmedian):
df[f'{col}_null'] = 0
df.loc[df[col].isnull(), f'{col}_null'] = 1
df[col].fillna(filler(df[col]), inplace=True)
fill_nulls('LotFrontage')
fill_nulls('GarageYrBlt')
Now, the “object” dtype columns are ready to be turned into actual categories. You can read more about the “category” dtype on the pandas documentation page. Essentially, it changes all the values from strings to numbers, so our model can use them.
For a vast majority of machine learning model types, we would need to do one additional step and “one-hot encode” these categorical variables. Since we are using a tree-based model, we will not need to do this, so we will not cover it.
for col in df.columns[df.dtypes=='object']:
df[col] = df[col].astype('category')
We noted above that some of the categorical features were ordered. We will encode two of them as ordered categorical columns, but leave the rest as an exercise.
from pandas.api.types import CategoricalDtype
order = ['Po', 'Fa', 'TA', 'Gd', 'Ex']
cat_type = CategoricalDtype(categories=order, ordered=True)
df['ExterQual'] = df['ExterQual'].astype(cat_type)
df['ExterCond'] = df['ExterCond'].astype(cat_type)
We can take a look at some of the category codes along with the original strings:
print(df['ExterQual'][:10].cat.codes)
print(df['ExterQual'][:10])
0 3
1 2
2 3
3 2
4 3
5 2
6 3
7 2
8 2
9 2
dtype: int8
0 Gd
1 TA
2 Gd
3 TA
4 Gd
5 TA
6 Gd
7 TA
8 TA
9 TA
Name: ExterQual, dtype: category
Categories (5, object): [Po < Fa < TA < Gd < Ex]
So far, we’ve done some basic data transformations so the data can be fed into our model. We did no feature engineering or data exploration. Your dataset could have thousands of dimensions, so it is infeasible to extensively explore the data before modelling. Our approach will be to let the model tell us what’s important and what’s not so we can focus our effots appropriately.
Modeling
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
Our goal is to predict the price of the houses given the rest of the information. We start by splitting our dataframe into our features (our model inputs) and our target (the price we are trying to predict).
df_y, df_x = df['SalePrice'], df.drop('SalePrice', axis=1)
# same as axis='columns'
Scikit-learn expects numerical values for all the columns, but our categorical variables contain text. We will replace the text with their numeric category number:
for col in df_x.columns[df_x.dtypes=='category']:
df_x[col] = df_x[col].cat.codes
In machine learning, you typically split your data into a training, validation, and test set. You train your model on the training set, then evaluate the performance on the validation set. You use this validation set to tune hyperparameters (explained later). After selecting your model, you use the test set as one final check to make sure your model generalizes well (and you didn’t just “overfit” to your validation set).
These concepts will be explained in more detail below. We split our data into just training and validation for simplicity, as we won’t be doing any rigorous model selection.
X_train, X_valid, y_train, y_valid = train_test_split(
df_x, df_y, test_size=0.2, random_state=42)
Let’s create some convenience functions to evaluate our model. Metric selection will not be discussed during this workshop. Here, we use the root mean squared error as our metric. Mean squared error is a common metric used for regression problems, and root mean squared error is nice in that it is in the same units as your original output.
def rmse(x,y):
return np.sqrt(np.mean(((x-y)**2)))
def print_score(model, X_t=X_train, y_t=y_train,
X_v=X_valid, y_v=y_valid):
scores = [rmse(model.predict(X_t), y_t),
rmse(model.predict(X_v), y_v),
model.score(X_t, y_t),
model.score(X_v, y_v)]
labels = ['Train RMSE', 'Valid RMSE', 'Train R^2', 'Valid R^2']
for t, s in zip(labels, scores):
print(f'{t}: {s}')
Time to fit our first model: a DecisionTree! It’s as easy as:
tree = DecisionTreeRegressor(min_samples_leaf=50)
tree.fit(X_train, y_train)
We used the min_samples_leaf
argument to limit the size of the tree so we can visualize it below:
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
g = export_graphviz(tree, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,
feature_names=X_train.columns)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
Given the features, the decision tree works as follows: we start at the root and look at the feature in question. If the condition at the top is True, we follow the True branch, otherwise we follow the False branch. This process continues until we reach a leaf. The value at the leaf is our prediction.
The tree is trained as follows. We take the data, and we look at all the features and values of these features to find the best split. The best split is the one such that our evaluation metric (mean squared error) is minimized by the split. In other words, if this split is used, the mean squared error of the model that comes from this split is minimized. The samples represents the number of samples of data that are in the subtree (due to the splits). The value is the mean target value of all the samples in that subtree. Then we repeat until each leaf only contains one sample (usually).
Let’s build a full decision tree and evaluate its performance:
model = DecisionTreeRegressor()
model.fit(X_train, y_train)
print_score(model)
Train RMSE: 0.0
Valid RMSE: 43111.0938316298
Train R^2: 1.0
Valid R^2: 0.7576939544477124
We see the Decision Tree has overfit. So, we introduce Random Forests:
model = RandomForestRegressor(n_estimators=30,
n_jobs=-1)
model.fit(X_train, y_train)
print_score(model)
Train RMSE: 11972.136923038881
Valid RMSE: 29315.10749027229
Train R^2: 0.9759693433070632
Valid R^2: 0.8879610196550729
There are a few key hyperparameters to tune (shown below):
model = RandomForestRegressor(n_estimators=50,
n_jobs=-1,
min_samples_leaf=1,
max_features=0.8) #'sqrt', 'log2'
model.fit(X_train, y_train)
print_score(model)
Train RMSE: 11464.59450285856
Valid RMSE: 29159.246751559134
Train R^2: 0.9779636487670959
Valid R^2: 0.889149216323862
Feature Importances
Now that we have a model with reasonable performance, we can use it to understand our dataset (and use that to improve the model). The first is looking at feature importance.
To calculate feature x’s importance, we first shuffle x column. Now, the feature is uncorrelated with the target. Next, we evaluate the model’s performance on the training set and see how much it has decreased compared to when the feature was unshuffled. The higher this value, the more important the feature is.
# properties with an underscore are available after you fit the model
fi = pd.DataFrame(model.feature_importances_,
index = X_train.columns,
columns=['importance'])
fi = fi.sort_values('importance', ascending=False)
fi[:30].plot.barh(figsize=(12, 7))
fi.loc['YearRemodAdd']
importance 0.00591
Name: YearRemodAdd, dtype: float64
These importances tell you where to focus your attention on your dataset. Your data may have thousands of features, so having a way to narrow your investigation scope is very helpful.
Here are a few things to think about:
- Which features are the most important? This is a good opportunity to do some feature engineering. For example, if a date turned out to be an important feature, you may want to split this up into more granular features, such as splitting up the day/month/year, day of week, adding season, etc.
- Which features are less important? Can we make them more useful through feature engineering?
- Do the important features make sense? For example, suppose the data had an ID number that was assigned randomly, but turned out to be important. This is suspicious and you should look into the data and see why this is the case.
- Are there are any features you expected to be important but aren’t? This could indicate some data cleaning work you missed. If not, this is useful to build intuition about your dat.
- If there are highly correlated features, which of the correlated features are important? How does removing the less important feature affect the model performance?
- And many more…
Let’s try keeping only the important features and seeing how the model performs:
X_train_new = X_train[fi[fi['importance'] > 0.005].index]
X_valid_new = X_valid[fi[fi['importance'] > 0.005].index]
X_train.shape # see how many features are left
(1168, 82)
model_new = RandomForestRegressor(n_estimators=30,
n_jobs=-1,
oob_score=True,
min_samples_leaf=3,
max_features=0.5)
model_new.fit(X_train_new, y_train)
print_score(model_new, X_t=X_train_new, X_v=X_valid_new)
Train RMSE: 17820.096007344346
Valid RMSE: 29815.94532060853
Train R^2: 0.9467594702883222
Valid R^2: 0.8841000276456914
We can compare the feature importances from the model trained with all the features vs the model with just some of the features.
fi2 = pd.DataFrame(model_new.feature_importances_,
index = X_train_new.columns,
columns=['importance']).sort_values('importance', ascending=False)
fig, ax = plt.subplots(1,2, figsize=(25,8))
fi[:16].plot.barh(ax=ax[0], title='Nice')
fi2.plot.barh(ax=ax[1])
Identifying Correlated Features
Identifying and resolving correlated features can sometimes improve your model in a few ways:
- Decreases the dimensionality of your data
- Can reduce overfitting
- Prevents the model from learning unintended biases
- Makes the model more interpretable (treeinterpeter/partial depencence plots will be more accurate)
- And more.
For many model types, you would look at normal (linear) correlation, but a random forest is not affected by just linear correlation. Since the trees simply split the data at some point, just the rank of the feature matters.
Consider you sort the data by a feature. The rank is the position of the example in the sorted dataset. Below, we identify rank correlations.
from scipy.cluster import hierarchy as hc
from scipy import stats
df2 = X_train_new[
X_train_new.columns[X_train_new.dtypes != 'category']
]
corr = np.round(stats.spearmanr(df2).correlation, 4)
fig, ax = plt.subplots(figsize=(10,8))
g = sns.heatmap(corr, ax=ax)
g.set_yticklabels(df2.columns, rotation=0)
g.set_xticklabels(df2.columns, rotation=90)
None
corr_condensed = hc.distance.squareform(1-corr)
z = hc.linkage(corr_condensed, method='average')
fig = plt.figure(figsize=(16,10))
dendrogram = hc.dendrogram(z, labels=df2.columns, orientation='left',
leaf_font_size=16)
plt.show()
Tree Interpreter
Tree interpreter is a great way to understand the individual predictions. Recall for a Decision Tree, at each node, the model makes a “decision” based on a condition on a feature to follow the left child or the right child. Each node contains the average target value for the subset of data that satisfies the condition. The tree interpreter measures how much the average changes for each of these decisions for each feature (averaged across all trees). We call this average change the contribution of the feature.
from treeinterpreter import treeinterpreter as ti
r = X_valid.values[None,0]
_, _, contributions = ti.predict(model, r)
ti_df = pd.DataFrame({
'feature': X_valid.columns,
'value': X_valid.iloc[0],
'contributions': contributions[0],
})
show_all(ti_df.sort_values('contributions'))
feature | value | contributions | |
---|---|---|---|
OverallQual | OverallQual | 6.0 | -23204.618284 |
GrLivArea | GrLivArea | 1068.0 | -18869.146520 |
GarageArea | GarageArea | 264.0 | -6130.961614 |
GarageCars | GarageCars | 1.0 | -4230.235767 |
ExterQual | ExterQual | 2.0 | -3690.661634 |
YearBuilt | YearBuilt | 1963.0 | -1606.690435 |
FullBath | FullBath | 1.0 | -1549.810925 |
MoSold | MoSold | 2.0 | -1044.108978 |
GarageYrBlt | GarageYrBlt | 1963.0 | -839.353054 |
1stFlrSF | 1stFlrSF | 1068.0 | -761.773371 |
FireplaceQu | FireplaceQu | -1.0 | -435.448301 |
2ndFlrSF | 2ndFlrSF | 0.0 | -353.114027 |
KitchenQual | KitchenQual | 3.0 | -274.866357 |
YrSold | YrSold | 2006.0 | -260.455148 |
HouseStyle | HouseStyle | 2.0 | -241.007907 |
HeatingQC | HeatingQC | 4.0 | -224.041546 |
MasVnrArea | MasVnrArea | 0.0 | -184.002489 |
Id | Id | 893.0 | -171.481627 |
Fireplaces | Fireplaces | 0.0 | -142.288747 |
BsmtFullBath | BsmtFullBath | 0.0 | -139.435641 |
BsmtFinSF2 | BsmtFinSF2 | 0.0 | -132.122053 |
BsmtExposure | BsmtExposure | 3.0 | -116.156015 |
BsmtQual | BsmtQual | 3.0 | -111.225296 |
Neighborhood | Neighborhood | 19.0 | -76.460459 |
ExterCond | ExterCond | 2.0 | -58.762708 |
HalfBath | HalfBath | 0.0 | -38.991079 |
SaleType | SaleType | 8.0 | -27.675000 |
LotFrontage_null | LotFrontage_null | 0.0 | -21.285714 |
ScreenPorch | ScreenPorch | 0.0 | -17.295352 |
MiscFeature | MiscFeature | -1.0 | -17.083333 |
GarageFinish | GarageFinish | 1.0 | -0.033333 |
KitchenAbvGr | KitchenAbvGr | 1.0 | 0.000000 |
Electrical | Electrical | 4.0 | 0.000000 |
LowQualFinSF | LowQualFinSF | 0.0 | 0.000000 |
Utilities | Utilities | 0.0 | 0.000000 |
LotShape | LotShape | 3.0 | 0.000000 |
LandSlope | LandSlope | 0.0 | 0.000000 |
Heating | Heating | 1.0 | 0.000000 |
GarageYrBlt_null | GarageYrBlt_null | 0.0 | 0.000000 |
Functional | Functional | 6.0 | 0.000000 |
BldgType | BldgType | 0.0 | 0.000000 |
Street | Street | 1.0 | 0.000000 |
3SsnPorch | 3SsnPorch | 0.0 | 0.000000 |
PoolArea | PoolArea | 0.0 | 0.000000 |
PoolQC | PoolQC | 0.0 | 0.000000 |
RoofMatl | RoofMatl | 1.0 | 0.000000 |
MiscVal | MiscVal | 0.0 | 0.000000 |
Condition2 | Condition2 | 2.0 | 0.000000 |
TotRmsAbvGrd | TotRmsAbvGrd | 6.0 | 0.000000 |
Alley | Alley | -1.0 | 0.000000 |
BsmtHalfBath | BsmtHalfBath | 1.0 | 4.666667 |
LandContour | LandContour | 3.0 | 19.390152 |
MasVnrType | MasVnrType | 2.0 | 20.747510 |
Condition1 | Condition1 | 2.0 | 27.096890 |
Foundation | Foundation | 1.0 | 27.947817 |
EnclosedPorch | EnclosedPorch | 0.0 | 28.570496 |
PavedDrive | PavedDrive | 2.0 | 40.605275 |
Exterior1st | Exterior1st | 6.0 | 50.324136 |
LotConfig | LotConfig | 4.0 | 56.750000 |
BsmtFinType2 | BsmtFinType2 | 5.0 | 73.333333 |
MSSubClass | MSSubClass | 20.0 | 80.537781 |
OpenPorchSF | OpenPorchSF | 0.0 | 98.498287 |
Exterior2nd | Exterior2nd | 6.0 | 99.801761 |
BsmtFinType1 | BsmtFinType1 | 2.0 | 151.642921 |
RoofStyle | RoofStyle | 3.0 | 153.249065 |
BedroomAbvGr | BedroomAbvGr | 3.0 | 161.383636 |
BsmtCond | BsmtCond | 3.0 | 255.793641 |
GarageCond | GarageCond | 4.0 | 264.679058 |
Fence | Fence | 2.0 | 321.183333 |
MSZoning | MSZoning | 3.0 | 396.914474 |
BsmtFinSF1 | BsmtFinSF1 | 663.0 | 449.878523 |
GarageQual | GarageQual | 4.0 | 462.399980 |
WoodDeckSF | WoodDeckSF | 192.0 | 487.761372 |
SaleCondition | SaleCondition | 4.0 | 490.223394 |
CentralAir | CentralAir | 1.0 | 533.540801 |
GarageType | GarageType | 1.0 | 569.096237 |
BsmtUnfSF | BsmtUnfSF | 396.0 | 601.818903 |
LotArea | LotArea | 8414.0 | 982.527398 |
LotFrontage | LotFrontage | 70.0 | 1255.136900 |
OverallCond | OverallCond | 8.0 | 2634.189802 |
YearRemodAdd | YearRemodAdd | 2003.0 | 2661.275923 |
TotalBsmtSF | TotalBsmtSF | 1059.0 | 11313.187661 |
Partial Dependence Plots
We will not cover Partial Dependence Plots in this workshop, but here is some code demonstrating how you can use them.
Partial dependence use the model and dataset as follows. It varies the value of the feature of interest while keeping the rest of the features the same. Then, it uses the model to make a prediction on this augmented data. This way, we can see the effect of that feature alone. This library finds clusters in the dataset predictions for you.
from pdpbox import pdp
def plot_pdp(feat, clusters=None, feat_name=None):
feat_name = feat_name or feat
p = pdp.pdp_isolate(model, X_train, X_train.columns, feat)
return pdp.pdp_plot(p, feat_name,
plot_lines=True,
cluster=clusters is not None,
n_cluster_centers=clusters)
plot_pdp('GrLivArea')
plot_pdp('OverallQual', clusters=5)
plt.figure()
sns.regplot(data=df, x='OverallQual', y='SalePrice')
There seems to be an issue in the library related to plotting (or perhaps my code is wrong somehow), which is why I use a try-except block. Due to this issue, the axis are not shown.
feats = ['OverallQual', 'GrLivArea']
p = pdp.pdp_interact(model, X_train, X_train.columns, feats)
try:
pdp.pdp_interact_plot(p, feats)
except TypeError:
pass
feats = ['OverallCond', 'OverallQual']
p = pdp.pdp_interact(model, X_train, X_train.columns, feats)
try:
pdp.pdp_interact_plot(p, feats)
except TypeError:
pass
Standard Process for Quick RF Development:
- Using scoring metric (provided or decided), create a scoring function (training + validation).
- Create a validation set with same properties as test set.
- Run a quick random forest on the data with minimal alterations.
- Plot feature importance, plot the features, and learn about those features (domain knowledge).
- Use a natural breakpoint to remove unimportant features then plot again.
- Look at data carefully and encode important features better.
- Using heirachical clustering to remove redundant features (scipy)
- For interpretation, use partial dependence plots from PDP.
- Use Tree Interpreter to explain individual predictions.