Introduction to Random Forests

27 Sep 2019

Through this workshop you will learn how to quickly model and understand datasets using scikit-learn. Topics will include a basic introduction to using decisions trees and random forests, understanding feature importance, identifying model weaknesses, explaining your model, and more.

If you are not familiar with pandas, check out this post first. This article contains some high level explanations of the code not covered in the live workshop, but skips some examples given in the main presentation. Here is a recording of the main workshop, here are the slides, and here is a colab link to run all the code.

Many of the model interpretation techniques were taken from the fast.ai ML course. Check it out!!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

Data Wrangling

Data munging/wrangling is the process of transforming your raw data into a form suitable for analysis or modelling. Typically this involves dealing with null values, finding (and possibly correcting) data issues, transforming non-numerical data into numerical data, and more.

In the “real world”, you will need to do much more than is shown below. There are numerous Kaggle Kernels demonstrating in depth data cleaning. For the purposes of this workshop, we will do (less than) the bare minimum for two reasons:

We are using a relatively clean dataset (with few data oddities).
Our model analysis will tell us where to focus our data cleaning and feature engineering efforts.
This workshop is only an hour long :P.

Here is an example of a great example of some data exploration and correction.

def show_all(df):
    """Shows our dataframe without cutting off any rows or columns."""
    with pd.option_context("display.max_rows", 1000,
                           "display.max_columns", 1000): 
        display(df)

First, we take a look at our data to get a sense of the types of values in the columns. We do this by looking at some of our data and using numerical summaries.

df = pd.read_csv('https://raw.githubusercontent.com/n2cholas/dsc-workshops/master/Random%20Forest%20Workshop/data/train.csv')
show_all(df.head())

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	RoofMatl	Exterior1st	Exterior2nd	MasVnrType	MasVnrArea	ExterQual	ExterCond	Foundation	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtUnfSF	TotalBsmtSF	Heating	HeatingQC	CentralAir	Electrical	1stFlrSF	2ndFlrSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Functional	Fireplaces	FireplaceQu	GarageType	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	PavedDrive	WoodDeckSF	OpenPorchSF	EnclosedPorch	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	7	5	2003	2003	Gable	CompShg	VinylSd	VinylSd	BrkFace	196.0	Gd	TA	PConc	Gd	TA	No	GLQ	706	Unf	150	856	GasA	Ex	Y	SBrkr	856	854	1710	1	0	2	1	3	1	Gd	8	Typ	0	NaN	Attchd	2003.0	RFn	2	548	TA	TA	Y	0	61	0	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	FR2	Gtl	Veenker	Feedr	Norm	1Fam	1Story	6	8	1976	1976	Gable	CompShg	MetalSd	MetalSd	None	0.0	TA	TA	CBlock	Gd	TA	Gd	ALQ	978	Unf	284	1262	GasA	Ex	Y	SBrkr	1262	0	1262	0	1	2	0	3	1	TA	6	Typ	1	TA	Attchd	1976.0	RFn	2	460	TA	TA	Y	298	0	0	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	7	5	2001	2002	Gable	CompShg	VinylSd	VinylSd	BrkFace	162.0	Gd	TA	PConc	Gd	TA	Mn	GLQ	486	Unf	434	920	GasA	Ex	Y	SBrkr	920	866	1786	1	0	2	1	3	1	Gd	6	Typ	1	TA	Attchd	2001.0	RFn	2	608	TA	TA	Y	0	42	0	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	Corner	Gtl	Crawfor	Norm	Norm	1Fam	2Story	7	5	1915	1970	Gable	CompShg	Wd Sdng	Wd Shng	None	0.0	TA	TA	BrkTil	TA	Gd	No	ALQ	216	Unf	540	756	GasA	Gd	Y	SBrkr	961	756	1717	1	0	1	0	3	1	Gd	7	Typ	1	Gd	Detchd	1998.0	Unf	3	642	TA	TA	Y	0	35	272	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	FR2	Gtl	NoRidge	Norm	Norm	1Fam	2Story	8	5	2000	2000	Gable	CompShg	VinylSd	VinylSd	BrkFace	350.0	Gd	TA	PConc	Gd	TA	Av	GLQ	655	Unf	490	1145	GasA	Ex	Y	SBrkr	1145	1053	2198	1	0	2	1	4	1	Gd	9	Typ	1	TA	Attchd	2000.0	RFn	3	836	TA	TA	Y	192	84	0	NaN	NaN	NaN	12	2008	WD	Normal	250000

len(df)

show_all(df.describe())

	Id	MSSubClass	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	BsmtFinSF2	BsmtUnfSF	TotalBsmtSF	1stFlrSF	2ndFlrSF	LowQualFinSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	TotRmsAbvGrd	Fireplaces	GarageYrBlt	GarageCars	GarageArea	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold	SalePrice
count	1460.000000	1460.000000	1201.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1452.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1379.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000
mean	730.500000	56.897260	70.049958	10516.828082	6.099315	5.575342	1971.267808	1984.865753	103.685262	443.639726	46.549315	567.240411	1057.429452	1162.626712	346.992466	5.844521	1515.463699	0.425342	0.057534	1.565068	0.382877	2.866438	1.046575	6.517808	0.613014	1978.506164	1.767123	472.980137	94.244521	46.660274	21.954110	3.409589	15.060959	2.758904	43.489041	6.321918	2007.815753	180921.195890
std	421.610009	42.300571	24.284752	9981.264932	1.382997	1.112799	30.202904	20.645407	181.066207	456.098091	161.319273	441.866955	438.705324	386.587738	436.528436	48.623081	525.480383	0.518911	0.238753	0.550916	0.502885	0.815778	0.220338	1.625393	0.644666	24.689725	0.747315	213.804841	125.338794	66.256028	61.119149	29.317331	55.757415	40.177307	496.123024	2.703626	1.328095	79442.502883
min	1.000000	20.000000	21.000000	1300.000000	1.000000	1.000000	1872.000000	1950.000000	0.000000	0.000000	0.000000	0.000000	0.000000	334.000000	0.000000	0.000000	334.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	2.000000	0.000000	1900.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	2006.000000	34900.000000
25%	365.750000	20.000000	59.000000	7553.500000	5.000000	5.000000	1954.000000	1967.000000	0.000000	0.000000	0.000000	223.000000	795.750000	882.000000	0.000000	0.000000	1129.500000	0.000000	0.000000	1.000000	0.000000	2.000000	1.000000	5.000000	0.000000	1961.000000	1.000000	334.500000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	5.000000	2007.000000	129975.000000
50%	730.500000	50.000000	69.000000	9478.500000	6.000000	5.000000	1973.000000	1994.000000	0.000000	383.500000	0.000000	477.500000	991.500000	1087.000000	0.000000	0.000000	1464.000000	0.000000	0.000000	2.000000	0.000000	3.000000	1.000000	6.000000	1.000000	1980.000000	2.000000	480.000000	0.000000	25.000000	0.000000	0.000000	0.000000	0.000000	0.000000	6.000000	2008.000000	163000.000000
75%	1095.250000	70.000000	80.000000	11601.500000	7.000000	6.000000	2000.000000	2004.000000	166.000000	712.250000	0.000000	808.000000	1298.250000	1391.250000	728.000000	0.000000	1776.750000	1.000000	0.000000	2.000000	1.000000	3.000000	1.000000	7.000000	1.000000	2002.000000	2.000000	576.000000	168.000000	68.000000	0.000000	0.000000	0.000000	0.000000	0.000000	8.000000	2009.000000	214000.000000
max	1460.000000	190.000000	313.000000	215245.000000	10.000000	9.000000	2010.000000	2010.000000	1600.000000	5644.000000	1474.000000	2336.000000	6110.000000	4692.000000	2065.000000	572.000000	5642.000000	3.000000	2.000000	3.000000	2.000000	8.000000	3.000000	14.000000	3.000000	2010.000000	4.000000	1418.000000	857.000000	547.000000	552.000000	508.000000	480.000000	738.000000	15500.000000	12.000000	2010.000000	755000.000000

We want to ensure all our columns (or features) have numerical or categorical values. Let’s look at the data types of each column.

show_all(df.dtypes)

Id                 int64
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
Street            object
Alley             object
LotShape          object
LandContour       object
Utilities         object
LotConfig         object
LandSlope         object
Neighborhood      object
Condition1        object
Condition2        object
BldgType          object
HouseStyle        object
OverallQual        int64
OverallCond        int64
YearBuilt          int64
YearRemodAdd       int64
RoofStyle         object
RoofMatl          object
Exterior1st       object
Exterior2nd       object
MasVnrType        object
MasVnrArea       float64
ExterQual         object
ExterCond         object
Foundation        object
BsmtQual          object
BsmtCond          object
BsmtExposure      object
BsmtFinType1      object
BsmtFinSF1         int64
BsmtFinType2      object
BsmtFinSF2         int64
BsmtUnfSF          int64
TotalBsmtSF        int64
Heating           object
HeatingQC         object
CentralAir        object
Electrical        object
1stFlrSF           int64
2ndFlrSF           int64
LowQualFinSF       int64
GrLivArea          int64
BsmtFullBath       int64
BsmtHalfBath       int64
FullBath           int64
HalfBath           int64
BedroomAbvGr       int64
KitchenAbvGr       int64
KitchenQual       object
TotRmsAbvGrd       int64
Functional        object
Fireplaces         int64
FireplaceQu       object
GarageType        object
GarageYrBlt      float64
GarageFinish      object
GarageCars         int64
GarageArea         int64
GarageQual        object
GarageCond        object
PavedDrive        object
WoodDeckSF         int64
OpenPorchSF        int64
EnclosedPorch      int64
3SsnPorch          int64
ScreenPorch        int64
PoolArea           int64
PoolQC            object
Fence             object
MiscFeature       object
MiscVal            int64
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
SalePrice          int64
dtype: object

Most of the data is of type object. In general, this will be a Python string but could also be other Python objects, such as lists or dicts. Just based on the names, it looks like these are all categorical features. Let’s check this by looking at the unique values in each column.

for c in df.columns[df.dtypes=='object']:
    print(df[c].value_counts())
    print()

RL         1151
RM          218
FV           65
RH           16
C (all)      10
Name: MSZoning, dtype: int64

Pave    1454
Grvl       6
Name: Street, dtype: int64

Grvl    50
Pave    41
Name: Alley, dtype: int64

Reg    925
IR1    484
IR2     41
IR3     10
Name: LotShape, dtype: int64

Lvl    1311
Bnk      63
HLS      50
Low      36
Name: LandContour, dtype: int64

AllPub    1459
NoSeWa       1
Name: Utilities, dtype: int64

Inside     1052
Corner      263
CulDSac      94
FR2          47
FR3           4
Name: LotConfig, dtype: int64

Gtl    1382
Mod      65
Sev      13
Name: LandSlope, dtype: int64

NAmes      225
CollgCr    150
OldTown    113
Edwards    100
Somerst     86
Gilbert     79
NridgHt     77
Sawyer      74
NWAmes      73
SawyerW     59
BrkSide     58
Crawfor     51
Mitchel     49
NoRidge     41
Timber      38
IDOTRR      37
ClearCr     28
StoneBr     25
SWISU       25
MeadowV     17
Blmngtn     17
BrDale      16
Veenker     11
NPkVill      9
Blueste      2
Name: Neighborhood, dtype: int64

Norm      1260
Feedr       81
Artery      48
RRAn        26
PosN        19
RRAe        11
PosA         8
RRNn         5
RRNe         2
Name: Condition1, dtype: int64

Norm      1445
Feedr        6
RRNn         2
PosN         2
Artery       2
RRAe         1
PosA         1
RRAn         1
Name: Condition2, dtype: int64

1Fam      1220
TwnhsE     114
Duplex      52
Twnhs       43
2fmCon      31
Name: BldgType, dtype: int64

1Story    726
2Story    445
1.5Fin    154
SLvl       65
SFoyer     37
1.5Unf     14
2.5Unf     11
2.5Fin      8
Name: HouseStyle, dtype: int64

Gable      1141
Hip         286
Flat         13
Gambrel      11
Mansard       7
Shed          2
Name: RoofStyle, dtype: int64

CompShg    1434
Tar&Grv      11
WdShngl       6
WdShake       5
Membran       1
Roll          1
ClyTile       1
Metal         1
Name: RoofMatl, dtype: int64

VinylSd    515
HdBoard    222
MetalSd    220
Wd Sdng    206
Plywood    108
CemntBd     61
BrkFace     50
WdShing     26
Stucco      25
AsbShng     20
Stone        2
BrkComm      2
AsphShn      1
ImStucc      1
CBlock       1
Name: Exterior1st, dtype: int64

VinylSd    504
MetalSd    214
HdBoard    207
Wd Sdng    197
Plywood    142
CmentBd     60
Wd Shng     38
Stucco      26
BrkFace     25
AsbShng     20
ImStucc     10
Brk Cmn      7
Stone        5
AsphShn      3
CBlock       1
Other        1
Name: Exterior2nd, dtype: int64

None       864
BrkFace    445
Stone      128
BrkCmn      15
Name: MasVnrType, dtype: int64

TA    906
Gd    488
Ex     52
Fa     14
Name: ExterQual, dtype: int64

TA    1282
Gd     146
Fa      28
Ex       3
Po       1
Name: ExterCond, dtype: int64

PConc     647
CBlock    634
BrkTil    146
Slab       24
Stone       6
Wood        3
Name: Foundation, dtype: int64

TA    649
Gd    618
Ex    121
Fa     35
Name: BsmtQual, dtype: int64

TA    1311
Gd      65
Fa      45
Po       2
Name: BsmtCond, dtype: int64

No    953
Av    221
Gd    134
Mn    114
Name: BsmtExposure, dtype: int64

Unf    430
GLQ    418
ALQ    220
BLQ    148
Rec    133
LwQ     74
Name: BsmtFinType1, dtype: int64

Unf    1256
Rec      54
LwQ      46
BLQ      33
ALQ      19
GLQ      14
Name: BsmtFinType2, dtype: int64

GasA     1428
GasW       18
Grav        7
Wall        4
OthW        2
Floor       1
Name: Heating, dtype: int64

Ex    741
TA    428
Gd    241
Fa     49
Po      1
Name: HeatingQC, dtype: int64

Y    1365
N      95
Name: CentralAir, dtype: int64

SBrkr    1334
FuseA      94
FuseF      27
FuseP       3
Mix         1
Name: Electrical, dtype: int64

TA    735
Gd    586
Ex    100
Fa     39
Name: KitchenQual, dtype: int64

Typ     1360
Min2      34
Min1      31
Mod       15
Maj1      14
Maj2       5
Sev        1
Name: Functional, dtype: int64

Gd    380
TA    313
Fa     33
Ex     24
Po     20
Name: FireplaceQu, dtype: int64

Attchd     870
Detchd     387
BuiltIn     88
Basment     19
CarPort      9
2Types       6
Name: GarageType, dtype: int64

Unf    605
RFn    422
Fin    352
Name: GarageFinish, dtype: int64

TA    1311
Fa      48
Gd      14
Ex       3
Po       3
Name: GarageQual, dtype: int64

TA    1326
Fa      35
Gd       9
Po       7
Ex       2
Name: GarageCond, dtype: int64

Y    1340
N      90
P      30
Name: PavedDrive, dtype: int64

Gd    3
Fa    2
Ex    2
Name: PoolQC, dtype: int64

MnPrv    157
GdPrv     59
GdWo      54
MnWw      11
Name: Fence, dtype: int64

Shed    49
Gar2     2
Othr     2
TenC     1
Name: MiscFeature, dtype: int64

WD       1267
New       122
COD        43
ConLD       9
ConLI       5
ConLw       5
CWD         4
Oth         3
Con         2
Name: SaleType, dtype: int64

Normal     1198
Partial     125
Abnorml     101
Family       20
Alloca       12
AdjLand       4
Name: SaleCondition, dtype: int64

Great, now we know all of our object columns contain categories. In reality, this is rarely the case and you would have to do additional cleaning steps.

We should also note that some of these categories are ordered, such as ExterQual and ExterCond.

Also, you may have a dataset with thousands of features, so it is infeasible to look through all the categories like we did. In this case you can look at the number of unique values for each feature. Features with many unique values might be numerical values that weren’t properly encoded or free form text, which would have to be dealt with otherwise.

Next thing we want to do is deal with null values. Let’s see which columns have null values by showing the proportion of values in each column that are null.

x = (df.isnull().sum()/len(df)).to_frame(name='perc_null')
x['types'] = df.dtypes
x[x.perc_null>0].sort_values('perc_null', ascending=False)

	perc_null	types
PoolQC	0.995205	object
MiscFeature	0.963014	object
Alley	0.937671	object
Fence	0.807534	object
FireplaceQu	0.472603	object
LotFrontage	0.177397	float64
GarageType	0.055479	object
GarageYrBlt	0.055479	float64
GarageFinish	0.055479	object
GarageQual	0.055479	object
GarageCond	0.055479	object
BsmtExposure	0.026027	object
BsmtFinType2	0.026027	object
BsmtFinType1	0.025342	object
BsmtCond	0.025342	object
BsmtQual	0.025342	object
MasVnrArea	0.005479	float64
MasVnrType	0.005479	object
Electrical	0.000685	object

For the categorical variables, we will simply add a null category, when we turn them into categorical variables, pandas will automatically create a nan category if needed.

Dealing with nans (not a numbers) in numerical columns is more challenging. You typically want to replace the nans with a default value. Is there a reasonable default for this column? Does 0 make sense? What about the min/max/median? There is much discussion about this on the web.

Here, PoolQC and MasVnrArea are null when the house does not have a pool, so it makes sense to fill in 0 for these columns. For LotFrontage and GarageYrBuilt, we use the median. For the latter two, the missing information may have some pattern that helps us predict the price, so we will create an indicator column that tells us whether the column was null or not. This could be useful if, for example, all houses without a LotFrontage had a very small Lot, so they have a lower cost, so the indicator would help our model learn that this is the case.

df.PoolQC.fillna(0, inplace=True)
df.MasVnrArea.fillna(0, inplace=True)

def fill_nulls(col, filler=np.nanmedian):
    df[f'{col}_null'] = 0
    df.loc[df[col].isnull(), f'{col}_null'] = 1
    df[col].fillna(filler(df[col]), inplace=True)

fill_nulls('LotFrontage')
fill_nulls('GarageYrBlt')

Now, the “object” dtype columns are ready to be turned into actual categories. You can read more about the “category” dtype on the pandas documentation page. Essentially, it changes all the values from strings to numbers, so our model can use them.

For a vast majority of machine learning model types, we would need to do one additional step and “one-hot encode” these categorical variables. Since we are using a tree-based model, we will not need to do this, so we will not cover it.

for col in df.columns[df.dtypes=='object']:
    df[col] = df[col].astype('category')

We noted above that some of the categorical features were ordered. We will encode two of them as ordered categorical columns, but leave the rest as an exercise.

from pandas.api.types import CategoricalDtype

order = ['Po', 'Fa', 'TA', 'Gd', 'Ex']
cat_type = CategoricalDtype(categories=order, ordered=True)
df['ExterQual'] = df['ExterQual'].astype(cat_type)
df['ExterCond'] = df['ExterCond'].astype(cat_type)

We can take a look at some of the category codes along with the original strings:

print(df['ExterQual'][:10].cat.codes)
print(df['ExterQual'][:10])

0    3
1    2
2    3
3    2
4    3
5    2
6    3
7    2
8    2
9    2
dtype: int8
0    Gd
1    TA
2    Gd
3    TA
4    Gd
5    TA
6    Gd
7    TA
8    TA
9    TA
Name: ExterQual, dtype: category
Categories (5, object): [Po < Fa < TA < Gd < Ex]

So far, we’ve done some basic data transformations so the data can be fed into our model. We did no feature engineering or data exploration. Your dataset could have thousands of dimensions, so it is infeasible to extensively explore the data before modelling. Our approach will be to let the model tell us what’s important and what’s not so we can focus our effots appropriately.

Modeling

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

Our goal is to predict the price of the houses given the rest of the information. We start by splitting our dataframe into our features (our model inputs) and our target (the price we are trying to predict).

df_y, df_x = df['SalePrice'], df.drop('SalePrice', axis=1)
# same as axis='columns'

Scikit-learn expects numerical values for all the columns, but our categorical variables contain text. We will replace the text with their numeric category number:

for col in df_x.columns[df_x.dtypes=='category']:
    df_x[col] = df_x[col].cat.codes

In machine learning, you typically split your data into a training, validation, and test set. You train your model on the training set, then evaluate the performance on the validation set. You use this validation set to tune hyperparameters (explained later). After selecting your model, you use the test set as one final check to make sure your model generalizes well (and you didn’t just “overfit” to your validation set).

These concepts will be explained in more detail below. We split our data into just training and validation for simplicity, as we won’t be doing any rigorous model selection.

X_train, X_valid, y_train, y_valid = train_test_split(
    df_x, df_y, test_size=0.2, random_state=42)

Let’s create some convenience functions to evaluate our model. Metric selection will not be discussed during this workshop. Here, we use the root mean squared error as our metric. Mean squared error is a common metric used for regression problems, and root mean squared error is nice in that it is in the same units as your original output.

def rmse(x,y): 
    return np.sqrt(np.mean(((x-y)**2)))

def print_score(model, X_t=X_train, y_t=y_train, 
                       X_v=X_valid, y_v=y_valid):
    scores = [rmse(model.predict(X_t), y_t), 
              rmse(model.predict(X_v), y_v),
              model.score(X_t, y_t), 
              model.score(X_v, y_v)]
    
    labels = ['Train RMSE', 'Valid RMSE', 'Train R^2', 'Valid R^2']
    for t, s in zip(labels, scores):
        print(f'{t}: {s}')

Time to fit our first model: a DecisionTree! It’s as easy as:

tree = DecisionTreeRegressor(min_samples_leaf=50)
tree.fit(X_train, y_train)

We used the min_samples_leaf argument to limit the size of the tree so we can visualize it below:

from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

dot_data = StringIO()
g = export_graphviz(tree, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,
                feature_names=X_train.columns)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

png

Given the features, the decision tree works as follows: we start at the root and look at the feature in question. If the condition at the top is True, we follow the True branch, otherwise we follow the False branch. This process continues until we reach a leaf. The value at the leaf is our prediction.

The tree is trained as follows. We take the data, and we look at all the features and values of these features to find the best split. The best split is the one such that our evaluation metric (mean squared error) is minimized by the split. In other words, if this split is used, the mean squared error of the model that comes from this split is minimized. The samples represents the number of samples of data that are in the subtree (due to the splits). The value is the mean target value of all the samples in that subtree. Then we repeat until each leaf only contains one sample (usually).

Let’s build a full decision tree and evaluate its performance:

model = DecisionTreeRegressor()
model.fit(X_train, y_train)
print_score(model)

Train RMSE: 0.0
Valid RMSE: 43111.0938316298
Train R^2: 1.0
Valid R^2: 0.7576939544477124

We see the Decision Tree has overfit. So, we introduce Random Forests:

model = RandomForestRegressor(n_estimators=30, 
                              n_jobs=-1)
model.fit(X_train, y_train)
print_score(model)

Train RMSE: 11972.136923038881
Valid RMSE: 29315.10749027229
Train R^2: 0.9759693433070632
Valid R^2: 0.8879610196550729

There are a few key hyperparameters to tune (shown below):

model = RandomForestRegressor(n_estimators=50, 
                              n_jobs=-1,
                              min_samples_leaf=1,
                              max_features=0.8) #'sqrt', 'log2'
model.fit(X_train, y_train)
print_score(model)

Train RMSE: 11464.59450285856
Valid RMSE: 29159.246751559134
Train R^2: 0.9779636487670959
Valid R^2: 0.889149216323862

Feature Importances

Now that we have a model with reasonable performance, we can use it to understand our dataset (and use that to improve the model). The first is looking at feature importance.

To calculate feature x’s importance, we first shuffle x column. Now, the feature is uncorrelated with the target. Next, we evaluate the model’s performance on the training set and see how much it has decreased compared to when the feature was unshuffled. The higher this value, the more important the feature is.

# properties with an underscore are available after you fit the model
fi = pd.DataFrame(model.feature_importances_,
                  index = X_train.columns,
                  columns=['importance'])
fi = fi.sort_values('importance', ascending=False)
fi[:30].plot.barh(figsize=(12, 7))

png

fi.loc['YearRemodAdd']

importance    0.00591
Name: YearRemodAdd, dtype: float64

These importances tell you where to focus your attention on your dataset. Your data may have thousands of features, so having a way to narrow your investigation scope is very helpful.

Here are a few things to think about:

Which features are the most important? This is a good opportunity to do some feature engineering. For example, if a date turned out to be an important feature, you may want to split this up into more granular features, such as splitting up the day/month/year, day of week, adding season, etc.
Which features are less important? Can we make them more useful through feature engineering?
Do the important features make sense? For example, suppose the data had an ID number that was assigned randomly, but turned out to be important. This is suspicious and you should look into the data and see why this is the case.
Are there are any features you expected to be important but aren’t? This could indicate some data cleaning work you missed. If not, this is useful to build intuition about your dat.
If there are highly correlated features, which of the correlated features are important? How does removing the less important feature affect the model performance?
And many more…

Let’s try keeping only the important features and seeing how the model performs:

X_train_new = X_train[fi[fi['importance'] > 0.005].index]
X_valid_new = X_valid[fi[fi['importance'] > 0.005].index]

X_train.shape  # see how many features are left

(1168, 82)

model_new = RandomForestRegressor(n_estimators=30, 
                                  n_jobs=-1,
                                  oob_score=True,
                                  min_samples_leaf=3,
                                  max_features=0.5)
model_new.fit(X_train_new, y_train)
print_score(model_new, X_t=X_train_new, X_v=X_valid_new)

Train RMSE: 17820.096007344346
Valid RMSE: 29815.94532060853
Train R^2: 0.9467594702883222
Valid R^2: 0.8841000276456914

We can compare the feature importances from the model trained with all the features vs the model with just some of the features.

fi2 = pd.DataFrame(model_new.feature_importances_,
                   index = X_train_new.columns,
                   columns=['importance']).sort_values('importance', ascending=False)

fig, ax = plt.subplots(1,2, figsize=(25,8))
fi[:16].plot.barh(ax=ax[0], title='Nice')
fi2.plot.barh(ax=ax[1])

png

Identifying Correlated Features

Identifying and resolving correlated features can sometimes improve your model in a few ways:

Decreases the dimensionality of your data
Can reduce overfitting
Prevents the model from learning unintended biases
Makes the model more interpretable (treeinterpeter/partial depencence plots will be more accurate)
And more.

For many model types, you would look at normal (linear) correlation, but a random forest is not affected by just linear correlation. Since the trees simply split the data at some point, just the rank of the feature matters.

Consider you sort the data by a feature. The rank is the position of the example in the sorted dataset. Below, we identify rank correlations.

from scipy.cluster import hierarchy as hc
from scipy import stats

df2 = X_train_new[
  X_train_new.columns[X_train_new.dtypes != 'category']
]
corr = np.round(stats.spearmanr(df2).correlation, 4)
fig, ax = plt.subplots(figsize=(10,8))
g = sns.heatmap(corr, ax=ax)
g.set_yticklabels(df2.columns, rotation=0)
g.set_xticklabels(df2.columns, rotation=90)
None

png

corr_condensed = hc.distance.squareform(1-corr)
z = hc.linkage(corr_condensed, method='average')
fig = plt.figure(figsize=(16,10))
dendrogram = hc.dendrogram(z, labels=df2.columns, orientation='left', 
                           leaf_font_size=16)
plt.show()

png

Tree Interpreter

Tree interpreter is a great way to understand the individual predictions. Recall for a Decision Tree, at each node, the model makes a “decision” based on a condition on a feature to follow the left child or the right child. Each node contains the average target value for the subset of data that satisfies the condition. The tree interpreter measures how much the average changes for each of these decisions for each feature (averaged across all trees). We call this average change the contribution of the feature.

from treeinterpreter import treeinterpreter as ti

r = X_valid.values[None,0]
_, _, contributions = ti.predict(model, r)

ti_df = pd.DataFrame({
    'feature': X_valid.columns,
    'value': X_valid.iloc[0],
    'contributions': contributions[0],
})
show_all(ti_df.sort_values('contributions'))

	feature	value	contributions
OverallQual	OverallQual	6.0	-23204.618284
GrLivArea	GrLivArea	1068.0	-18869.146520
GarageArea	GarageArea	264.0	-6130.961614
GarageCars	GarageCars	1.0	-4230.235767
ExterQual	ExterQual	2.0	-3690.661634
YearBuilt	YearBuilt	1963.0	-1606.690435
FullBath	FullBath	1.0	-1549.810925
MoSold	MoSold	2.0	-1044.108978
GarageYrBlt	GarageYrBlt	1963.0	-839.353054
1stFlrSF	1stFlrSF	1068.0	-761.773371
FireplaceQu	FireplaceQu	-1.0	-435.448301
2ndFlrSF	2ndFlrSF	0.0	-353.114027
KitchenQual	KitchenQual	3.0	-274.866357
YrSold	YrSold	2006.0	-260.455148
HouseStyle	HouseStyle	2.0	-241.007907
HeatingQC	HeatingQC	4.0	-224.041546
MasVnrArea	MasVnrArea	0.0	-184.002489
Id	Id	893.0	-171.481627
Fireplaces	Fireplaces	0.0	-142.288747
BsmtFullBath	BsmtFullBath	0.0	-139.435641
BsmtFinSF2	BsmtFinSF2	0.0	-132.122053
BsmtExposure	BsmtExposure	3.0	-116.156015
BsmtQual	BsmtQual	3.0	-111.225296
Neighborhood	Neighborhood	19.0	-76.460459
ExterCond	ExterCond	2.0	-58.762708
HalfBath	HalfBath	0.0	-38.991079
SaleType	SaleType	8.0	-27.675000
LotFrontage_null	LotFrontage_null	0.0	-21.285714
ScreenPorch	ScreenPorch	0.0	-17.295352
MiscFeature	MiscFeature	-1.0	-17.083333
GarageFinish	GarageFinish	1.0	-0.033333
KitchenAbvGr	KitchenAbvGr	1.0	0.000000
Electrical	Electrical	4.0	0.000000
LowQualFinSF	LowQualFinSF	0.0	0.000000
Utilities	Utilities	0.0	0.000000
LotShape	LotShape	3.0	0.000000
LandSlope	LandSlope	0.0	0.000000
Heating	Heating	1.0	0.000000
GarageYrBlt_null	GarageYrBlt_null	0.0	0.000000
Functional	Functional	6.0	0.000000
BldgType	BldgType	0.0	0.000000
Street	Street	1.0	0.000000
3SsnPorch	3SsnPorch	0.0	0.000000
PoolArea	PoolArea	0.0	0.000000
PoolQC	PoolQC	0.0	0.000000
RoofMatl	RoofMatl	1.0	0.000000
MiscVal	MiscVal	0.0	0.000000
Condition2	Condition2	2.0	0.000000
TotRmsAbvGrd	TotRmsAbvGrd	6.0	0.000000
Alley	Alley	-1.0	0.000000
BsmtHalfBath	BsmtHalfBath	1.0	4.666667
LandContour	LandContour	3.0	19.390152
MasVnrType	MasVnrType	2.0	20.747510
Condition1	Condition1	2.0	27.096890
Foundation	Foundation	1.0	27.947817
EnclosedPorch	EnclosedPorch	0.0	28.570496
PavedDrive	PavedDrive	2.0	40.605275
Exterior1st	Exterior1st	6.0	50.324136
LotConfig	LotConfig	4.0	56.750000
BsmtFinType2	BsmtFinType2	5.0	73.333333
MSSubClass	MSSubClass	20.0	80.537781
OpenPorchSF	OpenPorchSF	0.0	98.498287
Exterior2nd	Exterior2nd	6.0	99.801761
BsmtFinType1	BsmtFinType1	2.0	151.642921
RoofStyle	RoofStyle	3.0	153.249065
BedroomAbvGr	BedroomAbvGr	3.0	161.383636
BsmtCond	BsmtCond	3.0	255.793641
GarageCond	GarageCond	4.0	264.679058
Fence	Fence	2.0	321.183333
MSZoning	MSZoning	3.0	396.914474
BsmtFinSF1	BsmtFinSF1	663.0	449.878523
GarageQual	GarageQual	4.0	462.399980
WoodDeckSF	WoodDeckSF	192.0	487.761372
SaleCondition	SaleCondition	4.0	490.223394
CentralAir	CentralAir	1.0	533.540801
GarageType	GarageType	1.0	569.096237
BsmtUnfSF	BsmtUnfSF	396.0	601.818903
LotArea	LotArea	8414.0	982.527398
LotFrontage	LotFrontage	70.0	1255.136900
OverallCond	OverallCond	8.0	2634.189802
YearRemodAdd	YearRemodAdd	2003.0	2661.275923
TotalBsmtSF	TotalBsmtSF	1059.0	11313.187661

Partial Dependence Plots

We will not cover Partial Dependence Plots in this workshop, but here is some code demonstrating how you can use them.

Partial dependence use the model and dataset as follows. It varies the value of the feature of interest while keeping the rest of the features the same. Then, it uses the model to make a prediction on this augmented data. This way, we can see the effect of that feature alone. This library finds clusters in the dataset predictions for you.

from pdpbox import pdp

def plot_pdp(feat, clusters=None, feat_name=None):
    feat_name = feat_name or feat
    p = pdp.pdp_isolate(model, X_train, X_train.columns, feat)
    return pdp.pdp_plot(p, feat_name, 
                        plot_lines=True,
                        cluster=clusters is not None,
                        n_cluster_centers=clusters)

plot_pdp('GrLivArea')

png

plot_pdp('OverallQual', clusters=5)
plt.figure()
sns.regplot(data=df, x='OverallQual', y='SalePrice')

png

There seems to be an issue in the library related to plotting (or perhaps my code is wrong somehow), which is why I use a try-except block. Due to this issue, the axis are not shown.

feats = ['OverallQual', 'GrLivArea']
p = pdp.pdp_interact(model, X_train, X_train.columns, feats)

try:
  pdp.pdp_interact_plot(p, feats)
except TypeError:
  pass

png

feats = ['OverallCond', 'OverallQual']
p = pdp.pdp_interact(model, X_train, X_train.columns, feats)

try:
  pdp.pdp_interact_plot(p, feats)
except TypeError:
  pass

png

Standard Process for Quick RF Development:

Using scoring metric (provided or decided), create a scoring function (training + validation).
Create a validation set with same properties as test set.
Run a quick random forest on the data with minimal alterations.
Plot feature importance, plot the features, and learn about those features (domain knowledge).
Use a natural breakpoint to remove unimportant features then plot again.
Look at data carefully and encode important features better.
Using heirachical clustering to remove redundant features (scipy)
For interpretation, use partial dependence plots from PDP.
Use Tree Interpreter to explain individual predictions.

Nicholas Vadivelu

Introduction to Random Forests

Data Wrangling

Modeling

Feature Importances

Identifying Correlated Features

Tree Interpreter

Partial Dependence Plots

Standard Process for Quick RF Development:

Related posts

Nicholas Vadivelu

Introduction to Random Forests

Data Wrangling

Modeling

Feature Importances

Identifying Correlated Features

Tree Interpreter

Partial Dependence Plots

Standard Process for Quick RF Development:

Related posts

Optimizing k-Means in NumPy & SciPy 10 May 2021

Rejection & Importance Sampling Explained in Code 09 Mar 2021

Groupby-by From Scratch "Part 2" 16 Feb 2021