Start with the estimate \(\hat y = \bar y\) for all observations.
Split into two subsets based on one variable and one threshold. All observations below the threshold go into one group. All above go into another.
Prediction for each group is the group mean of the target variable. Calculate MSE over both groups.
Choice of variable and threshold on which to split was based on minimizing MSE.
Split each subset into further subsets and continue.
Example of Decision Tree Splitting
Another Example
Ask Julius to read mldata.xlsx.
Ask Julius to fit a decision tree regressor with max_depth=2 to predict “continuous” from x1 through x100.
Ask Julius to plot the tree.
Ask Julius to set x to be an array of 100 standard normals. Ask Julius to show x[:10] and ask Julius what the tree predicts for x.
Ask Julius to use max_depth=3 and plot the tree.
Random forest
Forests
A forest is multiple trees.
For any observation - old or new - each tree makes a prediction.
Average the predictions to get the final prediction.
Generating random forests
A random forest is created by generating random datasets and fitting a tree to each.
A random dataset is generated by randomly drawing rows from the original dataset.
Default in scikit-learn is to draw with replacement as many rows as in the original and then drop duplicates.
Example
Ask Julius to fit random forest regression to predict “continuous” from x1 through x100 with n_estimators=2 and max_depth=2.
Ask Julius to plot both trees.
Ask Julius what the random forest predicts for x.
A More Realistic Example
Ask Julius to fit a random forest regression to predict “continuous” from x1 through x100 (let Julius choose n_estimators and max_depth - will probably use defaults).
Ask Julius what the score is on the training and test data.
Important Hyperparameters
How much splitting to do
Examples:
max_depth = 3 means split 3 times (\(2^3\) leaves)
min_samples_split = = 50 means don’t split groups smaller than 50
min_samples_leaf = 50 means don’t create groups smaller than 50
Other important hyperparameters
How to split
criterion (squared error, absolute error, …). Absolute error is less influenced by outlier values for the target variable.
max_features: Randomly choose n features at each split and split on one of them. Small max_features generates more variation in the trees.
Number of trees (n_estimators)
Example
Ask Julius to use GridSearchCV to find the best max_depth in [2, 4, 6, 8, 10]
Ask what the scores are on the training and test data.
Ask Julius to plot the test data predictions against x1 in the test data.
Ask Julius to tell you the feature importances.
Another Example
Ask Julius to get the Boston house price data from sklearn.
Build a random forest model to predict MEDV using the other variables.
GridSearchCV for max_depth
Get score on test data
Get feature importances
Boosting
How Gradient boosting works
Fit a decision tree.
Look at its errors. Fit a new decision tree to predict the errors.
New prediction is original plus a fraction of the prediction of original’s error (fraction = learning rate).
Look at the errors of the new predictions. Fit a new decision tree to predict these errors.
Continue …
Key hyperparameters
Same as random forest
Plus learning rate
Extreme Gradient Boosting (xgboost)
Ask Julius to explain xgboost
Ask Julius to fit xgboost to predict “continuous” from x1 through x100 in mldata.xlsx.
Ask Julius to use GridSearchCV to find the best max_depth and learning rate.
Ask Julius
what the score is on the test data
what the feature importances are
to plot the actual and predicted target values in the test data against x1 in the test data.
Neural Networks
Example of Multi-layer Perceptron
Rectified linear units
The usual function for the neurons (except in the last layer) is
\[ y = \max(0,b+w_1x_1 + \cdots + w_nx_n)\]
Parameters \(b\) (called bias) and \(w_1, \ldots w_n\) (called weights) are different for different neurons.
This function is called a rectified linear unit (ReLU).
Analogy to neurons firing
If \(w_i>0\) then \(y>0\) only when \(x_i\) are large enough.
A neuron fires when it is sufficiently stimulated by signals from other neurons (in prior layer).
Output function
The output doesn’t have a truncation, so it can be negative.
For regression problems, it is linear:
\[z = b+w_1y_1 + \cdots + w_ny_n\]
Key hyperparameters
Number of hidden layers
Number of neurons in each layer
Activation function
Also, choice of optimizer can matter
Example
Ask Julius to fit a multi-layer perceptron to predict “continuous” from x1 through x100 in mldata.xlsx.
Try different hidden layer sizes. For example (64, 32) means two hidden layers with 64 neurons in the first and 32 in the second.
You can use GridSearchCV to search over different hidden layer sizes - e.g. (8, ), (4, 4, 4), etc.