MGMT 675

AI-Assisted Financial Analysis

Trees, Forests, and Nets

Outline

Decision tree
Random forests
Gradient boosting
Neural networks

Concepts from last class

Train and test
R-squared (score)
Underfitting and overfitting
Hyperparameters
Cross validation
Scaling and pipelines

How decision trees work

Start with the estimate \(\hat y = \bar y\) for all observations.
Split into two subsets based on one variable and one threshold. All observations below the threshold go into one group. All above go into another.
Prediction for each group is the group mean of the target variable. Calculate MSE over both groups.
Choice of variable and threshold on which to split was based on minimizing MSE.
Split each subset into further subsets and continue.

Example of Decision Tree Splitting

Another Example

Ask Julius to read mldata.xlsx.
Ask Julius to fit a decision tree regressor with max_depth=2 to predict “continuous” from x1 through x100.
Ask Julius to plot the tree.
Ask Julius to set x to be an array of 100 standard normals. Ask Julius to show x[:10] and ask Julius what the tree predicts for x.
Ask Julius to use max_depth=3 and plot the tree.

Random forest

Forests

A forest is multiple trees.
For any observation - old or new - each tree makes a prediction.
Average the predictions to get the final prediction.

Generating random forests

A random forest is created by generating random datasets and fitting a tree to each.
A random dataset is generated by randomly drawing rows from the original dataset.
Default in scikit-learn is to draw with replacement as many rows as in the original and then drop duplicates.

Example

Ask Julius to fit random forest regression to predict “continuous” from x1 through x100 with n_estimators=2 and max_depth=2.
Ask Julius to plot both trees.
Ask Julius what the random forest predicts for x.

A More Realistic Example

Ask Julius to fit a random forest regression to predict “continuous” from x1 through x100 (let Julius choose n_estimators and max_depth - will probably use defaults).
Ask Julius what the score is on the training and test data.

Important Hyperparameters

How much splitting to do
Examples:
- max_depth = 3 means split 3 times (\(2^3\) leaves)
- min_samples_split = = 50 means don’t split groups smaller than 50
- min_samples_leaf = 50 means don’t create groups smaller than 50

Other important hyperparameters

How to split
- criterion (squared error, absolute error, …). Absolute error is less influenced by outlier values for the target variable.
- max_features: Randomly choose n features at each split and split on one of them. Small max_features generates more variation in the trees.
Number of trees (n_estimators)

Example

Ask Julius to use GridSearchCV to find the best max_depth in [2, 4, 6, 8, 10]
Ask what the scores are on the training and test data.
Ask Julius to plot the test data predictions against x1 in the test data.
Ask Julius to tell you the feature importances.

Another Example

Ask Julius to get the Boston house price data from sklearn.
Build a random forest model to predict MEDV using the other variables.
- GridSearchCV for max_depth
- Get score on test data
- Get feature importances

Boosting

How Gradient boosting works

Fit a decision tree.
Look at its errors. Fit a new decision tree to predict the errors.
New prediction is original plus a fraction of the prediction of original’s error (fraction = learning rate).
Look at the errors of the new predictions. Fit a new decision tree to predict these errors.
Continue …

Key hyperparameters

Same as random forest
Plus learning rate

Extreme Gradient Boosting (xgboost)

Ask Julius to explain xgboost
Ask Julius to fit xgboost to predict “continuous” from x1 through x100 in mldata.xlsx.
Ask Julius to use GridSearchCV to find the best max_depth and learning rate.

Ask Julius

what the score is on the test data
what the feature importances are
to plot the actual and predicted target values in the test data against x1 in the test data.

Neural Networks

Example of Multi-layer Perceptron

Rectified linear units

The usual function for the neurons (except in the last layer) is

\[ y = \max(0,b+w_1x_1 + \cdots + w_nx_n)\]

Parameters \(b\) (called bias) and \(w_1, \ldots w_n\) (called weights) are different for different neurons.
This function is called a rectified linear unit (ReLU).

Analogy to neurons firing

If \(w_i>0\) then \(y>0\) only when \(x_i\) are large enough.
A neuron fires when it is sufficiently stimulated by signals from other neurons (in prior layer).

Output function

The output doesn’t have a truncation, so it can be negative.
For regression problems, it is linear:

\[z = b+w_1y_1 + \cdots + w_ny_n\]

Key hyperparameters

Number of hidden layers
Number of neurons in each layer
Activation function
Also, choice of optimizer can matter

Example

Ask Julius to fit a multi-layer perceptron to predict “continuous” from x1 through x100 in mldata.xlsx.
Try different hidden layer sizes. For example (64, 32) means two hidden layers with 64 neurons in the first and 32 in the second.
You can use GridSearchCV to search over different hidden layer sizes - e.g. (8, ), (4, 4, 4), etc.