MGMT 675



AI-Assisted Financial Analysis

Classification

Categorical target variables

  • Binary (off/on, yes/no, …) or multiclass
  • Forests, neural nets, linear (logistic), plus support vector machines
  • Example datasets for today:
    • sklearn breast cancer (binary)
    • sklearn digits (multiclass)
    • loan defaults (binary)

Binary classification

  • 0 or 1 (negative or positive)
  • Predict probability \(p\) of positive (e.g., 0.9)
    • Error is \(p\) is actually negative (e.g., 0.9)
    • Error is \(1-p\) if actually positive (e.g., 0.1)
  • Try to minimize sum of some function (square, log, …) of errors
  • Ultimate prediction is positive if \(p>1/2\).

Predicting probabilities

  • For logistic regression (logit), predicted probability is \[\frac{1}{1 + e^{-\alpha - \beta_1x_1 - \cdots - \beta_nx_n}}\]
  • Higher value for \(\alpha + \beta_1x_1 + \cdots + \beta_nx_n\) produces a higher probability (between 0 and 1).
  • Same function is used (usually) for neural networks.

Binary classification with trees and forests

  • Predicted probability is class frequency within the leaf.
  • Try to split to create pure leaves
  • A group is pure if it is all of the same class
  • The more mixed a group is, the higher the impurity
  • Gini index: like the reverse of diversification
    • Less diversified = less impure = more pure = higher index

Example

  • Ask Julius to get the sklearn breast cancer dataset and to show the head of the data with the target variable labeled benign or malignant.
  • Ask Julius to fit a decision tree and to plot the tree.

Errors

  • Ask Julius to fit a random forest regressor and to show the confusion matrix on the training data and the test data.

Possible adjustments

  • Can run GridSearchCV for hyperparameters like max_depth
  • Can change class weights to emphasize one class (e.g., focus on getting malignant right) or because of imbalanced data (more of one type than another)
  • Can change threshold for predicting one class or the other

Prediction threshold

  • Ask Julius to plot a histogram of the predicted probabilities for the training data.
  • The default rule is to predict the class with the highest probability.
  • But you can change the threshold. For example, predict malignant if probability >=25%. Will reduce false negatives (but also increase false positives).

ROC Curve

  • ROC (Receiver Operating Characteristics) curve shows trade-off between different types of errors.
  • True positive rate = fraction of positives that are correctly classified
  • False positive rate = fraction of negatives that are incorrectly classified
  • Ask Julius to show the ROC curve for predicting malignant (i.e., malignant=positive) for the training data.

Class Weights

  • Ask Julius to refit the random forest with the class weight of malignant 10 times higher than the class weight of benign.
  • Ask Julius to show the ROC curves for predicting malignant for the training data with and without the class weight adjustment.
  • From the ROC curves, we can probably see which model we like best and what the threshold should be. Then we can test it.

Scoring for GridSearchCV

  • Accuracy (% correctly classified) is usual score.
  • But we can also use the true positive rate as the score.
  • Or false negative rate or …

Another example

  • Ask Julius to fit a neural network.
  • Ask Julius to show the confusion matrix and the ROC curve for the test data.

Multiclass

  • For trees and forests, predicted probabilities are again class frequencies
  • Neural networks produce a separate output for each class. Probability for class \(i\) is \[\frac{e^{\text{output}_i}}{\text{sum over classes of } e^{\text{output}_j}}\]
  • Called softmax function.
  • Logistic regression is same.

Example

  • Ask Julius to get the digits dataset from sklearn and to show the head of the data.
  • Ask Julius to show the image for the first sample.
  • Ask Julius to fit a neural network and to show the confusion matrix on the test data.

Default data

  • Start a new chat.
  • Download the loan default data from the course website and upload to Julius.
  • Ask Julius to show the head of the dataset.
  • Build a model to predict defaults!