MGMT 675



AI-Assisted Financial Analysis

Introduction to Machine Learning

Objective of Machine Learning

  • Goal is prediction (or making optimal decisions in reinforcement learning)
  • Linear regression is a machine learning model (usually not the best) but we don’t do t-stats or p-values in machine learning.
  • Only question is how good are the predictions on data the machine was not trained on - can we trust the predictions on new data?

Learn = Train = Fit = Estimate

  • Estimating a model like linear regression is called “fitting the model”or “training the model.”

  • To say a machine “learns” from data means its parameters are estimated from the data.

  • The predictors (x variables) are called features.

Regression vs Classification

  • Regression means to predict a continuous variable (not necessarily linear regression).
  • Classification is to predict a categorical variable. Binary or multiclass.

Machine learning in finance

  • Fraud detection
  • Credit risk analysis
  • Return prediction
  • Valuation
  • Sentiment analysis
  • Time series forecasting

Train and Test

  • To assess how well a model will perform on new data, we hold out some data when training it.
  • Split our data (usually randomly) into train and test subsets.
  • For return prediction etc., we may split on some date and train on older data and test on newer data.
  • Train on the training data and test on the test data.

Assessing Performance

  • How do we decide if performance is good or bad?
  • For continuous variables,
    • usually want to achieve a low sum of squared errors
    • equivalently, achieve a high \(R^2\).

\[R^2 = 1 - \frac{\sum (y_i - \hat y_i)^2}{\sum (y_i - \bar{y})^2}\]

  • For categorical, can use % accurately classified.

Example

  • Download mldata.xlsx from the course website and upload to Julius.
  • Tell Julius you want to predict the “continuous” variable using x1 through x100 as the features.
  • Ask Julius to recommend a model.
  • Ask Julius to fit the model and report the scores on the training data and the test data.

How was the data generated?

  • This is the true model.
  • x1 is the only useful feature.

Data we’re feeding the model

  • Added noise
  • And added 99 other completely irrelevant features

Overfitting and Shrinkage

Underfitting and Overfitting

  • A model is underfit if it makes poor predictions on its training data.
  • A model is overfit if it makes good predictions on its training data but performs poorly on new data.
  • Overfitting means that the chosen parameters reflect chance relationships in the training data.

Model complexity and shrinkage

  • We can create more complex models by increasing the number of parameters (like adding variables in a linear regression).
  • We can reduce complexity by reducing the number of parameters. Or by choosing less influential parameter values (like regression coefficients closer to zero).
  • A more complex model is less likely to underfit but more likely to overfit.

Shrinkage/Regularization/Penalization

  • To induce selection of parameter values that are less influential, we can penalize large values.
  • Example: lasso and ridge regression
    • Usually in linear regression, we minimize SSE (sum of squared errors).
    • LASSO: minimize SSE + penalty \(\times\) sum of \(|\beta_i|\).
    • Ridge: minimize SSE + penalty \(\times\) sum of \(\beta_i^2\).

Example

Ask Julius to vary the penalty parameter (called alpha) in ridge regression and to report the scores on the training and test data.

Validating Hyperparameters

Hyperparameters

  • The penalty parameter in lasso or ridge is called a hyperparameter.
  • Hyperparameters are model parameters that are not fit from the data during training.
  • Instead they are specified before training.
  • How to choose the best hyperparameter values?

Don’t optimize on your test data

  • Choosing the best hyperparameter using the test data may overfit the test data.
  • The test data may no longer provide a reliable test of how the model will perform on new data.
  • Instead, we could hold out some data within the training data (called validation data) for doing model comparison.

Cross Validation

  • Instead of splitting the training data once into train/validate, we can do it multiple times.
  • This is called cross validation

  • Split the training data into random subsets, say, \(A, B, C, D\), and \(E\).
  • Use \(A \cup B \cup C \cup D\) as training data and validate on \(E\).
  • Then use \(B \cup C \cup D \cup E\) as training data and validate on \(A\).
  • Then, …, until we have trained and validated 5 times (or as many times as we want).

  • For each train/validate split,
    • Train for multiple values of the hyperparameter.
    • Compute the score for each on the validation data.
  • This produces 5 scores for each value of the hyperparameter. Average them.
  • Choose the value that produces the highest average score.
  • Then proceed to testing on the test data.

Scaling/Transforming Features

Why Scale or Transform?

  • A coefficient of 2 on a variable that ranges from 1,000 to 10,000 is very different from a coefficient of 2 on a variable that ranges from 0.1 to 1.0.
  • We should penalize the former more aggressively.
  • Or, better, rescale the variables so they are on the same scale before training a model.

Standard Scaler/Quantile Transformer

  • Scikit-learn’s standard scaler will subtract the mean and divide by standard deviation, so each feature has a zero mean and standard deviation = 1.
  • Quantile transformer can be better for outliers.
    • Can transform to normal(0, 1) or uniform on (0, 1)

Remembering Transformations

  • Suppose you scale or transform features and train a model.
  • Now you get a new observation and need to make a prediction. How do you scale or transform the new observation?
  • You need to apply the same transformation you applied to the training data. So, your model needs to remember it.

Pipelines

  • Easiest way is to put scale/transform and model fit in a pipeline.
  • Then fit the pipeline. Use the pipeline to predict.
  • Can also run cross validation on the pipeline.