MGMT 675: Generative AI for Finance
A pretrained LLM already knows language. Fine-tuning adjusts its weights on a smaller, task-specific dataset so it consistently performs a specific task the way you want:
Analogy: a pretrained model is like hiring a smart generalist. Fine-tuning is specialized on-the-job training.
Prompting / Skills
RAG
Fine-Tuning
Fine-tuning is not the right tool when:
Fine-tuning is worth the effort when:
One compelling use case for fine-tuning: use a large model to generate training data, then fine-tune a smaller model on it.
Cost
Large models cost 10–30x more per query
Generate training data once, then run the small model at a fraction of the cost
Speed
A fine-tuned small model responds in milliseconds for classification, seconds for generation
Large API models are 10–50x slower
Privacy
A fine-tuned small model runs entirely on your infrastructure
No data leaves your network (HIPAA, GDPR)
Well-known examples of LLM → small model distillation:
The pattern works best for specific, repeatable tasks at scale: classify emails, extract fields, generate reports. For diverse, one-off tasks, just use the LLM directly.
Training data is a set of input/output pairs — examples of what you want the model to do.
| Input | Output |
|---|---|
| “Summarize this 10-K risk factor…” | “The company faces supply chain risk due to…” |
| “Classify this earnings call sentence…” | “Positive guidance” |
| “Write a credit memo for…” | “Credit Assessment: BBB+ …” |
Start small — behavioral changes can appear with as few as 20 examples. Meta’s LIMA project showed strong results with just 1,000 carefully curated examples. Quality matters more than quantity.
Fine-tuning is not risk-free. Issues to watch for:
Full Fine-Tuning
LoRA (Parameter-Efficient)
In practice, almost everyone uses LoRA or a similar parameter-efficient method. Full fine-tuning is reserved for large-budget projects.
The LoRA paper (Hu et al., 2021) observed that the changes needed to adapt a model to a new task are surprisingly simple — they can be captured with a small number of parameters.
Instead of updating an entire weight matrix, LoRA decomposes the update into two small matrices. Only these small matrices are trained; the original weights are frozen.
Full Fine-Tuning
LoRA (rank 16)

Each of the 6 nodes in layer k+1 receives input from all 6 nodes in layer k, so the weight matrix has 6 × 6 = 36 coefficients. Each node also has a bias, adding 6 more parameters — 42 total per layer. Full fine-tuning means retraining all 42.

The original 36 direct connections remain active but frozen (not shown). At real scale (4096 × 4096 layers, rank 16): 131K vs. 16.8M trainable parameters — a 128× reduction.
LoRA trains two small matrices per layer. The adapter path is purely linear — no activation function in between:
At inference, merge the adapter into the original weights:
\[W' = W + B \times A\]
Because the adapter path is linear (just two matrix multiplications, no activation function), the product \(B \times A\) is itself a matrix that can be added directly to \(W\). The merged model has the same architecture and speed as the original — the adapter disappears into the weights.
QLoRA (Dettmers et al., 2023) adds 4-bit quantization on top of LoRA:
Result: fine-tune a 70B-parameter model on a single 48GB GPU. For smaller models like Gemma 1B, QLoRA makes fine-tuning possible on a free Colab T4 (16GB).
Fine-tune Google Gemma 3 1B to classify financial news sentiment using QLoRA — in a Colab notebook.
What You’ll Do
What You’ll Need