MGMT 675: Generative AI for Finance
We discussed data security risks with the Deloitte Canada case. A key question is: where does inference happen?
Public Cloud API
Cloud-Hosted Private AI
On-Prem / Private Cloud
Ollama lets you run open-source models on your laptop — no cloud, no API key, no data leaving your machine. Same idea as on-prem deployment.
Setup
Go to ollama.com and download the installer for your OS
Run the installer (no terminal needed)
Once installed, open a terminal and run:
ollama run gemma4:e2bThe prompt
Give gemma4:e2b this prompt:
A private equity firm is evaluating a manufacturing company with $50M in revenue, 12% EBITDA margins, and 8% annual revenue growth. Comparable transactions have been done at 7–9x EBITDA. What is a reasonable valuation range, and what are the top three risks you would flag before proceeding?
Assess the quality of the response.
Now try:
ollama run gemma4:e4b
Try the same prompt and assess the response.
Full-precision LLMs store each parameter as a 32-bit float. Quantization shrinks those numbers to 4–8 bits, by rounding on a discrete grid, slashing the memory requirement with surprisingly little quality loss.
What Quantization Does
What You Give Up
For deployment, another consideration beyond the LLM is the coding harness. A coding agent is not just an LLM — it’s a harness of tools wrapped around the model.
The model is the brain. The harness is everything else. On-prem means building the harness yourself.
On March 31, 2026, Anthropic accidentally published Claude Code’s full source code via an npm packaging error — making the architectural complexity of a production coding agent visible for the first time.
512,000 lines (not all production code), ~40 permission-gated tools (file, terminal, git, web, IDE)
What happened next