Module 10: Corporate AI Deployment

MGMT 675: Generative AI for Finance

Kerry Back, Rice University

The Deployment Spectrum

We discussed data security risks with the Deloitte Canada case. A key question is: where does inference happen?

Public Cloud API

Research, drafting, coding assistance
Non-sensitive data analysis
Frontier model quality
No infrastructure to manage

Cloud-Hosted Private AI

AWS Bedrock, Azure OpenAI, GCP Vertex AI
Models run in your VPC with private endpoints
Customer-managed encryption keys
Meets HIPAA, SOX, FedRAMP for many use cases
Most common choice

On-Prem / Private Cloud

Only available for open-source models (Llama, Gemma, Mistral, etc.)
Air-gapped or classified environments
Data that cannot transit any external network
Full physical control of hardware

Running AI Locally

Ollama lets you run open-source models on your laptop — no cloud, no API key, no data leaving your machine. Same idea as on-prem deployment.

Setup

Go to ollama.com and download the installer for your OS
Run the installer (no terminal needed)
Once installed, open a terminal and run:
```
ollama run gemma4:e2b
```

Prompt and Assess

The prompt

Give gemma4:e2b this prompt:

A private equity firm is evaluating a manufacturing company with $50M in revenue, 12% EBITDA margins, and 8% annual revenue growth. Comparable transactions have been done at 7–9x EBITDA. What is a reasonable valuation range, and what are the top three risks you would flag before proceeding?

Assess the quality of the response.

A Larger Model

Now try:

ollama run gemma4:e4b

Try the same prompt and assess the response.

Example Model sizes and RAM

Gemma 4 E2B (Google, ~2B active) — runs on 4GB RAM
Gemma 4 E4B (Google, ~4B active) — runs on 8GB RAM
Llama 3.1 8B (Meta) — needs 16GB RAM

How Small Models Fit on a Laptop: Quantization

Full-precision LLMs store each parameter as a 32-bit float. Quantization shrinks those numbers to 4–8 bits, by rounding on a discrete grid, slashing the memory requirement with surprisingly little quality loss.

What Quantization Does

Converts weights from 32-bit (or 16-bit) floats to 4-bit or 8-bit integers (integer=location on the grid)
A 7B-parameter model: ~14 GB at 16-bit → ~4 GB at 4-bit
Different quantizations ship as different model names, with the default usually being 4-bit

What You Give Up

Slight loss in nuance on complex reasoning
Rare factual errors on edge cases
For most business tasks (drafting, summarization, code), the difference is negligible
8-bit is nearly lossless; 4-bit is the sweet spot for laptops

Developer Tooling

For deployment, another consideration beyond the LLM is the coding harness. A coding agent is not just an LLM — it’s a harness of tools wrapped around the model.

The model is the brain. The harness is everything else. On-prem means building the harness yourself.

The Claude Code Leak

On March 31, 2026, Anthropic accidentally published Claude Code’s full source code via an npm packaging error — making the architectural complexity of a production coding agent visible for the first time.

512,000 lines (not all production code), ~40 permission-gated tools (file, terminal, git, web, IDE)

What happened next

claw-code: open-source reimplementation became the fastest-growing repo on GitHub
Provider-agnostic: works with Claude, OpenAI, or local models via Ollama
Other alternatives: Aider, OpenHands, Continue.dev