Module 10: Corporate AI Deployment

MGMT 675: Generative AI for Finance

Kerry Back, Rice University

The Deployment Spectrum

We discussed data security risks with the Deloitte Canada case. A key question is: where does inference happen?

Public Cloud API

  • Research, drafting, coding assistance
  • Non-sensitive data analysis
  • Frontier model quality
  • No infrastructure to manage

Cloud-Hosted Private AI

  • AWS Bedrock, Azure OpenAI, GCP Vertex AI
  • Models run in your VPC with private endpoints
  • Customer-managed encryption keys
  • Meets HIPAA, SOX, FedRAMP for many use cases
  • Most common choice

On-Prem / Private Cloud

  • Only available for open-source models (Llama, Gemma, Mistral, etc.)
  • Air-gapped or classified environments
  • Data that cannot transit any external network
  • Full physical control of hardware

Running AI Locally

Ollama lets you run open-source models on your laptop — no cloud, no API key, no data leaving your machine. Same idea as on-prem deployment.

Setup

  1. Go to ollama.com and download the installer for your OS

  2. Run the installer (no terminal needed)

  3. Once installed, open a terminal and run:

    ollama run gemma4:e2b

Prompt and Assess

The prompt

Give gemma4:e2b this prompt:

A private equity firm is evaluating a manufacturing company with $50M in revenue, 12% EBITDA margins, and 8% annual revenue growth. Comparable transactions have been done at 7–9x EBITDA. What is a reasonable valuation range, and what are the top three risks you would flag before proceeding?

Assess the quality of the response.

A Larger Model

Now try:

ollama run gemma4:e4b

Try the same prompt and assess the response.

Example Model sizes and RAM

  • Gemma 4 E2B (Google, ~2B active) — runs on 4GB RAM
  • Gemma 4 E4B (Google, ~4B active) — runs on 8GB RAM
  • Llama 3.1 8B (Meta) — needs 16GB RAM

How Small Models Fit on a Laptop: Quantization

Full-precision LLMs store each parameter as a 32-bit float. Quantization shrinks those numbers to 4–8 bits, by rounding on a discrete grid, slashing the memory requirement with surprisingly little quality loss.

What Quantization Does

  • Converts weights from 32-bit (or 16-bit) floats to 4-bit or 8-bit integers (integer=location on the grid)
  • A 7B-parameter model: ~14 GB at 16-bit → ~4 GB at 4-bit
  • Different quantizations ship as different model names, with the default usually being 4-bit

What You Give Up

  • Slight loss in nuance on complex reasoning
  • Rare factual errors on edge cases
  • For most business tasks (drafting, summarization, code), the difference is negligible
  • 8-bit is nearly lossless; 4-bit is the sweet spot for laptops

Developer Tooling

For deployment, another consideration beyond the LLM is the coding harness. A coding agent is not just an LLM — it’s a harness of tools wrapped around the model.

The model is the brain. The harness is everything else. On-prem means building the harness yourself.

The Claude Code Leak

On March 31, 2026, Anthropic accidentally published Claude Code’s full source code via an npm packaging error — making the architectural complexity of a production coding agent visible for the first time.

512,000 lines (not all production code), ~40 permission-gated tools (file, terminal, git, web, IDE)

What happened next

  • claw-code: open-source reimplementation became the fastest-growing repo on GitHub
  • Provider-agnostic: works with Claude, OpenAI, or local models via Ollama
  • Other alternatives: Aider, OpenHands, Continue.dev