Module 7: Verification, Auditing, and Governance

MGMT 675: Generative AI for Finance

Kerry Back, Rice University

The Trust Spectrum

AI with Code Execution: A Fast Junior Analyst

AI with code execution is like a fast junior analyst — capable but needs oversight. The level of oversight depends on the stakes.

Low-Stakes

  • Exploratory charts
  • Directional trends
  • Internal brainstorming

Quick sanity check.

Medium-Stakes

  • Team tools and skills
  • Recurring reports
  • Internal dashboards

Test with known answers.

High-Stakes

  • Client deliverables
  • Regulatory filings
  • Board presentations

Full verification protocol.

The more consequential the decision, the more you verify. Not distrust — calibrated confidence.

AI Makes Silent Analytical Errors

The Real Risk: Not Hallucination — Bugs

AI in code-execution mode doesn’t invent facts the way a chatbot might. Instead, it makes silent analytical errors: wrong filters, dropped rows, incorrect joins.

Common errors

  • Missing data silently dropped
  • Date or currency parsing surprises
  • Filters slightly off; aggregating before vs. after filtering

Why it’s dangerous

  • The output looks professional
  • Charts render cleanly and nobody questions a polished result
  • Confidence without verification is the real risk

Catching a Silent Error

The prompt

“What’s the average profit margin by sub-category? Exclude any sub-categories with negative total profit.”

The trap

  • Does AI filter before or after aggregating?
  • Does it drop rows with negative profit, or sub-categories with negative total profit?
  • The order matters — and AI often gets it wrong

How to catch it

  • “Show me the code” — inspect the filter logic
  • “How many rows are in the result?” — does it match your expectation?
  • “List the excluded sub-categories” — are they the right ones?

Lesson: The more specific the filter logic, the more likely AI misinterprets it. Always verify multi-step operations.

The Verification Checklist

A 4-Step Verification Protocol

Apply this to any AI-generated analysis.

  1. Sanity-check with known answers Ask questions where you already know the result: row count, column names, a total you can verify in Excel.

  2. Ask AI to show its methodology “Show me the code” or “Show me the SQL.” Read the logic, not just the output.

  3. Cross-check by rephrasing Ask the same question differently. Start a fresh conversation and ask again. Do answers agree?

  4. Spot-check edge cases Test the smallest group, zero values, null fields, boundary dates. Errors hide at the edges.

Trust is a process, not a setting.

Every AI output deserves the same scrutiny you’d give a new analyst’s first deliverable.

Hands-On: Validating Prior Work

Reproducibility Test

Take analysis from earlier modules and test whether it reproduces.

The exercise

  1. Pick a financial analysis from Module 2 (portfolio) or Module 4 (DCF)
  2. Start a fresh conversation
  3. Give the same data and ask the same question
  4. Compare: do the results match?

What you’re testing

  • Reproducibility: Same data, same question — same answer?
  • Methodology: Did AI use the same approach both times?
  • Stability: Small rephrasing shouldn’t change the result

If the numbers don’t match, that’s not a failure — it’s exactly why we verify. And it’s exactly why skills (Module 3) matter.

Auditing a DCF Model

Take the DCF built in Module 4. Start a fresh conversation. Ask Claude to critique it.

What to ask

  • “Identify 3+ questionable assumptions”
  • “Check formula consistency — does FCF match the pro formas?”
  • “Is the terminal value reasonable? Compare implied terminal growth to GDP growth.”

Why this matters

  • The most valuable use of AI is often having it check work, not produce it
  • Cross-evaluation: use a second conversation to critique the first
  • Compare original vs. revised valuations

Red-Teaming a Deployed Skill

A database skill is the kind of tool people stop questioning once deployed. Test it like any tool your team depends on.

Adversarial queries

  1. Query for nonexistent data (“Sales in 2030”)
  2. Request something not in the schema (“Customer satisfaction scores”)
  3. Use ambiguous language (“Best employee”)

For each failure

  • Diagnose: Bad SQL? Missing context? Wrong assumption?
  • Fix: Add clarification, examples, or constraints to SKILL.md
  • Retest: Run the query again after the fix

This is how production skills get hardened. Every failure makes the skill better.

Red-Teaming Protocols

Systematic Red-Teaming

Red-teaming goes beyond ad-hoc testing. A protocol ensures you test every failure mode systematically.

Test Category What You Try
Out-of-scope questions Ask about data that doesn’t exist in the schema
Conflicting documents Upload docs with contradictory numbers
Edge cases Zero values, null fields, single-row groups, boundary dates
Hand-written SQL comparison Write the query yourself and compare to AI’s result
Ambiguous language “Best customer” — by revenue? frequency? margin?

A skill that hasn’t been red-teamed is a liability, not an asset.

Red-Teaming a DCF Model

Apply red-teaming protocols to the DCF models built in Module 4.

Financial Red-Team Tests

  1. Set revenue growth to 0% — does the model break?
  2. Use a WACC below the risk-free rate — does AI flag it?
  3. Feed a company with negative FCF and compare AI’s output to a hand-built Excel model

What You Learn

  • Where the model silently produces nonsense
  • Which assumptions the AI accepts without questioning
  • How to improve guardrails so the verification checklist catches the errors

Audit Logging

JSONL Audit Logs

Every AI interaction in a production system should produce an audit log entry. The standard format is JSONL (one JSON object per line).

What to Log

  • Identity: timestamp, user, model
  • Content: prompt and response
  • Metadata: token count, latency, cost

Why It Matters

  • Compliance: Regulators can audit what the AI said
  • Debugging: Trace errors back to the exact prompt
  • Cost and quality: Track token usage and identify common failures for skill updates

Store logs locally or in a cloud database. If it’s not logged, it didn’t happen.

Three Layers of Auditability

Defense in Depth

📄

Layer 1

RAG Citations

💻

Layer 2

Code Transparency

📋

Layer 3

Audit Logs

  1. RAG citations: Ground answers in source documents with traceable references (Module 8). AI answers based on your content, not its training data.

  2. Code transparency: AI displays the SQL or Python it used. Analysts review the logic, not just the answer.

  3. Audit logs: Every prompt, response, and token count recorded in JSONL. Complete trail of every AI decision.

The more layers, the harder it is for errors to reach your team.

Security and Governance

Data Policies: Training and Retention

Before sending data to any AI provider, know the answers to two questions.

Will they train on your data?

  • Consumer tiers (free ChatGPT, free Claude) may use your input to improve models
  • API and enterprise tiers typically do not
  • Check your agreement and settings

How long do they keep it?

  • Providers retain prompts for a window (often 30 days) for safety monitoring
  • Enterprise contracts may offer zero retention
  • Know the policy before you send sensitive data

The Schema-Only Pattern

AI can analyze sensitive data without ever reading it. It only needs the table structure.

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '28px'}, 'flowchart': {'nodeSpacing': 80, 'rankSpacing': 120, 'padding': 24, 'useMaxWidth': true}}}%%
flowchart LR
  S["<b>Schema</b><br/><i>table names, columns</i>"] -->|"safe to share"| AI["<b>AI</b>"]
  AI --> SQL["<b>SQL Code</b>"]
  SQL --> L["<b>Runs Locally</b>"]
  DB["<b>Your Database</b>"] -->|"data stays here"| L

  style S fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#0f172a
  style AI fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#0f172a
  style SQL fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#0f172a
  style L fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#0f172a
  style DB fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#0f172a

How it works

  1. Describe tables in instructions: names, columns, types
  2. Ask AI to write the query
  3. AI writes SQL from the schema alone; the script runs locally and output stays local

Why it’s compliant

  • Table metadata is not PII — safe to share
  • Sensitive records stay on your server
  • AI produces the tool, not the output
  • Works for FERPA, HIPAA, SOX, GDPR

Improving SQL Quality: Schema + Synthetic Rows

Schema alone works, but adding ~3 synthetic rows per table makes the generated SQL significantly more accurate.

Why sample rows help

  • Reveal data formats (is status 'Active' or 1?)
  • Disambiguate columns (is code a country code or product code?)
  • Help the LLM write correct WHERE clauses with properly formatted literals

Rules

  • Rows are synthetic — fabricated to match real formats without containing actual records
  • Three rows per table is the sweet spot; more tends to distract the model
  • Databricks: ~3 rows per table raises SQL accuracy by ~6 pp

Where Should AI Infrastructure Live?

When you connect an AI agent to enterprise data, you face a fundamental question: where does the computation happen?

Public Cloud AI

  • Data sent to provider (OpenAI, Anthropic, Google)
  • Fastest to deploy
  • Provider controls infrastructure

Private / On-Prem AI

  • Data never leaves your network
  • Full control over access
  • Higher upfront cost

Hybrid

  • Sensitive data stays on-prem
  • Non-sensitive tasks use cloud AI
  • Most common enterprise pattern

Antonio Neri, CEO of HPE: “In the AI era, where data sovereignty, privacy and control of data matter, the industry is seeing a shift back to private clouds.”

Data Sovereignty

Data sovereignty means controlling where your data lives, who can access it, and what jurisdiction governs it.

Why it matters for AI

  • AI agents may transmit query results to external LLMs
  • Regulations (GDPR, HIPAA, SOX) may restrict where data can go
  • Defense agencies require air-gapped environments

Enterprise examples

  • DISA (U.S. DoD): $931M on-prem private cloud
  • Barclays: 50K+ workloads to private cloud
  • 2degrees (NZ): sovereign AI for data residency

Data Security: Running AI Locally

The Case for Local AI

Cloud AI is convenient, but local models keep data inside your machine — no transmission, no retention policy, no provider to trust.

When local AI makes sense

  • Client data under NDA
  • HIPAA, GDPR, or FERPA-covered records
  • Proprietary deal models and memos
  • Regulated industries (banking, defense, healthcare)

The trade-off

Cloud AI Local AI
Model quality ★★★★★ ★★★☆☆
Data privacy Provider policy 100% local
Setup cost None Minutes
Compute Unlimited Your hardware

For narrow, sensitive tasks, a smaller local model is often the right call.

Ollama: Local LLMs in One Installer

Ollama lets you run open-source models on your laptop — no cloud, no API key, no data leaving your machine.

How it works

  1. Download and run the Ollama installer
  2. Pull a model (e.g., llama3.2)
  3. Chat locally via a browser UI or the terminal
  4. All inference runs on your CPU/GPU

Available models

  • Llama 3.1 (Meta, 8B) — strong reasoning
  • Llama 3.2 (Meta, 3B) — faster, lighter
  • Mistral, Gemma, Phi — many options
  • Models download once and run offline

Exercise: Install Ollama and Try Local LLMs

Exercise: Running a Local Model

Setup

  1. Go to ollama.com and download the installer for your OS

  2. Run the installer (no terminal needed)

  3. Once installed, open a terminal and run:

    ollama run llama3.2
  4. Then try:

    ollama run llama3.1

The prompt

Give each model this prompt and compare their responses:

“A private equity firm is evaluating a manufacturing company with $50M in revenue, 12% EBITDA margins, and 8% annual revenue growth. Comparable transactions have been done at 7–9x EBITDA. What is a reasonable valuation range, and what are the top three risks you would flag before proceeding?”

What to notice: quality of reasoning, how each model handles uncertainty, and whether either refuses or hedges differently.

Both models run entirely on your machine. No data is sent anywhere.

AI Adoption Roadmap

The Sandbox-Audit-Deploy Pattern

Before deploying any AI workflow to your team, follow three phases.

🧪

  1. Sandbox

Test with dummy data

🔍

  1. Audit

Review logs and logic

🚀

  1. Deploy

Roll out with monitoring

Test with synthetic data, review logs and logic, then roll out with weekly monitoring.

Adoption Timeline: From Quick Wins to Strategic

Each adoption stage maps to a trust level.

  1. Quick wins (weeks) — light verification:
    • Chat for research, summarization, drafts
    • Upload spreadsheets for ad-hoc analysis
  2. Medium-term (months) — test-before-deploy:
    • Skills for recurring workflows (Module 3)
    • MCP integrations and Streamlit apps (Modules 7, 8)
  3. Strategic (quarters) — full audit trail:
    • Production apps with authentication
    • AI usage policy and governance framework

Small Language Models and Fine-Tuning

Improving Reliability: Fine-Tuning and SLMs

A complementary approach to verification: make the model itself more reliable for your tasks.

Fine-tuning an LLM

  • Train on your domain data (10-Ks, memos, deal docs)
  • Model learns your terminology and conventions
  • Fewer silent errors on domain-specific tasks

Small Language Models (SLMs)

  • ~1–8B parameters (vs. hundreds of billions)
  • Fine-tuned in hours on a single GPU
  • Run on-premise — data never leaves your network
  • Specialized small models can outperform large models on narrow tasks

Why SLMs for Agentic Systems?

Belcak et al. (2025), “Small Language Models are the Future of Agentic AI”:

  • Most agent subtasks are repetitive, scoped, and non-conversational — classifying documents, extracting fields, formatting outputs
  • SLMs fine-tuned on a single task outperform LLMs on format compliance and consistency
  • 10–30x cheaper in latency, energy, and compute vs. large models
  • On-device deployment enables real-time, offline inference with full data privacy
  • Agents can use multiple models: an LLM for complex reasoning, SLMs for routine subtasks

From LLM to SLM: The Conversion Process

The paper proposes a six-step migration path:

📝

  1. Log

Agent interactions (prompts, responses, tool calls)

🧹

  1. Curate

Remove PII, filter for quality

🔗

  1. Cluster

Identify recurring task patterns

📐

  1. Select

SLM sized to each task’s complexity

⚙️

  1. Fine-tune

LoRA with 10K–100K examples

🔄

  1. Repeat

Retrain as usage patterns evolve

The audit logs from earlier in this module are exactly the data you need for step 1.

Summary

Summary

Verification

  • Trust spectrum
  • 4-step protocol
  • Silent errors

Red-Teaming

  • Systematic protocols
  • Edge cases
  • SQL comparison

Audit Logs

  • JSONL format
  • Compliance trail
  • Cost tracking

Governance

  • Data policies
  • Schema-only
  • Adoption roadmap

Verify, govern, then deploy.