Module 7: Verification, Auditing, and Governance

MGMT 675: Generative AI for Finance

Kerry Back, Rice University

The Trust Spectrum

AI with Code Execution: A Fast Junior Analyst

AI with code execution is like a fast junior analyst — capable but needs oversight. The level of oversight depends on the stakes.

Low-Stakes

Exploratory charts
Directional trends
Internal brainstorming

Quick sanity check.

Medium-Stakes

Team tools and skills
Recurring reports
Internal dashboards

Test with known answers.

High-Stakes

Client deliverables
Regulatory filings
Board presentations

Full verification protocol.

The more consequential the decision, the more you verify. Not distrust — calibrated confidence.

AI Makes Silent Analytical Errors

The Real Risk: Not Hallucination — Bugs

AI in code-execution mode doesn’t invent facts the way a chatbot might. Instead, it makes silent analytical errors: wrong filters, dropped rows, incorrect joins.

Common errors

Missing data silently dropped
Date or currency parsing surprises
Filters slightly off; aggregating before vs. after filtering

Why it’s dangerous

The output looks professional
Charts render cleanly and nobody questions a polished result
Confidence without verification is the real risk

Catching a Silent Error

The prompt

“What’s the average profit margin by sub-category? Exclude any sub-categories with negative total profit.”

The trap

Does AI filter before or after aggregating?
Does it drop rows with negative profit, or sub-categories with negative total profit?
The order matters — and AI often gets it wrong

How to catch it

“Show me the code” — inspect the filter logic
“How many rows are in the result?” — does it match your expectation?
“List the excluded sub-categories” — are they the right ones?

Lesson: The more specific the filter logic, the more likely AI misinterprets it. Always verify multi-step operations.

The Verification Checklist

A 4-Step Verification Protocol

Apply this to any AI-generated analysis.

Sanity-check with known answers Ask questions where you already know the result: row count, column names, a total you can verify in Excel.
Ask AI to show its methodology “Show me the code” or “Show me the SQL.” Read the logic, not just the output.
Cross-check by rephrasing Ask the same question differently. Start a fresh conversation and ask again. Do answers agree?
Spot-check edge cases Test the smallest group, zero values, null fields, boundary dates. Errors hide at the edges.

Trust is a process, not a setting.

Every AI output deserves the same scrutiny you’d give a new analyst’s first deliverable.

Hands-On: Validating Prior Work

Reproducibility Test

Take analysis from earlier modules and test whether it reproduces.

The exercise

Pick a financial analysis from Module 2 (portfolio) or Module 4 (DCF)
Start a fresh conversation
Give the same data and ask the same question
Compare: do the results match?

What you’re testing

Reproducibility: Same data, same question — same answer?
Methodology: Did AI use the same approach both times?
Stability: Small rephrasing shouldn’t change the result

If the numbers don’t match, that’s not a failure — it’s exactly why we verify. And it’s exactly why skills (Module 3) matter.

Auditing a DCF Model

Take the DCF built in Module 4. Start a fresh conversation. Ask Claude to critique it.

What to ask

“Identify 3+ questionable assumptions”
“Check formula consistency — does FCF match the pro formas?”
“Is the terminal value reasonable? Compare implied terminal growth to GDP growth.”

Why this matters

The most valuable use of AI is often having it check work, not produce it
Cross-evaluation: use a second conversation to critique the first
Compare original vs. revised valuations

Red-Teaming a Deployed Skill

A database skill is the kind of tool people stop questioning once deployed. Test it like any tool your team depends on.

Adversarial queries

Query for nonexistent data (“Sales in 2030”)
Request something not in the schema (“Customer satisfaction scores”)
Use ambiguous language (“Best employee”)

For each failure

Diagnose: Bad SQL? Missing context? Wrong assumption?
Fix: Add clarification, examples, or constraints to SKILL.md
Retest: Run the query again after the fix

This is how production skills get hardened. Every failure makes the skill better.

Red-Teaming Protocols

Systematic Red-Teaming

Red-teaming goes beyond ad-hoc testing. A protocol ensures you test every failure mode systematically.

Test Category	What You Try
Out-of-scope questions	Ask about data that doesn’t exist in the schema
Conflicting documents	Upload docs with contradictory numbers
Edge cases	Zero values, null fields, single-row groups, boundary dates
Hand-written SQL comparison	Write the query yourself and compare to AI’s result
Ambiguous language	“Best customer” — by revenue? frequency? margin?

A skill that hasn’t been red-teamed is a liability, not an asset.

Red-Teaming a DCF Model

Apply red-teaming protocols to the DCF models built in Module 4.

Financial Red-Team Tests

Set revenue growth to 0% — does the model break?
Use a WACC below the risk-free rate — does AI flag it?
Feed a company with negative FCF and compare AI’s output to a hand-built Excel model

What You Learn

Where the model silently produces nonsense
Which assumptions the AI accepts without questioning
How to improve guardrails so the verification checklist catches the errors

Audit Logging

JSONL Audit Logs

Every AI interaction in a production system should produce an audit log entry. The standard format is JSONL (one JSON object per line).

What to Log

Identity: timestamp, user, model
Content: prompt and response
Metadata: token count, latency, cost

Why It Matters

Compliance: Regulators can audit what the AI said
Debugging: Trace errors back to the exact prompt
Cost and quality: Track token usage and identify common failures for skill updates

Store logs locally or in a cloud database. If it’s not logged, it didn’t happen.

Three Layers of Auditability

Defense in Depth

📄

Layer 1

RAG Citations

💻

Layer 2

Code Transparency

📋

Layer 3

Audit Logs

RAG citations: Ground answers in source documents with traceable references (Module 8). AI answers based on your content, not its training data.
Code transparency: AI displays the SQL or Python it used. Analysts review the logic, not just the answer.
Audit logs: Every prompt, response, and token count recorded in JSONL. Complete trail of every AI decision.

The more layers, the harder it is for errors to reach your team.

Security and Governance

Data Policies: Training and Retention

Before sending data to any AI provider, know the answers to two questions.

Will they train on your data?

Consumer tiers (free ChatGPT, free Claude) may use your input to improve models
API and enterprise tiers typically do not
Check your agreement and settings

How long do they keep it?

Providers retain prompts for a window (often 30 days) for safety monitoring
Enterprise contracts may offer zero retention
Know the policy before you send sensitive data

The Schema-Only Pattern

AI can analyze sensitive data without ever reading it. It only needs the table structure.

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '28px'}, 'flowchart': {'nodeSpacing': 80, 'rankSpacing': 120, 'padding': 24, 'useMaxWidth': true}}}%%
flowchart LR
  S["<b>Schema</b><br/><i>table names, columns</i>"] -->|"safe to share"| AI["<b>AI</b>"]
  AI --> SQL["<b>SQL Code</b>"]
  SQL --> L["<b>Runs Locally</b>"]
  DB["<b>Your Database</b>"] -->|"data stays here"| L

  style S fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#0f172a
  style AI fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#0f172a
  style SQL fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#0f172a
  style L fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#0f172a
  style DB fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#0f172a

How it works

Describe tables in instructions: names, columns, types
Ask AI to write the query
AI writes SQL from the schema alone; the script runs locally and output stays local

Why it’s compliant

Table metadata is not PII — safe to share
Sensitive records stay on your server
AI produces the tool, not the output
Works for FERPA, HIPAA, SOX, GDPR

Improving SQL Quality: Schema + Synthetic Rows

Schema alone works, but adding ~3 synthetic rows per table makes the generated SQL significantly more accurate.

Why sample rows help

Reveal data formats (is status 'Active' or 1?)
Disambiguate columns (is code a country code or product code?)
Help the LLM write correct WHERE clauses with properly formatted literals

Rules

Rows are synthetic — fabricated to match real formats without containing actual records
Three rows per table is the sweet spot; more tends to distract the model
Databricks: ~3 rows per table raises SQL accuracy by ~6 pp

Where Should AI Infrastructure Live?

When you connect an AI agent to enterprise data, you face a fundamental question: where does the computation happen?

Public Cloud AI

Data sent to provider (OpenAI, Anthropic, Google)
Fastest to deploy
Provider controls infrastructure

Private / On-Prem AI

Data never leaves your network
Full control over access
Higher upfront cost

Hybrid

Sensitive data stays on-prem
Non-sensitive tasks use cloud AI
Most common enterprise pattern

Antonio Neri, CEO of HPE: “In the AI era, where data sovereignty, privacy and control of data matter, the industry is seeing a shift back to private clouds.”

Data Sovereignty

Data sovereignty means controlling where your data lives, who can access it, and what jurisdiction governs it.

Why it matters for AI

AI agents may transmit query results to external LLMs
Regulations (GDPR, HIPAA, SOX) may restrict where data can go
Defense agencies require air-gapped environments

Enterprise examples

DISA (U.S. DoD): $931M on-prem private cloud
Barclays: 50K+ workloads to private cloud
2degrees (NZ): sovereign AI for data residency

Data Security: Running AI Locally

The Case for Local AI

Cloud AI is convenient, but local models keep data inside your machine — no transmission, no retention policy, no provider to trust.

When local AI makes sense

Client data under NDA
HIPAA, GDPR, or FERPA-covered records
Proprietary deal models and memos
Regulated industries (banking, defense, healthcare)

The trade-off

	Cloud AI	Local AI
Model quality	★★★★★	★★★☆☆
Data privacy	Provider policy	100% local
Setup cost	None	Minutes
Compute	Unlimited	Your hardware

For narrow, sensitive tasks, a smaller local model is often the right call.

Ollama: Local LLMs in One Installer

Ollama lets you run open-source models on your laptop — no cloud, no API key, no data leaving your machine.

How it works

Download and run the Ollama installer
Pull a model (e.g., llama3.2)
Chat locally via a browser UI or the terminal
All inference runs on your CPU/GPU

Available models

Llama 3.1 (Meta, 8B) — strong reasoning
Llama 3.2 (Meta, 3B) — faster, lighter
Mistral, Gemma, Phi — many options
Models download once and run offline

Exercise: Install Ollama and Try Local LLMs

Exercise: Running a Local Model

Setup

Go to ollama.com and download the installer for your OS
Run the installer (no terminal needed)
Once installed, open a terminal and run:
```
ollama run llama3.2
```
Then try:
```
ollama run llama3.1
```

The prompt

Give each model this prompt and compare their responses:

“A private equity firm is evaluating a manufacturing company with $50M in revenue, 12% EBITDA margins, and 8% annual revenue growth. Comparable transactions have been done at 7–9x EBITDA. What is a reasonable valuation range, and what are the top three risks you would flag before proceeding?”

What to notice: quality of reasoning, how each model handles uncertainty, and whether either refuses or hedges differently.

Both models run entirely on your machine. No data is sent anywhere.

AI Adoption Roadmap

The Sandbox-Audit-Deploy Pattern

Before deploying any AI workflow to your team, follow three phases.

🧪

Sandbox

Test with dummy data

🔍

Audit

Review logs and logic

🚀

Deploy

Roll out with monitoring

Test with synthetic data, review logs and logic, then roll out with weekly monitoring.

Adoption Timeline: From Quick Wins to Strategic

Each adoption stage maps to a trust level.

Quick wins (weeks) — light verification:
- Chat for research, summarization, drafts
- Upload spreadsheets for ad-hoc analysis
Medium-term (months) — test-before-deploy:
- Skills for recurring workflows (Module 3)
- MCP integrations and Streamlit apps (Modules 7, 8)
Strategic (quarters) — full audit trail:
- Production apps with authentication
- AI usage policy and governance framework

Small Language Models and Fine-Tuning

Improving Reliability: Fine-Tuning and SLMs

A complementary approach to verification: make the model itself more reliable for your tasks.

Fine-tuning an LLM

Train on your domain data (10-Ks, memos, deal docs)
Model learns your terminology and conventions
Fewer silent errors on domain-specific tasks

Small Language Models (SLMs)

~1–8B parameters (vs. hundreds of billions)
Fine-tuned in hours on a single GPU
Run on-premise — data never leaves your network
Specialized small models can outperform large models on narrow tasks

Why SLMs for Agentic Systems?

Belcak et al. (2025), “Small Language Models are the Future of Agentic AI”:

Most agent subtasks are repetitive, scoped, and non-conversational — classifying documents, extracting fields, formatting outputs
SLMs fine-tuned on a single task outperform LLMs on format compliance and consistency
10–30x cheaper in latency, energy, and compute vs. large models
On-device deployment enables real-time, offline inference with full data privacy
Agents can use multiple models: an LLM for complex reasoning, SLMs for routine subtasks

From LLM to SLM: The Conversion Process

The paper proposes a six-step migration path:

📝

Agent interactions (prompts, responses, tool calls)

🧹

Curate

Remove PII, filter for quality

🔗

Cluster

Identify recurring task patterns

📐

Select

SLM sized to each task’s complexity

⚙️

Fine-tune

LoRA with 10K–100K examples

🔄

Repeat

Retrain as usage patterns evolve

The audit logs from earlier in this module are exactly the data you need for step 1.

Summary

Verification

Trust spectrum
4-step protocol
Silent errors

Red-Teaming

Systematic protocols
Edge cases
SQL comparison

Audit Logs

JSONL format
Compliance trail
Cost tracking

Governance

Data policies
Schema-only
Adoption roadmap

Verify, govern, then deploy.