AI with code execution is like a fast junior analyst — capable but needs oversight. The level of oversight depends on the stakes.
Low-Stakes
Exploratory charts
Directional trends
Internal brainstorming
Quick sanity check.
Medium-Stakes
Team tools and skills
Recurring reports
Internal dashboards
Test with known answers.
High-Stakes
Client deliverables
Regulatory filings
Board presentations
Full verification protocol.
The more consequential the decision, the more you verify. Not distrust — calibrated confidence.
AI Makes Silent Analytical Errors
The Real Risk: Not Hallucination — Bugs
AI in code-execution mode doesn’t invent facts the way a chatbot might. Instead, it makes silent analytical errors: wrong filters, dropped rows, incorrect joins.
Common errors
Missing data silently dropped
Date or currency parsing surprises
Filters slightly off; aggregating before vs. after filtering
Why it’s dangerous
The output looks professional
Charts render cleanly and nobody questions a polished result
Confidence without verification is the real risk
Catching a Silent Error
The prompt
“What’s the average profit margin by sub-category? Exclude any sub-categories with negative total profit.”
The trap
Does AI filter before or after aggregating?
Does it drop rows with negative profit, or sub-categories with negative total profit?
The order matters — and AI often gets it wrong
How to catch it
“Show me the code” — inspect the filter logic
“How many rows are in the result?” — does it match your expectation?
“List the excluded sub-categories” — are they the right ones?
Lesson: The more specific the filter logic, the more likely AI misinterprets it. Always verify multi-step operations.
The Verification Checklist
A 4-Step Verification Protocol
Apply this to any AI-generated analysis.
Sanity-check with known answers Ask questions where you already know the result: row count, column names, a total you can verify in Excel.
Ask AI to show its methodology “Show me the code” or “Show me the SQL.” Read the logic, not just the output.
Cross-check by rephrasing Ask the same question differently. Start a fresh conversation and ask again. Do answers agree?
Spot-check edge cases Test the smallest group, zero values, null fields, boundary dates. Errors hide at the edges.
Trust is a process, not a setting.
Every AI output deserves the same scrutiny you’d give a new analyst’s first deliverable.
Hands-On: Validating Prior Work
Reproducibility Test
Take analysis from earlier modules and test whether it reproduces.
The exercise
Pick a financial analysis from Module 2 (portfolio) or Module 4 (DCF)
Start a fresh conversation
Give the same data and ask the same question
Compare: do the results match?
What you’re testing
Reproducibility: Same data, same question — same answer?
Methodology: Did AI use the same approach both times?
Stability: Small rephrasing shouldn’t change the result
If the numbers don’t match, that’s not a failure — it’s exactly why we verify. And it’s exactly why skills (Module 3) matter.
Auditing a DCF Model
Take the DCF built in Module 4. Start a fresh conversation. Ask Claude to critique it.
What to ask
“Identify 3+ questionable assumptions”
“Check formula consistency — does FCF match the pro formas?”
“Is the terminal value reasonable? Compare implied terminal growth to GDP growth.”
Why this matters
The most valuable use of AI is often having it check work, not produce it
Cross-evaluation: use a second conversation to critique the first
Compare original vs. revised valuations
Red-Teaming a Deployed Skill
A database skill is the kind of tool people stop questioning once deployed. Test it like any tool your team depends on.
Adversarial queries
Query for nonexistent data (“Sales in 2030”)
Request something not in the schema (“Customer satisfaction scores”)
Use ambiguous language (“Best employee”)
For each failure
Diagnose: Bad SQL? Missing context? Wrong assumption?
Fix: Add clarification, examples, or constraints to SKILL.md
Retest: Run the query again after the fix
This is how production skills get hardened. Every failure makes the skill better.
Red-Teaming Protocols
Systematic Red-Teaming
Red-teaming goes beyond ad-hoc testing. A protocol ensures you test every failure mode systematically.
Test Category
What You Try
Out-of-scope questions
Ask about data that doesn’t exist in the schema
Conflicting documents
Upload docs with contradictory numbers
Edge cases
Zero values, null fields, single-row groups, boundary dates
Hand-written SQL comparison
Write the query yourself and compare to AI’s result
Ambiguous language
“Best customer” — by revenue? frequency? margin?
A skill that hasn’t been red-teamed is a liability, not an asset.
Red-Teaming a DCF Model
Apply red-teaming protocols to the DCF models built in Module 4.
Financial Red-Team Tests
Set revenue growth to 0% — does the model break?
Use a WACC below the risk-free rate — does AI flag it?
Feed a company with negative FCF and compare AI’s output to a hand-built Excel model
What You Learn
Where the model silently produces nonsense
Which assumptions the AI accepts without questioning
How to improve guardrails so the verification checklist catches the errors
Audit Logging
JSONL Audit Logs
Every AI interaction in a production system should produce an audit log entry. The standard format is JSONL (one JSON object per line).
What to Log
Identity: timestamp, user, model
Content: prompt and response
Metadata: token count, latency, cost
Why It Matters
Compliance: Regulators can audit what the AI said
Debugging: Trace errors back to the exact prompt
Cost and quality: Track token usage and identify common failures for skill updates
Store logs locally or in a cloud database. If it’s not logged, it didn’t happen.
Three Layers of Auditability
Defense in Depth
📄
Layer 1
RAG Citations
💻
Layer 2
Code Transparency
📋
Layer 3
Audit Logs
RAG citations: Ground answers in source documents with traceable references (Module 8). AI answers based on your content, not its training data.
Code transparency: AI displays the SQL or Python it used. Analysts review the logic, not just the answer.
Audit logs: Every prompt, response, and token count recorded in JSONL. Complete trail of every AI decision.
The more layers, the harder it is for errors to reach your team.
Security and Governance
Data Policies: Training and Retention
Before sending data to any AI provider, know the answers to two questions.
Will they train on your data?
Consumer tiers (free ChatGPT, free Claude) may use your input to improve models
API and enterprise tiers typically do not
Check your agreement and settings
How long do they keep it?
Providers retain prompts for a window (often 30 days) for safety monitoring
Enterprise contracts may offer zero retention
Know the policy before you send sensitive data
The Schema-Only Pattern
AI can analyze sensitive data without ever reading it. It only needs the table structure.
%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '28px'}, 'flowchart': {'nodeSpacing': 80, 'rankSpacing': 120, 'padding': 24, 'useMaxWidth': true}}}%%
flowchart LR
S["<b>Schema</b><br/><i>table names, columns</i>"] -->|"safe to share"| AI["<b>AI</b>"]
AI --> SQL["<b>SQL Code</b>"]
SQL --> L["<b>Runs Locally</b>"]
DB["<b>Your Database</b>"] -->|"data stays here"| L
style S fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#0f172a
style AI fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#0f172a
style SQL fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#0f172a
style L fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#0f172a
style DB fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#0f172a
How it works
Describe tables in instructions: names, columns, types
Ask AI to write the query
AI writes SQL from the schema alone; the script runs locally and output stays local
Why it’s compliant
Table metadata is not PII — safe to share
Sensitive records stay on your server
AI produces the tool, not the output
Works for FERPA, HIPAA, SOX, GDPR
Improving SQL Quality: Schema + Synthetic Rows
Schema alone works, but adding ~3 synthetic rows per table makes the generated SQL significantly more accurate.
Why sample rows help
Reveal data formats (is status 'Active' or 1?)
Disambiguate columns (is code a country code or product code?)
Help the LLM write correct WHERE clauses with properly formatted literals
Rules
Rows are synthetic — fabricated to match real formats without containing actual records
Three rows per table is the sweet spot; more tends to distract the model
Databricks: ~3 rows per table raises SQL accuracy by ~6 pp
Where Should AI Infrastructure Live?
When you connect an AI agent to enterprise data, you face a fundamental question: where does the computation happen?
Public Cloud AI
Data sent to provider (OpenAI, Anthropic, Google)
Fastest to deploy
Provider controls infrastructure
Private / On-Prem AI
Data never leaves your network
Full control over access
Higher upfront cost
Hybrid
Sensitive data stays on-prem
Non-sensitive tasks use cloud AI
Most common enterprise pattern
Antonio Neri, CEO of HPE: “In the AI era, where data sovereignty, privacy and control of data matter, the industry is seeing a shift back to private clouds.”
Data Sovereignty
Data sovereignty means controlling where your data lives, who can access it, and what jurisdiction governs it.
Why it matters for AI
AI agents may transmit query results to external LLMs
Regulations (GDPR, HIPAA, SOX) may restrict where data can go
Defense agencies require air-gapped environments
Enterprise examples
DISA (U.S. DoD): $931M on-prem private cloud
Barclays: 50K+ workloads to private cloud
2degrees (NZ): sovereign AI for data residency
Data Security: Running AI Locally
The Case for Local AI
Cloud AI is convenient, but local models keep data inside your machine — no transmission, no retention policy, no provider to trust.
For narrow, sensitive tasks, a smaller local model is often the right call.
Ollama: Local LLMs in One Installer
Ollama lets you run open-source models on your laptop — no cloud, no API key, no data leaving your machine.
How it works
Download and run the Ollama installer
Pull a model (e.g., llama3.2)
Chat locally via a browser UI or the terminal
All inference runs on your CPU/GPU
Available models
Llama 3.1 (Meta, 8B) — strong reasoning
Llama 3.2 (Meta, 3B) — faster, lighter
Mistral, Gemma, Phi — many options
Models download once and run offline
Exercise: Install Ollama and Try Local LLMs
Exercise: Running a Local Model
Setup
Go to ollama.com and download the installer for your OS
Run the installer (no terminal needed)
Once installed, open a terminal and run:
ollama run llama3.2
Then try:
ollama run llama3.1
The prompt
Give each model this prompt and compare their responses:
“A private equity firm is evaluating a manufacturing company with $50M in revenue, 12% EBITDA margins, and 8% annual revenue growth. Comparable transactions have been done at 7–9x EBITDA. What is a reasonable valuation range, and what are the top three risks you would flag before proceeding?”
What to notice: quality of reasoning, how each model handles uncertainty, and whether either refuses or hedges differently.
Both models run entirely on your machine. No data is sent anywhere.
AI Adoption Roadmap
The Sandbox-Audit-Deploy Pattern
Before deploying any AI workflow to your team, follow three phases.
🧪
Sandbox
Test with dummy data
🔍
Audit
Review logs and logic
🚀
Deploy
Roll out with monitoring
Test with synthetic data, review logs and logic, then roll out with weekly monitoring.
Adoption Timeline: From Quick Wins to Strategic
Each adoption stage maps to a trust level.
Quick wins (weeks) — light verification:
Chat for research, summarization, drafts
Upload spreadsheets for ad-hoc analysis
Medium-term (months) — test-before-deploy:
Skills for recurring workflows (Module 3)
MCP integrations and Streamlit apps (Modules 7, 8)
Strategic (quarters) — full audit trail:
Production apps with authentication
AI usage policy and governance framework
Small Language Models and Fine-Tuning
Improving Reliability: Fine-Tuning and SLMs
A complementary approach to verification: make the model itself more reliable for your tasks.
Fine-tuning an LLM
Train on your domain data (10-Ks, memos, deal docs)
Model learns your terminology and conventions
Fewer silent errors on domain-specific tasks
Small Language Models (SLMs)
~1–8B parameters (vs. hundreds of billions)
Fine-tuned in hours on a single GPU
Run on-premise — data never leaves your network
Specialized small models can outperform large models on narrow tasks