Module 8: More on AI Agents

MGMT 675: Generative AI for Finance

Kerry Back, Rice University

The API: Building Agents in Code

The API: Talking to an LLM from Code

An API lets your code communicate with an LLM service over the internet.

What You Need

An API key (authentication)
The anthropic Python package
A model name (e.g., claude-sonnet-4-20250514)

Getting Your Key

Go to console.anthropic.com
Create an API key under Settings
Add credit ($5 is plenty for the course)

Your Claude Pro subscription covers Claude.ai and Claude Code. The API is a separate product with pay-per-use pricing.

The Model Landscape: OpenRouter

OpenRouter is a marketplace that routes API calls to hundreds of models from dozens of providers — a single place to see what exists and compare pricing.

One API, many models — same code, swap the model name to use GPT-4o, Gemini, Llama, Mistral, Grok, and more
Pricing transparency — cost per million input/output tokens shown for every model
Good for exploration — browse capabilities and benchmarks before committing to a provider

Visit openrouter.ai/models to see the full landscape of available models.

A Single API Call

Calling the Anthropic API

import anthropic

client = anthropic.Anthropic(api_key=API_KEY)

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": prompt}
    ]
)
answer = response.content[0].text

Conversation History

Messages Are a List of Dictionaries

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="You are a finance tutor...",
    messages=[
        {"role": "user", "content": "What is a P/E ratio?"},
        {"role": "assistant", "content": "A P/E ratio is..."},
        {"role": "user", "content": "How do I interpret it?"}
    ]
)

Each API call is independent — LLM has no memory
You must send the entire conversation history each time
The system prompt defines the agent’s behavior and available tools

Agent Loop Pseudocode

The Agent’s Decision Loop

while not done:
    response = call_llm(messages, system_prompt, tools)

    if response.has_tool_call:
        result = execute_tool(response.tool_call)
        messages.append(tool_result)
    elif response.is_final:
        done = True
        return response.content

The agent is part LLM intelligence, part traditional programming.

Claude Code as an Agent Harness

What a Harness Provides

The pseudocode shows the logic. A harness is the infrastructure that actually runs that loop — and controls what the agent is allowed to do.

Runtime Infrastructure

Maintains conversation state across turns
Executes tool calls and captures results
Handles errors so the agent can recover
Enforces iteration and cost limits

Control and Safety

Permission system: which tools are allowed
Human-in-the-loop approval gates
Hooks: custom checks before/after actions
Audit log of every action taken

Claude Code is a complete harness — you supply the task; it handles the loop, tools, and guardrails.

The Real Agent Loop

A Production-Grade Loop

iteration = 0
while iteration < max_turns:
    response = call_llm(messages, system_prompt, tools)
    messages.append(response)             # Maintain full history

    if response.stop_reason == "end_turn":
        return response.text              # Task complete

    for tool_call in response.tool_uses:
        if needs_approval(tool_call):     # Human-in-the-loop
            if not ask_user(tool_call):
                break                     # User blocked the action
        try:
            result = execute(tool_call)
        except Exception as e:
            result = f"Error: {e}"        # Agent sees error and recovers
        messages.append(tool_result(result))

    iteration += 1
raise MaxIterationsError("Did not complete")

Claude Code’s Four Harness Layers

CLAUDE.md

Injected as the system prompt. Defines the agent’s persona, project context, and behavioral rules before any user message is sent.

Permissions

Controls which tools and file paths the agent can access. Separate allow/deny lists for bash commands, file writes, and network calls.

Hooks

Shell commands that fire automatically at agent lifecycle events: before a tool runs, after it completes, or when the agent stops.

Iteration Limits

--max-turns caps how many steps the agent takes. Prevents runaway loops and controls API spend.

Guardrails and Constraints

Why Guardrails?

Without constraints, an autonomous agent can cause real damage — even with good intentions.

What Can Go Wrong

“Clean up the repo” → deletes files you need
A loop bug triggers hundreds of API calls
Write access to the database → unintended edits
A misread instruction → wrong files overwritten

The Principle

Grant the agent the minimum access it needs to complete the task — and no more.

Read-only unless writes are required
Specific directories, not the whole filesystem
No network unless the task requires it

Permission Modes in Claude Code

Claude Code pauses and asks before taking consequential actions — unless you explicitly allow them.

Default: Human in the Loop

Agent stops before running bash commands, writing files, or calling the network
You approve or deny each action
Safe for unfamiliar tasks or new codebases

Allow Lists

Approve a specific command once — or always
Approve an entire tool (e.g., all file reads)
Approve a path prefix (e.g., ./reports/)
Stored in settings.json

Human approval is the default. You choose how much autonomy to grant, task by task.

CLAUDE.md as a Behavioral Guardrail

CLAUDE.md is not just documentation — it is the agent’s system prompt. Use it to enforce rules.

Examples of Constraint Instructions

- Never modify files outside the src/ directory
- Always ask before running git push or git commit
- Use read-only database connections unless told otherwise
- If a task would delete more than 3 files, pause and confirm

The system prompt is the first line of defense. Clear written rules reduce the chance of unintended actions.

Hooks: Programmatic Checks

Hooks run shell commands automatically at agent lifecycle events.

Event Types

PreToolUse — runs before a tool fires; can block the action
PostToolUse — runs after a tool completes; can log or validate
Stop — runs when the agent finishes

Example Uses

Block any bash command containing rm -rf
Log every file write to an audit trail
Send an alert when the agent finishes
Run tests automatically after every code edit

Hooks enforce rules programmatically — the check runs whether or not the agent remembers the instruction.

Layered Guardrails

No single guardrail is sufficient. Defense in depth combines all layers.

CLAUDE.md — behavioral rules in the system prompt
Permissions — tool and path access control
Hooks — programmatic checks at runtime

Sandboxing — isolated execution environment
Human approval — pause before high-stakes actions
Iteration limits — prevent runaway loops and cost overruns

Each layer catches what the previous one misses. Together they make autonomous agents safe to deploy.

The Dashboard Trap

Dashboards Answer Yesterday’s Questions

Organizations spend millions building dashboards. When a new question arises, the cycle restarts: requirements –> design –> build –> deploy.

The Typical Dashboard Lifecycle

Business user requests a report
Team builds queries, charts, and deploys
User asks a follow-up — back to step 1

The Cost

Time: Weeks to months per dashboard
Money: BI licenses, engineering hours, maintenance
Rigidity: Fixed views of fixed data

Gartner: only 20% of analytic insights deliver business outcomes.

The Fundamental Problem

Dashboards answer pre-defined questions. But the most valuable analysis comes from ad-hoc questions that arise in the moment.

“What happened to margins in the Southeast last quarter?”
“Show me our top 10 customers by growth rate, excluding one-time orders”
“Compare Q3 headcount vs. budget by department, and flag anyone over 110%”

These are simple questions. Getting answers shouldn’t require a development cycle.

Natural Language as the Query Interface

The Shift: From Dashboards to Conversations

Traditional Dashboard

Click filters and select dates to query
Answers take minutes to weeks
Follow-ups require a new dashboard request

Natural Language AI

Ask in plain English
Answers in seconds
Follow-ups are the next sentence

The dashboard was a workaround for the fact that databases don’t speak English. Now they do.

What This Looks Like in Practice

The Conversation

“Show me monthly revenue by product line for 2025”
AI: writes SQL, produces grouped bar chart
“Break out Enterprise by region”
AI: refines query, updates chart

What the User Needed to Know

What questions to ask
Whether the answers make sense
Nothing else

Behind the Scenes

4 different SQL queries written
3 visualizations produced
Derived metrics calculated

The Database Agent Pattern

The most powerful dashboard replacement: an AI agent connected to your database.

How to Build It

Connect database via MCP or file upload
Give AI the schema: table names, columns, relationships
Describe the business context
Start asking questions

What the Agent Can Do

Write and execute SQL queries
Compute derived metrics (growth rates, ratios)
Generate charts and export to Excel or PowerPoint

Replacing Dashboards: FP&A and Treasury

FP&A

“Budget vs. actual for Q3, decompose the variance into volume, price, and cost drivers”
“Rolling 12-month revenue forecast with confidence bands”

Treasury & Risk

“Cash position over the next 90 days using AR/AP forecasts”
“VaR by desk for the last 30 days, flag breaches”

Each of these would take a BI team days to build. With a database agent, they take seconds.

Replacing Dashboards: Portfolio and Executive

Portfolio Management

“Sector allocation vs. benchmark with active weights”
“Performance attribution — allocation vs. selection”

Executive Reporting

“One-page executive summary with KPIs and trends”
“Board deck from this quarter’s financials — 5 slides max”

The AI doesn’t replace the analyst’s judgment — it replaces the mechanical work of pulling data and building charts.

The Weekly Ops Review — Before and After

Before: The Dashboard Era

Data team pulls exports and builds slides (6+ hrs)
Manager reviews and revises (3 hrs)
VP asks a question — “We’ll get back to you next week”

Total: 9+ hours per week

After: Natural Language AI

AI agent generates ops review (3 min)
Manager reviews and iterates in chat
VP asks a question — AI answers in 10 seconds

Total: 15 minutes + live Q&A

The Reporting Pipeline

Five Steps from Query to Report

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '28px'}, 'flowchart': {'nodeSpacing': 100, 'rankSpacing': 140, 'padding': 28, 'useMaxWidth': true}}}%%
flowchart LR
  A["<b>Database<br/>Query</b>"] --> B["<b>Transform<br/>& Compute</b>"]
  B --> C["<b>Charts &<br/>Tables</b>"]
  C --> D["<b>AI<br/>Narrative</b>"]
  D --> E["<b>PowerPoint<br/>Output</b>"]

  style A fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:28px,padding:24px
  style B fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:28px,padding:24px
  style C fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:28px,padding:24px
  style D fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:28px,padding:24px
  style E fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:28px,padding:24px

From raw data to polished deck — no copy-paste, no formatting, no manual writing.

Building It with Your Coding Agent

You do not need to write this by hand. Give your coding agent a single prompt:

“Build an app that connects to a database. When the user selects a report and clicks Generate, run the SQL query, create a chart, write a three-sentence executive summary, and assemble a three-slide PowerPoint (title, chart, findings). Show the chart and summary on screen with a download button for the PowerPoint file.”

Your job is not to write the code. Your job is to test the result, refine the prompt, and iterate until the output meets your standards.

Building the Pipeline with Streamlit

Streamlit turns a Python script into a web app in minutes. Combined with an AI pipeline, it becomes a self-service reporting tool.

User Interface

Dropdown: select time period, department, or metric
Button: “Generate Report”
Download: auto-generated .pptx file

Behind the Scenes

Query database for selected parameters
Compute metrics and build charts
Send chart data to LLM for narrative
Assemble PowerPoint with python-pptx

The user clicks one button and gets a polished deck. No analyst needed.

Example: Monthly Ops Review Pipeline

What the Pipeline Produces

Title slide, KPI summary table, and trend charts
Variance waterfall (budget vs. actual)
AI-generated executive summary

Time Comparison

Manual: 6–8 hours per month (data pull, Excel, copy-paste into PowerPoint, write narrative)
Pipeline: 30 seconds per run
ROI: First month pays for the build time

Automated PowerPoint with python-pptx

python-pptx is a Python library for creating and editing PowerPoint files programmatically.

What It Can Do

Create slides from templates
Insert charts, tables, and images
Apply corporate formatting (fonts, colors, logos)
Populate placeholders with live data

The AI Advantage

AI writes the python-pptx code for you
Tell Claude: “Create a 5-slide deck from this data with a waterfall chart on slide 3”
AI handles layout, formatting, and data binding

You describe the deck; AI builds the automation. The pipeline runs unattended after that.

Sandboxed Execution

Development vs. Production

Development (Your Laptop)

Agent runs in your environment
Full access to your files
Fine for prototyping

Production

Agent runs in a container (Docker)
Isolated, disposable environment
Read-only database access

A container is a disposable, isolated computing environment. A bug in AI-generated code cannot affect your other files, your database, or your network.

The Sandbox Pattern

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '28px'}, 'flowchart': {'nodeSpacing': 80, 'rankSpacing': 120, 'padding': 24, 'useMaxWidth': true}}}%%
flowchart LR
  U["<b>User</b>"] --> A["<b>Agent</b>"]
  A --> S["<b>Sandboxed<br>Code</b>"]
  S <--> DB["<b>Database<br>(read-only)</b>"]
  S --> R["<b>Report</b>"]
  R --> U

  style U fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:28px,padding:24px
  style A fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:28px,padding:24px
  style S fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#0f172a,font-size:28px,padding:24px
  style DB fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:28px,padding:24px
  style R fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:28px,padding:24px

Agent plans and orchestrates; code runs in a sandboxed container
Container queries the database through a read-only connection
Finished report flows back to the user

From Prototype to Production

The reporting app on your laptop is a prototype. Moving it to production adds infrastructure, not intelligence.

Authentication: SSO so only authorized users can generate reports
Scheduling: weekly reports generated automatically
Logging: every query and LLM call recorded for compliance
Read-only credentials: the agent cannot modify data

The pipeline you prototyped is the same artifact that powers the production tool. IT wraps it in infrastructure; you provide the domain knowledge and the prompt.

Variance Analysis

What is Variance Analysis?

Variance analysis compares budgeted figures to actual results and decomposes the differences into actionable drivers. It is the core analytical task in FP&A.

Why It Matters

Every public company does it quarterly
Boards and investors ask “why did we miss?”
Drives re-forecasting and capital allocation

Key Decompositions

Revenue: volume vs. price vs. mix
COGS: volume vs. unit cost
SG&A: headcount vs. rate vs. discretionary

Revenue variance = Volume effect + Price effect + Mix effect

Variance Analysis: Example Data

Scenario: You are an FP&A analyst. Q1 actuals just closed. The CEO wants to know why operating income missed budget by $300K.

Revenue: Budget 100K units at $50 = $5M; Actual 95K units at $51 = $4.845M — ($155K) miss
COGS: Budget $30/unit = $3M; Actual $32/unit = $3.04M — ($40K) miss
SG&A: Budget $1.45M; Actual $1.555M — ($105K) miss
Operating Income: Budget $550K; Actual $250K — ($300K) total miss

Variance Analysis with an Agent

With a database agent connected to your financials, one prompt replaces 2–4 hours of FP&A work:

Prompt

“Here is our Q1 budget vs. actuals spreadsheet. Decompose the $300K operating income miss into volume, price, cost, and discretionary spending drivers. Produce a waterfall chart and a summary memo for the CFO.”

What the Agent Does

Plans: identifies the variance decomposition formulas needed
Decomposes revenue into volume + price; COGS into volume + rate
Generates waterfall chart and writes CFO-ready memo in Word format

Finance Application: M&A Due Diligence

M&A Due Diligence with an Agent

Give an agent a goal: “Evaluate this acquisition target.” The agent:

Ingests data from multiple formats (Excel, PDF, Word, CSV)
Applies evaluation criteria and computes risk metrics
Produces a summary memo with flagged risks

This is a single prompt to Claude Code. The agent plans and executes all three steps autonomously, reading multiple file formats and combining the results.

Orchestration

The Orchestration Layer

The agent’s control logic can route different tasks to different models and prompts.

Different Prompts

SQL generation –> database schema prompt
Python analysis –> data science prompt
Each task gets specialized instructions

Different Models

Simple classification –> fast, cheap model
Complex reasoning –> powerful model
Cost and speed optimization

Sub-agents: dispatch specialized workers for parallel tasks. This is what Cowork does with its parallel VMs.

Exercises

Exercise 1: Multi-Step Agent Workflow

Give Claude Code a single compound instruction that requires at least three steps.

Example: “Fetch Apple’s quarterly revenue from the Rice Data Portal. Create a bar chart of the last 8 quarters. Write a one-paragraph executive summary. Save the chart and summary to a reports folder.”

Observe the agent loop — how many distinct tool calls does the agent make?
Does it check its own work?
If the output is not right, refine your prompt and try again

Exercise 2: Workflow Decomposition

Consider the task: “Prepare the quarterly business review for the CEO.”

List the 5–7 steps a human analyst would take to complete this task
For each step, identify what tool the agent would use and what data it would need
Mark which steps need human approval before the agent continues
Write a single prompt that describes the full workflow for an agent

Exercise 3: M&A Due Diligence

Download the due diligence data pack (Excel + PDF + Word + CSV)
Ask Claude Code to evaluate the acquisition target end-to-end
Submit: the summary report + screenshots of intermediate steps

Watch how Claude plans its approach, reads each file, and combines the results into a coherent analysis.

Exercise 4: Streamlit App with the Anthropic API

Get an API key from console.anthropic.com
Ask Claude Code to build a Streamlit app that:
- Takes a company ticker from the user
- Sends a prompt to the Anthropic API asking for an investment summary
- Displays the AI-generated summary on screen
Extend: add a system prompt with specific analysis criteria (valuation, growth, risks)