Module 8: More on AI Agents

MGMT 675: Generative AI for Finance

Kerry Back, Rice University

The API: Building Agents in Code

The API: Talking to an LLM from Code

An API lets your code communicate with an LLM service over the internet.

What You Need

  • An API key (authentication)
  • The anthropic Python package
  • A model name (e.g., claude-sonnet-4-20250514)

Getting Your Key

  • Go to console.anthropic.com
  • Create an API key under Settings
  • Add credit ($5 is plenty for the course)

Your Claude Pro subscription covers Claude.ai and Claude Code. The API is a separate product with pay-per-use pricing.

The Model Landscape: OpenRouter

OpenRouter is a marketplace that routes API calls to hundreds of models from dozens of providers — a single place to see what exists and compare pricing.

  • One API, many models — same code, swap the model name to use GPT-4o, Gemini, Llama, Mistral, Grok, and more
  • Pricing transparency — cost per million input/output tokens shown for every model
  • Good for exploration — browse capabilities and benchmarks before committing to a provider

Visit openrouter.ai/models to see the full landscape of available models.

A Single API Call

Calling the Anthropic API

import anthropic

client = anthropic.Anthropic(api_key=API_KEY)

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": prompt}
    ]
)
answer = response.content[0].text

Conversation History

Messages Are a List of Dictionaries

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="You are a finance tutor...",
    messages=[
        {"role": "user", "content": "What is a P/E ratio?"},
        {"role": "assistant", "content": "A P/E ratio is..."},
        {"role": "user", "content": "How do I interpret it?"}
    ]
)
  • Each API call is independent — LLM has no memory
  • You must send the entire conversation history each time
  • The system prompt defines the agent’s behavior and available tools

Agent Loop Pseudocode

The Agent’s Decision Loop

while not done:
    response = call_llm(messages, system_prompt, tools)

    if response.has_tool_call:
        result = execute_tool(response.tool_call)
        messages.append(tool_result)
    elif response.is_final:
        done = True
        return response.content

The agent is part LLM intelligence, part traditional programming.

Claude Code as an Agent Harness

What a Harness Provides

The pseudocode shows the logic. A harness is the infrastructure that actually runs that loop — and controls what the agent is allowed to do.

Runtime Infrastructure

  • Maintains conversation state across turns
  • Executes tool calls and captures results
  • Handles errors so the agent can recover
  • Enforces iteration and cost limits

Control and Safety

  • Permission system: which tools are allowed
  • Human-in-the-loop approval gates
  • Hooks: custom checks before/after actions
  • Audit log of every action taken

Claude Code is a complete harness — you supply the task; it handles the loop, tools, and guardrails.

The Real Agent Loop

A Production-Grade Loop

iteration = 0
while iteration < max_turns:
    response = call_llm(messages, system_prompt, tools)
    messages.append(response)             # Maintain full history

    if response.stop_reason == "end_turn":
        return response.text              # Task complete

    for tool_call in response.tool_uses:
        if needs_approval(tool_call):     # Human-in-the-loop
            if not ask_user(tool_call):
                break                     # User blocked the action
        try:
            result = execute(tool_call)
        except Exception as e:
            result = f"Error: {e}"        # Agent sees error and recovers
        messages.append(tool_result(result))

    iteration += 1
raise MaxIterationsError("Did not complete")

Claude Code’s Four Harness Layers

CLAUDE.md

Injected as the system prompt. Defines the agent’s persona, project context, and behavioral rules before any user message is sent.

Permissions

Controls which tools and file paths the agent can access. Separate allow/deny lists for bash commands, file writes, and network calls.

Hooks

Shell commands that fire automatically at agent lifecycle events: before a tool runs, after it completes, or when the agent stops.

Iteration Limits

--max-turns caps how many steps the agent takes. Prevents runaway loops and controls API spend.

Guardrails and Constraints

Why Guardrails?

Without constraints, an autonomous agent can cause real damage — even with good intentions.

What Can Go Wrong

  • “Clean up the repo” → deletes files you need
  • A loop bug triggers hundreds of API calls
  • Write access to the database → unintended edits
  • A misread instruction → wrong files overwritten

The Principle

Grant the agent the minimum access it needs to complete the task — and no more.

  • Read-only unless writes are required
  • Specific directories, not the whole filesystem
  • No network unless the task requires it

Permission Modes in Claude Code

Claude Code pauses and asks before taking consequential actions — unless you explicitly allow them.

Default: Human in the Loop

  • Agent stops before running bash commands, writing files, or calling the network
  • You approve or deny each action
  • Safe for unfamiliar tasks or new codebases

Allow Lists

  • Approve a specific command once — or always
  • Approve an entire tool (e.g., all file reads)
  • Approve a path prefix (e.g., ./reports/)
  • Stored in settings.json

Human approval is the default. You choose how much autonomy to grant, task by task.

CLAUDE.md as a Behavioral Guardrail

CLAUDE.md is not just documentation — it is the agent’s system prompt. Use it to enforce rules.

Examples of Constraint Instructions

- Never modify files outside the src/ directory
- Always ask before running git push or git commit
- Use read-only database connections unless told otherwise
- If a task would delete more than 3 files, pause and confirm

The system prompt is the first line of defense. Clear written rules reduce the chance of unintended actions.

Hooks: Programmatic Checks

Hooks run shell commands automatically at agent lifecycle events.

Event Types

  • PreToolUse — runs before a tool fires; can block the action
  • PostToolUse — runs after a tool completes; can log or validate
  • Stop — runs when the agent finishes

Example Uses

  • Block any bash command containing rm -rf
  • Log every file write to an audit trail
  • Send an alert when the agent finishes
  • Run tests automatically after every code edit

Hooks enforce rules programmatically — the check runs whether or not the agent remembers the instruction.

Layered Guardrails

No single guardrail is sufficient. Defense in depth combines all layers.

  1. CLAUDE.md — behavioral rules in the system prompt
  2. Permissions — tool and path access control
  3. Hooks — programmatic checks at runtime
  1. Sandboxing — isolated execution environment
  2. Human approval — pause before high-stakes actions
  3. Iteration limits — prevent runaway loops and cost overruns

Each layer catches what the previous one misses. Together they make autonomous agents safe to deploy.

The Dashboard Trap

Dashboards Answer Yesterday’s Questions

Organizations spend millions building dashboards. When a new question arises, the cycle restarts: requirements –> design –> build –> deploy.

The Typical Dashboard Lifecycle

  1. Business user requests a report
  2. Team builds queries, charts, and deploys
  3. User asks a follow-up — back to step 1

The Cost

  • Time: Weeks to months per dashboard
  • Money: BI licenses, engineering hours, maintenance
  • Rigidity: Fixed views of fixed data

Gartner: only 20% of analytic insights deliver business outcomes.

The Fundamental Problem

Dashboards answer pre-defined questions. But the most valuable analysis comes from ad-hoc questions that arise in the moment.

  • “What happened to margins in the Southeast last quarter?”
  • “Show me our top 10 customers by growth rate, excluding one-time orders”
  • “Compare Q3 headcount vs. budget by department, and flag anyone over 110%”

These are simple questions. Getting answers shouldn’t require a development cycle.

Natural Language as the Query Interface

The Shift: From Dashboards to Conversations

Traditional Dashboard

  • Click filters and select dates to query
  • Answers take minutes to weeks
  • Follow-ups require a new dashboard request

Natural Language AI

  • Ask in plain English
  • Answers in seconds
  • Follow-ups are the next sentence

The dashboard was a workaround for the fact that databases don’t speak English. Now they do.

What This Looks Like in Practice

The Conversation

  • “Show me monthly revenue by product line for 2025”
  • AI: writes SQL, produces grouped bar chart
  • “Break out Enterprise by region”
  • AI: refines query, updates chart

What the User Needed to Know

  • What questions to ask
  • Whether the answers make sense
  • Nothing else

Behind the Scenes

  • 4 different SQL queries written
  • 3 visualizations produced
  • Derived metrics calculated

The Database Agent Pattern

The most powerful dashboard replacement: an AI agent connected to your database.

How to Build It

  1. Connect database via MCP or file upload
  2. Give AI the schema: table names, columns, relationships
  3. Describe the business context
  4. Start asking questions

What the Agent Can Do

  • Write and execute SQL queries
  • Compute derived metrics (growth rates, ratios)
  • Generate charts and export to Excel or PowerPoint

Replacing Dashboards: FP&A and Treasury

FP&A

  • “Budget vs. actual for Q3, decompose the variance into volume, price, and cost drivers”
  • “Rolling 12-month revenue forecast with confidence bands”

Treasury & Risk

  • “Cash position over the next 90 days using AR/AP forecasts”
  • “VaR by desk for the last 30 days, flag breaches”

Each of these would take a BI team days to build. With a database agent, they take seconds.

Replacing Dashboards: Portfolio and Executive

Portfolio Management

  • “Sector allocation vs. benchmark with active weights”
  • “Performance attribution — allocation vs. selection”

Executive Reporting

  • “One-page executive summary with KPIs and trends”
  • “Board deck from this quarter’s financials — 5 slides max”

The AI doesn’t replace the analyst’s judgment — it replaces the mechanical work of pulling data and building charts.

The Weekly Ops Review — Before and After

Before: The Dashboard Era

  1. Data team pulls exports and builds slides (6+ hrs)
  2. Manager reviews and revises (3 hrs)
  3. VP asks a question — “We’ll get back to you next week”

Total: 9+ hours per week

After: Natural Language AI

  1. AI agent generates ops review (3 min)
  2. Manager reviews and iterates in chat
  3. VP asks a question — AI answers in 10 seconds

Total: 15 minutes + live Q&A

The Reporting Pipeline

Five Steps from Query to Report

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '28px'}, 'flowchart': {'nodeSpacing': 100, 'rankSpacing': 140, 'padding': 28, 'useMaxWidth': true}}}%%
flowchart LR
  A["<b>Database<br/>Query</b>"] --> B["<b>Transform<br/>& Compute</b>"]
  B --> C["<b>Charts &<br/>Tables</b>"]
  C --> D["<b>AI<br/>Narrative</b>"]
  D --> E["<b>PowerPoint<br/>Output</b>"]

  style A fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:28px,padding:24px
  style B fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:28px,padding:24px
  style C fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:28px,padding:24px
  style D fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:28px,padding:24px
  style E fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:28px,padding:24px

From raw data to polished deck — no copy-paste, no formatting, no manual writing.

Building It with Your Coding Agent

You do not need to write this by hand. Give your coding agent a single prompt:

“Build an app that connects to a database. When the user selects a report and clicks Generate, run the SQL query, create a chart, write a three-sentence executive summary, and assemble a three-slide PowerPoint (title, chart, findings). Show the chart and summary on screen with a download button for the PowerPoint file.”

Your job is not to write the code. Your job is to test the result, refine the prompt, and iterate until the output meets your standards.

Building the Pipeline with Streamlit

Streamlit turns a Python script into a web app in minutes. Combined with an AI pipeline, it becomes a self-service reporting tool.

User Interface

  • Dropdown: select time period, department, or metric
  • Button: “Generate Report”
  • Download: auto-generated .pptx file

Behind the Scenes

  • Query database for selected parameters
  • Compute metrics and build charts
  • Send chart data to LLM for narrative
  • Assemble PowerPoint with python-pptx

The user clicks one button and gets a polished deck. No analyst needed.

Example: Monthly Ops Review Pipeline

What the Pipeline Produces

  1. Title slide, KPI summary table, and trend charts
  2. Variance waterfall (budget vs. actual)
  3. AI-generated executive summary

Time Comparison

  • Manual: 6–8 hours per month (data pull, Excel, copy-paste into PowerPoint, write narrative)
  • Pipeline: 30 seconds per run
  • ROI: First month pays for the build time

Automated PowerPoint with python-pptx

python-pptx is a Python library for creating and editing PowerPoint files programmatically.

What It Can Do

  • Create slides from templates
  • Insert charts, tables, and images
  • Apply corporate formatting (fonts, colors, logos)
  • Populate placeholders with live data

The AI Advantage

  • AI writes the python-pptx code for you
  • Tell Claude: “Create a 5-slide deck from this data with a waterfall chart on slide 3”
  • AI handles layout, formatting, and data binding

You describe the deck; AI builds the automation. The pipeline runs unattended after that.

Sandboxed Execution

Development vs. Production

Development (Your Laptop)

  • Agent runs in your environment
  • Full access to your files
  • Fine for prototyping

Production

  • Agent runs in a container (Docker)
  • Isolated, disposable environment
  • Read-only database access

A container is a disposable, isolated computing environment. A bug in AI-generated code cannot affect your other files, your database, or your network.

The Sandbox Pattern

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '28px'}, 'flowchart': {'nodeSpacing': 80, 'rankSpacing': 120, 'padding': 24, 'useMaxWidth': true}}}%%
flowchart LR
  U["<b>User</b>"] --> A["<b>Agent</b>"]
  A --> S["<b>Sandboxed<br>Code</b>"]
  S <--> DB["<b>Database<br>(read-only)</b>"]
  S --> R["<b>Report</b>"]
  R --> U

  style U fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:28px,padding:24px
  style A fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:28px,padding:24px
  style S fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#0f172a,font-size:28px,padding:24px
  style DB fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:28px,padding:24px
  style R fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:28px,padding:24px

  • Agent plans and orchestrates; code runs in a sandboxed container
  • Container queries the database through a read-only connection
  • Finished report flows back to the user

From Prototype to Production

The reporting app on your laptop is a prototype. Moving it to production adds infrastructure, not intelligence.

  • Authentication: SSO so only authorized users can generate reports
  • Scheduling: weekly reports generated automatically
  • Logging: every query and LLM call recorded for compliance
  • Read-only credentials: the agent cannot modify data

The pipeline you prototyped is the same artifact that powers the production tool. IT wraps it in infrastructure; you provide the domain knowledge and the prompt.

Variance Analysis

What is Variance Analysis?

Variance analysis compares budgeted figures to actual results and decomposes the differences into actionable drivers. It is the core analytical task in FP&A.

Why It Matters

  • Every public company does it quarterly
  • Boards and investors ask “why did we miss?”
  • Drives re-forecasting and capital allocation

Key Decompositions

  • Revenue: volume vs. price vs. mix
  • COGS: volume vs. unit cost
  • SG&A: headcount vs. rate vs. discretionary

Revenue variance = Volume effect + Price effect + Mix effect

Variance Analysis: Example Data

Scenario: You are an FP&A analyst. Q1 actuals just closed. The CEO wants to know why operating income missed budget by $300K.

  • Revenue: Budget 100K units at $50 = $5M; Actual 95K units at $51 = $4.845M — ($155K) miss
  • COGS: Budget $30/unit = $3M; Actual $32/unit = $3.04M — ($40K) miss
  • SG&A: Budget $1.45M; Actual $1.555M — ($105K) miss
  • Operating Income: Budget $550K; Actual $250K — ($300K) total miss

Variance Analysis with an Agent

With a database agent connected to your financials, one prompt replaces 2–4 hours of FP&A work:

Prompt

“Here is our Q1 budget vs. actuals spreadsheet. Decompose the $300K operating income miss into volume, price, cost, and discretionary spending drivers. Produce a waterfall chart and a summary memo for the CFO.”

What the Agent Does

  1. Plans: identifies the variance decomposition formulas needed
  2. Decomposes revenue into volume + price; COGS into volume + rate
  3. Generates waterfall chart and writes CFO-ready memo in Word format

Finance Application: M&A Due Diligence

M&A Due Diligence with an Agent

Give an agent a goal: “Evaluate this acquisition target.” The agent:

  1. Ingests data from multiple formats (Excel, PDF, Word, CSV)
  2. Applies evaluation criteria and computes risk metrics
  3. Produces a summary memo with flagged risks

This is a single prompt to Claude Code. The agent plans and executes all three steps autonomously, reading multiple file formats and combining the results.

Orchestration

The Orchestration Layer

The agent’s control logic can route different tasks to different models and prompts.

Different Prompts

  • SQL generation –> database schema prompt
  • Python analysis –> data science prompt
  • Each task gets specialized instructions

Different Models

  • Simple classification –> fast, cheap model
  • Complex reasoning –> powerful model
  • Cost and speed optimization

Sub-agents: dispatch specialized workers for parallel tasks. This is what Cowork does with its parallel VMs.

Exercises

Exercise 1: Multi-Step Agent Workflow

Give Claude Code a single compound instruction that requires at least three steps.

Example: “Fetch Apple’s quarterly revenue from the Rice Data Portal. Create a bar chart of the last 8 quarters. Write a one-paragraph executive summary. Save the chart and summary to a reports folder.”

  • Observe the agent loop — how many distinct tool calls does the agent make?
  • Does it check its own work?
  • If the output is not right, refine your prompt and try again

Exercise 2: Workflow Decomposition

Consider the task: “Prepare the quarterly business review for the CEO.”

  1. List the 5–7 steps a human analyst would take to complete this task
  2. For each step, identify what tool the agent would use and what data it would need
  3. Mark which steps need human approval before the agent continues
  4. Write a single prompt that describes the full workflow for an agent

Exercise 3: M&A Due Diligence

  1. Download the due diligence data pack (Excel + PDF + Word + CSV)
  2. Ask Claude Code to evaluate the acquisition target end-to-end
  3. Submit: the summary report + screenshots of intermediate steps

Watch how Claude plans its approach, reads each file, and combines the results into a coherent analysis.

Exercise 4: Streamlit App with the Anthropic API

  1. Get an API key from console.anthropic.com
  2. Ask Claude Code to build a Streamlit app that:
    • Takes a company ticker from the user
    • Sends a prompt to the Anthropic API asking for an investment summary
    • Displays the AI-generated summary on screen
  3. Extend: add a system prompt with specific analysis criteria (valuation, growth, risks)