Module 12: Retrieval Augmented Generation

MGMT 675: Generative AI for Finance

Kerry Back, Rice University

What We’ve Done

  • Financial analysis with AI: mean-variance, DCF, Excel
  • Email and calendar connectors
  • Skills and memory files
  • Don’t accept the first draft: critique skill
  • Custom chatbots (system prompts) and agents
  • Web scraping
  • Corporate deployment: data security, system prompt, logging

Where We’re Going

M12: RAG

  • Give AI knowledge it wasn’t trained on
  • Chunking, embedding, and vector search

M13: Fine-Tuning

  • Customize model behavior with your data
  • LoRA, QLoRA, and distillation

M14: Sentiment Analysis

  • LLMs as sentiment classifiers
  • News and earnings call analysis
  • Trading signals from text

M15: More on Building Agents

  • Define tools as JSON Schema
  • The agentic loop
  • Portfolio review agent

M16: How AI Works

  • Neural networks
  • Embeddings and attention
  • Training of LLMs

RAG: What Does it Mean?

Retrieval

Fetch relevant passages from a knowledge base

  • Search your documents
  • Find the most relevant passages
  • “Retrieve” what the LLM needs to know

Augmented

Pass retrieved passages alongside the user’s prompt to the LLM

  • The prompt is augmented with context
  • LLM sees: your question + relevant excerpts
  • No retraining needed

Generation

The LLM generates an answer grounded in the retrieved context

  • This is the “Gen” in Gen AI
  • Answer is based on your documents, not just training data
  • Can cite specific sources (with proper metadata)

RAG: What Does it Mean?

Retrieval

Fetch relevant passages from a knowledge base

  • Search your documents
  • Find the most relevant passages
  • “Retrieve” what the LLM needs to know

Augmented

Pass retrieved passages alongside the user’s prompt to the LLM

  • The prompt is augmented with context
  • LLM sees: your question + relevant excerpts
  • No retraining needed

Generation

The LLM generates an answer grounded in the retrieved context

  • This is the “Gen” in Gen AI
  • Answer is based on your documents, not just training data
  • Can cite specific sources (with proper metadata)

RAG: What Does it Mean?

Retrieval

Fetch relevant passages from a knowledge base

  • Search your documents
  • Find the most relevant passages
  • “Retrieve” what the LLM needs to know

Augmented

Pass retrieved passages alongside the user’s prompt to the LLM

  • The prompt is augmented with context
  • LLM sees: your question + relevant excerpts
  • No retraining needed

Generation

The LLM generates an answer grounded in the retrieved context

  • This is the “Gen” in Gen AI
  • Answer is based on your documents, not just training data
  • Can cite specific sources (with proper metadata)

Beyond Prompting

Four Ways to Give AI Knowledge

Prompting and skills customize how an LLM responds. But what if you need it to know private data it wasn’t trained on?

Long Context RAG Fine-Tuning Small Language Model
Best for Short docs that fit in one prompt Large or many docs; proprietary facts Specific tone, format, or domain expertise Full control, privacy, specialized tasks
Data freshness Real-time (paste it in) Real-time updates Stale after training Stale after training
Cost to start None Low (no model training) Medium (GPU hours) High (training + hosting)
Cost per query High (long prompts are expensive) Low (only relevant chunks sent) Low Low
Complexity None Moderate (pipeline + vector DB) High (dataset curation) Very high (full ML ops)

The RAG Pipeline

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '64px', 'lineWidth': 6}, 'flowchart': {'nodeSpacing': 80, 'rankSpacing': 100, 'padding': 24, 'useMaxWidth': true}}}%%
flowchart LR
  subgraph prep ["Preparation (done first)"]
    D["<b>Documents</b>"] --> CE["<b>Chunk &<br>Embed</b>"]
    CE --> VDB["<b>Vector DB</b>"]
  end
  UQ["<b>User Query</b>"] --> EQ["<b>Embed<br>Query</b>"]
  EQ --> R["<b>Retrieve<br>Matches</b>"]
  VDB --> R
  R -->|"query + context"| LLM["<b>LLM</b>"]

  style prep fill:none,stroke:#3b82f6,stroke-width:4px,stroke-dasharray:12 6,rx:40,ry:40,color:#3b82f6,font-size:40px
  style D fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:64px,padding:20px
  style CE fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:64px,padding:20px
  style VDB fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:64px,padding:20px
  style UQ fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:64px,padding:20px
  style EQ fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:64px,padding:20px
  style R fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:64px,padding:20px
  style LLM fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#0f172a,font-size:64px,padding:20px

Embeddings: A Concrete Example

Word Vectors

An embedding model converts text into a vector of numbers. Words with similar meanings get similar vectors.

Using GloVe (50-dimensional vectors trained on Wikipedia + Gigaword):

import gensim.downloader as api
model = api.load("glove-wiki-gigaword-50")

model["king"]   # array([ 0.50,  0.69, -0.60, -0.02,  0.60, ...])  50 dims
model["queen"]  # array([ 0.38,  1.82, -1.26, -0.10,  0.36, ...])
model["apple"]  # array([ 0.52, -0.83,  0.50,  1.29,  0.12, ...])

“King” and “queen” have similar vectors. “Apple” is very different. (GloVe illustrates the idea; RAG uses more advanced sentence embeddings we’ll see next.)

Word Analogies: King – Man + Woman

Word vectors capture meaning arithmetically:

\[\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}\]

result = model.most_similar(
    positive=["king", "woman"], negative=["man"], topn=3
)
# [('queen', 0.86), ('daughter', 0.77), ('prince', 0.76)]

The model was never told these relationships — it learned them from patterns in text.

How Are Word Embeddings Trained?

The model reads billions of sentences and learns: words that appear in similar contexts get similar vectors.

  • “The company’s revenue grew 15% year over year”
  • “The company’s income grew 15% year over year”
  • “The company’s profit grew 15% year over year”

Revenue, income, and profit fill the same blank — so the model pushes their vectors close together.

Optimization objective: given a word, predict its neighbors (or vice versa). The vectors are the learned parameters — adjusted millions of times until the predictions are accurate.

How Are Sentence Embeddings Trained?

Word embeddings only capture individual words. For RAG, we need to compare whole sentences. But the model can’t memorize every possible sentence.

Contrastive learning on sentence pairs:

Sentence A Sentence B Label
“Revenue rose 15%” “Sales increased significantly” Similar
“Revenue rose 15%” “The CEO resigned yesterday” Dissimilar

Objective: adjust the model so that similar pairs have high cosine similarity and dissimilar pairs have low cosine similarity.

From Sentence to Vector: The Algorithm

How does a new, never-seen sentence become a single vector?

  1. Tokenize — split the sentence into tokens (subword pieces)
  2. Transformer — each token passes through attention layers that consider every other token in the sentence
  3. Pool — average the contextualized token vectors into one sentence vector

The transformer is the key: it means “bank” gets a different vector in “river bank” vs. “bank account” because attention incorporates the surrounding words before pooling.

From Words to Sentences

Word vectors capture individual word meaning, but RAG needs to compare the meaning of entire passages.

sentence-transformers passes the full text through a transformer and outputs a single vector:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")  # 384 dimensions

v1 = model.encode("Revenue grew 15%")
v2 = model.encode("How did profits change?")
print(f"v1 shape: {v1.shape}  — one vector per sentence")
print(f"v1[:5]:   {v1[:5].round(2)} ...")
print(f"v2[:5]:   {v2[:5].round(2)} ...")
v1 shape: (384,)  — one vector per sentence
v1[:5]:   [ 0.05  0.01  0.01 -0.06  0.05] ...
v2[:5]:   [ 0.    0.02 -0.01 -0.01  0.04] ...

This is not an average of word vectors — the transformer considers word order, grammar, and context to produce a single representation.

Cosine Similarity

How do we measure if two vectors are “close”? Cosine similarity: the cosine of the angle between them.

  • 1.0 = identical direction (same meaning)
  • 0.0 = perpendicular (unrelated)
  • -1.0 = opposite direction

Cosine Similarity in Action

from sklearn.metrics.pairwise import cosine_similarity

chunks = [
    "Revenue grew 15% year over year to $4.2 billion.",
    "The company opened 12 new retail locations in Q3.",
    "Net income declined 8% due to rising input costs.",
]
query = "What happened to net income?"

scores = cosine_similarity(model.encode([query]), model.encode(chunks))[0]
for chunk, score in zip(chunks, scores):
    print(f"  {score:.2f}  {chunk}")
  0.41  Revenue grew 15% year over year to $4.2 billion.
  0.09  The company opened 12 new retail locations in Q3.
  0.54  Net income declined 8% due to rising input costs.

Chunking a Document

Before embedding, documents are split into chunks — small pieces such that each becomes one vector.

text = """Apple reported Q4 revenue of $89.5 billion, down 1% year
over year. Services revenue hit an all-time record of $22.3
billion. iPhone revenue was $43.8 billion. The company returned
over $25 billion to shareholders during the quarter through
dividends and share repurchases."""

# Split into chunks of ~50 words with overlap
chunks = [
    "Apple reported Q4 revenue of $89.5 billion, down 1% year over year. "
    "Services revenue hit an all-time record of $22.3 billion.",

    "Services revenue hit an all-time record of $22.3 billion. "
    "iPhone revenue was $43.8 billion.",

    "iPhone revenue was $43.8 billion. The company returned over $25 "
    "billion to shareholders through dividends and share repurchases.",
]

Overlap ensures context isn’t lost at chunk boundaries.

The Full Picture: Query → Retrieve → Generate

# 1. Embed chunks and store in a vector database
chunk_vectors = model.encode(chunks)

# 2. User asks a question — embed the query with the SAME model
query = "What was Apple's services revenue?"
query_vector = model.encode([query])

# 3. Find the most similar chunk (cosine similarity)
scores = cosine_similarity(query_vector, chunk_vectors)[0]
best_chunk = chunks[scores.argmax()]
print(f"Best match (score {scores.max():.2f}):")
print(f"  {best_chunk}")

# 4. Pass the chunk + question to the LLM
prompt = f"Based on this context:\n{best_chunk}\n\nAnswer: {query}"
Best match (score 0.79):
  Apple reported Q4 revenue of $89.5 billion, down 1% year over year. Services revenue hit an all-time record of $22.3 billion.

The LLM never sees the whole document — just the most relevant chunk.

Where Have You Seen This Before?

You Already Use Embeddings

The embed → cosine similarity → retrieve pipeline isn’t new. It powers tools you use every day:

Product What Gets Embedded What It Does
Google Search Web pages and queries Matches query meaning, not just keywords (BERT, 2019)
Spotify Songs, playlists, listening history Discover Weekly, “similar songs”
Netflix / YouTube Movies, watch history “Because you watched…” recommendations
iPhone Photos Images Search “beach sunset” with no tags
Amazon Products, purchase history “Customers who bought this also bought…”
LinkedIn Resumes, job postings Job/candidate matching beyond keyword overlap

Same Math, Different Documents

Recommendations

  • Embed songs, movies, or products into vectors
  • Find nearest neighbors by cosine similarity
  • Return the most similar items

RAG

  • Embed document chunks into vectors
  • Find nearest neighbors by cosine similarity
  • Return the most relevant passages to the LLM

RAG applies the same math that powers recommendations and search — just to documents instead of songs or products.

RAG in Finance

Finance Applications of RAG

Document Types

  • 10-K/10-Q filings and earnings transcripts
  • Analyst reports and deal documents
  • Internal policies and memos

Use Cases

  • Compliance Q&A: query regulatory filings, internal policies
  • Due diligence: search deal documents with citations
  • Research synthesis: combine multiple sources

RAG: Strengths and Limitations

Strengths

  • No model training required (pipeline engineering is the main cost)
  • Data can be updated in real time
  • Answers are traceable to source pages

Limitations

  • Quality depends on retrieval quality
  • Chunking can split important context
  • Adds infrastructure cost (embedding API, vector DB hosting) and query latency

Why not just paste the whole document into the prompt? Context windows are growing fast (100K–1M tokens), but RAG still wins when you have many documents, need to keep costs down (long prompts are expensive), or want traceable citations from specific passages.

What Can Go Wrong?

Retrieval Failures

The LLM can only answer from what it retrieves. If retrieval fails, the answer will be wrong or fabricated.

Problem Example Result
Wrong chunk retrieved Query about “Apple revenue” retrieves a chunk about Apple’s supply chain Answer discusses supply chain, ignores revenue
Relevant info split across chunks Revenue in one chunk, costs in another LLM sees only half the picture
No relevant chunk exists Question about a topic not in the documents LLM may hallucinate an answer anyway
Too many chunks retrieved 20 chunks passed to LLM, mostly noise Key information buried; LLM confused

Chunking Problems

How you split documents dramatically affects retrieval quality:

Chunks Too Small

  • “Revenue was $89.5 billion” in one chunk
  • “down 1% year over year” in another
  • Neither chunk alone tells the full story
  • Query matches the wrong fragment

Chunks Too Large

  • A 3-page chunk covers revenue, costs, risks, and guidance
  • Query about revenue retrieves the whole chunk
  • LLM gets noisy context, may pull from the wrong section
  • Embedding is a blurry average of many topics

Hallucination Despite Context

Even with good retrieval, the LLM can still go wrong:

  • Over-extrapolation: “Revenue grew 15%” → LLM infers “the company is doing well” even though the next paragraph discusses a major lawsuit
  • Confidence without evidence: LLM generates a precise-sounding answer that isn’t supported by the retrieved chunks
  • Blending sources: mixes information from different chunks in ways that create false conclusions
  • Ignoring the context: especially with long or complex chunks, the LLM may fall back on its training data instead

RAG reduces hallucination but does not eliminate it. Always verify critical claims against the source document.

Best Practices

Chunking Best Practices

Chunk Size

  • 200–500 tokens for precise retrieval
  • 500–1000 tokens for more context per chunk
  • Test on your actual queries to find the sweet spot
  • Financial filings: paragraph-level chunks often work best

Overlap

  • 10–20% overlap between consecutive chunks
  • Ensures sentences at boundaries aren’t orphaned
  • Example: 400-token chunks with 80-token overlap
  • Critical for tables and lists that span chunk boundaries

There is no universal best chunk size. The right answer depends on your documents, your questions, and your embedding model. Test and iterate.

Retrieval Best Practices

  • Retrieve more, then re-rank: pull top 10 chunks, then use a re-ranker to select the best 3–5
  • Hybrid search: combine vector similarity (semantic) with keyword search (exact match) — catches both meaning and specific terms like ticker symbols
  • Metadata filters: tag chunks with source, date, section (e.g., “Risk Factors”) and filter before similarity search
  • Query expansion: rephrase the user’s question multiple ways to improve recall

Prompt Engineering for RAG

The prompt that wraps retrieved chunks matters (pseudocode — plug into an f-string or template):

Based ONLY on the following excerpts from {company}'s 10-K filing,
answer the question below. If the excerpts do not contain enough
information to answer, say "I don't have enough information" ---
do NOT guess or use outside knowledge.

EXCERPTS: {retrieved_chunks}
QUESTION: {user_query}
Cite the specific excerpt(s) that support your answer.

The instruction to say “I don’t know” and to cite sources are critical guardrails against hallucination.