Module 12: Retrieval Augmented Generation

MGMT 675: Generative AI for Finance

Kerry Back, Rice University

What We’ve Done

Financial analysis with AI: mean-variance, DCF, Excel
Email and calendar connectors
Skills and memory files
Don’t accept the first draft: critique skill
Custom chatbots (system prompts) and agents
Web scraping
Corporate deployment: data security, system prompt, logging

Where We’re Going

M12: RAG

Give AI knowledge it wasn’t trained on
Chunking, embedding, and vector search

M13: Fine-Tuning

Customize model behavior with your data
LoRA, QLoRA, and distillation

M14: Sentiment Analysis

LLMs as sentiment classifiers
News and earnings call analysis
Trading signals from text

M15: More on Building Agents

Define tools as JSON Schema
The agentic loop
Portfolio review agent

M16: How AI Works

Neural networks
Embeddings and attention
Training of LLMs

RAG: What Does it Mean?

Retrieval

Fetch relevant passages from a knowledge base

Search your documents
Find the most relevant passages
“Retrieve” what the LLM needs to know

Augmented

Pass retrieved passages alongside the user’s prompt to the LLM

The prompt is augmented with context
LLM sees: your question + relevant excerpts
No retraining needed

Generation

The LLM generates an answer grounded in the retrieved context

This is the “Gen” in Gen AI
Answer is based on your documents, not just training data
Can cite specific sources (with proper metadata)

RAG: What Does it Mean?

Retrieval

Fetch relevant passages from a knowledge base

Search your documents
Find the most relevant passages
“Retrieve” what the LLM needs to know

Augmented

Pass retrieved passages alongside the user’s prompt to the LLM

The prompt is augmented with context
LLM sees: your question + relevant excerpts
No retraining needed

Generation

The LLM generates an answer grounded in the retrieved context

This is the “Gen” in Gen AI
Answer is based on your documents, not just training data
Can cite specific sources (with proper metadata)

RAG: What Does it Mean?

Retrieval

Fetch relevant passages from a knowledge base

Search your documents
Find the most relevant passages
“Retrieve” what the LLM needs to know

Augmented

Pass retrieved passages alongside the user’s prompt to the LLM

The prompt is augmented with context
LLM sees: your question + relevant excerpts
No retraining needed

Generation

The LLM generates an answer grounded in the retrieved context

This is the “Gen” in Gen AI
Answer is based on your documents, not just training data
Can cite specific sources (with proper metadata)

Beyond Prompting

Four Ways to Give AI Knowledge

Prompting and skills customize how an LLM responds. But what if you need it to know private data it wasn’t trained on?

	Long Context	RAG	Fine-Tuning	Small Language Model
Best for	Short docs that fit in one prompt	Large or many docs; proprietary facts	Specific tone, format, or domain expertise	Full control, privacy, specialized tasks
Data freshness	Real-time (paste it in)	Real-time updates	Stale after training	Stale after training
Cost to start	None	Low (no model training)	Medium (GPU hours)	High (training + hosting)
Cost per query	High (long prompts are expensive)	Low (only relevant chunks sent)	Low	Low
Complexity	None	Moderate (pipeline + vector DB)	High (dataset curation)	Very high (full ML ops)

The RAG Pipeline

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '64px', 'lineWidth': 6}, 'flowchart': {'nodeSpacing': 80, 'rankSpacing': 100, 'padding': 24, 'useMaxWidth': true}}}%%
flowchart LR
  subgraph prep ["Preparation (done first)"]
    D["<b>Documents</b>"] --> CE["<b>Chunk &<br>Embed</b>"]
    CE --> VDB["<b>Vector DB</b>"]
  end
  UQ["<b>User Query</b>"] --> EQ["<b>Embed<br>Query</b>"]
  EQ --> R["<b>Retrieve<br>Matches</b>"]
  VDB --> R
  R -->|"query + context"| LLM["<b>LLM</b>"]

  style prep fill:none,stroke:#3b82f6,stroke-width:4px,stroke-dasharray:12 6,rx:40,ry:40,color:#3b82f6,font-size:40px
  style D fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:64px,padding:20px
  style CE fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:64px,padding:20px
  style VDB fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:64px,padding:20px
  style UQ fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:64px,padding:20px
  style EQ fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:64px,padding:20px
  style R fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:64px,padding:20px
  style LLM fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#0f172a,font-size:64px,padding:20px

Embeddings: A Concrete Example

Word Vectors

An embedding model converts text into a vector of numbers. Words with similar meanings get similar vectors.

Using GloVe (50-dimensional vectors trained on Wikipedia + Gigaword):

import gensim.downloader as api
model = api.load("glove-wiki-gigaword-50")

model["king"]   # array([ 0.50,  0.69, -0.60, -0.02,  0.60, ...])  50 dims
model["queen"]  # array([ 0.38,  1.82, -1.26, -0.10,  0.36, ...])
model["apple"]  # array([ 0.52, -0.83,  0.50,  1.29,  0.12, ...])

“King” and “queen” have similar vectors. “Apple” is very different. (GloVe illustrates the idea; RAG uses more advanced sentence embeddings we’ll see next.)

Word Analogies: King – Man + Woman

Word vectors capture meaning arithmetically:

\[\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}\]

result = model.most_similar(
    positive=["king", "woman"], negative=["man"], topn=3
)
# [('queen', 0.86), ('daughter', 0.77), ('prince', 0.76)]

The model was never told these relationships — it learned them from patterns in text.

How Are Word Embeddings Trained?

The model reads billions of sentences and learns: words that appear in similar contexts get similar vectors.

“The company’s revenue grew 15% year over year”
“The company’s income grew 15% year over year”
“The company’s profit grew 15% year over year”

Revenue, income, and profit fill the same blank — so the model pushes their vectors close together.

Optimization objective: given a word, predict its neighbors (or vice versa). The vectors are the learned parameters — adjusted millions of times until the predictions are accurate.

How Are Sentence Embeddings Trained?

Word embeddings only capture individual words. For RAG, we need to compare whole sentences. But the model can’t memorize every possible sentence.

Contrastive learning on sentence pairs:

Sentence A	Sentence B	Label
“Revenue rose 15%”	“Sales increased significantly”	Similar
“Revenue rose 15%”	“The CEO resigned yesterday”	Dissimilar

Objective: adjust the model so that similar pairs have high cosine similarity and dissimilar pairs have low cosine similarity.

From Sentence to Vector: The Algorithm

How does a new, never-seen sentence become a single vector?

Tokenize — split the sentence into tokens (subword pieces)
Transformer — each token passes through attention layers that consider every other token in the sentence
Pool — average the contextualized token vectors into one sentence vector

The transformer is the key: it means “bank” gets a different vector in “river bank” vs. “bank account” because attention incorporates the surrounding words before pooling.

From Words to Sentences

Word vectors capture individual word meaning, but RAG needs to compare the meaning of entire passages.

sentence-transformers passes the full text through a transformer and outputs a single vector:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")  # 384 dimensions

v1 = model.encode("Revenue grew 15%")
v2 = model.encode("How did profits change?")
print(f"v1 shape: {v1.shape}  — one vector per sentence")
print(f"v1[:5]:   {v1[:5].round(2)} ...")
print(f"v2[:5]:   {v2[:5].round(2)} ...")

v1 shape: (384,)  — one vector per sentence
v1[:5]:   [ 0.05  0.01  0.01 -0.06  0.05] ...
v2[:5]:   [ 0.    0.02 -0.01 -0.01  0.04] ...

This is not an average of word vectors — the transformer considers word order, grammar, and context to produce a single representation.

Cosine Similarity

How do we measure if two vectors are “close”? Cosine similarity: the cosine of the angle between them.

1.0 = identical direction (same meaning)
0.0 = perpendicular (unrelated)
-1.0 = opposite direction

Cosine Similarity in Action

from sklearn.metrics.pairwise import cosine_similarity

chunks = [
    "Revenue grew 15% year over year to $4.2 billion.",
    "The company opened 12 new retail locations in Q3.",
    "Net income declined 8% due to rising input costs.",
]
query = "What happened to net income?"

scores = cosine_similarity(model.encode([query]), model.encode(chunks))[0]
for chunk, score in zip(chunks, scores):
    print(f"  {score:.2f}  {chunk}")

  0.41  Revenue grew 15% year over year to $4.2 billion.
  0.09  The company opened 12 new retail locations in Q3.
  0.54  Net income declined 8% due to rising input costs.

Chunking a Document

Before embedding, documents are split into chunks — small pieces such that each becomes one vector.

text = """Apple reported Q4 revenue of $89.5 billion, down 1% year
over year. Services revenue hit an all-time record of $22.3
billion. iPhone revenue was $43.8 billion. The company returned
over $25 billion to shareholders during the quarter through
dividends and share repurchases."""

# Split into chunks of ~50 words with overlap
chunks = [
    "Apple reported Q4 revenue of $89.5 billion, down 1% year over year. "
    "Services revenue hit an all-time record of $22.3 billion.",

    "Services revenue hit an all-time record of $22.3 billion. "
    "iPhone revenue was $43.8 billion.",

    "iPhone revenue was $43.8 billion. The company returned over $25 "
    "billion to shareholders through dividends and share repurchases.",
]

Overlap ensures context isn’t lost at chunk boundaries.

The Full Picture: Query → Retrieve → Generate

# 1. Embed chunks and store in a vector database
chunk_vectors = model.encode(chunks)

# 2. User asks a question — embed the query with the SAME model
query = "What was Apple's services revenue?"
query_vector = model.encode([query])

# 3. Find the most similar chunk (cosine similarity)
scores = cosine_similarity(query_vector, chunk_vectors)[0]
best_chunk = chunks[scores.argmax()]
print(f"Best match (score {scores.max():.2f}):")
print(f"  {best_chunk}")

# 4. Pass the chunk + question to the LLM
prompt = f"Based on this context:\n{best_chunk}\n\nAnswer: {query}"

Best match (score 0.79):
  Apple reported Q4 revenue of $89.5 billion, down 1% year over year. Services revenue hit an all-time record of $22.3 billion.

The LLM never sees the whole document — just the most relevant chunk.

Where Have You Seen This Before?

You Already Use Embeddings

The embed → cosine similarity → retrieve pipeline isn’t new. It powers tools you use every day:

Product	What Gets Embedded	What It Does
Google Search	Web pages and queries	Matches query meaning, not just keywords (BERT, 2019)
Spotify	Songs, playlists, listening history	Discover Weekly, “similar songs”
Netflix / YouTube	Movies, watch history	“Because you watched…” recommendations
iPhone Photos	Images	Search “beach sunset” with no tags
Amazon	Products, purchase history	“Customers who bought this also bought…”
LinkedIn	Resumes, job postings	Job/candidate matching beyond keyword overlap

Same Math, Different Documents

Recommendations

Embed songs, movies, or products into vectors
Find nearest neighbors by cosine similarity
Return the most similar items

RAG

Embed document chunks into vectors
Find nearest neighbors by cosine similarity
Return the most relevant passages to the LLM

RAG applies the same math that powers recommendations and search — just to documents instead of songs or products.

RAG in Finance

Finance Applications of RAG

Document Types

10-K/10-Q filings and earnings transcripts
Analyst reports and deal documents
Internal policies and memos

Use Cases

Compliance Q&A: query regulatory filings, internal policies
Due diligence: search deal documents with citations
Research synthesis: combine multiple sources

RAG: Strengths and Limitations

Strengths

No model training required (pipeline engineering is the main cost)
Data can be updated in real time
Answers are traceable to source pages

Limitations

Quality depends on retrieval quality
Chunking can split important context
Adds infrastructure cost (embedding API, vector DB hosting) and query latency

Why not just paste the whole document into the prompt? Context windows are growing fast (100K–1M tokens), but RAG still wins when you have many documents, need to keep costs down (long prompts are expensive), or want traceable citations from specific passages.

What Can Go Wrong?

Retrieval Failures

The LLM can only answer from what it retrieves. If retrieval fails, the answer will be wrong or fabricated.

Problem	Example	Result
Wrong chunk retrieved	Query about “Apple revenue” retrieves a chunk about Apple’s supply chain	Answer discusses supply chain, ignores revenue
Relevant info split across chunks	Revenue in one chunk, costs in another	LLM sees only half the picture
No relevant chunk exists	Question about a topic not in the documents	LLM may hallucinate an answer anyway
Too many chunks retrieved	20 chunks passed to LLM, mostly noise	Key information buried; LLM confused

Chunking Problems

How you split documents dramatically affects retrieval quality:

Chunks Too Small

“Revenue was $89.5 billion” in one chunk
“down 1% year over year” in another
Neither chunk alone tells the full story
Query matches the wrong fragment

Chunks Too Large

A 3-page chunk covers revenue, costs, risks, and guidance
Query about revenue retrieves the whole chunk
LLM gets noisy context, may pull from the wrong section
Embedding is a blurry average of many topics

Hallucination Despite Context

Even with good retrieval, the LLM can still go wrong:

Over-extrapolation: “Revenue grew 15%” → LLM infers “the company is doing well” even though the next paragraph discusses a major lawsuit
Confidence without evidence: LLM generates a precise-sounding answer that isn’t supported by the retrieved chunks
Blending sources: mixes information from different chunks in ways that create false conclusions
Ignoring the context: especially with long or complex chunks, the LLM may fall back on its training data instead

RAG reduces hallucination but does not eliminate it. Always verify critical claims against the source document.

Best Practices

Chunking Best Practices

Chunk Size

200–500 tokens for precise retrieval
500–1000 tokens for more context per chunk
Test on your actual queries to find the sweet spot
Financial filings: paragraph-level chunks often work best

Overlap

10–20% overlap between consecutive chunks
Ensures sentences at boundaries aren’t orphaned
Example: 400-token chunks with 80-token overlap
Critical for tables and lists that span chunk boundaries

There is no universal best chunk size. The right answer depends on your documents, your questions, and your embedding model. Test and iterate.

Retrieval Best Practices

Retrieve more, then re-rank: pull top 10 chunks, then use a re-ranker to select the best 3–5
Hybrid search: combine vector similarity (semantic) with keyword search (exact match) — catches both meaning and specific terms like ticker symbols
Metadata filters: tag chunks with source, date, section (e.g., “Risk Factors”) and filter before similarity search
Query expansion: rephrase the user’s question multiple ways to improve recall

Prompt Engineering for RAG

The prompt that wraps retrieved chunks matters (pseudocode — plug into an f-string or template):

Based ONLY on the following excerpts from {company}'s 10-K filing,
answer the question below. If the excerpts do not contain enough
information to answer, say "I don't have enough information" ---
do NOT guess or use outside knowledge.

EXCERPTS: {retrieved_chunks}
QUESTION: {user_query}
Cite the specific excerpt(s) that support your answer.

The instruction to say “I don’t know” and to cite sources are critical guardrails against hallucination.