Module 11: Sentiment Analysis

MGMT 675: Generative AI for Finance

Kerry Back, Rice University

Why News Moves Markets

Asset prices reflect expectations. News changes expectations. The trader who understands news faster and more accurately captures the profit.

  • Earnings surprises, Fed announcements, M&A rumors, FDA decisions, geopolitical shocks
  • Quantitative firms began automating this with simple keyword rules in the 2000s
  • LLMs now understand nuance, context, and implication — a qualitative leap

The Evolution of Text-Based Trading

Era Method Capability
Pre-2010 Keyword matching Count positive/negative words
2011 Loughran-McDonald dictionary Finance-specific word lists
2018–19 BERT / FinBERT Contextual understanding of sentences
2023 GPT-3.5 / GPT-4 Zero-shot reasoning about market impact
2024–26 Fine-tuned SLMs + agents Domain-optimized, multi-step analysis

The Models

FinBERT: The First Financial Sentiment Model

FinBERT (Araci, 2019) = BERT pre-trained on financial text and fine-tuned for 3-class sentiment (positive / negative / neutral).

Architecture

  • 110M parameters (BERT-base)
  • Fine-tuned on Financial PhraseBank (4,840 labeled sentences)
  • Open source: ProsusAI/finbert on HuggingFace

Performance

  • ~87% accuracy on Financial PhraseBank
  • 14 pp improvement over vanilla BERT
  • Became the standard baseline for financial NLP

BloombergGPT and FinGPT

BloombergGPT (2023)

  • 50B parameters; ~$10M training cost
  • Trained on Bloomberg’s FinPile (363B tokens) + 345B tokens of general text
  • Proprietary — weights not released

FinGPT (2023)

  • Open-source (AI4Finance Foundation)
  • Fine-tunes Llama, Falcon, etc. with LoRA
  • Training cost: under $300

BloombergGPT proved the concept; FinGPT democratized it.

GPT-4 and General-Purpose LLMs

General-purpose LLMs can classify financial sentiment with zero training data — just a well-crafted prompt.

Strengths

  • Understands: “despite strong revenue, guidance was weak” → negative
  • Handles sarcasm, hedging, implicit sentiment
  • Sentiment + event classification + summarization in one model

Weaknesses

  • Latency: 500ms–5s per API call
  • Cost: $2.50–10/M input tokens (varies by model)
  • Data leaves your infrastructure

Fine-Tuning Concepts

  • Fine-tuning: Take a pre-trained model, continue training on domain-specific data
  • LoRA (Low-Rank Adaptation): Update only a small fraction of model weights — dramatically reduces cost and GPU requirements
  • Small language models (1–10B parameters): privacy (data stays local), speed (5–10ms inference), specialization

Why this matters for trading: FinBERT processes headlines in milliseconds; GPT-4 takes seconds.

What the Research Shows

Can ChatGPT Forecast Stock Prices?

Lopez-Lira & Tang (2023). Accepted at the Journal of Finance.

  • 67,586 headlines for 4,138 companies (Oct 2021–Dec 2022)
  • Long-short strategies: overnight Sharpe ratio 2.97, intraday Sharpe ratio 2.63
  • Forecasting ability increases with model size — financial reasoning is an “emerging capability” of larger LLMs

LLMs vs. Traditional Sentiment Analysis

Kirtac & Germano (2024), Finance Research Letters. 965,375 U.S. financial news articles.

Method Sentiment Accuracy Long-Short Sharpe
Loughran-McDonald dictionary 50.1% 1.23
FinBERT 72.2% 2.07
OPT (GPT-3 family) 74.4% 3.05

Traditional bag-of-words methods are now effectively obsolete for this task.

More Key Results

GPT-4 vs. Human Analysts

  • Kim, Muhn & Nikolaev (2024), Chicago Booth
  • GPT-4 given only anonymized financial statements
  • Outperforms the median human analyst at predicting earnings direction

Fine-Tuned Small Models

  • FinLlama (2024, Imperial College)
  • Llama 2 7B fine-tuned with LoRA (4.2M trainable params)
  • Outperforms FinBERT by 44.7% in cumulative returns

Multi-Agent Trading Systems

MarketSenseAI 2.0 (2025): GPT-4 + RAG multi-agent system.

  • 125.9% cumulative returns vs. 73.5% for S&P 100

How It Works: Implementation

The News Trading Pipeline

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '28px'}, 'flowchart': {'nodeSpacing': 80, 'rankSpacing': 120, 'padding': 24, 'useMaxWidth': true}}}%%
flowchart LR
  A["<b>Data<br>Ingestion</b>"] --> B["<b>Entity<br>Resolution</b>"]
  B --> C["<b>Sentiment<br>Extraction</b>"]
  C --> D["<b>Signal<br>Construction</b>"]
  D --> E["<b>Trade<br>Execution</b>"]

  style A fill:#eff6ff,stroke:#0f172a,stroke-width:2px,color:#0f172a,font-size:28px,padding:20px
  style B fill:#eff6ff,stroke:#0f172a,stroke-width:2px,color:#0f172a,font-size:28px,padding:20px
  style C fill:#eff6ff,stroke:#0f172a,stroke-width:2px,color:#0f172a,font-size:28px,padding:20px
  style D fill:#eff6ff,stroke:#0f172a,stroke-width:2px,color:#0f172a,font-size:28px,padding:20px
  style E fill:#eff6ff,stroke:#0f172a,stroke-width:2px,color:#0f172a,font-size:28px,padding:20px

  • Data ingestion: Reuters, Bloomberg, SEC filings
  • Entity resolution: “Apple” → AAPL
  • Sentiment extraction: The LLM’s core job
  • Signal construction: Aggregate, weight, normalize
  • Trade execution: Route orders
  • The LLM replaces steps 2 and 3

What the LLM Extracts

Sentiment is more than positive/negative. A production system classifies along multiple dimensions:

Dimension Question Example
Polarity Positive, negative, or neutral? “Revenue beat estimates” → positive
Magnitude How strong? “Slight miss” vs. “catastrophic failure”
Relevance Is this market-moving? Routine board meeting → low

LLMs handle all dimensions in a single prompt; traditional methods handle only polarity.

The Cascade Architecture

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '22px'}, 'flowchart': {'nodeSpacing': 60, 'rankSpacing': 80, 'padding': 16, 'useMaxWidth': true}}}%%
flowchart LR
  A["<b>All Headlines</b>"] --> B["<b>FinBERT</b><br><i>~5ms</i>"]
  B -->|"high conf."| C["<b>Immediate Signal</b>"]
  B -->|"low conf."| D["<b>GPT-4</b><br><i>~1-5s</i>"]
  D --> E["<b>Enhanced Signal</b>"]

  style A fill:#eff6ff,stroke:#0f172a,stroke-width:2px,color:#0f172a,font-size:22px,padding:14px
  style B fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:22px,padding:14px
  style C fill:#eff6ff,stroke:#0f172a,stroke-width:2px,color:#0f172a,font-size:22px,padding:14px
  style D fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#0f172a,font-size:22px,padding:14px
  style E fill:#eff6ff,stroke:#0f172a,stroke-width:2px,color:#0f172a,font-size:22px,padding:14px

Fine-Tuned Small Model

  • FinBERT, FinLlama, custom BERT
  • Inference: 5–10ms on GPU
  • Deterministic; runs on your infra

General-Purpose LLM

  • GPT-4, Claude, Gemini
  • Inference: 500ms–5s via API
  • Zero labeled data needed

Best Practice: Fast path (~5ms) for high-confidence; slow path (~1–5s) for low-confidence; batch reprocessing end-of-day.

Latency and Alpha Decay

Alpha decay = the rate at which a signal’s profitability diminishes as the market incorporates the information.

Strategy Latency Budget Model
HFT / market making <1ms Keyword lookup
Low-latency systematic 1–100ms FinBERT
Event-driven 100ms–5s FinBERT + LLM
Daily systematic Minutes–hours Full LLM pipeline
Fundamental Hours–days Deep LLM analysis

You don’t need to be the fastest — you need to be fast enough for the alpha you’re targeting.

Event-Driven Strategies

Types of News Events

Scheduled Events

  • Earnings: Compare actuals to consensus; analyze management tone
  • Fed / central bank: Single word changes carry enormous implications
  • Economic data: CPI, jobs reports, PMI

Unscheduled Events

  • M&A: Assess deal likelihood and regulatory risk
  • FDA decisions: Binary, high-impact
  • Geopolitical: Sanctions, trade policy, elections

The tone of an earnings call often matters more for returns than the reported numbers.

Earnings Call Analysis

The Prompt

“Rate management’s tone from $-$5 (very bearish) to +5 (very bullish). Explain your reasoning. Identify forward-looking statements that differ from consensus.”

  • The LLM processes the entire transcript (8,000–15,000 tokens)
  • Detects hedging: “We’re pleased with results but expect headwinds in Q4” → net negative
  • Traditional keyword methods would flag “pleased” as positive

This is where LLMs provide the greatest edge over traditional NLP.

Beyond Company Sentiment: Geopolitical Risk

The Propagation Problem

Most LLM trading research focuses on company-level sentiment. The real competitive advantage: understanding how macro events propagate across sectors.

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '28px'}, 'flowchart': {'nodeSpacing': 60, 'rankSpacing': 100, 'padding': 24, 'useMaxWidth': true}}}%%
flowchart LR
  A["<b>Conflict</b>"] --> B["<b>Oil<br>prices ↑</b>"]
  B --> C["<b>Energy<br>stocks ↑</b>"]
  B --> D["<b>Fertilizer<br>costs ↑</b>"]
  D --> E["<b>Food<br>prices ↑</b>"]
  E --> F["<b>Staples<br>margins ↓</b>"]

  style A fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#0f172a,font-size:28px,padding:20px
  style B fill:#eff6ff,stroke:#0f172a,stroke-width:2px,color:#0f172a,font-size:28px,padding:20px
  style C fill:#eff6ff,stroke:#0f172a,stroke-width:2px,color:#0f172a,font-size:28px,padding:20px
  style D fill:#eff6ff,stroke:#0f172a,stroke-width:2px,color:#0f172a,font-size:28px,padding:20px
  style E fill:#eff6ff,stroke:#0f172a,stroke-width:2px,color:#0f172a,font-size:28px,padding:20px
  style F fill:#eff6ff,stroke:#0f172a,stroke-width:2px,color:#0f172a,font-size:28px,padding:20px

  • “Russia invades Ukraine” — which sectors benefit? Which suffer?
  • Multi-hop causal reasoning:
    • 1st order: Oil prices rise → energy stocks up
    • 2nd order: Fertilizer costs rise → agriculture costs up
    • 3rd order: Food prices rise → consumer staples margins squeezed

Measuring Geopolitical Risk

Caldara-Iacoviello GPR Index

  • Published in AER (2022)
  • Counts articles in 10 newspapers across 8 categories
  • Available for 44 countries since 1900

BlackRock BGRI

  • Neural network NLP on brokerage reports + news
  • For each risk, identifies the 3 most sensitive assets
  • The industry gold standard

Knowledge Graphs: Mapping Shock Propagation

A knowledge graph maps entities (firms, sectors, commodities) and their relationships. Combined with an LLM, it enables multi-hop reasoning about how shocks propagate.

FinDKG (Li et al., 2024)

  • Fine-tuned LLM builds a dynamic knowledge graph from financial news
  • Outperforms thematic ETFs at identifying sector themes
  • Open source

Supply Chain Mapping

  • LLMs extract multi-tier supplier networks from text
  • Map 3,000+ firms and 11,000+ supply links
  • Predict how disruptions cascade through supply chains

Industry Adoption and Risks

Who Is Doing This?

Hedge Funds

  • Bridgewater: $2B AI fund using OpenAI, Anthropic, Perplexity
  • Two Sigma: NLP for 10+ years
  • Numerai: Crowdsourced AI fund, $550M AUM

Platforms

  • Bloomberg: BloombergGPT, terminal-integrated NLP
  • RavenPack: Structured sentiment signals
  • Kensho: Acquired by S&P Global for $550M

Permutable AI: LLM-based trading live Oct 2024 — 20.6% return, Sharpe ratio 2.85.

Cautionary Tales

Event What Happened Impact
AP Twitter Hack (2013) Fake “White House explosions” tweet S&P 500 lost $136B in seconds
Alpha Arena ChatGPT trading competition 63% loss
IMF Warning AI trading increases volatility Regulatory scrutiny

The competitive edge is real, but so are the risks.

The Crowding Problem

As more firms adopt the same LLM-based analysis, alpha from news sentiment decays faster.

  • Lopez-Lira & Tang document declining returns as LLM adoption rises — consistent with the Efficient Market Hypothesis
  • More sophisticated signals (tone analysis, cross-sector propagation) retain more alpha
  • The arms race: better NLP → alpha captured → faster decay → need even better NLP

The moat is shifting from speed of access to depth of understanding.

Exercises

Exercise 1: Headline Sentiment Classification

  1. Collect 20 recent financial headlines and have Claude classify sentiment, magnitude, and relevance for each.
  2. Compare classifications to actual stock price movements over the following day and calculate accuracy.
  3. Submit: spreadsheet with headlines, classifications, prices, and accuracy.

Exercise 2: Earnings Call Analysis

  1. Download an earnings call transcript (e.g., from Seeking Alpha)
  2. Ask Claude to:
    • Rate management tone ($-$5 to +5)
    • Identify forward-looking statements that differ from consensus
    • Flag hedging language
  3. Compare to the stock’s post-earnings move
  4. Submit: transcript excerpt + Claude’s analysis + stock movement + reflection

Exercise 3: FinBERT Comparison (Bonus)

  1. Run the same 20 headlines through FinBERT (via HuggingFace)
  2. Compare FinBERT’s classifications to Claude’s
  3. Which handles nuance better? Where does each fail?
  4. Submit: side-by-side comparison + analysis

Summary

What We Know

  • LLMs dramatically outperform dictionaries (50% → 74% accuracy)
  • GPT-4 can match or beat human analysts
  • Fine-tuned small models compete at a fraction of cost

What to Watch

  • Alpha decay as adoption increases
  • Adversarial attacks and manipulation
  • Crowding risk: correlated AI trades → systemic instability

The value of LLMs in trading is not just speed — it is depth of understanding.

Key References

  • Lopez-Lira & Tang (2023). Can ChatGPT Forecast Stock Price Movements? Journal of Finance.
  • Kirtac & Germano (2024). Sentiment Trading with Large Language Models. Finance Research Letters.
  • Kim, Muhn & Nikolaev (2024). Financial Statement Analysis with Large Language Models.
  • Araci (2019). FinBERT: Financial Sentiment Analysis with Pre-Trained Language Models.
  • Wu et al. (2023). BloombergGPT: A Large Language Model for Finance.
  • Yang et al. (2023). FinGPT: Open-Source Financial Large Language Models.
  • FinLlama (2024). Financial Sentiment Analysis for Algorithmic Trading. ICAIF ’24.
  • Caldara & Iacoviello (2022). Measuring Geopolitical Risk. AER, 112(4).
  • Li et al. (2024). FinDKG: Dynamic Knowledge Graphs with LLMs. ICAIF ’24.