Module 14: Sentiment Analysis

MGMT 675: Generative AI for Finance

Kerry Back, Rice University

Trading on Sentiment

Sentiment of What?

Traders extract sentiment from many text sources. Each differs in volume, speed, and signal quality:

Source	Volume	Speed	Signal
News headlines	High	Seconds–minutes	Strong, well-studied
Full news articles	High	Minutes–hours	Richer context
Social media (Twitter, StockTwits)	Very high	Real-time	Noisy but fast
Earnings call transcripts	Low (quarterly)	Hours–days	High quality
Financial statements (10-K, 10-Q)	Low (quarterly)	Days–weeks	Deep fundamentals

The academic evidence now covers all of these.

News Headlines

Lopez-Lira & Tang (2023), accepted at the Journal of Finance.

GPT-3.5 / GPT-4 classify 67,586 headlines for 4,123 stocks (Oct 2021–Dec 2023)
Speed: overnight news → trade at next-day market open, exit at close (single-day hold)
Performance: GPT-4 long-short Sharpe ratio 3.28

Forecasting ability increases with model size: GPT-4 >> GPT-3.5 >> GPT-3 — financial reasoning is an “emerging capability” of larger LLMs. But the Sharpe ratio declined from 6.5 to 1.2 as the sample extended from 2021 to 2024, consistent with more traders adopting the same approach.

News Articles

Kirtac & Germano (2024), Finance Research Letters. 965,375 U.S. financial news articles (2010–2023).

Speed: news before 6 AM → trade at same-day open; news 6 AM–4 PM → trade at same-day close; news after 4 PM → trade at next-day open
Long top 20% sentiment, short bottom 20%, daily rebalanced

Method	Accuracy	Long-Short Sharpe
Loughran-McDonald dictionary	50.1%	1.23
FinBERT	72.2%	2.07
OPT (GPT-3 family)	74.4%	3.05

News Articles: Fine-Tuned Small Models

FinLlama (2024), ICAIF Conference. 204,017 articles for S&P 500 stocks (2015–2021).

Model: Llama 2 7B fine-tuned with LoRA (4.2M trainable parameters)
Speed: daily rebalancing based on sentiment scores
Long top 35%, short bottom 35%
Performance: Sharpe ratio 2.4, annualized return 45%

A fine-tuned 7B model outperforms FinBERT by 44.7% in cumulative returns — and runs on a single GPU at a fraction of GPT-4’s cost.

Earnings Calls and Financial Statements

Earnings Call Summaries

Kim, Muhn & Nikolaev (2024), U. Chicago Booth

GPT summarizes 10-K filings and earnings call transcripts to < 1/5 original length
Summary sentiment is more predictive of stock returns than original-document sentiment
Two-day event window around filing date

Financial Statement Analysis

Kim, Muhn & Nikolaev (2024), U. Chicago Booth

GPT-4 given only anonymized financial statements
Predicts earnings direction with 60% accuracy vs. 53% for human analysts
Long-short alpha > 12% per year

These are the slowest strategies — days to weeks after filings. But the alpha is large because the analysis is deep, not fast.

Multi-Source Agents

MarketSenseAI 2.0 (2025): a GPT-4o multi-agent system that combines news, SEC filings, earnings calls, price data, and macro reports.

Speed: monthly rebalancing — the slowest strategy we’ve seen, but the most comprehensive
Performance (S&P 100, 2023–2024): cumulative return 125.9% vs. 73.5% benchmark; Sharpe 2.76
Performance (S&P 500, 2024): cumulative return 48.7% vs. 25.6% benchmark; alpha 17.6%

Monthly rebalancing is the most realistic for practical implementation — and the returns are still extraordinary.

Summary: Source, Speed, and Performance

Paper	Source	Trade Delay	Sharpe
Lopez-Lira & Tang (2023)	News headlines	Next-day open	3.28
Kirtac & Germano (2024)	News articles	Same-day or next-day	3.05
FinLlama (2024)	News articles	Daily	2.40
Cookson et al. (2024)	Social media	Next day	—
Kim et al. (2024)	Earnings calls / 10-Ks	2-day event window	—
MarketSenseAI (2025)	Multi-source	Monthly	2.76

The highest Sharpe ratios require the fastest execution and highest turnover. As you move from headlines to deeper analysis, speed decreases but the alpha persists longer.

Alpha Decay

As more firms adopt LLM-based analysis, alpha from news sentiment decays faster:

Lopez-Lira & Tang document the decline: Sharpe 6.5 → 1.2 as their sample extends from 2021 to 2024
More sophisticated signals (tone analysis, cross-sector propagation, multi-source) retain more alpha
The arms race: better NLP → alpha captured → faster decay → need even better NLP

The moat is shifting from speed of access to depth of understanding.

Sentiment Classification Methods

Bag of Words

The earliest approach: count words from predefined positive/negative lists.

“Apple revenue beats estimates, shares surge 5%” → beats (+1), surge (+1) → positive ✓

But bag-of-words fails on:

Negation: “did not report a loss” → counts loss as negative ✗
Hedging: “despite strong revenue, warned about disruptions” → counts strong as positive ✗
Domain language: liability, outstanding, capital have different meanings in finance

Loughran & McDonald (2011) improved this with finance-specific word lists — but the fundamental limitation remains: words are not sentences.

The Loughran-McDonald Dictionary

Loughran & McDonald (2011) built word lists specifically for financial text by reading thousands of 10-K filings.

Negative words (2,355)

impairment, litigation, restated, breach, delinquent, uncollectible, insolvent, misrepresent, overdue, penalized, underperform, whistleblower

Positive words (354)

achieve, attain, efficient, favorable, innovative, profitable, stabilize, surpass, upturn, versatile

New categories: uncertainty (297 words), litigious (903 words), constraining (184 words), superfluous (56 words)

Notice the asymmetry: 7\(\times\) more negative words than positive. Financial disclosures are legally required to discuss risks, so negative language is far more varied.

How BERT Learns Language

BERT (Bidirectional Encoder Representations from Transformers) is trained in two stages:

Stage 1: Pre-Training

Masked language model: hide 15% of words, predict them from context
“The company reported strong [MASK]” → earnings (not losses)
Learns grammar, meaning, and relationships from millions of sentences
No human labels needed

Stage 2: Fine-Tuning

Add a classification layer on top of the pre-trained model
Train on labeled examples: sentence → positive / negative / neutral
Only a few thousand examples needed (Financial PhraseBank: 4,840 sentences)
The pre-trained knowledge transfers to the new task

The key insight: BERT reads both directions simultaneously. “Bank” means something different in “river bank” vs. “bank earnings” — BERT uses the full context to decide.

FinBERT: BERT for Finance

FinBERT (Araci, 2019) = BERT pre-trained on financial text, then fine-tuned for 3-class sentiment.

110M parameters
Fine-tuned on Financial PhraseBank (4,840 labeled sentences)
~87% accuracy on financial sentiment (14 pp above vanilla BERT)
Inference: ~5 ms per sentence on GPU

FinBERT became the standard baseline for financial NLP. It’s fast enough for real-time trading and accurate enough for most headline classification. Open source: ProsusAI/finbert on HuggingFace.

How LLMs Classify Sentiment

Large language models (GPT-4, Claude, Gemini) classify sentiment with no task-specific training — just a well-crafted prompt:

“Classify the sentiment of this financial headline as positive, negative, or neutral. Explain your reasoning.”

Why this works:

Pre-trained on trillions of tokens including financial text
Understands nuance: “despite strong revenue, guidance was weak” → negative
Handles sarcasm, hedging, implicit sentiment
Can classify sentiment, extract events, and summarize in a single prompt

The tradeoff: 500 ms – 5 s per API call vs. 5 ms for FinBERT. For real-time trading, use FinBERT for high-confidence cases and route low-confidence cases to an LLM.

Try It: FinBERT

import warnings
warnings.filterwarnings("ignore")
import logging
logging.disable(logging.WARNING)

from transformers import pipeline

pipe = pipeline("text-classification", model="ProsusAI/finbert", top_k=None)

headlines = [
    "Apple revenue beats estimates, shares surge 5%",
    "Despite strong revenue, the company warned about supply chain disruptions",
    "The company did not report a loss",
    "Shares fell less than feared after earnings miss",
]

results = pipe(headlines)
for headline, scores in zip(headlines, results):
    best = max(scores, key=lambda x: x["score"])
    print(f"{best['label']:>8s} ({best['score']:.0%})  {headline}")

positive (92%)  Apple revenue beats estimates, shares surge 5%
negative (97%)  Despite strong revenue, the company warned about supply chain disruptions
 neutral (89%)  The company did not report a loss
negative (97%)  Shares fell less than feared after earnings miss

Try It Yourself

Try it locally: Ask Claude to use FinBERT to determine the sentiment of “Despite record revenue, management issued cautious guidance citing tariff uncertainty.”
Try it on Colab: Compare bag of words, Loughran-McDonald, FinBERT, and Gemma (via OpenRouter) on a variety of sentences — including negation, hedging, and finance-specific language.

Open in Google Colab →

Key References

Lopez-Lira & Tang (2023). Can ChatGPT Forecast Stock Price Movements? Journal of Finance.
Kirtac & Germano (2024). Sentiment Trading with Large Language Models. Finance Research Letters.
Cookson, Lu, Mullins & Niessner (2024). The Social Signal. Journal of Financial Economics.
Kim, Muhn & Nikolaev (2024). Bloated Disclosures: Can ChatGPT Help Investors Process Information?
Kim, Muhn & Nikolaev (2024). Financial Statement Analysis with Large Language Models.
Fatouros, Metaxas, Soldatos & Kyriazis (2025). MarketSenseAI 2.0.
Gupta, Raghunath, Buddaraju, Krishnasamy & Mitra (2024). FinLlama. ICAIF ’24.
Araci (2019). FinBERT: Financial Sentiment Analysis with Pre-Trained Language Models.
Loughran & McDonald (2011). When Is a Liability Not a Liability? Journal of Finance.
Chen, Kelly & Xiu (2023). Expected Returns and Large Language Models. NBER Working Paper 31502.