Module 14: Sentiment Analysis

MGMT 675: Generative AI for Finance

Kerry Back, Rice University

Trading on Sentiment

Sentiment of What?

Traders extract sentiment from many text sources. Each differs in volume, speed, and signal quality:

Source Volume Speed Signal
News headlines High Seconds–minutes Strong, well-studied
Full news articles High Minutes–hours Richer context
Social media (Twitter, StockTwits) Very high Real-time Noisy but fast
Earnings call transcripts Low (quarterly) Hours–days High quality
Financial statements (10-K, 10-Q) Low (quarterly) Days–weeks Deep fundamentals

The academic evidence now covers all of these.

News Headlines

Lopez-Lira & Tang (2023), accepted at the Journal of Finance.

  • GPT-3.5 / GPT-4 classify 67,586 headlines for 4,123 stocks (Oct 2021–Dec 2023)
  • Speed: overnight news → trade at next-day market open, exit at close (single-day hold)
  • Performance: GPT-4 long-short Sharpe ratio 3.28

Forecasting ability increases with model size: GPT-4 >> GPT-3.5 >> GPT-3 — financial reasoning is an “emerging capability” of larger LLMs. But the Sharpe ratio declined from 6.5 to 1.2 as the sample extended from 2021 to 2024, consistent with more traders adopting the same approach.

News Articles

Kirtac & Germano (2024), Finance Research Letters. 965,375 U.S. financial news articles (2010–2023).

  • Speed: news before 6 AM → trade at same-day open; news 6 AM–4 PM → trade at same-day close; news after 4 PM → trade at next-day open
  • Long top 20% sentiment, short bottom 20%, daily rebalanced
Method Accuracy Long-Short Sharpe
Loughran-McDonald dictionary 50.1% 1.23
FinBERT 72.2% 2.07
OPT (GPT-3 family) 74.4% 3.05

News Articles: Fine-Tuned Small Models

FinLlama (2024), ICAIF Conference. 204,017 articles for S&P 500 stocks (2015–2021).

  • Model: Llama 2 7B fine-tuned with LoRA (4.2M trainable parameters)
  • Speed: daily rebalancing based on sentiment scores
  • Long top 35%, short bottom 35%
  • Performance: Sharpe ratio 2.4, annualized return 45%

A fine-tuned 7B model outperforms FinBERT by 44.7% in cumulative returns — and runs on a single GPU at a fraction of GPT-4’s cost.

Social Media

Cookson, Lu, Mullins & Niessner (2024), “The Social Signal,” Journal of Financial Economics.

  • Sources: Twitter, StockTwits, and Seeking Alpha
  • Speed: social media sentiment predicts next-day returns (one-to-two-day window)
  • Social media sentiment predicts positive returns; social media attention predicts negative returns — the two signals go in opposite directions

Informativeness deteriorated significantly after the Jan 2021 GameStop short squeeze, concentrated among messages by new users. Platform quality matters.

Earnings Calls and Financial Statements

Earnings Call Summaries

Kim, Muhn & Nikolaev (2024), U. Chicago Booth

  • GPT summarizes 10-K filings and earnings call transcripts to < 1/5 original length
  • Summary sentiment is more predictive of stock returns than original-document sentiment
  • Two-day event window around filing date

Financial Statement Analysis

Kim, Muhn & Nikolaev (2024), U. Chicago Booth

  • GPT-4 given only anonymized financial statements
  • Predicts earnings direction with 60% accuracy vs. 53% for human analysts
  • Long-short alpha > 12% per year

These are the slowest strategies — days to weeks after filings. But the alpha is large because the analysis is deep, not fast.

Multi-Source Agents

MarketSenseAI 2.0 (2025): a GPT-4o multi-agent system that combines news, SEC filings, earnings calls, price data, and macro reports.

  • Speed: monthly rebalancing — the slowest strategy we’ve seen, but the most comprehensive
  • Performance (S&P 100, 2023–2024): cumulative return 125.9% vs. 73.5% benchmark; Sharpe 2.76
  • Performance (S&P 500, 2024): cumulative return 48.7% vs. 25.6% benchmark; alpha 17.6%

Monthly rebalancing is the most realistic for practical implementation — and the returns are still extraordinary.

Summary: Source, Speed, and Performance

Paper Source Trade Delay Sharpe
Lopez-Lira & Tang (2023) News headlines Next-day open 3.28
Kirtac & Germano (2024) News articles Same-day or next-day 3.05
FinLlama (2024) News articles Daily 2.40
Cookson et al. (2024) Social media Next day
Kim et al. (2024) Earnings calls / 10-Ks 2-day event window
MarketSenseAI (2025) Multi-source Monthly 2.76

The highest Sharpe ratios require the fastest execution and highest turnover. As you move from headlines to deeper analysis, speed decreases but the alpha persists longer.

Alpha Decay

As more firms adopt LLM-based analysis, alpha from news sentiment decays faster:

  • Lopez-Lira & Tang document the decline: Sharpe 6.5 → 1.2 as their sample extends from 2021 to 2024
  • More sophisticated signals (tone analysis, cross-sector propagation, multi-source) retain more alpha
  • The arms race: better NLP → alpha captured → faster decay → need even better NLP

The moat is shifting from speed of access to depth of understanding.

Sentiment Classification Methods

Bag of Words

The earliest approach: count words from predefined positive/negative lists.

“Apple revenue beats estimates, shares surge 5%”beats (+1), surge (+1) → positive ✓

But bag-of-words fails on:

  • Negation: “did not report a loss” → counts loss as negative ✗
  • Hedging: “despite strong revenue, warned about disruptions” → counts strong as positive ✗
  • Domain language: liability, outstanding, capital have different meanings in finance

Loughran & McDonald (2011) improved this with finance-specific word lists — but the fundamental limitation remains: words are not sentences.

The Loughran-McDonald Dictionary

Loughran & McDonald (2011) built word lists specifically for financial text by reading thousands of 10-K filings.

Negative words (2,355)

  • impairment, litigation, restated, breach, delinquent, uncollectible, insolvent, misrepresent, overdue, penalized, underperform, whistleblower

Positive words (354)

  • achieve, attain, efficient, favorable, innovative, profitable, stabilize, surpass, upturn, versatile

New categories: uncertainty (297 words), litigious (903 words), constraining (184 words), superfluous (56 words)

Notice the asymmetry: 7\(\times\) more negative words than positive. Financial disclosures are legally required to discuss risks, so negative language is far more varied.

How BERT Learns Language

BERT (Bidirectional Encoder Representations from Transformers) is trained in two stages:

Stage 1: Pre-Training

  • Masked language model: hide 15% of words, predict them from context
  • “The company reported strong [MASK]” → earnings (not losses)
  • Learns grammar, meaning, and relationships from millions of sentences
  • No human labels needed

Stage 2: Fine-Tuning

  • Add a classification layer on top of the pre-trained model
  • Train on labeled examples: sentence → positive / negative / neutral
  • Only a few thousand examples needed (Financial PhraseBank: 4,840 sentences)
  • The pre-trained knowledge transfers to the new task

The key insight: BERT reads both directions simultaneously. “Bank” means something different in “river bank” vs. “bank earnings” — BERT uses the full context to decide.

FinBERT: BERT for Finance

FinBERT (Araci, 2019) = BERT pre-trained on financial text, then fine-tuned for 3-class sentiment.

  • 110M parameters
  • Fine-tuned on Financial PhraseBank (4,840 labeled sentences)
  • ~87% accuracy on financial sentiment (14 pp above vanilla BERT)
  • Inference: ~5 ms per sentence on GPU

FinBERT became the standard baseline for financial NLP. It’s fast enough for real-time trading and accurate enough for most headline classification. Open source: ProsusAI/finbert on HuggingFace.

How LLMs Classify Sentiment

Large language models (GPT-4, Claude, Gemini) classify sentiment with no task-specific training — just a well-crafted prompt:

“Classify the sentiment of this financial headline as positive, negative, or neutral. Explain your reasoning.”

Why this works:

  • Pre-trained on trillions of tokens including financial text
  • Understands nuance: “despite strong revenue, guidance was weak” → negative
  • Handles sarcasm, hedging, implicit sentiment
  • Can classify sentiment, extract events, and summarize in a single prompt

The tradeoff: 500 ms – 5 s per API call vs. 5 ms for FinBERT. For real-time trading, use FinBERT for high-confidence cases and route low-confidence cases to an LLM.

Try It: FinBERT

import warnings
warnings.filterwarnings("ignore")
import logging
logging.disable(logging.WARNING)

from transformers import pipeline

pipe = pipeline("text-classification", model="ProsusAI/finbert", top_k=None)

headlines = [
    "Apple revenue beats estimates, shares surge 5%",
    "Despite strong revenue, the company warned about supply chain disruptions",
    "The company did not report a loss",
    "Shares fell less than feared after earnings miss",
]

results = pipe(headlines)
for headline, scores in zip(headlines, results):
    best = max(scores, key=lambda x: x["score"])
    print(f"{best['label']:>8s} ({best['score']:.0%})  {headline}")
positive (92%)  Apple revenue beats estimates, shares surge 5%
negative (97%)  Despite strong revenue, the company warned about supply chain disruptions
 neutral (89%)  The company did not report a loss
negative (97%)  Shares fell less than feared after earnings miss

Try It Yourself

  1. Try it locally: Ask Claude to use FinBERT to determine the sentiment of “Despite record revenue, management issued cautious guidance citing tariff uncertainty.”

  2. Try it on Colab: Compare bag of words, Loughran-McDonald, FinBERT, and Gemma (via OpenRouter) on a variety of sentences — including negation, hedging, and finance-specific language.

Open in Google Colab →

Key References

  • Lopez-Lira & Tang (2023). Can ChatGPT Forecast Stock Price Movements? Journal of Finance.
  • Kirtac & Germano (2024). Sentiment Trading with Large Language Models. Finance Research Letters.
  • Cookson, Lu, Mullins & Niessner (2024). The Social Signal. Journal of Financial Economics.
  • Kim, Muhn & Nikolaev (2024). Bloated Disclosures: Can ChatGPT Help Investors Process Information?
  • Kim, Muhn & Nikolaev (2024). Financial Statement Analysis with Large Language Models.
  • Fatouros, Metaxas, Soldatos & Kyriazis (2025). MarketSenseAI 2.0.
  • Gupta, Raghunath, Buddaraju, Krishnasamy & Mitra (2024). FinLlama. ICAIF ’24.
  • Araci (2019). FinBERT: Financial Sentiment Analysis with Pre-Trained Language Models.
  • Loughran & McDonald (2011). When Is a Liability Not a Liability? Journal of Finance.
  • Chen, Kelly & Xiu (2023). Expected Returns and Large Language Models. NBER Working Paper 31502.