Speed: overnight news → trade at next-day market open, exit at close (single-day hold)
Performance: GPT-4 long-short Sharpe ratio 3.28
Forecasting ability increases with model size: GPT-4 >> GPT-3.5 >> GPT-3 — financial reasoning is an “emerging capability” of larger LLMs. But the Sharpe ratio declined from 6.5 to 1.2 as the sample extended from 2021 to 2024, consistent with more traders adopting the same approach.
News Articles
Kirtac & Germano (2024), Finance Research Letters. 965,375 U.S. financial news articles (2010–2023).
Speed: news before 6 AM → trade at same-day open; news 6 AM–4 PM → trade at same-day close; news after 4 PM → trade at next-day open
Long top 20% sentiment, short bottom 20%, daily rebalanced
Model: Llama 2 7B fine-tuned with LoRA (4.2M trainable parameters)
Speed: daily rebalancing based on sentiment scores
Long top 35%, short bottom 35%
Performance: Sharpe ratio 2.4, annualized return 45%
A fine-tuned 7B model outperforms FinBERT by 44.7% in cumulative returns — and runs on a single GPU at a fraction of GPT-4’s cost.
Social Media
Cookson, Lu, Mullins & Niessner (2024), “The Social Signal,” Journal of Financial Economics.
Sources: Twitter, StockTwits, and Seeking Alpha
Speed: social media sentiment predicts next-day returns (one-to-two-day window)
Social media sentiment predicts positive returns; social media attention predicts negative returns — the two signals go in opposite directions
Informativeness deteriorated significantly after the Jan 2021 GameStop short squeeze, concentrated among messages by new users. Platform quality matters.
Earnings Calls and Financial Statements
Earnings Call Summaries
Kim, Muhn & Nikolaev (2024), U. Chicago Booth
GPT summarizes 10-K filings and earnings call transcripts to < 1/5 original length
Summary sentiment is more predictive of stock returns than original-document sentiment
Two-day event window around filing date
Financial Statement Analysis
Kim, Muhn & Nikolaev (2024), U. Chicago Booth
GPT-4 given only anonymized financial statements
Predicts earnings direction with 60% accuracy vs. 53% for human analysts
Long-short alpha > 12% per year
These are the slowest strategies — days to weeks after filings. But the alpha is large because the analysis is deep, not fast.
Multi-Source Agents
MarketSenseAI 2.0 (2025): a GPT-4o multi-agent system that combines news, SEC filings, earnings calls, price data, and macro reports.
Speed: monthly rebalancing — the slowest strategy we’ve seen, but the most comprehensive
Monthly rebalancing is the most realistic for practical implementation — and the returns are still extraordinary.
Summary: Source, Speed, and Performance
Paper
Source
Trade Delay
Sharpe
Lopez-Lira & Tang (2023)
News headlines
Next-day open
3.28
Kirtac & Germano (2024)
News articles
Same-day or next-day
3.05
FinLlama (2024)
News articles
Daily
2.40
Cookson et al. (2024)
Social media
Next day
—
Kim et al. (2024)
Earnings calls / 10-Ks
2-day event window
—
MarketSenseAI (2025)
Multi-source
Monthly
2.76
The highest Sharpe ratios require the fastest execution and highest turnover. As you move from headlines to deeper analysis, speed decreases but the alpha persists longer.
Alpha Decay
As more firms adopt LLM-based analysis, alpha from news sentiment decays faster:
Lopez-Lira & Tang document the decline: Sharpe 6.5 → 1.2 as their sample extends from 2021 to 2024
More sophisticated signals (tone analysis, cross-sector propagation, multi-source) retain more alpha
The arms race: better NLP → alpha captured → faster decay → need even better NLP
The moat is shifting from speed of access to depth of understanding.
Sentiment Classification Methods
Bag of Words
The earliest approach: count words from predefined positive/negative lists.
Notice the asymmetry: 7\(\times\) more negative words than positive. Financial disclosures are legally required to discuss risks, so negative language is far more varied.
How BERT Learns Language
BERT (Bidirectional Encoder Representations from Transformers) is trained in two stages:
Stage 1: Pre-Training
Masked language model: hide 15% of words, predict them from context
“The company reported strong [MASK]” → earnings (not losses)
Learns grammar, meaning, and relationships from millions of sentences
No human labels needed
Stage 2: Fine-Tuning
Add a classification layer on top of the pre-trained model
Only a few thousand examples needed (Financial PhraseBank: 4,840 sentences)
The pre-trained knowledge transfers to the new task
The key insight: BERT reads both directions simultaneously. “Bank” means something different in “river bank” vs. “bank earnings” — BERT uses the full context to decide.
FinBERT: BERT for Finance
FinBERT (Araci, 2019) = BERT pre-trained on financial text, then fine-tuned for 3-class sentiment.
110M parameters
Fine-tuned on Financial PhraseBank (4,840 labeled sentences)
~87% accuracy on financial sentiment (14 pp above vanilla BERT)
Inference: ~5 ms per sentence on GPU
FinBERT became the standard baseline for financial NLP. It’s fast enough for real-time trading and accurate enough for most headline classification. Open source: ProsusAI/finbert on HuggingFace.
How LLMs Classify Sentiment
Large language models (GPT-4, Claude, Gemini) classify sentiment with no task-specific training — just a well-crafted prompt:
“Classify the sentiment of this financial headline as positive, negative, or neutral. Explain your reasoning.”
Why this works:
Pre-trained on trillions of tokens including financial text
Understands nuance: “despite strong revenue, guidance was weak” → negative
Handles sarcasm, hedging, implicit sentiment
Can classify sentiment, extract events, and summarize in a single prompt
The tradeoff: 500 ms – 5 s per API call vs. 5 ms for FinBERT. For real-time trading, use FinBERT for high-confidence cases and route low-confidence cases to an LLM.
Try It: FinBERT
import warningswarnings.filterwarnings("ignore")import logginglogging.disable(logging.WARNING)from transformers import pipelinepipe = pipeline("text-classification", model="ProsusAI/finbert", top_k=None)headlines = ["Apple revenue beats estimates, shares surge 5%","Despite strong revenue, the company warned about supply chain disruptions","The company did not report a loss","Shares fell less than feared after earnings miss",]results = pipe(headlines)for headline, scores inzip(headlines, results): best =max(scores, key=lambda x: x["score"])print(f"{best['label']:>8s} ({best['score']:.0%}) {headline}")
positive (92%) Apple revenue beats estimates, shares surge 5%
negative (97%) Despite strong revenue, the company warned about supply chain disruptions
neutral (89%) The company did not report a loss
negative (97%) Shares fell less than feared after earnings miss
Try It Yourself
Try it locally: Ask Claude to use FinBERT to determine the sentiment of “Despite record revenue, management issued cautious guidance citing tariff uncertainty.”
Try it on Colab: Compare bag of words, Loughran-McDonald, FinBERT, and Gemma (via OpenRouter) on a variety of sentences — including negation, hedging, and finance-specific language.
Social Media
Cookson, Lu, Mullins & Niessner (2024), “The Social Signal,” Journal of Financial Economics.
Informativeness deteriorated significantly after the Jan 2021 GameStop short squeeze, concentrated among messages by new users. Platform quality matters.