Module 8: Web Scraping with AI

MGMT 675: Generative AI for Finance

Kerry Back, Rice University

Web Scraping Fundamentals

What Is Web Scraping?

Getting data from websites is a task for which AI + coding can really increase efficiency.

Three levels of difficulty:

  1. Level 1 — Easy: extract a table from a static HTML page
  2. Level 2 — Medium: construct URLs and loop over downloads
  3. Level 3 — Hard: pages rendered dynamically with JavaScript

Level 1: Extracting a Table from Wikipedia

This Wikipedia page maintains a list of all S&P 500 companies with ticker, sector, headquarters, etc.

import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
tables = pd.read_html(url)
sp500 = tables[0]
sp500.to_excel("sp500.xlsx", index=False)

Compared to manually typing 500 rows, this takes seconds.

How Does read_html Work?

Right-click on the Wikipedia page → “View page source” to see the raw HTML. The read_html function searches for <table> tags, then interprets:

  • <tbody> = table body
  • <tr> = table row
  • <th> = table header
  • <td> = table data

It builds a pandas DataFrame from the HTML structure — no AI required, just pattern matching.

Level 2: FINRA Short Interest Data

FINRA publishes short interest data twice monthly. Clicking a date downloads a CSV. What’s the URL pattern?

Right-click → “View page source” → search for a date like 20250430:

https://cdn.finra.org/equity/otcmarket/biweekly/shrt20250430.csv

Now we can automate the download by looping over dates.

Automating FINRA Downloads

import pandas as pd
from datetime import date, timedelta

frames = []
d = date(2024, 1, 1)
while d <= date(2025, 4, 30):
    url = f"https://cdn.finra.org/equity/otcmarket/biweekly/shrt{d:%Y%m%d}.csv"
    try:
        frames.append(pd.read_csv(url, sep="|"))
    except Exception:
        pass  # not a reporting date --- skip it
    d += timedelta(days=1)

Automating FINRA Downloads (cont.)

df = pd.concat(frames)
df.to_csv("finra_short_interest.csv", index=False)

Try every possible date and silently skip failures — quick and effective. Only FINRA settlement dates (roughly the 15th and last business day) will succeed.

Level 3: JavaScript-Rendered Pages

Many pages are dynamically generated with JavaScript. The data we want isn’t in the HTML source — it’s fetched by executing JavaScript code.

Example: Open a Tesla 10-K on EDGAR. The first table lists Tesla’s directors. But if you “View page source” and search for “Elon Musk” — 0 results.

The page content is loaded dynamically by JavaScript — the raw HTML is just a shell.

Playwright: Automating a Real Browser

Playwright is a library that controls a real browser (Chrome, Firefox, or Safari) from Python or JavaScript. It executes the JavaScript on the page, then lets you read the rendered HTML.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://www.sec.gov/ix?doc=/Archives/edgar/data/...")

    # Now the JavaScript has executed --- we can see the data
    content = page.content()          # full rendered HTML
    page.screenshot(path="page.png")  # visual snapshot
    browser.close()

Unlike requests or read_html, Playwright sees the page as a human would.

Playwright: Key Capabilities

Navigation & Interaction

  • Click buttons, fill forms, select dropdowns
  • Wait for elements to load
  • Handle login pages and cookies
  • Download files triggered by clicks

Data Extraction

  • Read rendered HTML after JavaScript executes
  • Take screenshots of pages or elements
  • Extract text, tables, links
  • Record and replay browser sessions

Install: pip install playwright && playwright install

Playwright MCP: AI-Controlled Browsing

The Playwright MCP server connects Playwright to Claude Code, letting the AI control the browser directly.

Add to ~/.claude/.mcp.json:

{
  "mcpServers": {
    "playwright": {
      "command": "npx",
      "args": ["@playwright/mcp@latest"]
    }
  }
}

Once configured, Claude Code can navigate pages, click links, fill forms, take screenshots, and extract data — all through natural language instructions.

When to Use Each Tool

Scenario Tool Why
Static HTML table pd.read_html Simplest, fastest
File download by URL pattern requests / curl Construct URLs, loop
API with JSON responses requests Parse structured data (requests is Python’s standard HTTP library)
JavaScript-rendered page Playwright Executes JS, sees rendered page
Complex multi-step scraping Playwright MCP AI navigates and decides
Login-protected content Playwright or MCP Handles authentication flows

As we’ll see in the case study below, we used the EDGAR API and direct downloads where possible, avoiding the need for browser automation.

Case Study: Collecting Financial Documents

Goal: Company Research Library

We want to collect a comprehensive set of financial documents for a single company — the kind of library an analyst would build before a valuation. In Module 11, we will feed this corpus to an AI system (RAG) to generate analysis — but first we need the raw materials.

Target company: Magnolia Oil & Gas (MGY) — a Houston-based small-cap E&P company operating in the Eagle Ford Shale.

What we need:

  • 10-K annual reports (MD&A, risk factors, financial statements)
  • Earnings call transcripts and press releases
  • News articles

Why Magnolia?

  • Public since 2018 — 7 years of filings on EDGAR, enough for trend analysis
  • Houston-headquartered — relevant to Rice students
  • Simple business model — onshore oil & gas E&P, easy to understand financially
  • Thin analyst coverage — free transcript sources are spotty, so we need multiple scraping techniques
  • In your group projects, you will pick your own company

Data Sources Overview

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '22px'}, 'flowchart': {'nodeSpacing': 60, 'rankSpacing': 80, 'padding': 16, 'useMaxWidth': true}}}%%
flowchart LR
  EDGAR["<b>SEC EDGAR</b><br>10-Ks, 8-Ks"] --> CORPUS["<b>Document Library</b><br>58 documents"]
  MF["<b>Motley Fool</b><br>Transcripts"] --> CORPUS
  IR["<b>Company IR</b><br>Press releases"] --> CORPUS
  NEWS["<b>News Sites</b><br>Articles"] --> CORPUS

  style EDGAR fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:14px
  style MF fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:14px
  style IR fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:14px
  style NEWS fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:14px
  style CORPUS fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#0f172a,font-size:20px,padding:14px

Step 1: Finding the 10-Ks

The EDGAR API

SEC EDGAR provides a free REST API that returns filing metadata as JSON. No API key required — just a User-Agent header.

import requests, json

url = "https://data.sec.gov/submissions/CIK0001698990.json"
headers = {"User-Agent": "Kerry Back keb7@rice.edu"}
data = requests.get(url, headers=headers).json()

for i, form in enumerate(data["filings"]["recent"]["form"]):
    if form == "10-K":
        date = data["filings"]["recent"]["filingDate"][i]
        doc = data["filings"]["recent"]["primaryDocument"][i]
        print(f"{date}: {doc}")

What We Got: 7 Years of 10-Ks

Fiscal Year Filing Date Document
2025 2026-02-12 mgy-20251231.htm
2024 2025-02-19 mgy-20241231.htm
2023 2024-02-15 mgy-20231231.htm
2022 2023-02-16 mgy-20221231.htm
2021 2022-02-17 mgy-20211231.htm
2020 2021-02-23 mgy-20201231.htm
2019 2020-02-26 a10-k2019.htm

All downloaded as HTML directly from EDGAR.

Step 2: Extracting Sections

Extracting 10-K Sections with Claude Code

The raw EDGAR HTML files are cluttered with CSS, XBRL tags, and formatting. We asked Claude Code to extract three sections from each 10-K:

“Read the file mgy-10k-2025.htm. Extract Item 7 (MD&A), Item 1A (Risk Factors), and the Financial Statements. Strip all XBRL tags and CSS. Convert HTML tables to markdown pipe format. Save each section as a separate .md file.”

Claude used BeautifulSoup to parse the HTML, find section boundaries, and produce clean markdown. The complete extraction scripts are in the magnolia/10k/ folder.

21 Extracted Files

For each of the 7 fiscal years, we extracted:

MD&A

  • Management’s Discussion & Analysis (Item 7)
  • Revenue drivers, cost trends, liquidity
  • 33–46 KB per year

Risk Factors

  • Item 1A: all risk categories
  • Commodity prices, regulatory, operational
  • 70–83 KB per year

Financial Statements

  • Balance sheets, income statements, cash flows
  • Equity statements + all footnotes
  • 120–174 KB per year

Example: Extracted Financial Data (FY2025)

The output is clean markdown tables ready for AI analysis:

| ASSETS             |     2025 |     2024 |
|--------------------|----------|----------|
| Cash & equiv.      | $266,785 | $260,049 |
| Current assets     |  442,435 |  410,824 |
| TOTAL ASSETS       |$2,903,092|$2,820,835|
| REVENUES           |     2025 |     2024 |
|--------------------|----------|----------|
| Oil revenues       | $918,027 |$1,046,675|
| Total revenues     |1,311,845 |1,315,886 |
| NET INCOME         |  337,279 |  397,330 |

(In thousands. Full statements in magnolia/10k/extracted/)

Step 3: Earnings Calls

Two Sources of Earnings Data

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '22px'}, 'flowchart': {'nodeSpacing': 60, 'rankSpacing': 80, 'padding': 16, 'useMaxWidth': true}}}%%
flowchart TD
  EC["<b>Quarterly Earnings</b><br>27 quarters (Q2 2019 -- Q4 2025)"]
  EC --> T["<b>Full Transcripts</b><br>6 quarters<br>(Motley Fool)"]
  EC --> PR["<b>Press Releases</b><br>21 quarters<br>(EDGAR 8-K filings)"]

  style EC fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:14px
  style T fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:14px
  style PR fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:14px

  • Transcripts include management commentary and analyst Q&A
  • Press releases include financial tables, production data, and guidance
  • Together they cover every quarter from Q2 2019 onward

Transcript Sources: The Challenge

Full earnings call transcripts are hard to get for free:

Source Coverage Access
Motley Fool Spotty for small-caps Free
Seeking Alpha Complete Paywalled
MarketScreener Complete Paywalled
GuruFocus Complete JS-rendered / paywalled
Company webcasts Audio only Registration wall

Our approach: Grab what’s free from Motley Fool, fill gaps with EDGAR press releases — or transcribe the audio yourself.

Option: Transcribe Company IR Audio

If no free transcript exists, you can transcribe the audio webcast from the company’s IR page. (We did not use this for Magnolia.)

How it works

# OpenAI's Whisper API ($0.006/min)
from openai import OpenAI
client = OpenAI()
with open("earnings_call.mp3", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-1", file=f
    )

Or run locally for free: pip install openai-whisper

What to know

  • Whisper API — $0.006/min, fast, accurate
  • Local Whisper — free, runs on your machine
  • AssemblyAI — adds speaker labels (who said what)
  • Audio quality varies; no speaker labels with basic Whisper
  • A 60-min call produces 8,000–12,000 words

Fetching Transcripts from Motley Fool

“Fetch the Magnolia Oil & Gas Q3 2025 earnings call transcript from Motley Fool (https://www.fool.com/earnings/call-transcripts/2025/10/30/magnolia-oil-and-gas-mgy-earnings-transcript/). Extract the full text and save it as magnolia/transcripts/mgy-q3-2025-earnings.md.”

Claude Code used WebFetch to download and clean the page, capturing prepared remarks, financial highlights, management guidance, and the full analyst Q&A.

Fetching Press Releases from EDGAR

For quarters without free transcripts, we downloaded Exhibit 99.1 from each earnings 8-K filing:

# Same EDGAR API --- filter for 8-K filings
# Download the Exhibit 99.1 (earnings press release)
# Contains: financial tables, production data, guidance

url = ("https://www.sec.gov/Archives/edgar/data/1698990/"
       "000169899025000028/ex991-3q2025earningsrele.htm")

Example output: magnolia/transcripts/mgy-q4-2024-press-release.md

Example: Transcript Excerpt (Q3 2025)

Christopher Stavros, CEO: “We have a lot of both financial and operational flexibility, especially considering some of those deferrals that have snaked through the system in 2025.”

Key data points extracted:

  • Record production: 100,500 BOE/day (11% YoY growth)
  • Free cash flow: $134 million
  • Adjusted EBITDAX: $219 million
  • Dividend increased 15% to $0.15/share quarterly

Step 4: News Articles

Scraping Financial News

The final layer: news articles from financial media. Claude Code’s WebSearch and WebFetch tools handle this directly.

The prompt

“Search for recent news articles about Magnolia Oil & Gas (MGY). For each article, fetch the full text and save it as a markdown file in magnolia/news/ with the format YYYY-MM-DD_source_title.md.”

What we collected

  • Earnings coverage and analyst reactions
  • Analyst upgrades/downgrades and price targets
  • Dividend increases and buyback announcements
  • Production records and guidance updates
  • 10 articles from Benzinga, AlphaStreet, Zacks, and others

The Complete Corpus

What We Built

10-K Extractions

  • 7 fiscal years (2019–2025)
  • 21 files: MD&A + Risk Factors + Financials
  • ~900 KB of clean markdown

Earnings Data

  • 6 full transcripts
  • 21 press releases
  • 27 files covering every quarter

News Articles

  • 10 articles from financial media
  • Analyst upgrades, earnings coverage
  • Dividend and strategy news

All stored as clean, structured text. Browse the full corpus: Magnolia Documents (Dropbox)

Corpus File Structure

magnolia/
├── 10k/                        7 raw HTML + extracted/
│   └── extracted/              mda, risk-factors, financial-statements
│                               (7 × 3 = 21 markdown files)
├── transcripts/                6 earnings + 21 press releases (27 files)
└── news/                       10 articles from financial media

Key Techniques Summary

Step Tool / API What It Does
Find filings EDGAR Submissions API Returns filing metadata as JSON
Download 10-Ks curl + EDGAR URLs Fetches HTML filings directly
Parse HTML BeautifulSoup Strips XBRL/CSS, extracts sections
Convert tables Python script HTML tables → markdown pipe format
Get transcripts WebFetch + Motley Fool Extracts full call transcripts
Get press releases EDGAR 8-K Exhibit 99.1 Quarterly financial data
Scrape news WebSearch + WebFetch Articles from financial media

Lessons Learned

What worked well:

  • EDGAR API is reliable and free — no authentication needed
  • Constructing URLs from patterns (like the FINRA example) scales to many filings
  • pd.read_html and BeautifulSoup handle static HTML tables well
  • Motley Fool provides full transcripts for free (when they cover the company)

What was hard:

  • Transcript coverage for small-caps is spotty on free sources
  • Most transcript providers (Seeking Alpha, MarketScreener) are paywalled
  • Company webcasts use JS-based players (the same JavaScript problem we saw with Tesla/EDGAR)
  • EDGAR HTML is messy — XBRL tags and inline styles require serious parsing

Exercises

Exercise 1: Scrape a Wikipedia Table

Use pd.read_html to extract the S&P 500 company list from Wikipedia. Save it as an Excel file. How many sectors are represented? Which sector has the most companies?

Exercise 2: Build a Mini Corpus

Pick a public company of your choice. Using Claude Code:

  1. Ask Claude Code to find the company’s CIK on EDGAR (it can use https://www.sec.gov/files/company_tickers.json)
  2. Download the most recent 10-K using the EDGAR Submissions API
  3. Ask Claude to extract the MD&A section and save it as markdown
  4. Find and save 3 news articles about the company using WebSearch + WebFetch

Submit: the four markdown files and a brief note on what worked and what was hard.

Exercise 3: Adapt to Your Group Project Company

For the company your group is analyzing, collect:

  • The 3 most recent 10-K filings from EDGAR
  • At least 2 earnings call transcripts or press releases (from Motley Fool, EDGAR 8-K Exhibit 99.1, or transcribed from IR audio)
  • 5+ news articles

This corpus will be used in Module 11 (RAG) exercises. Store everything in a corpus/ folder in your project. The Magnolia extraction scripts in magnolia/10k/ can serve as a reference.

Summary

Summary

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '20px'}, 'flowchart': {'nodeSpacing': 50, 'rankSpacing': 70, 'padding': 14, 'useMaxWidth': true}}}%%
flowchart LR
  S1["<b>1. Identify<br>Sources</b>"] --> S2["<b>2. Download<br>Documents</b>"]
  S2 --> S3["<b>3. Parse &<br>Extract</b>"]
  S3 --> S4["<b>4. Collect<br>Transcripts</b>"]
  S4 --> S5["<b>5. Scrape<br>News</b>"]
  S5 --> S6["<b>6. Clean &<br>Structure</b>"]

  style S1 fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:12px
  style S2 fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:12px
  style S3 fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:12px
  style S4 fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:12px
  style S5 fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:12px
  style S6 fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#0f172a,font-size:20px,padding:12px

Web scraping is fundamentally a data engineering problem: finding sources, downloading documents, parsing messy formats, and converting everything to clean, structured text that AI tools can work with.

What’s Next

In Module 11 (RAG), we will feed this Magnolia corpus to an AI retrieval system that can answer questions grounded in the actual documents — citing specific passages from 10-Ks, transcripts, and news articles.

The quality of RAG output depends entirely on the quality of the corpus. What you scrape determines what AI can find.