Module 8: Web Scraping with AI

MGMT 675: Generative AI for Finance

Kerry Back, Rice University

Web Scraping Fundamentals

What Is Web Scraping?

Getting data from websites is a task for which AI + coding can really increase efficiency.

Three levels of difficulty:

Level 1 — Easy: extract a table from a static HTML page
Level 2 — Medium: construct URLs and loop over downloads
Level 3 — Hard: pages rendered dynamically with JavaScript

Level 1: Extracting a Table from Wikipedia

This Wikipedia page maintains a list of all S&P 500 companies with ticker, sector, headquarters, etc.

import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
tables = pd.read_html(url)
sp500 = tables[0]
sp500.to_excel("sp500.xlsx", index=False)

Compared to manually typing 500 rows, this takes seconds.

How Does `read_html` Work?

Right-click on the Wikipedia page → “View page source” to see the raw HTML. The read_html function searches for <table> tags, then interprets:

<tbody> = table body
<tr> = table row
<th> = table header
<td> = table data

It builds a pandas DataFrame from the HTML structure — no AI required, just pattern matching.

Level 2: FINRA Short Interest Data

FINRA publishes short interest data twice monthly. Clicking a date downloads a CSV. What’s the URL pattern?

Right-click → “View page source” → search for a date like 20250430:

https://cdn.finra.org/equity/otcmarket/biweekly/shrt20250430.csv

Now we can automate the download by looping over dates.

Automating FINRA Downloads

import pandas as pd
from datetime import date, timedelta

frames = []
d = date(2024, 1, 1)
while d <= date(2025, 4, 30):
    url = f"https://cdn.finra.org/equity/otcmarket/biweekly/shrt{d:%Y%m%d}.csv"
    try:
        frames.append(pd.read_csv(url, sep="|"))
    except Exception:
        pass  # not a reporting date --- skip it
    d += timedelta(days=1)

Automating FINRA Downloads (cont.)

df = pd.concat(frames)
df.to_csv("finra_short_interest.csv", index=False)

Try every possible date and silently skip failures — quick and effective. Only FINRA settlement dates (roughly the 15th and last business day) will succeed.

Level 3: JavaScript-Rendered Pages

Many pages are dynamically generated with JavaScript. The data we want isn’t in the HTML source — it’s fetched by executing JavaScript code.

Example: Open a Tesla 10-K on EDGAR. The first table lists Tesla’s directors. But if you “View page source” and search for “Elon Musk” — 0 results.

The page content is loaded dynamically by JavaScript — the raw HTML is just a shell.

Playwright: Automating a Real Browser

Playwright is a library that controls a real browser (Chrome, Firefox, or Safari) from Python or JavaScript. It executes the JavaScript on the page, then lets you read the rendered HTML.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://www.sec.gov/ix?doc=/Archives/edgar/data/...")

    # Now the JavaScript has executed --- we can see the data
    content = page.content()          # full rendered HTML
    page.screenshot(path="page.png")  # visual snapshot
    browser.close()

Unlike requests or read_html, Playwright sees the page as a human would.

Playwright: Key Capabilities

Navigation & Interaction

Click buttons, fill forms, select dropdowns
Wait for elements to load
Handle login pages and cookies
Download files triggered by clicks

Data Extraction

Read rendered HTML after JavaScript executes
Take screenshots of pages or elements
Extract text, tables, links
Record and replay browser sessions

Install: pip install playwright && playwright install

Playwright MCP: AI-Controlled Browsing

The Playwright MCP server connects Playwright to Claude Code, letting the AI control the browser directly.

Add to ~/.claude/.mcp.json:

{
  "mcpServers": {
    "playwright": {
      "command": "npx",
      "args": ["@playwright/mcp@latest"]
    }
  }
}

Once configured, Claude Code can navigate pages, click links, fill forms, take screenshots, and extract data — all through natural language instructions.

When to Use Each Tool

Scenario	Tool	Why
Static HTML table	`pd.read_html`	Simplest, fastest
File download by URL pattern	`requests` / `curl`	Construct URLs, loop
API with JSON responses	`requests`	Parse structured data (`requests` is Python’s standard HTTP library)
JavaScript-rendered page	Playwright	Executes JS, sees rendered page
Complex multi-step scraping	Playwright MCP	AI navigates and decides
Login-protected content	Playwright or MCP	Handles authentication flows

As we’ll see in the case study below, we used the EDGAR API and direct downloads where possible, avoiding the need for browser automation.

Before You Scrape: Legal and Ethical Considerations

Respect the Rules

Check robots.txt before scraping any site
SEC EDGAR enforces a 10 requests/second rate limit — exceed it and your IP gets blocked
Motley Fool and Seeking Alpha have terms of service restricting automated access
Fair use may protect limited use for research and education, but it depends on the specifics — always check the site’s Terms of Service

Best Practices

Use APIs when available — they are faster, more reliable, and explicitly permitted
Include a User-Agent header identifying yourself (required by EDGAR)
Add delays between requests to avoid overwhelming servers
Prefer official data sources over scraping when both exist

When in doubt: if the site provides an API, use it. If it doesn’t, check the ToS.

Case Study: Collecting Financial Documents

Goal: Company Research Library

We want to collect a comprehensive set of financial documents for a single company — the kind of library an analyst would build before a valuation. In Module 11, we will feed this corpus to an AI system (RAG) to generate analysis — but first we need the raw materials.

Target company: Magnolia Oil & Gas (MGY) — a Houston-based small-cap E&P company operating in the Eagle Ford Shale.

What we need:

10-K annual reports (MD&A, risk factors, financial statements)
Earnings call transcripts and press releases
News articles

Why Magnolia?

Public since 2018 — 7 years of filings on EDGAR, enough for trend analysis
Houston-headquartered — relevant to Rice students
Simple business model — onshore oil & gas E&P, easy to understand financially
Thin analyst coverage — free transcript sources are spotty, so we need multiple scraping techniques
In your group projects, you will pick your own company

Data Sources Overview

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '22px'}, 'flowchart': {'nodeSpacing': 60, 'rankSpacing': 80, 'padding': 16, 'useMaxWidth': true}}}%%
flowchart LR
  EDGAR["<b>SEC EDGAR</b><br>10-Ks, 8-Ks"] --> CORPUS["<b>Document Library</b><br>58 documents"]
  MF["<b>Motley Fool</b><br>Transcripts"] --> CORPUS
  IR["<b>Company IR</b><br>Press releases"] --> CORPUS
  NEWS["<b>News Sites</b><br>Articles"] --> CORPUS

  style EDGAR fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:14px
  style MF fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:14px
  style IR fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:14px
  style NEWS fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:14px
  style CORPUS fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#0f172a,font-size:20px,padding:14px

Step 1: Finding the 10-Ks

The EDGAR API

SEC EDGAR provides a free REST API that returns filing metadata as JSON. No API key required — just a User-Agent header.

import requests, json

url = "https://data.sec.gov/submissions/CIK0001698990.json"
headers = {"User-Agent": "Kerry Back keb7@rice.edu"}
data = requests.get(url, headers=headers).json()

for i, form in enumerate(data["filings"]["recent"]["form"]):
    if form == "10-K":
        date = data["filings"]["recent"]["filingDate"][i]
        doc = data["filings"]["recent"]["primaryDocument"][i]
        print(f"{date}: {doc}")

What We Got: 7 Years of 10-Ks

Fiscal Year	Filing Date	Document
2025	2026-02-12	`mgy-20251231.htm`
2024	2025-02-19	`mgy-20241231.htm`
2023	2024-02-15	`mgy-20231231.htm`
2022	2023-02-16	`mgy-20221231.htm`
2021	2022-02-17	`mgy-20211231.htm`
2020	2021-02-23	`mgy-20201231.htm`
2019	2020-02-26	`a10-k2019.htm`

All downloaded as HTML directly from EDGAR.

Step 2: Extracting Sections

Extracting 10-K Sections with Claude Code

The raw EDGAR HTML files are cluttered with CSS, XBRL tags, and formatting. We asked Claude Code to extract three sections from each 10-K:

“Read the file mgy-10k-2025.htm. Extract Item 7 (MD&A), Item 1A (Risk Factors), and the Financial Statements. Strip all XBRL tags and CSS. Convert HTML tables to markdown pipe format. Save each section as a separate .md file.”

Claude used BeautifulSoup to parse the HTML, find section boundaries, and produce clean markdown. The complete extraction scripts are in the magnolia/10k/ folder.

21 Extracted Files

For each of the 7 fiscal years, we extracted:

MD&A

Management’s Discussion & Analysis (Item 7)
Revenue drivers, cost trends, liquidity
33–46 KB per year

Risk Factors

Item 1A: all risk categories
Commodity prices, regulatory, operational
70–83 KB per year

Financial Statements

Balance sheets, income statements, cash flows
Equity statements + all footnotes
120–174 KB per year

Example: Extracted Financial Data (FY2025)

The output is clean markdown tables ready for AI analysis:

| ASSETS             |     2025 |     2024 |
|--------------------|----------|----------|
| Cash & equiv.      | $266,785 | $260,049 |
| Current assets     |  442,435 |  410,824 |
| TOTAL ASSETS       |$2,903,092|$2,820,835|

| REVENUES           |     2025 |     2024 |
|--------------------|----------|----------|
| Oil revenues       | $918,027 |$1,046,675|
| Total revenues     |1,311,845 |1,315,886 |
| NET INCOME         |  337,279 |  397,330 |

(In thousands. Full statements in magnolia/10k/extracted/)

Step 3: Earnings Calls

Two Sources of Earnings Data

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '22px'}, 'flowchart': {'nodeSpacing': 60, 'rankSpacing': 80, 'padding': 16, 'useMaxWidth': true}}}%%
flowchart TD
  EC["<b>Quarterly Earnings</b><br>27 quarters (Q2 2019 -- Q4 2025)"]
  EC --> T["<b>Full Transcripts</b><br>6 quarters<br>(Motley Fool)"]
  EC --> PR["<b>Press Releases</b><br>21 quarters<br>(EDGAR 8-K filings)"]

  style EC fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:14px
  style T fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:14px
  style PR fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:14px

Transcripts include management commentary and analyst Q&A
Press releases include financial tables, production data, and guidance
Together they cover every quarter from Q2 2019 onward

Transcript Sources: The Challenge

Full earnings call transcripts are hard to get for free:

Source	Coverage	Access
Motley Fool	Spotty for small-caps	Free
Seeking Alpha	Complete	Paywalled
MarketScreener	Complete	Paywalled
GuruFocus	Complete	JS-rendered / paywalled
Company webcasts	Audio only	Registration wall

Our approach: Grab what’s free from Motley Fool, fill gaps with EDGAR press releases — or transcribe the audio yourself.

Option: Transcribe Company IR Audio

If no free transcript exists, you can transcribe the audio webcast from the company’s IR page. (We did not use this for Magnolia.)

How it works

# OpenAI's Whisper API ($0.006/min)
from openai import OpenAI
client = OpenAI()
with open("earnings_call.mp3", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-1", file=f
    )

Or run locally for free: pip install openai-whisper

What to know

Whisper API — $0.006/min, fast, accurate
Local Whisper — free, runs on your machine
AssemblyAI — adds speaker labels (who said what)
Audio quality varies; no speaker labels with basic Whisper
A 60-min call produces 8,000–12,000 words

Fetching Transcripts from Motley Fool

“Fetch the Magnolia Oil & Gas Q3 2025 earnings call transcript from Motley Fool (https://www.fool.com/earnings/call-transcripts/2025/10/30/magnolia-oil-and-gas-mgy-earnings-transcript/). Extract the full text and save it as magnolia/transcripts/mgy-q3-2025-earnings.md.”

Claude Code used WebFetch to download and clean the page, capturing prepared remarks, financial highlights, management guidance, and the full analyst Q&A.

Fetching Press Releases from EDGAR

For quarters without free transcripts, we downloaded Exhibit 99.1 from each earnings 8-K filing:

# Same EDGAR API --- filter for 8-K filings
# Download the Exhibit 99.1 (earnings press release)
# Contains: financial tables, production data, guidance

url = ("https://www.sec.gov/Archives/edgar/data/1698990/"
       "000169899025000028/ex991-3q2025earningsrele.htm")

Example output: magnolia/transcripts/mgy-q4-2024-press-release.md

Example: Transcript Excerpt (Q3 2025)

Christopher Stavros, CEO: “We have a lot of both financial and operational flexibility, especially considering some of those deferrals that have snaked through the system in 2025.”

Key data points extracted:

Record production: 100,500 BOE/day (11% YoY growth)
Free cash flow: $134 million
Adjusted EBITDAX: $219 million
Dividend increased 15% to $0.15/share quarterly

Step 4: News Articles

Scraping Financial News

The final layer: news articles from financial media. Claude Code’s WebSearch and WebFetch tools handle this directly.

The prompt

“Search for recent news articles about Magnolia Oil & Gas (MGY). For each article, fetch the full text and save it as a markdown file in magnolia/news/ with the format YYYY-MM-DD_source_title.md.”

What we collected

Earnings coverage and analyst reactions
Analyst upgrades/downgrades and price targets
Dividend increases and buyback announcements
Production records and guidance updates
10 articles from Benzinga, AlphaStreet, Zacks, and others

The Complete Corpus

What We Built

10-K Extractions

7 fiscal years (2019–2025)
21 files: MD&A + Risk Factors + Financials
~900 KB of clean markdown

Earnings Data

6 full transcripts
21 press releases
27 files covering every quarter

News Articles

10 articles from financial media
Analyst upgrades, earnings coverage
Dividend and strategy news

All stored as clean, structured text. Browse the full corpus: Magnolia Documents (Dropbox)

Corpus File Structure

magnolia/
├── 10k/                        7 raw HTML + extracted/
│   └── extracted/              mda, risk-factors, financial-statements
│                               (7 × 3 = 21 markdown files)
├── transcripts/                6 earnings + 21 press releases (27 files)
└── news/                       10 articles from financial media

Key Techniques Summary

Step	Tool / API	What It Does
Find filings	EDGAR Submissions API	Returns filing metadata as JSON
Download 10-Ks	`curl` + EDGAR URLs	Fetches HTML filings directly
Parse HTML	BeautifulSoup	Strips XBRL/CSS, extracts sections
Convert tables	Python script	HTML tables → markdown pipe format
Get transcripts	WebFetch + Motley Fool	Extracts full call transcripts
Get press releases	EDGAR 8-K Exhibit 99.1	Quarterly financial data
Scrape news	WebSearch + WebFetch	Articles from financial media

Lessons Learned

What worked well:

EDGAR API is reliable and free — no authentication needed
Constructing URLs from patterns (like the FINRA example) scales to many filings
pd.read_html and BeautifulSoup handle static HTML tables well
Motley Fool provides full transcripts for free (when they cover the company)

What was hard:

Transcript coverage for small-caps is spotty on free sources
Most transcript providers (Seeking Alpha, MarketScreener) are paywalled
Company webcasts use JS-based players (the same JavaScript problem we saw with Tesla/EDGAR)
EDGAR HTML is messy — XBRL tags and inline styles require serious parsing

Exercises

Exercise 1: Scrape a Wikipedia Table

Use pd.read_html to extract the S&P 500 company list from Wikipedia. Save it as an Excel file. How many sectors are represented? Which sector has the most companies?

Exercise 2: Build a Mini Corpus

Pick a public company of your choice. Using Claude Code:

Ask Claude Code to find the company’s CIK on EDGAR (it can use https://www.sec.gov/files/company_tickers.json)
Download the most recent 10-K using the EDGAR Submissions API
Ask Claude to extract the MD&A section and save it as markdown
Find and save 3 news articles about the company using WebSearch + WebFetch

Submit: the four markdown files and a brief note on what worked and what was hard.

Exercise 3: Adapt to Your Group Project Company

For the company your group is analyzing, collect:

The 3 most recent 10-K filings from EDGAR
At least 2 earnings call transcripts or press releases (from Motley Fool, EDGAR 8-K Exhibit 99.1, or transcribed from IR audio)
5+ news articles

This corpus will be used in Module 11 (RAG) exercises. Store everything in a corpus/ folder in your project. The Magnolia extraction scripts in magnolia/10k/ can serve as a reference.

Summary

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '20px'}, 'flowchart': {'nodeSpacing': 50, 'rankSpacing': 70, 'padding': 14, 'useMaxWidth': true}}}%%
flowchart LR
  S1["<b>1. Identify<br>Sources</b>"] --> S2["<b>2. Download<br>Documents</b>"]
  S2 --> S3["<b>3. Parse &<br>Extract</b>"]
  S3 --> S4["<b>4. Collect<br>Transcripts</b>"]
  S4 --> S5["<b>5. Scrape<br>News</b>"]
  S5 --> S6["<b>6. Clean &<br>Structure</b>"]

  style S1 fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:12px
  style S2 fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:12px
  style S3 fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:12px
  style S4 fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:12px
  style S5 fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:12px
  style S6 fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#0f172a,font-size:20px,padding:12px

Web scraping is fundamentally a data engineering problem: finding sources, downloading documents, parsing messy formats, and converting everything to clean, structured text that AI tools can work with.

What’s Next

In Module 11 (RAG), we will feed this Magnolia corpus to an AI retrieval system that can answer questions grounded in the actual documents — citing specific passages from 10-Ks, transcripts, and news articles.

The quality of RAG output depends entirely on the quality of the corpus. What you scrape determines what AI can find.

Module 8: Web Scraping with AI

Web Scraping Fundamentals

What Is Web Scraping?

Level 1: Extracting a Table from Wikipedia

How Does read_html Work?

Level 2: FINRA Short Interest Data

Automating FINRA Downloads

Automating FINRA Downloads (cont.)

Level 3: JavaScript-Rendered Pages

Playwright: Automating a Real Browser

Playwright: Key Capabilities

Playwright MCP: AI-Controlled Browsing

When to Use Each Tool

Before You Scrape: Legal and Ethical Considerations

Case Study: Collecting Financial Documents

Goal: Company Research Library

Why Magnolia?

Data Sources Overview

Step 1: Finding the 10-Ks

The EDGAR API

What We Got: 7 Years of 10-Ks

Step 2: Extracting Sections

Extracting 10-K Sections with Claude Code

21 Extracted Files

Example: Extracted Financial Data (FY2025)

Step 3: Earnings Calls

Two Sources of Earnings Data

Transcript Sources: The Challenge

Option: Transcribe Company IR Audio

Fetching Transcripts from Motley Fool

Fetching Press Releases from EDGAR

Example: Transcript Excerpt (Q3 2025)

Step 4: News Articles

Scraping Financial News

The Complete Corpus

What We Built

Corpus File Structure

Key Techniques Summary

Lessons Learned

Exercises

Exercise 1: Scrape a Wikipedia Table

Exercise 2: Build a Mini Corpus

Exercise 3: Adapt to Your Group Project Company

Summary

Summary

What’s Next

How Does `read_html` Work?