Now we can automate the download by looping over dates.
Automating FINRA Downloads
import pandas as pdfrom datetime import date, timedeltaframes = []d = date(2024, 1, 1)while d <= date(2025, 4, 30): url =f"https://cdn.finra.org/equity/otcmarket/biweekly/shrt{d:%Y%m%d}.csv"try: frames.append(pd.read_csv(url, sep="|"))exceptException:pass# not a reporting date --- skip it d += timedelta(days=1)
Try every possible date and silently skip failures — quick and effective. Only FINRA settlement dates (roughly the 15th and last business day) will succeed.
Level 3: JavaScript-Rendered Pages
Many pages are dynamically generated with JavaScript. The data we want isn’t in the HTML source — it’s fetched by executing JavaScript code.
Example: Open a Tesla 10-K on EDGAR. The first table lists Tesla’s directors. But if you “View page source” and search for “Elon Musk” — 0 results.
The page content is loaded dynamically by JavaScript — the raw HTML is just a shell.
Playwright: Automating a Real Browser
Playwright is a library that controls a real browser (Chrome, Firefox, or Safari) from Python or JavaScript. It executes the JavaScript on the page, then lets you read the rendered HTML.
from playwright.sync_api import sync_playwrightwith sync_playwright() as p: browser = p.chromium.launch() page = browser.new_page() page.goto("https://www.sec.gov/ix?doc=/Archives/edgar/data/...")# Now the JavaScript has executed --- we can see the data content = page.content() # full rendered HTML page.screenshot(path="page.png") # visual snapshot browser.close()
Unlike requests or read_html, Playwright sees the page as a human would.
Once configured, Claude Code can navigate pages, click links, fill forms, take screenshots, and extract data — all through natural language instructions.
When to Use Each Tool
Scenario
Tool
Why
Static HTML table
pd.read_html
Simplest, fastest
File download by URL pattern
requests / curl
Construct URLs, loop
API with JSON responses
requests
Parse structured data (requests is Python’s standard HTTP library)
JavaScript-rendered page
Playwright
Executes JS, sees rendered page
Complex multi-step scraping
Playwright MCP
AI navigates and decides
Login-protected content
Playwright or MCP
Handles authentication flows
As we’ll see in the case study below, we used the EDGAR API and direct downloads where possible, avoiding the need for browser automation.
Before You Scrape: Legal and Ethical Considerations
Respect the Rules
Check robots.txt before scraping any site
SEC EDGAR enforces a 10 requests/second rate limit — exceed it and your IP gets blocked
Motley Fool and Seeking Alpha have terms of service restricting automated access
Fair use may protect limited use for research and education, but it depends on the specifics — always check the site’s Terms of Service
Best Practices
Use APIs when available — they are faster, more reliable, and explicitly permitted
Include a User-Agent header identifying yourself (required by EDGAR)
Add delays between requests to avoid overwhelming servers
Prefer official data sources over scraping when both exist
When in doubt: if the site provides an API, use it. If it doesn’t, check the ToS.
Case Study: Collecting Financial Documents
Goal: Company Research Library
We want to collect a comprehensive set of financial documents for a single company — the kind of library an analyst would build before a valuation. In Module 11, we will feed this corpus to an AI system (RAG) to generate analysis — but first we need the raw materials.
Target company: Magnolia Oil & Gas (MGY) — a Houston-based small-cap E&P company operating in the Eagle Ford Shale.
Public since 2018 — 7 years of filings on EDGAR, enough for trend analysis
Houston-headquartered — relevant to Rice students
Simple business model — onshore oil & gas E&P, easy to understand financially
Thin analyst coverage — free transcript sources are spotty, so we need multiple scraping techniques
In your group projects, you will pick your own company
Data Sources Overview
%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '22px'}, 'flowchart': {'nodeSpacing': 60, 'rankSpacing': 80, 'padding': 16, 'useMaxWidth': true}}}%%
flowchart LR
EDGAR["<b>SEC EDGAR</b><br>10-Ks, 8-Ks"] --> CORPUS["<b>Document Library</b><br>58 documents"]
MF["<b>Motley Fool</b><br>Transcripts"] --> CORPUS
IR["<b>Company IR</b><br>Press releases"] --> CORPUS
NEWS["<b>News Sites</b><br>Articles"] --> CORPUS
style EDGAR fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:14px
style MF fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:14px
style IR fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:14px
style NEWS fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,color:#0f172a,font-size:20px,padding:14px
style CORPUS fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#0f172a,font-size:20px,padding:14px
Step 1: Finding the 10-Ks
The EDGAR API
SEC EDGAR provides a free REST API that returns filing metadata as JSON. No API key required — just a User-Agent header.
import requests, jsonurl ="https://data.sec.gov/submissions/CIK0001698990.json"headers = {"User-Agent": "Kerry Back keb7@rice.edu"}data = requests.get(url, headers=headers).json()for i, form inenumerate(data["filings"]["recent"]["form"]):if form =="10-K": date = data["filings"]["recent"]["filingDate"][i] doc = data["filings"]["recent"]["primaryDocument"][i]print(f"{date}: {doc}")
What We Got: 7 Years of 10-Ks
Fiscal Year
Filing Date
Document
2025
2026-02-12
mgy-20251231.htm
2024
2025-02-19
mgy-20241231.htm
2023
2024-02-15
mgy-20231231.htm
2022
2023-02-16
mgy-20221231.htm
2021
2022-02-17
mgy-20211231.htm
2020
2021-02-23
mgy-20201231.htm
2019
2020-02-26
a10-k2019.htm
All downloaded as HTML directly from EDGAR.
Step 2: Extracting Sections
Extracting 10-K Sections with Claude Code
The raw EDGAR HTML files are cluttered with CSS, XBRL tags, and formatting. We asked Claude Code to extract three sections from each 10-K:
“Read the file mgy-10k-2025.htm. Extract Item 7 (MD&A), Item 1A (Risk Factors), and the Financial Statements. Strip all XBRL tags and CSS. Convert HTML tables to markdown pipe format. Save each section as a separate .md file.”
Claude used BeautifulSoup to parse the HTML, find section boundaries, and produce clean markdown. The complete extraction scripts are in the magnolia/10k/ folder.
21 Extracted Files
For each of the 7 fiscal years, we extracted:
MD&A
Management’s Discussion & Analysis (Item 7)
Revenue drivers, cost trends, liquidity
33–46 KB per year
Risk Factors
Item 1A: all risk categories
Commodity prices, regulatory, operational
70–83 KB per year
Financial Statements
Balance sheets, income statements, cash flows
Equity statements + all footnotes
120–174 KB per year
Example: Extracted Financial Data (FY2025)
The output is clean markdown tables ready for AI analysis:
Transcripts include management commentary and analyst Q&A
Press releases include financial tables, production data, and guidance
Together they cover every quarter from Q2 2019 onward
Transcript Sources: The Challenge
Full earnings call transcripts are hard to get for free:
Source
Coverage
Access
Motley Fool
Spotty for small-caps
Free
Seeking Alpha
Complete
Paywalled
MarketScreener
Complete
Paywalled
GuruFocus
Complete
JS-rendered / paywalled
Company webcasts
Audio only
Registration wall
Our approach: Grab what’s free from Motley Fool, fill gaps with EDGAR press releases — or transcribe the audio yourself.
Option: Transcribe Company IR Audio
If no free transcript exists, you can transcribe the audio webcast from the company’s IR page. (We did not use this for Magnolia.)
How it works
# OpenAI's Whisper API ($0.006/min)from openai import OpenAIclient = OpenAI()withopen("earnings_call.mp3", "rb") as f: transcript = client.audio.transcriptions.create( model="whisper-1", file=f )
Or run locally for free: pip install openai-whisper
What to know
Whisper API — $0.006/min, fast, accurate
Local Whisper — free, runs on your machine
AssemblyAI — adds speaker labels (who said what)
Audio quality varies; no speaker labels with basic Whisper
A 60-min call produces 8,000–12,000 words
Fetching Transcripts from Motley Fool
“Fetch the Magnolia Oil & Gas Q3 2025 earnings call transcript from Motley Fool (https://www.fool.com/earnings/call-transcripts/2025/10/30/magnolia-oil-and-gas-mgy-earnings-transcript/). Extract the full text and save it as magnolia/transcripts/mgy-q3-2025-earnings.md.”
Claude Code used WebFetch to download and clean the page, capturing prepared remarks, financial highlights, management guidance, and the full analyst Q&A.
Fetching Press Releases from EDGAR
For quarters without free transcripts, we downloaded Exhibit 99.1 from each earnings 8-K filing:
# Same EDGAR API --- filter for 8-K filings# Download the Exhibit 99.1 (earnings press release)# Contains: financial tables, production data, guidanceurl = ("https://www.sec.gov/Archives/edgar/data/1698990/""000169899025000028/ex991-3q2025earningsrele.htm")
Example output: magnolia/transcripts/mgy-q4-2024-press-release.md
Example: Transcript Excerpt (Q3 2025)
Christopher Stavros, CEO: “We have a lot of both financial and operational flexibility, especially considering some of those deferrals that have snaked through the system in 2025.”
Key data points extracted:
Record production: 100,500 BOE/day (11% YoY growth)
Free cash flow: $134 million
Adjusted EBITDAX: $219 million
Dividend increased 15% to $0.15/share quarterly
Step 4: News Articles
Scraping Financial News
The final layer: news articles from financial media. Claude Code’s WebSearch and WebFetch tools handle this directly.
The prompt
“Search for recent news articles about Magnolia Oil & Gas (MGY). For each article, fetch the full text and save it as a markdown file in magnolia/news/ with the format YYYY-MM-DD_source_title.md.”
What we collected
Earnings coverage and analyst reactions
Analyst upgrades/downgrades and price targets
Dividend increases and buyback announcements
Production records and guidance updates
10 articles from Benzinga, AlphaStreet, Zacks, and others
magnolia/
├── 10k/ 7 raw HTML + extracted/
│ └── extracted/ mda, risk-factors, financial-statements
│ (7 × 3 = 21 markdown files)
├── transcripts/ 6 earnings + 21 press releases (27 files)
└── news/ 10 articles from financial media
Key Techniques Summary
Step
Tool / API
What It Does
Find filings
EDGAR Submissions API
Returns filing metadata as JSON
Download 10-Ks
curl + EDGAR URLs
Fetches HTML filings directly
Parse HTML
BeautifulSoup
Strips XBRL/CSS, extracts sections
Convert tables
Python script
HTML tables → markdown pipe format
Get transcripts
WebFetch + Motley Fool
Extracts full call transcripts
Get press releases
EDGAR 8-K Exhibit 99.1
Quarterly financial data
Scrape news
WebSearch + WebFetch
Articles from financial media
Lessons Learned
What worked well:
EDGAR API is reliable and free — no authentication needed
Constructing URLs from patterns (like the FINRA example) scales to many filings
pd.read_html and BeautifulSoup handle static HTML tables well
Motley Fool provides full transcripts for free (when they cover the company)
What was hard:
Transcript coverage for small-caps is spotty on free sources
Most transcript providers (Seeking Alpha, MarketScreener) are paywalled
Company webcasts use JS-based players (the same JavaScript problem we saw with Tesla/EDGAR)
EDGAR HTML is messy — XBRL tags and inline styles require serious parsing
Exercises
Exercise 1: Scrape a Wikipedia Table
Use pd.read_html to extract the S&P 500 company list from Wikipedia. Save it as an Excel file. How many sectors are represented? Which sector has the most companies?
Exercise 2: Build a Mini Corpus
Pick a public company of your choice. Using Claude Code:
Ask Claude Code to find the company’s CIK on EDGAR (it can use https://www.sec.gov/files/company_tickers.json)
Download the most recent 10-K using the EDGAR Submissions API
Ask Claude to extract the MD&A section and save it as markdown
Find and save 3 news articles about the company using WebSearch + WebFetch
Submit: the four markdown files and a brief note on what worked and what was hard.
Exercise 3: Adapt to Your Group Project Company
For the company your group is analyzing, collect:
The 3 most recent 10-K filings from EDGAR
At least 2 earnings call transcripts or press releases (from Motley Fool, EDGAR 8-K Exhibit 99.1, or transcribed from IR audio)
5+ news articles
This corpus will be used in Module 11 (RAG) exercises. Store everything in a corpus/ folder in your project. The Magnolia extraction scripts in magnolia/10k/ can serve as a reference.
Web scraping is fundamentally a data engineering problem: finding sources, downloading documents, parsing messy formats, and converting everything to clean, structured text that AI tools can work with.
What’s Next
In Module 11 (RAG), we will feed this Magnolia corpus to an AI retrieval system that can answer questions grounded in the actual documents — citing specific passages from 10-Ks, transcripts, and news articles.
The quality of RAG output depends entirely on the quality of the corpus. What you scrape determines what AI can find.