frontier-s1
A live extractor for AI / frontier-tech S-1 prospectuses on SEC EDGAR. The poller
watches for new S-1, S-1/A, F-1, F-1/A registration statements, downloads the
prospectus HTML, and rips out the bits a VC associate or fintwit operator actually
reads on day-one: revenue lines, gross margin, AI-related risk factors, named
competitors, lead underwriters and use of proceeds. Each filing is rendered as a
one-screen comparison card; up to four can be put side-by-side.
No mock data. Every number comes from the actual prospectus on EDGAR.
What it does
- Polls SEC EDGAR for new S-1 / F-1 filings every 15 minutes.
- Hourly per-CIK pull from
data.sec.gov/submissions/CIK*.jsonfor a curated frontier-tech watchlist (SpaceX, Anthropic, OpenAI, xAI, Cerebras, Groq, Arm, Snowflake, Palantir, Reddit, Rubrik, Coinbase, Klaviyo, Astera Labs, Rocket Lab, and others). - Downloads each filing's primary prospectus HTML.
- Runs six regex / DOM-based extractors over the HTML:
- Income-statement lines (Total revenue, Gross profit, Net loss, Adjusted EBITDA, etc.)
- AI risk-factor excerpts (filtered by a frontier keyword list).
- Named competitors (from the Competition subsection).
- Lead underwriters (from cover page / Underwriting section).
- Use of proceeds (first ~1,200 chars of the section).
- Proposed ticker + exchange.
- Renders a SPA feed with category filters, a filing-detail view, a four-up compare view, and a 1200×630 share-card view per filing.
Data sources (all public, no API keys)
| Source | URL pattern | Refresh interval |
|---|---|---|
| EDGAR full-text search | https://efts.sec.gov/LATEST/search-index?q=&dateRange=custom&startdt=…&enddt=…&forms=S-1,S-1/A,F-1,F-1/A | every 15 min |
| EDGAR per-company submissions | https://data.sec.gov/submissions/CIK{cik10}.json | every 60 min (re-checked on 15-min ticks) |
| EDGAR filing index | https://www.sec.gov/Archives/edgar/data/{cik}/{accNoDash}/index.json | on demand, cached forever |
| EDGAR primary doc HTML | https://www.sec.gov/Archives/edgar/data/{cik}/{accNoDash}/{primary_doc} | on demand, cached on disk |
| EDGAR company-tickers map | https://www.sec.gov/files/company_tickers.json | every 24 h (and at first boot) |
All requests carry User-Agent: frontier-s1 ([email protected]) per SEC fair-access
policy, and the in-process limiter caps SEC traffic at 8 req/s. On three consecutive
poll failures /health returns degraded:true.
HTTP endpoints
All public, no auth. Everything is mounted under /frontier-s1.
| Method | Path | Returns |
|---|---|---|
| GET | /frontier-s1/health | {ok, filings, frontierFilings, lastPollAt, lastPollStatus, degraded} |
| GET | /frontier-s1/api/filings?limit=&category=&since= | List of frontier filings (newest first) |
| GET | /frontier-s1/api/filings/:accession | Full extracted detail for one filing |
| GET | /frontier-s1/api/feed | Last 20 frontier filings (lightweight) |
| GET | /frontier-s1/api/compare?ids=acc1,acc2,… | Up to 4 filings, side-by-side payload |
| GET | /frontier-s1/api/watchlist | Curated watchlist + last-seen-filing per company |
| GET | /frontier-s1/api/status | Poll log (last 50 ticks), per-source health |
| POST | /frontier-s1/api/refresh | Trigger immediate poll cycle (idempotent) |
| GET | /frontier-s1/ | SPA feed |
| GET | /frontier-s1/filing/:accession | SPA filing detail |
| GET | /frontier-s1/compare?ids=… | SPA compare view |
| GET | /frontier-s1/card/:accession | Standalone 1200×630 share-card HTML |
Frontier-tag rule
A filing is tagged is_frontier=1 if any of the following hold:
- Its CIK is on the curated watchlist (
scrapers/watchlist.js). - Its SIC code is in the curated frontier set:
7372,7370,3674,3812,3669,3845,8731. - Its prospectus body mentions ≥3 of these keywords:
artificial intelligence,foundation model,large language model,machine learning,spacecraft,launch vehicle,satellite constellation,semiconductor,accelerator,inference,autonomous,generative ai,neural network.
Non-frontier filings are still stored (with is_frontier=0) but excluded from the UI feed.
Local run
npm install
PORT=4836 node server.js
Then open http://localhost:4836/frontier-s1/.
- First boot kicks off a bootstrap poll (~30 s) that hits both the full-text search and the per-CIK submissions for each watchlist entry, downloads the primary doc for each frontier S-1/F-1, and runs all six extractors.
- The cron tick repeats every 15 minutes; an hourly tick re-runs to pick up filings that took longer to fetch.
- Hit the
↻ Refreshbutton in the UI (orPOST /frontier-s1/api/refresh) to force an immediate poll. - The SQLite DB lives at
data/frontier-s1.db; raw prospectus HTML is cached underdata/html/.
Scope cuts (intentional)
This is the 4-hour MVP. It does not do:
- PDF parsing (HTML-only; SEC EDGAR exposes the iXBRL HTML primary doc).
- XBRL fact parsing (regex over HTML body is the MVP).
- LLM summarisation of the prospectus — extractors are deterministic regex / cheerio so the output is cheap and lawyer-safe.
- Financial modelling, DCF, projections, ratios.
- Headless-browser screenshot capture of the share card. The route serves an HTML shell; users screenshot themselves.
- Auth, login, accounts, saved searches, email alerts, RSS, webhook fan-out.
- Pitchbook / Crunchbase / Bloomberg / paywalled data.
File layout
frontier-s1/
server.js Express bootstrap, cron, mounts /frontier-s1
db.js better-sqlite3 schema + prepared statements
config.js base path, port, watchlist + keyword sets
scrapers/
edgar-client.js rate-limited fetch wrapper with SEC User-Agent
fulltext-search.js EDGAR full-text search → recent S-1 / F-1 hits
company-submissions.js per-CIK submissions JSON
filing-index.js resolve accession → primary doc filename
primary-doc.js download + cache prospectus HTML
company-tickers.js CIK ↔ ticker ↔ name map (24-h refresh)
watchlist.js curated CIK list + categories
extractors/
html-utils.js cheerio helpers (section finder, money parser)
classifier.js frontier-tag decision + AI-mention counter
revenue.js income-statement table regex
risk-factors.js Risk-Factors split + keyword filter
competitors.js Competition section + proper-noun phrases
underwriters.js Cover/Underwriting bookrunner extraction
use-of-proceeds.js Use of Proceeds section extraction
ticker-exchange.js Proposed ticker symbol + exchange
jobs/
poll.js cron tick: search → submissions → fetch → extract
extract-queue.js in-process extract queue
routes/
api.js JSON endpoints
pages.js SPA shell routes
public/
index.html SPA shell
card.html share-card shell
app.js vanilla SPA: feed / detail / compare / status
card.js share-card renderer
style.css dark theme
favicon.svg
Honest notes
- The risk-factor extractor relies on bold-shaped sub-headers in the Risk Factors section. When that structure isn't recognised (some issuers use plain paragraph styles), the extractor falls back to scanning the post-anchor body text for frontier-keyword sentences. The fallback is best-effort and can surface non-header sentences — quality over silence.
- The competitor extractor mixes a frontier-name allowlist boost with regex on the Competition subsection. False positives like "U.S" or "FAA" sometimes leak through.
- Ticker extraction is regex-only. If a cover page uses unusual wording the field will be
null— and the UI will say "awaiting" — rather than guess. - We do not fabricate rows. If EDGAR is unreachable on first boot the feed is empty and the UI says so.