frontier-s1

A live extractor for AI / frontier-tech S-1 prospectuses on SEC EDGAR. The poller
watches for new S-1, S-1/A, F-1, F-1/A registration statements, downloads the
prospectus HTML, and rips out the bits a VC associate or fintwit operator actually
reads on day-one: revenue lines, gross margin, AI-related risk factors, named
competitors, lead underwriters and use of proceeds. Each filing is rendered as a
one-screen comparison card; up to four can be put side-by-side.

No mock data. Every number comes from the actual prospectus on EDGAR.

What it does

Polls SEC EDGAR for new S-1 / F-1 filings every 15 minutes.
Hourly per-CIK pull from data.sec.gov/submissions/CIK*.json for a curated frontier-tech watchlist (SpaceX, Anthropic, OpenAI, xAI, Cerebras, Groq, Arm, Snowflake, Palantir, Reddit, Rubrik, Coinbase, Klaviyo, Astera Labs, Rocket Lab, and others).
Downloads each filing's primary prospectus HTML.
Runs six regex / DOM-based extractors over the HTML:
Income-statement lines (Total revenue, Gross profit, Net loss, Adjusted EBITDA, etc.)
AI risk-factor excerpts (filtered by a frontier keyword list).
Named competitors (from the Competition subsection).
Lead underwriters (from cover page / Underwriting section).
Use of proceeds (first ~1,200 chars of the section).
Proposed ticker + exchange.
Renders a SPA feed with category filters, a filing-detail view, a four-up compare view, and a 1200×630 share-card view per filing.

Data sources (all public, no API keys)

| Source | URL pattern | Refresh interval |
|---|---|---|
| EDGAR full-text search | https://efts.sec.gov/LATEST/search-index?q=&dateRange=custom&startdt=…&enddt=…&forms=S-1,S-1/A,F-1,F-1/A | every 15 min |
| EDGAR per-company submissions | https://data.sec.gov/submissions/CIK{cik10}.json | every 60 min (re-checked on 15-min ticks) |
| EDGAR filing index | https://www.sec.gov/Archives/edgar/data/{cik}/{accNoDash}/index.json | on demand, cached forever |
| EDGAR primary doc HTML | https://www.sec.gov/Archives/edgar/data/{cik}/{accNoDash}/{primary_doc} | on demand, cached on disk |
| EDGAR company-tickers map | https://www.sec.gov/files/company_tickers.json | every 24 h (and at first boot) |

All requests carry User-Agent: frontier-s1 ([email protected]) per SEC fair-access
policy, and the in-process limiter caps SEC traffic at 8 req/s. On three consecutive
poll failures /health returns degraded:true.

HTTP endpoints

All public, no auth. Everything is mounted under /frontier-s1.

| Method | Path | Returns |
|---|---|---|
| GET | /frontier-s1/health | {ok, filings, frontierFilings, lastPollAt, lastPollStatus, degraded} |
| GET | /frontier-s1/api/filings?limit=&category=&since= | List of frontier filings (newest first) |
| GET | /frontier-s1/api/filings/:accession | Full extracted detail for one filing |
| GET | /frontier-s1/api/feed | Last 20 frontier filings (lightweight) |
| GET | /frontier-s1/api/compare?ids=acc1,acc2,… | Up to 4 filings, side-by-side payload |
| GET | /frontier-s1/api/watchlist | Curated watchlist + last-seen-filing per company |
| GET | /frontier-s1/api/status | Poll log (last 50 ticks), per-source health |
| POST | /frontier-s1/api/refresh | Trigger immediate poll cycle (idempotent) |
| GET | /frontier-s1/ | SPA feed |
| GET | /frontier-s1/filing/:accession | SPA filing detail |
| GET | /frontier-s1/compare?ids=… | SPA compare view |
| GET | /frontier-s1/card/:accession | Standalone 1200×630 share-card HTML |

Frontier-tag rule

A filing is tagged is_frontier=1 if any of the following hold:

Its CIK is on the curated watchlist (scrapers/watchlist.js).
Its SIC code is in the curated frontier set: 7372, 7370, 3674, 3812, 3669, 3845, 8731.
Its prospectus body mentions ≥3 of these keywords: artificial intelligence, foundation model, large language model, machine learning, spacecraft, launch vehicle, satellite constellation, semiconductor, accelerator, inference, autonomous, generative ai, neural network.

Non-frontier filings are still stored (with is_frontier=0) but excluded from the UI feed.

Local run

npm install
PORT=4836 node server.js

Then open http://localhost:4836/frontier-s1/.

First boot kicks off a bootstrap poll (~30 s) that hits both the full-text search and the per-CIK submissions for each watchlist entry, downloads the primary doc for each frontier S-1/F-1, and runs all six extractors.
The cron tick repeats every 15 minutes; an hourly tick re-runs to pick up filings that took longer to fetch.
Hit the ↻ Refresh button in the UI (or POST /frontier-s1/api/refresh) to force an immediate poll.
The SQLite DB lives at data/frontier-s1.db; raw prospectus HTML is cached under data/html/.

Scope cuts (intentional)

This is the 4-hour MVP. It does not do:

PDF parsing (HTML-only; SEC EDGAR exposes the iXBRL HTML primary doc).
XBRL fact parsing (regex over HTML body is the MVP).
LLM summarisation of the prospectus — extractors are deterministic regex / cheerio so the output is cheap and lawyer-safe.
Financial modelling, DCF, projections, ratios.
Headless-browser screenshot capture of the share card. The route serves an HTML shell; users screenshot themselves.
Auth, login, accounts, saved searches, email alerts, RSS, webhook fan-out.
Pitchbook / Crunchbase / Bloomberg / paywalled data.

File layout

frontier-s1/
  server.js              Express bootstrap, cron, mounts /frontier-s1
  db.js                  better-sqlite3 schema + prepared statements
  config.js              base path, port, watchlist + keyword sets
  scrapers/
    edgar-client.js      rate-limited fetch wrapper with SEC User-Agent
    fulltext-search.js   EDGAR full-text search → recent S-1 / F-1 hits
    company-submissions.js  per-CIK submissions JSON
    filing-index.js      resolve accession → primary doc filename
    primary-doc.js       download + cache prospectus HTML
    company-tickers.js   CIK ↔ ticker ↔ name map (24-h refresh)
    watchlist.js         curated CIK list + categories
  extractors/
    html-utils.js        cheerio helpers (section finder, money parser)
    classifier.js        frontier-tag decision + AI-mention counter
    revenue.js           income-statement table regex
    risk-factors.js      Risk-Factors split + keyword filter
    competitors.js       Competition section + proper-noun phrases
    underwriters.js      Cover/Underwriting bookrunner extraction
    use-of-proceeds.js   Use of Proceeds section extraction
    ticker-exchange.js   Proposed ticker symbol + exchange
  jobs/
    poll.js              cron tick: search → submissions → fetch → extract
    extract-queue.js     in-process extract queue
  routes/
    api.js               JSON endpoints
    pages.js             SPA shell routes
  public/
    index.html           SPA shell
    card.html            share-card shell
    app.js               vanilla SPA: feed / detail / compare / status
    card.js              share-card renderer
    style.css            dark theme
    favicon.svg

Honest notes

The risk-factor extractor relies on bold-shaped sub-headers in the Risk Factors section. When that structure isn't recognised (some issuers use plain paragraph styles), the extractor falls back to scanning the post-anchor body text for frontier-keyword sentences. The fallback is best-effort and can surface non-header sentences — quality over silence.
The competitor extractor mixes a frontier-name allowlist boost with regex on the Competition subsection. False positives like "U.S" or "FAA" sometimes leak through.
Ticker extraction is regex-only. If a cover page uses unusual wording the field will be null — and the UI will say "awaiting" — rather than guess.
We do not fabricate rows. If EDGAR is unreachable on first boot the feed is empty and the UI says so.