rfs-matcher

Paste your 2-3 sentence startup description, get back the YC Summer 2026
Requests-for-Startups bullets you match plus the already-funded YC companies
in your exact lane to study before you write the application.

The YC RFS page is a wall of prose. The YC company directory is ~6000 rows
deep. rfs-matcher does the matching for you in one shot: embed your pitch,
embed every RFS bullet and every YC company, cosine-rank, surface the top 3 RFS
bullets and top 5 funded peers with a one-line "why this matches" rationale.

No accounts. No auth. No tracking. The dataset is all public.

Run locally

npm install
cp .env.example .env   # optional — fill in OPENAI_API_KEY / ANTHROPIC_API_KEY
PORT=4819 node server.js
open http://localhost:4819/rfs-matcher/

First boot scrapes the YC RFS page (~30 KB) and the yc-oss companies JSON
(~10 MB, ~6000 rows) into ./data/rfs-matcher.db. Subsequent boots are
instant. Without API keys the service still works — see Modes below.

Endpoints

All mounted under /rfs-matcher. No auth, all public.

| Method | Path | What it does |
|---|---|---|
| GET | /rfs-matcher/ | The SPA |
| GET | /rfs-matcher/health | {ok:true, rfs_count, company_count, embeddings_mode} |
| POST | /rfs-matcher/api/match | Body {pitch} → top-3 RFS bullets + top-5 YC companies. Persisted, returns a short pid and share_url. |
| GET | /rfs-matcher/api/match/:pid | Re-fetch a prior match by id. |
| GET | /rfs-matcher/share/:pid | Server-rendered HTML with OG tags — for Twitter/LinkedIn link previews. |
| GET | /rfs-matcher/api/rfs | Every RFS bullet currently indexed. |
| GET | /rfs-matcher/api/companies?batch=&limit=&offset= | Paginated company rows. |
| GET | /rfs-matcher/api/stats | Corpus counts, batch breakdown, last-scrape timestamps, current modes. |

Rate limit: 10 matches/hour per IP (token bucket, in-process). Over-limit → 429
with a Retry-After header. No API key required.

Data sources

| Source | URL | Refresh |
|---|---|---|
| YC Requests for Startups | https://www.ycombinator.com/rfs | weekly, 0 5 1 (Mondays 05:00) |
| YC Companies directory | https://yc-oss.github.io/api/companies/all.json (community mirror of the YC company directory, refreshed daily from ycombinator.com/companies) | weekly, 0 6 1 |
| OpenAI Embeddings (text-embedding-3-small, 1536-d) | https://api.openai.com/v1/embeddings | per user pitch + once per row at bootstrap / delta |
| Anthropic Messages (claude-haiku-4-5) | https://api.anthropic.com/v1/messages | 1 call per /api/match request |

Refreshes are weekly cron jobs registered by lib/cron.js on boot. A snapshot
is kept in SQLite so the service stays up even if upstream is briefly broken;
see /api/stats for the last successful scrape time and scrape_log for
diagnostics.

Modes (honest about degradation)

| Condition | What happens |
|---|---|
| OPENAI_API_KEY set | RFS + companies + your pitch are embedded with text-embedding-3-small. Ranking is cosine similarity over Float32 1536-d vectors. |
| OPENAI_API_KEY missing | Embeddings fall back to a lexical scorer: a blend of overlap-coefficient and Jaccard over a stopword-filtered token set. embeddings_mode in /api/stats reads jaccard-fallback. Real comparison, lower quality. |
| ANTHROPIC_API_KEY set | Each result gets a one-line rationale tying a word from your pitch to a word from the target. |
| ANTHROPIC_API_KEY missing | Results ship with empty rationale strings and rationale_mode: "no-key". The UI shows the cards without the rationale line. |

Nothing is mocked. If a column would be fake, it's omitted.

Architecture

server.js — Express bootstrap, helmet/compression, mounts everything under /rfs-matcher, kicks the first-boot bootstrap.
db.js — better-sqlite3 (WAL), schema migration, all prepared statements, Float32 ↔ BLOB helpers.
routes/ — pages.js (root + health), data.js (rfs/companies/stats), match.js (POST match, GET match, share view).
scrapers/rfs.js — fetch + cheerio parse of ycombinator.com/rfs into {id, title, body} rows.
scrapers/companies.js — fetch of yc-oss.github.io/api/companies/all.json, normalised into yc_companies rows.
lib/embeddings.js — OpenAI client, batched at 96, exponential-backoff retry, per-IP token bucket.
lib/cosine.js — cosineSim, topK, tokenize, jaccard, overlap for the fallback path.
lib/rationale.js — Anthropic call, JSON-mode prompt, lenient JSON-extraction parse.
lib/cron.js — first-boot bootstrap + delta-embedding + the two weekly cron jobs.
lib/shareCard.js — server-rendered share page with OG tags.
public/ — vanilla-JS SPA (index.html, app.js, style.css).

Total: ~1700 LOC across 13 source files (plus the spec and frontmatter).

SQLite schema

4 tables in ./data/rfs-matcher.db:

rfs_bullets — {id, batch_label, title, body, source_url, embedding BLOB, embedding_model, scraped_at}
yc_companies — {slug, name, batch, industry, one_liner, long_desc, url, status, embedding BLOB, embedding_model, scraped_at}
matches — {pid, pitch, top_rfs JSON, top_companies JSON, embeddings_mode, rationale_mode, created_at, client_ip_hash}
scrape_log — {source, status, row_count, duration_ms, error, ran_at}

WAL mode. Embeddings stored as Float32 little-endian BLOBs. ~6000 × 1536 × 4 bytes ≈ 37 MB once embedded.

Limits / out-of-scope

No payments, no accounts. Future "$19 full report" is future — v1 is fully free.
Single language (English).
Only YC. Other accelerators are intentionally not supported in v1.
Pitch must be 20–2000 characters.
10 matches per IP per hour.
Every submission creates a new pid — matches are immutable; there's no edit.