Brand selection on this site was performed by an autonomous pipeline running in a single Claude Code session, not by hand-curation. The pipeline ran for ~85 minutes, made 396 model calls, and cost $98.21. This page documents what it did and how, so the methodology is auditable.
The work model
Everything is a task. Tasks live in a JSON queue. A driver pops the next runnable task, dispatches it to a worker, writes results back, and repeats until the queue is empty or a budget cap is hit.
Task types:
| Type | Worker | Outputs |
|---|---|---|
discover_webfetch |
claude -p + WebFetch |
One source's brand list |
discover_browser |
claude -p + chrome-devtools MCP (headless, isolated) |
Same, for JS/Cloudflare-gated sources |
merge_lists |
pure jq | Deduplicated brands.json |
enrich_brand |
claude -p + WebSearch/WebFetch |
Founding year, ownership, extension verdict |
verify_extension |
claude -p (Opus, sharper prompt) |
Second-opinion on low-confidence verdicts |
query |
claude -p (no tools) |
Final markdown answer |
finalize |
jq + bash | Run summary |
Each task carries state, attempts, max_attempts, cost_usd, and result_path fields. The state machine: pending → in_progress → succeeded, with failed tasks re-queued (incrementing attempts) up to max_attempts, then escalated or marked manual_review.
Discovery: WebFetch vs. headless browser
Sources have an extraction_method hint: webfetch, browser, or auto.
- WebFetch is cheap and fast. It works for most sources where brand names appear in HTML text or accessible JSON.
- Headless browser is required for Cloudflare-gated, JS-rendered, or chart-rendered sources (Bain's Insurgent Brands editions are the canonical case — the brand names live in SVG
<text>nodes inside a chart that loads after page render, and Cloudflare blocks most non-browser user agents).
The browser worker invokes claude -p with chrome-devtools-mcp running in --isolated --headless mode (each subprocess gets a temp user-data-dir so multiple workers can run in parallel). Three extraction strategies in order:
- DOM/SVG text: query all
<text>nodes inside the chart container. - Network capture: inspect XHR/fetch responses during page load. Charts often pull JSON from a data endpoint.
- Screenshot + vision:
page.screenshot()of the chart, hand to a vision-capableclaude -pcall to read brand names.
If WebFetch returns completeness: partial or sample, the task auto-escalates to discover_browser for the same source. If browser fails three times, the source is marked manual_review: true and surfaced in the Sources page.
Enrichment: per-brand research
For each unique brand, an enrich_brand task runs claude -p with WebSearch and WebFetch tools. The model is asked to fill in:
founded_year,founder,first_retail_yearus_market(boolean)is_extension(boolean) withextension_reasoning(why)ownership_status(independent / acquired / public),acquired_by,acquired_yearlatest_revenue_usd,valuation_usdsources[]— list of URLs cited per field
Each output field also carries a confidence rating (high / medium / low). Low confidence on is_extension auto-enqueues a verify_extension task using a different model (Opus 4.7 instead of Haiku) and a sharper prompt that lists known edge cases (Athletic Greens, Ghost, Liquid Death-style category creators). If the verifier disagrees with the enricher, the brand is marked manual_review: true — it stays in the database but is flagged.
Parallelism
Default MAX_WORKERS=5, env-overridable. Implementation: xargs -P $MAX_WORKERS reading task IDs from the queue. Each worker is independent; they touch different rows in brands.json. Writes go through a flock-guarded merge helper to avoid lost updates.
Cost and budget
Every claude -p call returns its cost in the JSON wrapper. The dispatcher accumulates total cost in run-state.json. Before dispatching any task, it checks total_cost < BUDGET_USD. Hitting budget is a graceful halt — the final report is written with whatever's done, and the run can be resumed later.
This run:
- Total cost: $98.21
- Model calls: 396
- Budget: $1,000
- Runtime: ~85 minutes (started 20:22 UTC, finalized 21:47 UTC)
Average enrichment cost per brand: ~$0.23. Average discovery cost per source: ~$0.65 (WebFetch) / ~$1.50 (browser).
Termination and audit
The loop exits when (pending_tasks == 0 AND escalated_tasks == 0) OR budget_exhausted. Then finalize.sh runs, which:
- Ensures the
querytask for the headline question has run. - Counts: sources covered, sources skipped, brands found, brands enriched, brands flagged
manual_review. - Writes a complete-run report with numerical summary, the answer to the question, the manual-review section, the skipped-sources section, total cost, runtime, and call count.
- Posts a one-line summary via push notification.
Every brand in the database has a sources[] array citing the URLs the model used to fill each field. Every model call has a JSONL log entry in logs/. The complete run-state.json, queue.json, and brands.json are preserved.
What this guarantees, and what it doesn't
Guarantees:
- No brand was added or excluded by hand-picking. The criteria are fixed and applied uniformly.
- Every classification is traceable to source URLs.
- The dataset is reproducible — re-running the loop with the same sources will produce a comparable dataset (model nondeterminism aside).
Doesn't guarantee:
- That the source lists themselves are exhaustive. If a brand isn't on Bain, Numerator, Circana, Inc. 5000 F&B, Pear Commerce, Food Institute, or Food Dive, it isn't here. See Sources for the full list.
- That every model classification is correct. The
verify_extensionsecond-opinion catches obvious disagreements; subtler edge cases are flagged asmanual_review(currently 0 in this run, with 5 brands flagged for incomplete data — see the Brands page). - That the data won't drift. Acquisitions, valuations, and revenue change. The dataset reflects state as of the run timestamp.
Source code and design doc
The full design document is in the project repository at DESIGN.md. Key files:
pipeline/loop.sh— top-level driverpipeline/dispatch.sh— one task → one workerpipeline/workers/*.sh— per-task-type implementationspipeline/prompts/*.md— prompt templates passed toclaude -ppipeline/schemas/*.json— JSON schemas for validationpipeline/lib/chrome-devtools-isolated.json— MCP config for browser workers