How the autonomous pipeline produced this

Brand selection on this site was performed by an autonomous pipeline running in a single Claude Code session, not by hand-curation. The pipeline ran for ~85 minutes, made 396 model calls, and cost $98.21. This page documents what it did and how, so the methodology is auditable.

The work model

Everything is a task. Tasks live in a JSON queue. A driver pops the next runnable task, dispatches it to a worker, writes results back, and repeats until the queue is empty or a budget cap is hit.

Task types:

Type	Worker	Outputs
`discover_webfetch`	`claude -p` + WebFetch	One source's brand list
`discover_browser`	`claude -p` + chrome-devtools MCP (headless, isolated)	Same, for JS/Cloudflare-gated sources
`merge_lists`	pure jq	Deduplicated `brands.json`
`enrich_brand`	`claude -p` + WebSearch/WebFetch	Founding year, ownership, extension verdict
`verify_extension`	`claude -p` (Opus, sharper prompt)	Second-opinion on low-confidence verdicts
`query`	`claude -p` (no tools)	Final markdown answer
`finalize`	jq + bash	Run summary

Each task carries state, attempts, max_attempts, cost_usd, and result_path fields. The state machine: pending → in_progress → succeeded, with failed tasks re-queued (incrementing attempts) up to max_attempts, then escalated or marked manual_review.

Discovery: WebFetch vs. headless browser

Sources have an extraction_method hint: webfetch, browser, or auto.

WebFetch is cheap and fast. It works for most sources where brand names appear in HTML text or accessible JSON.
Headless browser is required for Cloudflare-gated, JS-rendered, or chart-rendered sources (Bain's Insurgent Brands editions are the canonical case — the brand names live in SVG <text> nodes inside a chart that loads after page render, and Cloudflare blocks most non-browser user agents).

The browser worker invokes claude -p with chrome-devtools-mcp running in --isolated --headless mode (each subprocess gets a temp user-data-dir so multiple workers can run in parallel). Three extraction strategies in order:

DOM/SVG text: query all <text> nodes inside the chart container.
Network capture: inspect XHR/fetch responses during page load. Charts often pull JSON from a data endpoint.
Screenshot + vision: page.screenshot() of the chart, hand to a vision-capable claude -p call to read brand names.

If WebFetch returns completeness: partial or sample, the task auto-escalates to discover_browser for the same source. If browser fails three times, the source is marked manual_review: true and surfaced in the Sources page.

Enrichment: per-brand research

For each unique brand, an enrich_brand task runs claude -p with WebSearch and WebFetch tools. The model is asked to fill in:

founded_year, founder, first_retail_year
us_market (boolean)
is_extension (boolean) with extension_reasoning (why)
ownership_status (independent / acquired / public), acquired_by, acquired_year
latest_revenue_usd, valuation_usd
sources[] — list of URLs cited per field

Each output field also carries a confidence rating (high / medium / low). Low confidence on is_extension auto-enqueues a verify_extension task using a different model (Opus 4.7 instead of Haiku) and a sharper prompt that lists known edge cases (Athletic Greens, Ghost, Liquid Death-style category creators). If the verifier disagrees with the enricher, the brand is marked manual_review: true — it stays in the database but is flagged.

Parallelism

Default MAX_WORKERS=5, env-overridable. Implementation: xargs -P $MAX_WORKERS reading task IDs from the queue. Each worker is independent; they touch different rows in brands.json. Writes go through a flock-guarded merge helper to avoid lost updates.

Cost and budget

Every claude -p call returns its cost in the JSON wrapper. The dispatcher accumulates total cost in run-state.json. Before dispatching any task, it checks total_cost < BUDGET_USD. Hitting budget is a graceful halt — the final report is written with whatever's done, and the run can be resumed later.

This run:

Total cost: $98.21
Model calls: 396
Budget: $1,000
Runtime: ~85 minutes (started 20:22 UTC, finalized 21:47 UTC)

Average enrichment cost per brand: ~$0.23. Average discovery cost per source: ~$0.65 (WebFetch) / ~$1.50 (browser).

Termination and audit

The loop exits when (pending_tasks == 0 AND escalated_tasks == 0) OR budget_exhausted. Then finalize.sh runs, which:

Ensures the query task for the headline question has run.
Counts: sources covered, sources skipped, brands found, brands enriched, brands flagged manual_review.
Writes a complete-run report with numerical summary, the answer to the question, the manual-review section, the skipped-sources section, total cost, runtime, and call count.
Posts a one-line summary via push notification.

Every brand in the database has a sources[] array citing the URLs the model used to fill each field. Every model call has a JSONL log entry in logs/. The complete run-state.json, queue.json, and brands.json are preserved.

What this guarantees, and what it doesn't

Guarantees:

No brand was added or excluded by hand-picking. The criteria are fixed and applied uniformly.
Every classification is traceable to source URLs.
The dataset is reproducible — re-running the loop with the same sources will produce a comparable dataset (model nondeterminism aside).

Doesn't guarantee:

That the source lists themselves are exhaustive. If a brand isn't on Bain, Numerator, Circana, Inc. 5000 F&B, Pear Commerce, Food Institute, or Food Dive, it isn't here. See Sources for the full list.
That every model classification is correct. The verify_extension second-opinion catches obvious disagreements; subtler edge cases are flagged as manual_review (currently 0 in this run, with 5 brands flagged for incomplete data — see the Brands page).
That the data won't drift. Acquisitions, valuations, and revenue change. The dataset reflects state as of the run timestamp.

Source code and design doc

The full design document is in the project repository at DESIGN.md. Key files:

pipeline/loop.sh — top-level driver
pipeline/dispatch.sh — one task → one worker
pipeline/workers/*.sh — per-task-type implementations
pipeline/prompts/*.md — prompt templates passed to claude -p
pipeline/schemas/*.json — JSON schemas for validation
pipeline/lib/chrome-devtools-isolated.json — MCP config for browser workers