Step 11 · The toolbelt · The toolbelt · Loop Engineering ENPT
Module 4 · The toolbelt · Lesson 11

Bright Data CLI: real web evidence, always

The web-evidence tool every agent in the suite shares. When the loop needs a fact, a page, or a structured record from the live web, it does not guess — it runs brightdata and quotes what came back.

Read the plain version, or open the technical layer on any section.
1

The big idea


An agent's memory is frozen at the moment it was trained. Ask it a price, a version number, what a page says today, or whether some claim still holds, and it will happily answer — from a snapshot that may be months or years stale. That is a guess wearing a confident voice.

The fix is simple and it is a rule, not a preference: when you need something from the live web, go and get it. The whole suite reaches for one tool to do that — a command-line program called Bright Data, run as brightdata from any agent's shell. It can search the web, open and read any page (even ones that try to block robots), drive a real browser, and pull tidy records from forty-plus sites.

This is the same instinct as the loop's Proof Gate — "prove it at the real boundary, never claim it." For code, the boundary is running the test. For a fact about the world, the boundary is the live web, and brightdata is how every agent reaches it.

Think of it like… a courtroom. A lawyer doesn't tell the jury "I'm fairly sure the contract said X" from memory — that's hearsay, and it gets thrown out. They enter the document into evidence and read the exact line aloud. brightdata is how an agent enters web facts into evidence: it fetches the real source and quotes it, instead of testifying from a memory that might be wrong.

Why a CLI, and why this one

The CLI lives at /opt/homebrew/bin/brightdata and runs from any agent's shell — Claude Code, Codex, Grok, Kimi, OpenCode, Crush, Goose, pi, the Council. That uniformity is the point: every agent in a cross-agent run grounds facts the exact same way, so behaviour is identical no matter who is driving. A per-agent MCP or a per-model browser tool would fragment that; the shell is the one surface they all have.

Bright Data's edge is the Web Unlocker: a geo-targeted proxy network that renders JavaScript and gets past bot-walls, CAPTCHAs, and rate-limits that would return an empty page or a 403 to a naïve curl. So "scrape any site" means any site, including the ones that actively resist scraping. It is a paid API, so cost discipline matters — check brightdata budget before a bulk run, and use the right mode for the job.

The hard rule, repeated everywhere in this course and in the global config: web data is ALWAYS the brightdata CLI — never WebSearch/WebFetch, and never the Bright Data MCP (mcp__Bright_Data__*). The CLI is the single uniform path; the MCP and the built-in web tools are explicitly forbidden so the suite never has two different grounding behaviours.

2

The one rule


Before anything else, internalise the rule. It is short, and it overrides any default an agent might have.

The rule

For any web search or web scrape, use the brightdata CLI. Quote the evidence you got back — the command plus the key line or figure — instead of paraphrasing from memory.

NEVER: WebSearch / WebFetch · the Bright Data MCP (mcp__Bright_Data__*) · answering a live-web question from training memory

ALWAYS: the brightdata CLI, from the shell — the one uniform path every agent shares

Why so absolute? Because "usually verify, sometimes guess" collapses under pressure into "guess." A bright line — always the CLI, no exceptions — is the only version of the rule an autonomous loop can actually keep. The same logic as the Proof Gate: a verification you sometimes skip is a verification you cannot trust.
3

The four modes of the toolbox


The CLI is really four tools under one command. Each answers a different shape of "I need something from the web." Read them as a ladder — reach for the lightest one that does the job.

brightdata scrape

Read any page (Web Unlocker)

Open one known URL and bring back its content as clean markdown (or HTML, a screenshot, or JSON). Gets past bot-walls and CAPTCHAs that block a plain fetch.

Reach for it when: you already know the exact page and just need what it says, right now.

brightdata browser

Drive a live browser

A real, geo-targeted browser you can steer — click, scroll, wait, snapshot the page tree. For pages that only reveal their content after you interact with them.

Reach for it when: a one-shot scrape isn't enough — the content is behind a click, a scroll, or a multi-step flow.

brightdata pipelines

40+ structured datasets

Hand a URL or handle to a purpose-built extractor and get back a clean, typed record — a tweet, a Reddit thread, a product, a profile, a listing — with named fields instead of raw page soup.

Reach for it when: the source is a known platform (X, Reddit, YouTube, Amazon, LinkedIn, …) and you want structured fields, not prose.

brightdata one CLI, four modes search SERP — find candidate pages scrape read one page · Web Unlocker browser drive a live session pipelines 40+ structured datasets real evidence quote it — don't guess
Read left → right: one command splits into four modes; each returns the same thing — real evidence you quote instead of guessing.
4

Mode deep-dive: scrape, end to end


Take one mode — scrape — and look at it closely: what it does, the command you actually type, and the questions people ask the first time they use it. The other three follow the same shape.

a

What scrape does


brightdata scrape takes one URL and gives you back what is on that page — cleaned up and ready to read. By default it returns markdown: the page's text and structure, stripped of ads, navigation, and the rest of the clutter, so your agent reads the substance, not the soup.

The magic word is Web Unlocker. Many pages detect robots and serve them a blank wall or a CAPTCHA. The Unlocker routes the request through a real-looking browser from a real-looking location, runs the page's JavaScript, and gets the actual content — the same thing a person would see.

Think of it like… sending a well-dressed person to pick up a document instead of an obvious courier robot. The front desk that would have turned the robot away just hands the person the file. Unlike a person, though, the Unlocker can do it thousands of times a minute and never gets tired — that's the part the analogy drops.

Output formats and async

The format flag is -f markdown|html|screenshot|json (markdown is the default and is usually what an agent wants — least tokens, most signal). Add --country us to pick the exit geography, --mobile for the mobile render. For big or slow pages, --async returns a job id you poll with brightdata status <job-id> so you are not blocked waiting.

When a plain fetch would return a 403, an empty shell, or a JS-challenge page, the Unlocker is what turns that into real content. That is the entire reason this CLI exists rather than curl: it is built to beat the anti-bot layer.

b

The command


Three views of the same mode: the everyday scrape, the same thing pointed at a bot-walled page, and the async form for something big.

shell · any agent
# one known page → clean markdown (default format)
brightdata scrape "https://example.com/pricing"

# pick the format explicitly
brightdata scrape "https://example.com/pricing" -f markdown

Default format is markdown. Other modes: brightdata search … · brightdata browser … · brightdata pipelines <type> …

c

FAQ


Two reasons. First, the built-in tools get blocked by exactly the bot-walls and CAPTCHAs the Web Unlocker is built to beat — so on hard pages they return nothing. Second, and more important, the suite wants one grounding path that behaves identically on every agent. The rule is absolute: brightdata only, never WebSearch/WebFetch, never the MCP.
If the page is a known platform — a tweet, a Reddit thread, a YouTube video, an Amazon product — prefer pipelines: you get named, typed fields instead of having to parse prose. Use scrape for arbitrary pages that have no dedicated dataset: a docs page, a blog post, a company's pricing page.
Use the lightest mode that works (search/scrape for one-offs; pipelines or --async for bulk), and run brightdata budget before a large job to see your balance and spend. Cache findings into research.md so you fetch each fact once, not on every loop turn.
Run brightdata login (it opens a browser) or pass an API key with -k <key>. Configuration — the API URL and the default Web Unlocker zone — lives in brightdata config. Never paste an API key into a prompt or a log; redact it.
5

Try it: guess, or get real evidence?


Same question asked of two agents. One answers from memory; one runs brightdata and quotes the source. Click the button and watch what each one does — only one of them is safe to trust.

From memory

Guessing agent

Answers straight from training data. No fetch, no source, no way to know if it is still true.

Its answer
No source.
Waiting for a question.
Grounded

Evidence agent

Runs brightdata, reads the live page, and quotes the exact line back to you.

Its answer
No source yet.
Waiting for a question.
Question:

Both agents are asked the same live-web question. The difference is entirely whether they fetched the answer or invented it.

Guessing — no fetch, no source

function answer(question) {
  // straight from frozen training memory — may be stale
  return model.recall(question);   // confident, possibly wrong
}

Grounded — fetch first, then quote

function answer(question) {
  const hit = sh(`brightdata search "${question}" --json`);
  const page = sh(`brightdata scrape "${hit.topUrl}"`);
  return quote(page);              // the exact live line + its source
}

A few terms, in plain words

SERP
A search-engine results page — the list of titles, links, and snippets you get back from a query. e.g. what brightdata search returns.
Web Unlocker
Bright Data's proxy layer that renders JavaScript and gets past bot-walls and CAPTCHAs, so a scrape returns the real page. e.g. what lets brightdata scrape read a page a plain fetch can't.
Structured dataset
A purpose-built extractor for one platform that returns named, typed fields instead of raw page text. e.g. brightdata pipelines reddit_posts <url> → a record with title, upvotes, comments.
Grounding
Backing a claim with a real fetched source rather than parametric memory — the fact instinct behind the Proof Gate. e.g. quoting the scraped line, not "I think it said…".
Parametric memory
What a model "knows" from training — frozen at its cutoff date, with no awareness of what changed since. e.g. why a price or version from memory may be silently stale.
6

Where it plugs into the loop


The CLI isn't a side tool — it feeds three specific points of the loop. LEARN uses it to read real external state. Research (a Forge step) uses it to cache real findings into research.md. And VERIFY uses it as the Proof Gate for facts — pulling real evidence instead of letting a claim stand on memory.

Step through a single fact below and watch where brightdata enters. Pick a path, then press Step.

Path:
a web fact is needed brightdata … search / scrape / pipelines real evidence a quotable line into LEARN the real state research.md cached findings VERIFY gate proof for facts
The same three steps every time — need a fact, run brightdata, get evidence — then it lands in LEARN, in research.md, or in the VERIFY gate.
Step 0 — pick a path and press Step

Ready

Pick a path above, then press Step

Each path threads the same three steps — only the destination changes.

7

Anatomy of a pipelines call


When you ask for a structured record, your input travels through a few hand-offs before a clean record comes back. Here is the whole journey, one stage at a time.

What happens on a pipelines call


You hand brightdata pipelines a dataset type (say reddit_posts) and a URL or handle. Behind the scenes the Web Unlocker fetches the live page, a purpose-built extractor for that platform pulls out the fields that matter, and you get back a tidy record — named, typed values instead of a wall of HTML.

Think of it like… handing a messy receipt to an accountant who knows that exact store's layout. You don't read the receipt line by line — they hand back a clean form: date, total, tax, items. The extractor is that accountant, and it already knows the shape of every store it supports.

Under the hood

The catalog is live — brightdata pipelines list prints every supported type. There are 40+: social (x_posts, reddit_posts, instagram_posts, tiktok_posts, facebook_posts), video (youtube_videos), professional (linkedin_person_profile, linkedin_job_listings), commerce (amazon_product, walmart_product, google_shopping), maps/travel (google_maps_reviews, booking_hotel_listings, zillow_properties_listing), finance/news (yahoo_finance_business, reuter_news), and dev (github_repository_file).

Output format is --format json|csv|ndjson|jsonl, and -o <file> writes it straight to disk. For many records, the pipeline can run as a bulk job rather than a single blocking call — the same --async + status pattern as scrape.

In one picture


call → your call type + URL Web Unlocker fetch live page Extractor parse fields Record JSON fields dispatch html parsed clean record handed back to the agent
Read left → right: type + URL → Unlocker fetch → extractor parse → a structured record returns along the lower lane.
1You name a dataset and a target ~0 ms

You pick the dataset type that matches the platform — reddit_posts for a Reddit thread — and pass the URL or handle. Nothing has been fetched yet; you've just told the CLI which extractor to use and what to point it at.

2The Unlocker fetches the live page network

The Web Unlocker routes the request through a real-looking browser and location, runs the page's JavaScript, and gets the actual live content — past any bot-wall. This is the same engine scrape uses; pipelines just points it at a known platform.

3A per-platform extractor parses it ~ms

An extractor that already knows this platform's layout pulls out the meaningful values — title, author, upvotes, timestamp, comments — and drops the ads, chrome, and noise. This is the step that turns a page into fields.

4A clean record returns done

You get back a structured record — named, typed fields as JSON (or CSV/NDJSON). Your agent reads record.upvotes directly instead of hunting through HTML. That record is the evidence you cache and quote.

In the code


The whole journey, as the one command you actually type — and how to discover what's available.

shell · any agent
# 0 · discover the live catalog of dataset types
brightdata pipelines list

# 1–4 · name a dataset + a target → a clean record
brightdata pipelines reddit_posts "https://reddit.com/r/…/comments/…" --pretty

# write many records straight to a file
brightdata pipelines amazon_product_search "wireless earbuds" \
  --format csv -o earbuds.csv

Access it yourself

The full toolbox and dataset catalog live in the brightdata-cli skill: cat ~/.claude/skills/brightdata-cli/SKILL.md. The live list of dataset types is always brightdata pipelines list — trust that over any written list, since the catalog grows.

Check your balance and spend before a bulk run with brightdata budget; inspect the configured zones with brightdata zones and brightdata config.

8

Inside a dataset record


So what does a structured record actually look like? Here is a real-shaped record from brightdata pipelines reddit_posts. Click any field on the left to learn what it is and how the loop uses it.

brightdata pipelines reddit_posts <url> --pretty
{
  "title": "Anyone benchmarked the new CLI?",
  "author": "u/bench_nerd",
  "upvotes": 1284,
  "num_comments": 73,
  "created_utc": "2026-06-12T08:41:00Z",
  "subreddit": "r/commandline",
  "url": "https://reddit.com/r/…/abc",
  "flair": "Discussion"
}

title

string

The post's headline, already pulled out of the page for you — no HTML parsing.

Loop use: the human-readable thing you quote when this post is your evidence.

Click a field (or focus it and press Enter) to inspect it. Eight named fields — that's the whole point of a structured dataset: no parsing, just record.field.

raw page <div><span>…</span> ads · chrome · noise extract title · string upvotes · number created_utc · timestamp url · link read agent reads record.upvotes
A messy page becomes a handful of typed fields the agent reads by name — no HTML hunting, just evidence.
9

A report from scraped & structured data


Once findings are fetched, the loop reads them as a report — not as a wall of raw output. Here a research run has pulled from several sources; the dashboard shows what each brightdata mode returned, how fresh it is, and what the numbers say. Hit Refresh, or turn on Live.

Research evidence — "is the CLI still the recommended path?"

brightdata · cached into research.md · pulled just now

All sources grounded
last updated just now
Sources pulled
6
+2
Avg fetch
1.4s
−0.3 s
Blocked (un­walled)
0
Unlocker on
Records
31
+12
Per-source evidence
Source Freshness Fetched Mode used

One model, two views

A single list of source objects — each tagging which mode fetched it (search, scrape, pipelines, browser), how stale it is, and the fetch time — drives both the table and the rollup pill. "Blocked = 0" is the point: the Web Unlocker means a fetch doesn't silently fail and leave a hole in your evidence. The freshness badge is what tells the loop whether a cached finding in research.md needs re-pulling.

Why a report, not raw output

Raw scrape output is dozens of pages of markdown. The loop (and the human reading the log) needs the shape of the evidence at a glance: how many sources, how fresh, anything blocked. Status is never colour-only — every freshness pill pairs the hue with a text label and a dot.

10

Choosing which mode to reach for


You need a fact. Which of the four modes do you call? There's rarely one "right" answer — each makes a different bargain. Here they are side by side; then pick your situation and the matching mode lights up.

B

scrape

You know the exact URL. Read it now — past any bot-wall — as clean markdown.

brightdata scrape "https://site/pricing" \
  -f markdown

Pros

  • +Real content from any page.
  • +Beats CAPTCHAs and 403s.

Cons

  • You must already know the URL.
  • Prose, not named fields.
Pick this when You have the page and just need what it says.
C

browser

The content only appears after you click, scroll, or wait. Drive a real session step by step.

brightdata browser "https://app/feed" \
  --interactive --full-page

Pros

  • +Handles multi-step / logged-in flows.
  • +Snapshot the live page tree.

Cons

  • Heaviest and slowest mode.
  • Overkill for a static page.
Pick this when A one-shot scrape can't reach it — it's behind interaction.
D

pipelines

The source is a known platform. Get back named, typed fields instead of page soup.

brightdata pipelines x_posts \
  "https://x.com/…/status/…" --pretty

Pros

  • +Clean fields — no parsing.
  • +40+ platforms; bulk-friendly.

Cons

  • Only for supported platforms.
  • Useless for an arbitrary page.
Pick this when It's X / Reddit / YouTube / Amazon / LinkedIn / …

I need to…

11

In the code


The rule isn't just advice in a lesson — it's written into the global config every agent loads. Here is the actual instruction, and where it lives.

~/.claude/CLAUDE.md · the "Tools" + "NUNCA" sections
## NUNCA  (NEVER)
- WebSearch/WebFetch -> CLI brightdata  # (NEVER the MCP mcp__Bright_Data__*)

## Tools
- Web search/scrape: ALWAYS the CLI brightdata
  (search / scrape / browser / pipelines)
  — NEVER WebSearch/WebFetch, NEVER the MCP mcp__Bright_Data__*

Access it yourself

The rule sits in the global instructions: grep -n "brightdata" ~/.claude/CLAUDE.md. The full toolbox is the skill: cat ~/.claude/skills/brightdata-cli/SKILL.md.

Confirm the binary and config: which brightdata (→ /opt/homebrew/bin/brightdata), then brightdata config and brightdata pipelines list for the live dataset catalog. See docs.brightdata.com for the underlying API reference.

12

Worked example: one doubt, grounded


Mid-loop, an Executor is about to write "the library's latest version is 3.2." It is not sure that's current. Instead of letting the claim stand, it grounds it — and that turns a guess into evidence cached for the rest of the run.

# the doubt: is 3.2 really the latest?  → don't guess, fetch.

# 1 · find the canonical source (SERP)
brightdata search "acme-lib releases" --json --pretty
# → top hit: the package's release page

# 2 · read it past any bot-wall (Web Unlocker)
brightdata scrape "https://acme.dev/releases" -f markdown
# → "Latest: 4.0.1 — released 2026-06-09"

# 3 · cache the grounded fact into research.md, with its source
#    acme-lib latest = 4.0.1 (acme.dev/releases, pulled 2026-06-14)

# the claim is now evidence, not memory — and the loop quotes 4.0.1, not 3.2.
That's the whole discipline in three commands: doubt → fetch → quote. The memory said 3.2; the live web said 4.0.1. The loop shipped the truth, and never had to find out the hard way that its memory was a year stale.
13

Quick check: did the model land?


Four quick questions. Pick one answer in each — it grades on click, and tells you why.

An agent needs a price that may have changed since training. What should it do?

You have a Reddit thread URL and want upvotes and comment count. Which mode?

Which loop points does brightdata feed?

Why does the suite forbid WebSearch/WebFetch and the Bright Data MCP outright?

You are not done learning here — I am your teacher for this. Ask me to re-run the guess-vs-evidence demo with a different question, to walk a pipelines call for a platform you actually use, or to map which mode you'd reach for in your own workflow. Next up: Computer Use CLI — the non-blocking, accessibility-only way an agent drives native macOS apps.