Skip to main content

Command Palette

Search for a command to run...

Cloudflare just made web crawling stupidly simple

a new /crawl endpoint for their Browser Rendering service

Published
6 min read
Cloudflare just made web crawling stupidly simple
A
a TypeScript full stack developer shipping scalable web apps and adding AI powered workflows on top.

So Cloudflare dropped something quietly interesting yesterday — a new /crawl endpoint for their Browser Rendering service. And I think it's worth understanding what's actually happening here, because on the surface it sounds like a minor API addition, but if you think through the implications it's kind of a big deal.

Let me explain from scratch.


The old way was genuinely painful

If you've ever tried to scrape or crawl a website programmatically, you know the drill. You'd need to:

  1. Spin up a headless browser (Puppeteer, Playwright, whatever)

  2. Manage that browser process yourself — memory, crashes, timeouts

  3. Write logic to discover links, follow them, deduplicate visited URLs

  4. Handle JavaScript-rendered content (because most modern sites need a real browser to actually load)

  5. Parse the content into whatever format you actually need

  6. Scale all of this if you wanted more than one page at a time

That's a lot of infrastructure for what is conceptually a simple problem: "give me the content of this website."

People have been building companies around this problem for years. Firecrawl, Apify, and others exist specifically because getting content out of websites at scale is annoying enough that developers will pay someone else to deal with it.


What Cloudflare built

The new /crawl endpoint is dead simple in concept. You POST a URL, you get a job ID back. Then you poll (or wait) for the results. Cloudflare handles everything in between — discovering links, rendering pages in a headless browser, extracting content, and packaging it all up.

The output can be HTML, Markdown, or structured JSON. The JSON path uses Workers AI under the hood, which is interesting in itself.

Here's the whole thing to kick off a crawl:

curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
  -H 'Authorization: Bearer <apiToken>' \
  -H 'Content-Type: application/json' \
  -d '{ "url": "https://example.com/" }'

You get a job ID. Then:

curl -X GET 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}' \
  -H 'Authorization: Bearer <apiToken>'

That's it. No browser management. No link discovery logic. No rendering pipeline.


Why async makes sense here

The crawl jobs run asynchronously, which is the right call. You're not going to crawl a hundred pages in a single HTTP request — that'd time out. Instead, you kick it off, get a job ID, and poll with ?limit=1 to check status without pulling down all the results at once. Once the job's done, you fetch everything.

Jobs can run for up to seven days before hitting a timeout ceiling, which covers even pretty large sites.

There's also pagination built in — if results exceed 10MB, you get a cursor to fetch the next page. Again, sensible defaults for a use case that can balloon in size quickly.


The customization surface is solid

You're not stuck with a basic depth-first crawl either. There are parameters for:

  • Crawl depth — how many link hops to follow

  • Page limits — cap it at N pages

  • URL patterns — wildcard includes/excludes so you don't crawl every blog category archive if you only care about the actual posts

  • modifiedSince / maxAge — incremental crawling, so you're not re-fetching content you already have

  • Static mode — skip JavaScript rendering entirely for plain HTML sites, which is faster and cheaper

  • Custom user agents — serve different content based on UA if the target site behaves differently for bots

The modifiedSince option is particularly useful for monitoring use cases. If you're watching a documentation site for changes, you can crawl just what's new rather than re-ingesting everything each time.


It respects robots.txt, for better or worse

The crawler respects robots.txt directives, including crawl delays. Pages that are disallowed show up in the response with "status": "disallowed" rather than just silently being skipped, which is helpful for debugging.

There's also an interesting catch: if the site you're crawling uses Cloudflare's own bot protection products — WAF, Bot Management, Turnstile — those rules apply to the Browser Rendering crawler too. You'd need to create a WAF skip rule to allow your own crawls through. Which is slightly funny if you're crawling your own site and forgot you turned on bot protection.


What this is actually for

The announcement specifically calls out three use cases: training models, RAG pipelines, and content research/monitoring.

That framing tells you something about where the demand is coming from. Right now a huge chunk of developer energy is going into building AI-powered things that need to consume web content — knowledge bases, research tools, competitive intelligence dashboards, documentation ingestion pipelines. All of that needs a reliable way to get content off websites in a clean, structured format.

Markdown output in particular is tailored for this. LLMs consume Markdown well. If you're building a RAG system on top of someone's documentation, you want Markdown, not a pile of raw HTML with nav menus and cookie banners mixed in.


The competitive angle worth noticing

Here's the thing that the Hacker News crowd immediately pointed out: Cloudflare is the company whose products many developers use to block scrapers. And now Cloudflare is selling a scraper.

It's a bit of a paradox. If you're behind Cloudflare's WAF and bot protection, unauthorized crawlers get blocked. But if you pay Cloudflare for Browser Rendering, your crawls go through. Some people read this as Cloudflare positioning itself as the gatekeeper — pay them to crawl, or your crawler gets blocked.

I think that's partially true but also slightly uncharitable. The more neutral reading is that Cloudflare already runs browsers at scale for other reasons, and adding a crawl endpoint on top is a natural extension. They have the infrastructure sitting there. Whether it creates a weird market dynamic is a separate question.


Bottom line

If you're building something that needs to consume website content — AI pipelines, research tools, monitoring systems — this is genuinely useful. The hard parts (browser management, link discovery, rendering, content extraction) are handled for you. The output formats are sensible. The customization options cover the real edge cases.

Whether Cloudflare's the right vendor for your specific situation depends on your trust model and existing stack. But as a piece of developer tooling, they've taken something that used to require real infrastructure effort and collapsed it into two API calls.

That's usually worth paying attention to.