----
url: https://webcrawlerapi.com/changelog/2025-04-07-webcrawler-trial
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

April 7, 2025

✨ New: $10 Trial Balance for WebcrawlerAPI 💫

=============================================

We're excited to announce that all new WebcrawlerAPI accounts now receive a $10 evaluation balance for a 7-day trial period! This initiative allows new users to thoroughly test our API capabilities without any upfront commitment.

### What's included:

*   $10 trial funds automatically added to new accounts

*   Complete API access during 7-day evaluation period

*   Start immediately with no credit card required

*   Full access to all standard API features

The new trial balance makes it easier than ever to evaluate WebcrawlerAPI and test its capabilities for your projects.

----
url: https://webcrawlerapi.com/changelog/2025-03-28-proxy-management
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

March 28, 2025

Integrated Proxy Management System

==================================

Major Update 🚀

*   Integrated proxy management system:

    *   All proxies are now handled internally

    *   Included in the standard pricing

    *   Significantly improved success rates

    *   Enhanced protection against anti-bot measures

    *   **No additional setup required from users**

----
url: https://webcrawlerapi.com/blog/how-to-build-a-web-crawler
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

How to Build a Web Crawler

==========================

Learn the basics of building a web crawler from scratch. This guide covers key components, planning steps, common challenges, and best practices in simple terms.

Written byAndrew

Published onFeb 6, 2026

### Table of Contents

*   [How to Build a Web Crawler](#how-to-build-a-web-crawler)

*   [What is a Web Crawler?](#what-is-a-web-crawler)

*   [Key Parts of Every Web Crawler](#key-parts-of-every-web-crawler)

*   [Fetcher (The Scraper)](#fetcher-the-scraper)

*   [Parser (The Reader)](#parser-the-reader)

*   [URL Manager (The To-Do List)](#url-manager-the-to-do-list)

*   [Storage (The Memory)](#storage-the-memory)

*   [Planning Your Web Crawler](#planning-your-web-crawler)

*   [Set Clear Goals](#set-clear-goals)

*   [Know Your Target Websites](#know-your-target-websites)

*   [Decide on Depth and Scope](#decide-on-depth-and-scope)

*   [Common Challenges You Will Face](#common-challenges-you-will-face)

*   [Rate Limits and Being Polite](#rate-limits-and-being-polite)

*   [Pages That Need JavaScript](#pages-that-need-javascript)

*   [Anti-Bot Protection](#anti-bot-protection)

*   [Best Practices for a Good Web Crawler](#best-practices-for-a-good-web-crawler)

*   [Have a Dashboard](#have-a-dashboard)

*   [Respect robots.txt Rules](#respect-robotstxt-rules)

*   [Handle Errors Without Crashing](#handle-errors-without-crashing)

*   [Avoid Duplicate Pages](#avoid-duplicate-pages)

*   [Start simple](#start-simple)

*   [Scale Step by Step](#scale-step-by-step)

*   [Build It Yourself or Use an API?](#build-it-yourself-or-use-an-api)

*   [Summary](#summary)

### Table of Contents

*   [How to Build a Web Crawler](#how-to-build-a-web-crawler)

*   [What is a Web Crawler?](#what-is-a-web-crawler)

*   [Key Parts of Every Web Crawler](#key-parts-of-every-web-crawler)

*   [Fetcher (The Scraper)](#fetcher-the-scraper)

*   [Parser (The Reader)](#parser-the-reader)

*   [URL Manager (The To-Do List)](#url-manager-the-to-do-list)

*   [Storage (The Memory)](#storage-the-memory)

*   [Planning Your Web Crawler](#planning-your-web-crawler)

*   [Set Clear Goals](#set-clear-goals)

*   [Know Your Target Websites](#know-your-target-websites)

*   [Decide on Depth and Scope](#decide-on-depth-and-scope)

*   [Common Challenges You Will Face](#common-challenges-you-will-face)

*   [Rate Limits and Being Polite](#rate-limits-and-being-polite)

*   [Pages That Need JavaScript](#pages-that-need-javascript)

*   [Anti-Bot Protection](#anti-bot-protection)

*   [Best Practices for a Good Web Crawler](#best-practices-for-a-good-web-crawler)

*   [Have a Dashboard](#have-a-dashboard)

*   [Respect robots.txt Rules](#respect-robotstxt-rules)

*   [Handle Errors Without Crashing](#handle-errors-without-crashing)

*   [Avoid Duplicate Pages](#avoid-duplicate-pages)

*   [Start simple](#start-simple)

*   [Scale Step by Step](#scale-step-by-step)

*   [Build It Yourself or Use an API?](#build-it-yourself-or-use-an-api)

*   [Summary](#summary)

How to Build a Web Crawler

==========================

Hi, I'm Andrew. I'm a software engineer with fifteen years of experience. I'm building WebCrawler API for more than two years and I will show some problems and explain work that you will need if you want to build your webcrawler API. If you read this you will understand at the end how much work do you need to do to build your webcrawler and if it worth it or it is better to use existing API.

What is a Web Crawler?

----------------------

Webcrawler is a tool, that automates process of getting data from the website. It could be any kind of data, webpage content, headers, tags, SEO information, links, images, etc. The difference between scraper and crawler did scraper only get information and data from the single page and return result. The web crawler get the page from the seed page and then process all links and follow all links to get information also from redirected pages. Like a spider that crawling web. **That's why it called web crawler.**

Key Parts of Every Web Crawler

------------------------------

Every web crawler contains all vital part.

1.  Fetcher. Content receiver.

2.  Parser (Reader). Parser is part is the part that's responsible for getting content processing content that Fetcher retrieved. It extract required information, for example images or block post article. And also it extract links.

3.  URL manager. It's the part that's responsible of managing URLs. You need to prepare URL for further parsing.

### Fetcher (The Scraper)

Fetcher is the part that actually do the network job: make request, download response, and return raw content to the next steps. If you think about crawler like a small browser, Fetcher is your browser tab.

Here is a tiny fetcher example with a few headers:

    // Node 18+

    // Idea: fetch HTML with a few headers.

    // Real crawlers also need timeouts, retries, and backoff.

    export async function fetchPage(url) {

      const res = await fetch(url, {

        redirect: "follow",

        headers: {

          "user-agent": "MyCrawler/1.0",

          "accept-language": "en-US,en;q=0.9",

        },

      });

      const html = await res.text();

      return { url: res.url, status: res.status, html };

    }

In real life Fetcher is not just fetch(url). You will need to decide a lot of things:

*   What HTTP client you use (Node fetch, axios, undici, etc.)

*   How you set headers (User-Agent, Accept-Language, cookies)

*   How you handle redirects (follow, stop, max redirect count)

*   How you handle timeouts and retries (and when you should NOT retry)

*   How you limit concurrency (so you don't DDoS website and you don't kill your own server)

Also you will meet websites that works only with JavaScript. For simple HTML pages a normal HTTP request is enough. But if content rendered in browser, you need headless browser (Playwright/Puppeteer) or some rendering service. This is the moment when crawler become expensive: CPU, memory, and time per page go up a lot.

One more thing: fetchers are the first place where you fight anti-bot. You will see blocks, CAPTCHAs, 403/429, weird redirects, and sometimes just empty HTML. So you need good logging (request id, status code, response size) and you need to store some debug info (headers, final URL) to understand what is going on.

### Parser (The Reader)

Parser is the part that take raw response from Fetcher and turn it into structured data. For HTML it means: read document, find important blocks, extract fields, and extract links for next crawl.

A simple HTML parser example (extract title + links) using cheerio:

    import * as cheerio from "cheerio";

    export function parseHtml({ url, html }) {

      const $ = cheerio.load(html);

      const title = $("title").text().trim() || null;

      // Idea: collect links and resolve relative URLs.

      // Real code must skip junk links and handle invalid URLs.

      const links = $("a[href]")

        .map((_, a) => new URL($(a).attr("href"), url).toString())

        .get();

      return { title, links };

    }

The biggest mistake is to think that Parser is only “select some CSS selectors and done”. In practice pages are messy:

*   HTML is broken, tags not closed, weird nesting

*   Text is mixed with navigation, ads, cookie banners

*   Same site can have multiple templates for different pages

*   Encoding can be wrong (UTF-8 vs something else)

So you usually implement parsing in layers. First: detect content type (HTML, JSON, PDF, image) and choose parser. Second: clean the HTML (remove scripts/styles, normalize whitespace). Third: extract what you need (title, main content, meta tags, headings, images, etc.). Fourth: convert it to the format you need (markdown, JSON, text, etc.)

For crawler the most important output of Parser is links. You need to find all <a href> (and sometimes src like iframe), resolve relative URLs, remove junk (mailto, tel, javascript links), and normalize. And if you don't normalize, you will crawl duplicates forever: /page, /page/, /page?utm=....

Good Parser also produce signals for URL manager and storage: canonical URL, robots meta tags, noindex/nofollow, content hash, detected language, and errors. This is how you keep crawler stable when you scale from 100 pages to millions.

### URL Manager (The To-Do List)

URL Manager is the brain of crawler. Fetcher and Parser can be perfect, but if you manage URLs wrong you will waste money and time. This part decide what URL you crawl next, what you skip, and how you avoid infinite loops.

In simplest version it's just a queue: push seed URL, pop URL, fetch, parse, push new links. But real websites will break this naive approach immediately.

Here is a tiny (but practical) URL normalization + dedupe loop you can build on:

    export function normalizeUrl(input) {

      const u = new URL(input);

      u.hash = "";

      return u.toString();

    }

    export async function crawl(seedUrl, { fetcher, parser, maxPages = 100 } = {}) {

      const seen = new Set();

      const queue = [normalizeUrl(seedUrl)];

      while (queue.length && seen.size < maxPages) {

        const url = queue.shift();

        if (seen.has(url)) continue;

        seen.add(url);

        const { html } = await fetcher(url);

        const { links } = parser({ url, html });

        for (const link of links) queue.push(normalizeUrl(link));

      }

      return { pagesCrawled: seen.size };

    }

Things that URL Manager usually do:

*   Deduplication: don't crawl same URL again and again

*   Redirects: many links redirects to the same page

*   Normalization: make URL consistent (/page vs /page/, remove #hash, sort query params)

*   Filtering: skip mailto:, tel:, javascript:, logout links, calendar pages, etc.

*   Scope rules: stay inside domain/subdomain/path, limit depth, limit total pages

*   Priorities: home page first, then important pages, then everything else

*   Depth: how deep to follow the links

Very important problem is canonical and redirects. You can fetch URL A, it redirects to URL B, and HTML says canonical is URL C. If you don't unify this, you will store duplicates and your crawl graph become garbage. So URL Manager usually keeps mapping: requested URL -> final URL -> canonical URL.

Also you need strategy for failures. Some URLs should be retried (temporary network issue), some should be dropped (404), some should be paused (429 rate limit). This logic lives here too, because it controls the queue.

When crawler grows, URL Manager become a real system: database table, indexes, statuses, attempts counter, and maybe distributed queue. This is why people underestimate crawling: it's not “fetch pages”, it's “manage millions of URLs safely”.

### Storage (The Memory)

Storage is where crawler stop being a script and become a product. Because if you can fetch and parse, but you can't store results correctly, you basically did nothing.

You usually store two types of data:

1.  Crawl state (operational data): URL statuses, attempts, next run time, last seen, redirect/canonical mapping.

2.  Extracted data (business data): page HTML snapshot (optional), cleaned text, markdown, metadata, links, images, SEO fields, etc.

The first one must be reliable and fast for updates. This is usually Postgres or Redis + Postgres. If you lose crawl state you will re-crawl same pages and burn resources.

The second one depends on your use case. If you need search, you probably want a search index. If you need analytics, you might want columnar storage later. If you only need “give me parsed content”, Postgres can be enough for long time. But storing content in the DB only could become very expensive, so you have think about moving it to the file storage, like Cloudflare R2 or Amazon S3 and only save link in the main DB.

Important details that people miss:

*   Versioning: your parser will change, so store parser\_version to reprocess old data

*   Idempotency: same URL processed twice should not create duplicates

*   Raw vs processed: sometimes you need to keep raw HTML for debugging or re-parsing

*   Size limits: pages can be huge, don't save everything forever by default

*   Link graph: storing edges (from -> to) is expensive, but it's super useful

And last: storage is where you answer “how do I debug this?” When user see that “my crawler missed my page”, you need to open record and see: request URL, final URL, status, response size, parse error, extracted links. If you don't store this, you will be blind.

Planning Your Web Crawler

-------------------------

Before you write code, do small planning. It will save you a lot of time because crawler problems are usually product decisions, not technical bugs.

### Set Clear Goals

Define what you want to extract and what “done” means. If you skip this step, you will build crawler that can do everything and it will cost you a lot.

Start from output. Do you need full HTML, cleaned text, markdown, screenshots, SEO fields, structured data (JSON-LD), only links, or something else? Every extra field increases complexity because you need parsing rules, storage, and support/debug later.

If change tracking is the goal (new posts, new changelog entries), an [RSS feed](/blog/convert-any-website-to-rss-feed) output can be a very practical “done” definition.

Then define constraints:

*   Freshness: one-time crawl or re-crawl every day/week?

*   Quality: is “good enough” OK or you need 99% accuracy?

*   Performance: how many pages per minute you expect?

*   Budget: how much money per 1k pages is acceptable?

And define failure rules. For example: skip pages that require login, skip pages with CAPTCHA, stop at 429, or retry 3 times. This sounds boring, but it's exactly how you avoid endless edge cases.

### Know Your Target Websites

Check how websites behave (static vs JS, robots rules, rate limits) so you don't build wrong fetcher/parsing stack.

And here is the main difference in real life. You can have two very different use cases:

1.  Small known list of websites. Final list is fixed and rarely changes. This is easier because you can test every website in advance, understand if you need JavaScript rendering, and tune parsing rules per site. Usually this list comes from you, or from customer who knows exactly what they want.

2.  Unpredictable / user-provided websites. URLs come from users and list is changing all the time. This is much harder because you must predict everything: broken HTML, heavy JS apps, redirects chains, weird encodings, anti-bot, random downtime. In this case you plan more around safety: limits, fallbacks, retries, good logs, and clear error messages.

### Decide on Depth and Scope

Decide how far you go from seed URLs and where you stop, otherwise crawl will grow forever. This is not theory. If you crawl without limits you will hit crawl traps like calendars, endless filters, infinite pagination, and you will waste weeks.

Depth is about how many link hops you allow. For example:

*   Depth 0: only seed URL

*   Depth 1: all links from seed

*   Depth 2+: links from links, etc.

Scope is about what URLs are allowed. Typical scope rules:

*   Domain: only example.com (or allow subdomains)

*   Path prefix: only /blog/ and skip /admin/

*   Query params: allowlist important params, drop tracking (utm\_\*, fbclid)

*   Content types: HTML only, or also PDFs/images

Also set crawl budget. Even for “crawl entire site” you still need maximums:

*   Max pages per job (hard stop)

*   Max time per job (so it doesn't run forever)

*   Max depth (so it doesn't go too deep)

And define stop conditions. For example: stop when queue is empty, stop after N consecutive errors, stop if too many pages looks duplicate, stop if you hit rate limit too long.

In practice I recommend to start conservative. First run depth 1-2, only same domain, drop most query params. Then look at results and expand rules slowly. This is exactly how you avoid “my crawler downloaded 2 million URLs and 95% is garbage”. All depending on your use-case of course.

Common Challenges You Will Face

-------------------------------

If you build crawler once, you will meet all of this. It doesn't matter what language you use. Most problems are not “bug in code”, it's reality of the web: websites protect themselves, websites are slow, websites are inconsistent and buggy.

Below I list the most common challenges. You can solve all of them, but every solution adds cost and complexity, so it is better to know it early.

### Rate Limits and Being Polite

Rate limits is the first thing you will hit when you crawl something bigger than 50 pages. Websites don't want you to send 200 requests per second. Even if they don't have strict rules, their servers are not prepared for that.

Sometimes you will see open source crawlers that promise “10k pages per second”. If it was true for real websites, they would be blocked in seconds by anti-bot protections. High throughput is possible only in controlled environment (your own websites) or with a lot of expensive infrastructure.

This is why I implemented limits in WebcrawlerAPI and the notion “[politeness](https://webcrawlerapi.com/blog/web-scraping-ethics)” exists. It means you crawl in a way that looks like normal user traffic:

*   Limit concurrency per host (for example 1-5 requests at the same time)

*   Add delay between requests (random jitter is good)

*   Respect robots.txt crawl-delay (if present)

*   Use caching for repeated resources when possible

Also you need to handle responses like 429 (Too Many Requests). If you ignore it and continue, you will get blocked. So typical behavior is:

*   backoff and retry later

*   reduce concurrency

*   read Retry-After header when it exists

Important detail: rate limits are not only about being nice. It's also stability for your own system. If you start 1000 requests, you will run out of sockets, memory, and CPU and your crawler will crash.

So even in perfect world with no anti-bot, you still need rate limiting. This is why URL manager and fetcher must work together: queue + per-host scheduler.

One simple pattern is “per-host queue + delay”. It is not perfect, but it prevents you from blasting a single domain:

    const nextAt = new Map(); // host -> unix ms

    export async function politeFetch(url) {

      const { host } = new URL(url);

      const waitMs = Math.max(0, (nextAt.get(host) ?? 0) - Date.now());

      if (waitMs) await new Promise((r) => setTimeout(r, waitMs));

      // Idea: keep a small gap between requests to the same host.

      nextAt.set(host, Date.now() + 500);

      return fetch(url);

    }

### Pages That Need JavaScript

This is the moment when many crawler projects die. Because HTML you get from normal HTTP request is not always the HTML user sees in the browser.

Modern websites often ship empty shell and then render content with JavaScript. So you request page, and response body looks like:

*   a <div id="root"></div>

*   a bunch of scripts

*   and no real content

You have few options:

1.  Try to find API behind the page. Many sites load data from JSON endpoint. If you can call it directly it will be faster and cheaper than browser rendering.

2.  Use headless browser (Playwright/Puppeteer). It works for most cases because it executes JS like real browser. But it is expensive: each page needs CPU + memory, it is slower, and it is much easier to detect and block.

3.  Hybrid approach. Start with normal fetch. If you detect “empty content” or you don't find required fields, fallback to browser rendering only for those pages. However, this could also be tricky. Websites can load content frame (template) first and then use JS to download specific data.

In planning you should decide how much JS you can afford. If you render everything in browser, your 1k pages crawl becomes minutes/hours instead of seconds. Also your infra cost will be much higher.

And one more problem: JS pages are not deterministic. You will deal with timeouts, loading spinners, cookie banners, late-rendered content, and A/B tests. So you need time budgets and clear rules like “wait for selector X” or “wait network idle”, otherwise crawler will hang forever.

A common hybrid pattern: try plain fetch first, then fallback to Playwright only if content looks empty:

    // Idea: if plain HTML looks empty, render in a browser.

    // Real code uses Playwright/Puppeteer and reuses browser instances.

    async function renderInBrowser(url) {

      // open headless browser, goto url, return rendered HTML

      return "";

    }

    export async function fetchHtmlSmart(url) {

      const html = await fetch(url).then((r) => r.text());

      if (html.length > 2_000) return html;

      return renderInBrowser(url);

    }

### Anti-Bot Protection

Sooner or later you will get blocked. Sometimes on first request. Sometimes after 1000 pages. It depends on website and how aggressive you crawl.

If you crawl from you local machine - this could make your IP to get into black list and you will start see Anti-bot protection check very often. From the other side: datacenter IPs are by default are suspicious.

Anti-bot is a whole world. It can be simple (rate limit + IP ban) or very advanced (fingerprinting, behavior analysis, challenges). Common signals that you are blocked:

*   403/401 when it should be 200

*   429 even with low traffic

*   Redirect to “verify you are human”

*   HTML that looks normal, but content is empty (they serve you different page)

*   CAPTCHA or JavaScript challenge

The important thing: fighting anti-bot is not only technical. It is also legal and ethical area depending on what you crawl. So first rule is: crawl public pages, respect robots, and don't try to break protections on websites that clearly don't want you there.

From engineering side, you still need to handle it gracefully. Don't just retry forever. You need:

*   Detect block pages and mark URL as blocked

*   Backoff and slow down per host

*   Rotate IPs/proxies (if your use case allows it)

*   Keep consistent headers and cookies (session)

*   Use browser rendering for some sites (but it can be even more detectable)

Also remember that anti-bot is why “10k pages per second” promises are mostly marketing. If you crawl fast, you look like a bot. If you crawl slow, it can work, but now you need queue, scheduler, and good monitoring.

This is why observability is critical. Store response status, final URL, response size, and small HTML snippet for debug. Without it you will not even understand that you are blocked.

Best Practices for a Good Web Crawler

-------------------------------------

Best practices depends on many factors. If you crawl 5 known websites it is one thing. If you crawl user-provided URLs at scale it is completely different. Also it depends what content you want:

*   Only HTML status + headers

*   Full HTML snapshot

*   Cleaned text or markdown

*   Links graph

*   SEO fields (title, meta description, canonical, hreflang)

*   Images and files (PDFs)

*   Structured data (JSON)

*   Screenshots (real browser)

But there are few general advices that works almost always.

### Have a Dashboard

Have a simple dashboard to track progress and understand what is happening now: queue size, processed pages, errors, slow domains, and current retries. Without it you will debug crawler by guessing and reading logs all day.

### Respect robots.txt Rules

robots.txt is a small text file on website root (like https://example.com/robots.txt) that describes what crawlers are allowed to access. It is not a security tool. But it is a rule of the web, and if you ignore it you will get blocked faster and you can create legal problems for yourself.

In crawler you should treat robots as first-class input. At the entry point, before you start crawling a domain, download and parse its robots.txt and check if your crawler (by User-Agent) is allowed. And it is not enough to check it once. For every single URL you are going to fetch you should check that path against robots rules again.

Why again? Because robots file can allow / but disallow /private/, or allow /blog/ but disallow /search. If you skip per-page checks your URL manager will happily enqueue forbidden URLs and you will waste requests.

So keep robots rules cached per host, refresh it sometimes, and make URL manager use it as filter.

In Node, it is easiest to use an existing parser (because robots syntax has edge cases):

    // Idea: download /robots.txt and check if URL is allowed.

    // Real code uses a parser library like `robots-parser` and caches per host.

    export async function isAllowedByRobots(url) {

      const robotsUrl = new URL("/robots.txt", url);

      const robotsTxt = await fetch(robotsUrl).then((r) => (r.ok ? r.text() : ""));

      // parse robotsTxt and check rules

      return true;

    }

### Handle Errors Without Crashing

Crawler will fail all the time. Network fails, DNS fails, Puppeteer/Playwright fails, websites return 500, HTML is broken, parser throws exception. If one error crashes whole job, you will never crawl anything big.

So you need error handling on every layer. Fetcher should return structured error (timeout, DNS, status code, blocked) instead of throwing and killing process. Parser should fail per page, not per job. URL manager should mark URL as failed with attempts counter and reason.

Practical rules that help:

*   Timeouts everywhere (connect + response + overall)

*   Retries only for retryable errors (timeouts, temporary 5xx), not for 404

*   Max attempts per URL (otherwise you will retry forever)

*   Circuit breaker per host (if domain is down, pause it)

*   Store all error messages/statuses so you can debug later

Also make crawler idempotent. If job restarts in the middle, it should continue from stored state, not start from scratch. This is difference between demo crawler and production crawler.

One small thing that helps a lot is to make fetcher return structured outcomes instead of throwing:

    export async function safeFetch(url) {

      try {

        const res = await fetch(url);

        return { ok: res.ok, status: res.status, html: await res.text() };

      } catch (err) {

        return { ok: false, error: String(err) };

      }

    }

### Avoid Duplicate Pages

Duplicates is silent killer. You think you crawl 100k pages, but in reality it can be 10k unique pages and 90k duplicates with different URLs. It happens because web is full of aliases:

*   Trailing slash: /page vs /page/

*   Tracking params: ?utm\_source=...

*   Session params: ?sid=...

*   Sort/filter params that generate endless combinations

*   Same content on multiple paths

*   Redirects

First line of defense is URL normalization in URL manager. Strip #hash, normalize trailing slash, lower-case host, remove known tracking params, sort query params. Then dedupe by normalized URL.

Second line is canonical and redirects. After fetch, save mapping requested URL -> final URL (after redirects) and then read canonical from HTML. Use canonical as primary key when it exists.

Third line is content-level dedupe. Sometimes different URLs are different but content is same. So store content hash (for example hash of cleaned text) and detect duplicates. It is very helpful for pagination traps.

If you do these three steps, your crawl becomes cheaper, faster, and your data is cleaner.

### Start simple

Start from small crawler that actually works end-to-end. One seed URL, fetch HTML, parse a few fields, extract links, store result. No proxies, no headless browser, no distributed queue.

When this simple version is stable, you will see real problems in logs and data. Then you can add features one by one: retries, better normalization, per-host limits, JS rendering, more parsers. If you start from “enterprise crawler”, you will spend months and still don't have something you can trust.

### Scale Step by Step

Scaling is not “add more servers”. First scale your correctness and observability.

Do it in steps:

*   Crawl 100 pages and verify output manually

*   Crawl 10k pages and watch error types, duplicates, and storage size

*   Only after that go to 1M pages with proper limits and monitoring

Every scale level will reveal new problems: memory leaks, slow parsers, database hot spots, and anti-bot blocks. If you jump directly to big crawl, you will just burn money and still don't know what is wrong.

Build It Yourself or Use an API?

--------------------------------

You can build crawler yourself. It is possible. But now you see what it really means: fetcher, parser, URL manager, storage, retries, rate limits, robots rules, JS rendering, anti-bot, monitoring.

If you have a fixed list of websites and you crawl them every day, building custom crawler can make sense. You can tune it, you can control costs, and you can get exactly the data format you need in a predictable way.

But if your use case is “user give me any URL and I should crawl it”, this is a different level. You will spend a lot of time on edge cases and infra, not on your product. In this case using existing [crawling API](https://webcrawlerapi.com/) is often cheaper and faster.

My rule of thumb:

*   Build it if crawling is core feature and you have people to maintain it.

*   Use API if crawling is supporting feature and you just need reliable data.

Summary

-------

Web crawler is not hard to start, but hard to make reliable. Start with simple version, define goals and scope, respect robots and rate limits, and build good visibility into what crawler is doing. And before you invest months, be honest: maybe existing crawler API is already good enough for your use case.

----
url: https://webcrawlerapi.com/glossary/playwright/how-to-fix-playwright-did-not-expect-test-to-be-called-here
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

This occurs when test() is invoked outside a test file context, often in config/helpers imported by config.

Keep test() calls only in spec files and move shared logic to plain functions.

    // helpers/auth.ts

    export async function login(page) {

      await page.goto('/login');

      await page.getByLabel('Email').fill('[email protected]');

    }

Then call helper functions from \*.spec.ts, not from playwright.config.ts.

----
url: https://webcrawlerapi.com/glossary/puppeteer/how-to-fix-puppeteer-extensiontransport-tasks-and-session-management
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Puppeteer now dispatches each CDP message in its own JavaScript task by scheduling dispatch with setTimeout. This ensures a separate event loop turn for every message, which browsers require. It also fixes the page attach event to occur on the tab target session.

    function dispatchCDPMessage(msg) {

      // schedule the CDP message in its own task

      setTimeout(() => {

        sendCdpMessage(msg);

      }, 0);

    }

----
url: https://webcrawlerapi.com/glossary/puppeteer/how-to-fix-fetch-enable-wasnt-found-error-for-workers
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Fetch.enable wasn't found is raised when trying to enable the Fetch domain for a worker. The fix is to ignore this error for workers since the feature is not applicable there; the code path in WebWorker.ts already catches and handles this error. You can guard around the call and ignore errors that mention Fetch.enable wasn't found. Example:

    try {

      await session.send('Fetch.enable', { /* options */ });

    } catch (err) {

      if (err && typeof err.message === 'string' && err.message.includes("Fetch.enable wasn't found")) {

        // ignore for workers

      } else {

        throw err;

      }

    }

This behavior is implemented in the Puppeteer Core WebWorker handling, which catches the error appropriately and prevents it from failing the flow.

----
url: https://webcrawlerapi.com/glossary/scraping/what-are-ethical-web-scraping-practices
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Answer

Ethical scraping means minimizing harm and respecting site owners and users. Follow robots.txt, terms of service, and rate limits. Avoid collecting sensitive or personal data without a clear, legal basis. Be transparent in your user agent and provide contact information when possible. Ethical practices reduce conflicts and support long term access.

----
url: https://webcrawlerapi.com/docs/errors
----

Errors

======

Copy MarkdownOpen

Complete guide to error codes in WebcrawlerAPI with job level and job item level errors

There are 2 levels of errors: job level and job item level.

**Job** level error codes:

*   [insufficient\_balance](#insufficient-balance) - Insufficient balance

*   [invalid\_request](#invalid-request) - Invalid request

*   [internal\_error](#internal-error) - Internal server error

**Job item** level error codes:

*   [host\_returned\_error](#host-returned-error) - Unsuccessful HTTP response from the host

*   [website\_access\_denied](#website-access-denied) - Website access denied

*   [blocked\_by\_robots\_txt](#blocked-by-robots-txt) - URL blocked by robots.txt

*   [name\_not\_resolved](#name-resolution-error) - Name resolution error

*   [ssl\_cert\_error](#ssl-cert-error) - SSL/TLS certificate error

*   [internal\_error](#internal-error) - Internal server error

*   [timeout\_error](#website-timeout) - Website timeout

*   [webpage\_non\_success](#webpage-non-success) - Crawling attempt unsuccessful

*   [llm\_max\_context\_length\_error](#llm-max-context-length-error) - AI request error: maximum context length 128k tokens exceeded

*   [duplicate\_item](#duplicate-content) - Duplicate content detected within the same job

[Job Level Errors](#job-level-errors)

-------------------------------------

Job level errors means that the job failed to run. It could be for example that there is not enough balance or internal error from the service.

### [Insufficient Balance](#insufficient-balance)

This error occurs when the balance is not enough to run the job. Go to the [dashboard](https://dash.webcrawlerapi.com) to top up your balance.

API error response example:

    {

      "error_code": "insufficient_balance",

      "error_message": "Your balance is not enough to run this job"

    }

### [Invalid request](#invalid-request)

This error occurs when the request is invalid. For example, the URL is invalid or the parameters are invalid.

API error response example:

    {

      "error_code": "invalid_request",

      "error_message": "whitelist_regexp is invalid"

    }

### [Internal error](#internal-error)

This error means that something went wrong on our side. Please contact us on [\[email protected\]](/cdn-cgi/l/email-protection#a5d6d0d5d5cad7d1e5d2c0c7c6d7c4d2c9c0d7c4d5cc8bc6cac8) if you encounter this error.

API error response example:

    {

      "error_code": "internal_error",

      "error_message": "Internal server error"

    }

[Job Item Level Errors](#job-item-level-errors)

-----------------------------------------------

Job item level error means that the job item failed with the specific error.

Job item level errors are returned in the `job_items` array. List of error codes:

### [Host returned error](#host-returned-error)

Most common error. This error means that the response HTTP status code is not in range 200-299. Exception is 403 status code, that has a diffrenen error code `website_access_denied`.

API error response example:

    {

        "id": "60b7c4a5-aca7-4183-87db-017418218641",

        //...

    	"status": "done",

    	"job_items": [    		{

    			//...

    			"error_code": "host_returned_error",

    			"status": "error",

    			"last_error": "Webpage returned error status code: 404"

    		}

    	]

    }

### [Website access denied](#website-access-denied)

This is a special case of the `host_returned_error` error. It means that the website returned a 403 status code.

API error response example:

    {

        //...

    	"status": "done",

    	"job_items": [    		{

    			//...

    			"error_code": "website_access_denied",

    			"status": "error",

    			"last_error": "Webpage returned access denied status code: 403"

    		}

    	]

    }

### [Blocked by robots.txt](#blocked-by-robotstxt)

This error occurs when the `respect_robots_txt` parameter is set to `true` and the website's robots.txt file disallows access to the specific URL for crawlers. The robots.txt file is a standard used by websites to communicate with web crawlers about which parts of the site should not be crawled.

API error response example:

    {

        "error_code": "blocked_by_robots_txt",

        "error_message": "URL is blocked by robots.txt. The website's robots.txt file disallows access to this URL for crawlers. Respect robots.txt can be disabled in the request."

    }

### [Name resolution error](#name-resolution-error)

This error means that there was a problem with the website host name resolution. Most likelt the website does not exist or there is a typo in the URL.

API error response example:

    {

        //...

    	"job_items": [    		{

    			//...

    			"error_code": "name_not_resolved",

    			"status": "error",

    			"last_error": "Connection refused"

    		}

    	]

    }

### [SSL Cert Error](#ssl-cert-error)

This error occurs when the SSL/TLS handshake with the website fails. This typically means the website has an invalid, expired, or self-signed certificate that cannot be verified. The request will not be retried as this is a permanent configuration issue on the website's side.

**Common causes:**

*   The website's SSL certificate has expired

*   The certificate is self-signed and not trusted

*   The certificate does not match the domain name

*   The website has misconfigured SSL/TLS settings

API error response example:

    {

        "success": false,

        "status": "error",

        "error_code": "ssl_cert_error",

        "error_message": "SSL/TLS handshake failed"

    }

### [Website timeout](#website-timeout)

This error occurs when we tried to reach the webpage several times with different proxies, but unfortunately the website hasn't responded within a reasonable time. There could be several reasons for this:

**Troubleshooting steps:**

1.  **Check website accessibility** - First, verify that the website and webpage are accessible by visiting the webpage manually in your browser

2.  **If it loads slowly in your browser** - The issue is likely on the website's side (slow server, high traffic, or downtime)

3.  **If it loads instantly in your browser but still times out in the API** - This indicates the website has sophisticated anti-bot protection that we cannot bypass

**Common causes:**

*   The website is slow to respond or experiencing high traffic

*   The website is temporarily down or experiencing server issues

*   The website has advanced anti-bot protection systems

*   The website has captcha or other interactive elements that weren't solved in time

We recommend retrying the request. If the problem persists and the website loads normally in your browser, please contact us at [\[email protected\]](/cdn-cgi/l/email-protection#ddaea8adadb2afa99daab8bfbeafbcaab1b8afbcadb4f3beb2b0).

API error response example:

    {

        //...

    	"job_items": [    		{

    			//...

    			"error_code": "timeout_error",

    			"status": "error",

    			"last_error": "Website timeout. Please try again later or contact support at [email protected]"

    		}

    	]

    }

### [Webpage Non Success](#webpage-non-success)

We tried hard, but the crawling attempt was not successful. The content may be empty or blocked by anti-bot protection. This typically happens when the webpage either returns no useful content or has sophisticated protection mechanisms preventing access.

**Common causes:**

*   The webpage has advanced anti-bot protection systems

*   The page content is blocked or restricted

*   The page loaded but contained no extractable content

We recommend checking if the webpage is accessible normally and trying again. If the problem persists, please contact us at [\[email protected\]](/cdn-cgi/l/email-protection#33404643435c414773445651504152445f564152435a1d505c5e).

API error response example:

    {

        //...

    	"job_items": [    		{

    			//...

    			"error_code": "webpage_non_success",

    			"status": "error",

    			"last_error": "The crawling attempt was not successful. The content may be empty or blocked by anti-bot protection."

    		}

    	]

    }

### [LLM Max Context Length Error](#llm-max-context-length-error)

This error occurs when the webpage content is too large and doesn't fit within the AI model's context window. The AI processing requires the entire webpage content to fit within its maximum context length limit. When a webpage has too much text, images, or other content, it exceeds this limit and cannot be processed.

A possible solution is to use the `clean_selectors` parameter which allows you to exclude unneeded content (like navigation, ads, footers) before sending it to the LLM. See the [cleaning documentation](/docs/guides/cleaning) for more details on how to use clean selectors.

API error response example:

    {

        //...

    	"job_items": [    		{

    			//...

    			"error_code": "llm_max_context_length_error",

    			"status": "error",

    			"last_error": "AI request error: maximum context length 128k tokens exceeded for this page"

    		}

    	]

    }

### [Duplicate Content](#duplicate-content)

This error occurs when the same content is detected multiple times within the same crawling job. The system uses content hashing to identify pages with identical content, even if they have different URLs. When a duplicate is found, the job item will fail with this error code and reference the URL where the content was first seen.

**Common causes:**

*   Multiple URLs pointing to the same content (e.g., with different query parameters or URL paths)

*   Mirror pages or duplicate content on the website

*   Pagination pages with identical content

*   URL variations that load the same content

When a duplicate is detected, you will not be charged for processing the duplicate item, as the balance is automatically refunded.

API error response example:

    {

        //...

    	"job_items": [    		{

    			//...

    			"error_code": "duplicate_item",

    			"status": "error",

    			"last_error": "Duplicate content of: https://example.com/original-page"

    		}

    	]

    }

### [Internal error](#internal-error-1)

This error means that something went wrong on our side. Please contact us on [\[email protected\]](/cdn-cgi/l/email-protection#c5b6b0b5b5aab7b185b2a0a7a6b7a4b2a9a0b7a4b5aceba6aaa8) if you encounter this error.

API error response example:

    {

        //...

    	"job_items": [    		{

    			//...

    			"error_code": "internal_error",

    			"status": "error",

    			"last_error": "Internal server error"

    		}

    	]

    }

[Any website to feed

Previous Page](/docs/feeds)[Rate Limits

Next Page](/docs/rate-limits)

### On this page

[Job Level Errors](#job-level-errors)[Insufficient Balance](#insufficient-balance)[Invalid request](#invalid-request)[Internal error](#internal-error)[Job Item Level Errors](#job-item-level-errors)[Host returned error](#host-returned-error)[Website access denied](#website-access-denied)[Blocked by robots.txt](#blocked-by-robotstxt)[Name resolution error](#name-resolution-error)[SSL Cert Error](#ssl-cert-error)[Website timeout](#website-timeout)[Webpage Non Success](#webpage-non-success)[LLM Max Context Length Error](#llm-max-context-length-error)[Duplicate Content](#duplicate-content)[Internal error](#internal-error-1)

----
url: https://webcrawlerapi.com/changelog/2025-12-27-website-feeds
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

December 27, 2025

Monitor Any Website with RSS/JSON Feeds

=======================================

Turn any website into a feed. Monitor websites for changes and get automatic updates via RSS, JSON feeds, or webhooks.

The new [Feeds API](/docs/feeds) lets you track content changes on any website without building custom monitoring infrastructure. Perfect for tracking blogs, news sites, documentation, or any web content that doesn't offer native feeds.

What's new?

-----------

*   **RSS/Atom Feeds**: Subscribe to any website in standard Atom 1.0 format compatible with all feed readers

*   **JSON Feed Format**: Get updates in JSON Feed format for easy integration with applications

*   **Webhook Notifications**: Receive instant POST requests when content changes are detected

*   **Automatic Monitoring**: Periodic crawling with smart change detection

*   **Flexible Configuration**: Control crawl depth, page limits, URL patterns, and output formats (markdown, cleaned, HTML)

*   **Error Resilience**: Automatic pause after 3 consecutive errors to prevent unnecessary charges

*   **Status Tracking**: Detailed metrics on pages crawled, changed, new, unavailable, and errors

How it works

------------

Create a feed with a single API call, then subscribe to updates via RSS/Atom, JSON Feed, or webhooks. The system automatically crawls your target website periodically and delivers only what changed.

See the [Feeds documentation](/docs/feeds) and [API Reference](/docs/api/feed/feed-create) for complete details.

----
url: https://webcrawlerapi.com/blog/python-vs-nodejs-which-is-better-for-web-crawling
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

PythonComparisonTechnical30 min read to read

Python vs Node.js: Which is Better for Web Crawling?

====================================================

Explore the strengths and weaknesses of Python and Node.js for web crawling, and find the best fit for your project needs.

Written byAndrew

Published onJan 31, 2026

### Table of Contents

*   [Quick Comparison](#quick-comparison)

*   [Comparing Performance: Python vs Node.js](#comparing-performance-python-vs-nodejs)

*   [Speed and Efficiency Analysis](#speed-and-efficiency-analysis)

*   [Scalability and Resource Use](#scalability-and-resource-use)

*   [Library and Framework Overview](#library-and-framework-overview)

*   [Python Tools: Scrapy and BeautifulSoup](#python-tools-scrapy-and-beautifulsoup)

*   [Node.js Tools: Puppeteer and Cheerio](#nodejs-tools-puppeteer-and-cheerio)

*   [Library Comparison Table](#library-comparison-table)

*   [Use Cases for Python and Node.js](#use-cases-for-python-and-nodejs)

*   [When to Use Python](#when-to-use-python)

*   [When to Use Node.js](#when-to-use-nodejs)

*   [Using APIs for Web Crawling](#using-apis-for-web-crawling)

*   [WebCrawlerAPI Features](#webcrawlerapi-features)

*   [Why Use APIs?](#why-use-apis)

*   [Conclusion](#conclusion)

*   [Final Comparison: Python vs Node.js](#final-comparison-python-vs-nodejs)

*   [Key Takeaways](#key-takeaways)

### Table of Contents

*   [Quick Comparison](#quick-comparison)

*   [Comparing Performance: Python vs Node.js](#comparing-performance-python-vs-nodejs)

*   [Speed and Efficiency Analysis](#speed-and-efficiency-analysis)

*   [Scalability and Resource Use](#scalability-and-resource-use)

*   [Library and Framework Overview](#library-and-framework-overview)

*   [Python Tools: Scrapy and BeautifulSoup](#python-tools-scrapy-and-beautifulsoup)

*   [Node.js Tools: Puppeteer and Cheerio](#nodejs-tools-puppeteer-and-cheerio)

*   [Library Comparison Table](#library-comparison-table)

*   [Use Cases for Python and Node.js](#use-cases-for-python-and-nodejs)

*   [When to Use Python](#when-to-use-python)

*   [When to Use Node.js](#when-to-use-nodejs)

*   [Using APIs for Web Crawling](#using-apis-for-web-crawling)

*   [WebCrawlerAPI Features](#webcrawlerapi-features)

*   [Why Use APIs?](#why-use-apis)

*   [Conclusion](#conclusion)

*   [Final Comparison: Python vs Node.js](#final-comparison-python-vs-nodejs)

*   [Key Takeaways](#key-takeaways)

**[Python](https://webcrawlerapi.com/blog/how-to-crawl-the-website-with-python) and [Node.js](https://nodejs.org/en) are both excellent tools for web crawling, but the right choice depends on your project needs.**

*   **Choose Python** for large-scale data extraction, static websites, and tasks involving heavy data processing. Libraries like **[Scrapy](https://scrapy.org/)** and **[BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/)** make it ideal for handling complex datasets.

*   **Opt for Node.js** for real-time scraping, dynamic content, and JavaScript-heavy sites. Tools like **[Puppeteer](https://pptr.dev/)** and **[Cheerio](https://cheerio.js.org/docs/intro)** excel in managing modern web applications.

### Quick Comparison

Feature

Python

Node.js

**Performance**

Slower execution; better for data-heavy tasks

Faster execution; great for real-time scraping

**Best Libraries**

Scrapy, BeautifulSoup

Puppeteer, Cheerio

**Use Cases**

Static websites, data analysis

JavaScript-heavy, dynamic websites

**Learning Curve**

Beginner-friendly

More complex for automation

**Scalability**

Easily scales across machines

Requires setup for scaling

**Key takeaway**: Use Python for data-focused projects and Node.js for speed and dynamic content. APIs like **[WebCrawlerAPI](https://webcrawlerapi.com/)** can simplify tasks for both platforms.

Comparing Performance: Python vs Node.js

----------------------------------------

Performance differences between Python and Node.js play a key role in determining their suitability for web crawling tasks. Here's how they stack up.

### Speed and Efficiency Analysis

Node.js, driven by the [V8](https://en.wikipedia.org/wiki/V8_\(JavaScript_engine\)) engine, is faster in raw execution, making it ideal for quick data extraction and real-time tasks. Its event-driven architecture is great for managing multiple requests at once, making it perfect for lightweight, high-frequency crawls.

Python, while slower in execution, shines in its ability to handle complex data processing. This means that its speed limitations often don't create significant issues in practical web crawling scenarios.

### Scalability and Resource Use

Both Python and Node.js bring unique strengths to large-scale web crawling, but their approaches differ:

Aspect

Node.js Strength

Web Crawling Benefit

Memory Use

Non-blocking I/O

Manages many requests at once with less memory

CPU Efficiency

V8 engine

Handles parallel crawling tasks effectively

Python, on the other hand, emphasizes processing large datasets efficiently. Although it may initially use more resources, its libraries are designed for handling complex tasks like data analysis and manipulation. This makes Python particularly useful for crawling projects that require heavy data processing.

For dynamic content scraping, Node.js's Puppeteer library can simulate a full browser. While effective, this approach demands more resources and time, especially for large-scale operations.

The choice between Python and Node.js ultimately depends on your project's needs. Node.js is better suited for real-time data extraction with minimal resource consumption, while Python is the go-to for projects involving detailed data manipulation and analysis tasks [\[1\]](https://rayobyte.com/blog/web-scraping-python-vs-nodejs/)[\[4\]](https://www.geeksforgeeks.org/node-js-vs-python/).

It's also important to consider the libraries and frameworks available for each language, as they significantly impact web crawling efficiency.

Library and Framework Overview

------------------------------

Choosing the right libraries and frameworks plays a crucial role in web crawling. These tools leverage Python's strong data processing abilities and Node.js's event-driven design.

### Python Tools: [Scrapy](https://scrapy.org/) and [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/)

Scrapy is built for large-scale web crawling, thanks to its efficient pipeline for automating data processing and storage. BeautifulSoup, on the other hand, specializes in parsing HTML and XML with an easy-to-use API, making it a perfect partner for Scrapy when you need precise data extraction.

### Node.js Tools: [Puppeteer](https://pptr.dev/) and [Cheerio](https://cheerio.js.org/docs/intro)

Puppeteer operates as a headless Chrome browser, making it ideal for handling JavaScript-heavy websites and automating tasks like form submissions. Cheerio, with its jQuery-like syntax, is perfect for simpler projects requiring lightweight HTML parsing.

### Library Comparison Table

Feature

Python Libraries

Node.js Libraries

Performance

Scrapy: Great for static content

Puppeteer: Ideal for dynamic content

Memory Usage

Scrapy: Optimized for large-scale use

Cheerio: Lightweight and efficient

Learning Curve

BeautifulSoup: Beginner-friendly

Puppeteer: More complex for automation

Best Use Case

Static or semi-dynamic sites

JavaScript-heavy, dynamic websites

Scalability

Easily scales across machines

Requires additional setup for scaling

The right choice depends on your project needs. For example, Scrapy often outperforms when crawling static e-commerce sites due to its targeted approach. On the other hand, Puppeteer shines when working with modern single-page applications that rely heavily on JavaScript [\[2\]](https://www.restack.io/p/scrapy-answer-vs-puppeteer)[\[1\]](https://rayobyte.com/blog/web-scraping-python-vs-nodejs/).

Many developers combine these tools for better results. A popular strategy is using BeautifulSoup for parsing alongside Scrapy for crawling, or pairing Cheerio with Puppeteer to handle different tasks within the same project [\[1\]](https://rayobyte.com/blog/web-scraping-python-vs-nodejs/)[\[4\]](https://www.geeksforgeeks.org/node-js-vs-python/).

Knowing the strengths of these libraries helps you select the best tool for your web crawling requirements.

Use Cases for Python and Node.js

--------------------------------

Knowing when to pick Python or Node.js for web crawling can make or break your project. Each has its strengths, depending on the task at hand.

### When to Use Python

Python is a go-to choice for projects involving heavy data extraction or analysis. Its rich library ecosystem is perfect for tasks that demand high throughput and advanced data processing.

Scrapy, for example, is highly efficient for large-scale crawling, capable of handling thousands of pages per minute when set up correctly. Python works best for:

*   **Data-heavy tasks**: Combining web crawling with in-depth data analysis.

*   **Static websites**: Great for crawling e-commerce catalogs or content-focused platforms.

*   **Scientific projects**: Ideal for scenarios requiring both data collection and analysis.

*   **Large-scale crawling**: Distributed crawlers running across multiple machines.

### When to Use Node.js

Node.js stands out when dealing with modern, dynamic web applications. Its asynchronous nature makes it a strong contender for real-time data collection, especially on JavaScript-heavy websites.

Powered by the V8 JavaScript engine, Node.js performs exceptionally well in scenarios like:

*   **Dynamic content**: Crawling single-page applications or JavaScript-rendered sites.

*   **Real-time tracking**: Monitoring live price updates or inventory changes on e-commerce sites.

*   **Interactive scraping**: Handling websites that need user interaction or form submissions.

*   **API-heavy projects**: Managing multiple API integrations efficiently.

Your choice should align with your project's goals. Node.js is faster in certain tasks thanks to its V8 engine [\[3\]](https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/python3-node.html), while Python's extensive libraries and ease of use make it better suited for large-scale data-focused projects [\[5\]](https://brightdata.com/blog/web-data/javascript-vs-python).

For example, if you're building a crawler to track real-time price updates on dynamic e-commerce sites, Node.js with Puppeteer is a solid pick. On the other hand, if you're pulling large datasets from static news websites for analysis, Python with Scrapy is the way to go.

Using APIs for Web Crawling

---------------------------

Python and Node.js come with powerful libraries for web crawling, but APIs offer a simpler way to handle complex tasks, making the process more efficient on both platforms.

### [WebCrawlerAPI](https://webcrawlerapi.com/) Features

WebCrawlerAPI works seamlessly with Python and Node.js, offering key features to optimize web crawling:

Feature

Description

JavaScript Rendering

Automatically processes dynamic content loading

Anti-Bot Protection

Avoids CAPTCHAs and prevents IP blocking

Scalable Infrastructure

Handles large-scale crawling with ease

You can integrate this API into Python projects using Scrapy or Node.js apps with Puppeteer without needing significant code changes [\[2\]](https://www.restack.io/p/scrapy-answer-vs-puppeteer).

### Why Use APIs?

APIs simplify web crawling by tackling infrastructure challenges that would otherwise require additional tools in Python or Node.js.

**Save Time and Resources**: APIs take care of proxies, CAPTCHAs, and scaling, cutting down on development time and costs [\[1\]](https://rayobyte.com/blog/web-scraping-python-vs-nodejs/).

**Simplified Data Handling**: APIs come with features like:

*   Automated JavaScript rendering and CAPTCHA solving

*   IP rotation and proxy management

*   Built-in tools for data parsing and cleaning

**Works Across Platforms**: APIs deliver consistent results in Python and Node.js, making them ideal for teams using both languages [\[2\]](https://www.restack.io/p/scrapy-answer-vs-puppeteer). This ensures smooth workflows regardless of the programming environment.

Whether you’re leveraging Python for data analysis or Node.js for its event-driven model, APIs help you focus on extracting and analyzing data rather than dealing with infrastructure [\[1\]](https://rayobyte.com/blog/web-scraping-python-vs-nodejs/).

Conclusion

----------

After comparing their performance, libraries, and use cases, here's how Python and Node.js stack up for web crawling.

### Final Comparison: Python vs Node.js

Python and Node.js each bring different advantages to web crawling. Python is a strong choice for handling data-heavy projects, especially those involving large-scale data extraction and analysis. Libraries like **Scrapy** and **BeautifulSoup** are excellent for managing complex data extraction tasks.

On the other hand, Node.js shines when it comes to speed. Thanks to its V8 engine, it performs exceptionally well with dynamic content. Benchmark tests frequently show Node.js outpacing Python in raw execution speed, making it a great pick for performance-driven tasks [\[3\]](https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/python3-node.html).

### Key Takeaways

The best choice depends on the specific needs of your project:

*   **Go with Python if you need**: Tools for data-intensive tasks, large-scale data extraction, or seamless integration with data science workflows [\[1\]](https://rayobyte.com/blog/web-scraping-python-vs-nodejs/).

*   **Opt for Node.js if you need**: Real-time scraping, handling dynamic content, or event-driven workflows [\[2\]](https://www.restack.io/p/scrapy-answer-vs-puppeteer).

APIs like **WebCrawlerAPI** can help tackle common web crawling challenges. They offer features like JavaScript rendering and anti-bot protection, and they work with both Python and Node.js [\[4\]](https://www.geeksforgeeks.org/node-js-vs-python/).

Python's simple syntax makes it beginner-friendly, while Node.js offers more control for experienced developers. The right tool ultimately depends on your project's demands and goals.

----
url: https://webcrawlerapi.com/changelog/2025-03-21-error-handling
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

March 21, 2025

Comprehensive Error Handling System

===================================

Major WebcrawlerAPI update: Comprehensive [error handling](/docs/errors) system implementation

*   Added two-level error handling system: job level and job item level errors

*   New job level error codes:

    *   insufficient\_balance for balance-related issues

    *   invalid\_request for malformed requests

    *   internal\_error for system-level issues

*   New job item level error codes:

    *   host\_returned\_error for non-200 HTTP responses

    *   website\_access\_denied for 403 responses

    *   name\_not\_resolved for DNS resolution failures

    *   internal\_error for system-level issues

*   Each error now includes detailed error messages and specific error codes for better debugging

----
url: https://webcrawlerapi.com/glossary/playwright/how-to-fix-playwright-executable-does-not-exist
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Playwright package is installed, but browser binaries are missing in the environment.

Install browsers (and system deps in Linux containers) as part of setup.

    npx playwright install

For CI images that need OS libraries:

    npx playwright install --with-deps

Pin Playwright version and image tags together to avoid binary mismatches.

----
url: https://webcrawlerapi.com/glossary/playwright/how-to-fix-playwright-click-timeout-not-visible
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

The locator exists but is hidden, covered, or outside expected UI state.

Wait for visibility and stable state before clicking.

    const menu = page.getByRole('button', { name: 'Open menu' });

    await menu.click();

    const item = page.getByRole('menuitem', { name: 'Settings' });

    await expect(item).toBeVisible();

    await item.click();

If overlays block clicks, close modal/toast layers first instead of forcing clicks.

----
url: https://webcrawlerapi.com/blog/clean-crawled-data-with-beautifulsoup-in-python
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

PythonTutorial10 min read to read

Clean crawled or scraped data with BeatuifulSoup in Python

==========================================================

After crawling or scraping the webpage, the data may need to be cleaned. In this article, we provide a solution and code for using BeautifulSoup to remove unneeded content.

Written byAndrew

Published onFeb 3, 2026

### Table of Contents

*   [Write clean-up function using BeautifulSoup in Python](#write-clean-up-function-using-beautifulsoup-in-python)

*   [Use BeautifulSoup in Docker](#use-beautifulsoup-in-docker)

### Table of Contents

*   [Write clean-up function using BeautifulSoup in Python](#write-clean-up-function-using-beautifulsoup-in-python)

*   [Use BeautifulSoup in Docker](#use-beautifulsoup-in-docker)

Write clean-up function using BeautifulSoup in Python

-----------------------------------------------------

After crawling or scraping the webpage, the data may need to be cleaned. In this article, we provide a solution and code for using [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) to remove unneeded content.

If you need a tiny crawler to collect the pages first, start here: [BeautifulSoup4 Web Crawler](/blog/beatifulsoup-webcrawler). In other words it allows you to turn this:

    <html>

      <head>

        <title>Sample HTML</title>

      </head>

      <body>

        <p>This is a <b>sample</b> HTML content.</p>

        <p>It has multiple lines and <i>tags</i>.</p>

      </body>

      <script>

        some javascript

      </script>

    </html>

Into this:

    Sample HTML

    This is a sample HTML content.

    It has multiple lines and tags.

**Why clean data is important**

Dirty data can lead to incorrect analysis, false insights, and wasted time. Here are some reasons to clean crawled content:

1.  **Remove irrelevant information**: webpages usually contain HTML tags, styles, scripts, media, etc. You do not always need all this. Most likely, you need exactly the opposite: valuable data.

2.  **Reduce noise**: without cleaning, data contains too much noise. This can reduce the accuracy of your data analysis or training AI models based on this data.

3.  **Reduce size**: cleaning data also can significantly reduce the size required for storage.

**Cleaning Data with BeautifulSoup**

BeautifulSoup is a powerful and easy-to-use Python library for parsing and cleaning HTML and XML documents. It is particularly great for scraping and cleaning crawled data. Here's an example of how to clean crawled data with BeautifulSoup:

First, install BeatifulSoup4:

    pip install beautifulsoup4

Write clean-up funtion:

    from bs4 import BeautifulSoup

    import os

    def clean_html():

        soup = BeautifulSoup(HTML_CONTENT, 'html.parser')

        clean_text = soup.get_text()

        clean_text = '

    '.join([line for line in clean_text.split('

    ') if line.strip()])

        return clean_text

Add some test data:

    HTML_CONTENT = """

        <html>

        <head><title>Sample HTML</title></head>

        <body>

            <p>This is a <b>sample</b> HTML content.</p>

            <p>It has multiple lines and <i>tags</i>.</p>

        </body>

        <script>some javascript</script>

        </html>

        """

And a runner to test:

    def main():

        cleaned = clean_html()

        print(cleaned)

The final code looks like this:

    #clean.py

    from bs4 import BeautifulSoup

    def clean_html():

        HTML_CONTENT = """

        <html>

        <head><title>Sample HTML</title></head>

        <body>

            <p>This is a <b>sample</b> HTML content.</p>

            <p>It has multiple lines and <i>tags</i>.</p>

        </body>

        <script>some javascript</script>

        </html>

        """

        soup = BeautifulSoup(HTML_CONTENT, 'html.parser')

        clean_text = soup.get_text()

        clean_text = '

    '.join([line for line in clean_text.split('

    ') if line.strip()])

        return clean_text

    def main():

        cleaned = clean_html()

        print(cleaned)

    if __name__ == '__main__':

        main()

Now you can run it an see cleaned test data:

    python clean.py

    # Output:

    # Sample HTML

    # This is a sample HTML content.

    # It has multiple lines and tags.

Use BeautifulSoup in Docker

---------------------------

If you don't want to write code yourself or you are using other than Python language you can run docker with BeautifulSoup and make a request there.

Run docker:

    docker pull n10ty/beautifulsoup-api

    docker run -p5000:5000 n10ty/beautifulsoup-api

Make a request:

    curl --request POST

      --url http://localhost:5000/clean

      --data '<html>

        <head><title>Sample HTML</title></head>

        <body>

            <p>This is a <b>sample</b> HTML content.</p>

            <p>It has multiple lines and <i>tags</i>.</p>

        </body>

        <script>some javascript</script>

        </html>'

    # Output:

    # Sample HTML

    # This is a sample HTML content.

    # It has multiple lines and tags.

----
url: https://webcrawlerapi.com/glossary/playwright/how-to-fix-target-page-context-or-browser-closed
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

This error appears when code uses page after page.close(), context.close(), or browser.close().

Ensure cleanup runs after all page actions, and avoid floating async tasks.

    const browser = await chromium.launch();

    const context = await browser.newContext();

    const page = await context.newPage();

    await page.goto('https://example.com');

    await page.getByRole('link', { name: 'Docs' }).click();

    // Close only after awaited work is complete.

    await context.close();

    await browser.close();

If you use Promise.all, make sure none of the branches closes the context too early.

----
url: https://webcrawlerapi.com/blog/what-is-shadow-dom
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Technical5 mins to read to read

What is Shadow DOM? (And How to Scrape It)

==========================================

Shadow DOM is a way to build encapsulated UI components on the web. Learn what Shadow DOM is, why it is hard to scrape, and how to scrape Shadow DOM in your browser or with a browser extension.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [What is Shadow DOM?](#what-is-shadow-dom)

*   [Why it is difficult to scrape Shadow DOM](#why-it-is-difficult-to-scrape-shadow-dom)

*   [How to scrape Shadow DOM in a local browser](#how-to-scrape-shadow-dom-in-a-local-browser)

*   [1) Manually access the shadow root](#1-manually-access-the-shadow-root)

*   [2) Use a helper that searches through nested Shadow DOM](#2-use-a-helper-that-searches-through-nested-shadow-dom)

*   [How to scrape Shadow DOM via browser extension](#how-to-scrape-shadow-dom-via-browser-extension)

*   [\`manifest.json\`](#manifestjson)

*   [\`background.js\`](#backgroundjs)

### Table of Contents

*   [What is Shadow DOM?](#what-is-shadow-dom)

*   [Why it is difficult to scrape Shadow DOM](#why-it-is-difficult-to-scrape-shadow-dom)

*   [How to scrape Shadow DOM in a local browser](#how-to-scrape-shadow-dom-in-a-local-browser)

*   [1) Manually access the shadow root](#1-manually-access-the-shadow-root)

*   [2) Use a helper that searches through nested Shadow DOM](#2-use-a-helper-that-searches-through-nested-shadow-dom)

*   [How to scrape Shadow DOM via browser extension](#how-to-scrape-shadow-dom-via-browser-extension)

*   [\`manifest.json\`](#manifestjson)

*   [\`background.js\`](#backgroundjs)

What is Shadow DOM?

-------------------

Shadow DOM is a browser feature that lets a web component keep its internal HTML and CSS “private” from the rest of the page. You can think of it like a mini DOM tree attached to an element (the **host**) that renders its own content.

This is used a lot in design systems and modern UI widgets (dropdowns, date pickers, chat widgets, cookie banners) because it avoids CSS conflicts and makes components easier to reuse. Styles inside a shadow root do not automatically leak out, and page styles do not automatically leak in.

There are two common terms you will see:

*   **Light DOM**: the regular DOM you see in document.documentElement and page HTML.

*   **Shadow DOM**: the hidden/encapsulated DOM inside a component, accessible through the element’s shadowRoot (only for “open” shadow roots).

Example (open shadow root) so you can see the idea:

    // Create a host element

    const host = document.createElement("div");

    host.id = "my-widget";

    document.body.appendChild(host);

    // Attach a shadow root (open = accessible via host.shadowRoot)

    const root = host.attachShadow({ mode: "open" });

    root.innerHTML = `

      <style>

        .title { color: rebeccapurple; font-weight: 600; }

      </style>

      <div class="title">Hello from Shadow DOM</div>

    `;

    // This works because the root is open

    console.log(host.shadowRoot.querySelector(".title").textContent);

Why it is difficult to scrape Shadow DOM

----------------------------------------

Shadow DOM is difficult to scrape because most web scrapers start from the page HTML string. When you do a simple HTTP request and parse the response, you usually only get the **light DOM HTML** that came from the server. But shadow roots are often created at runtime by JavaScript, and their content may not exist in the raw HTML at all.

Even if the browser has rendered the page, the element you want might be inside a shadow root, so a normal selector like document.querySelector(".price") will return null. You must first find the host element, then “enter” its shadow root and query inside it.

There is also an extra limitation:

*   **Open shadow root**: you can access it with element.shadowRoot.

*   **Closed shadow root**: element.shadowRoot is null by design, even though the UI is visible.

Closed shadow roots are intentionally harder to access from scripts. In practice, scraping closed Shadow DOM often requires a different strategy (for example, using the component’s public attributes, listening to network responses, reading accessible text, or automating user-visible interactions).

How to scrape Shadow DOM in a local browser

-------------------------------------------

If you are scraping on your own machine (Chrome/Edge/Firefox), the fastest way is to use DevTools and run JavaScript directly in the page.

### 1) Manually access the shadow root

If you know the host element, you can query it and then query inside its shadow root:

    // Example: <product-card> is the host custom element

    const host = document.querySelector("product-card");

    const price = host?.shadowRoot?.querySelector(".price")?.textContent?.trim();

    console.log({ price });

### 2) Use a helper that searches through nested Shadow DOM

Real pages often have shadow roots inside other shadow roots. This helper walks the DOM and any _open_ shadow roots to find the first match:

    function deepQuerySelector(selector, root = document) {

      const lightDomMatch = root.querySelector(selector);

      if (lightDomMatch) return lightDomMatch;

      const treeWalker = document.createTreeWalker(

        root instanceof Document ? root.documentElement : root,

        NodeFilter.SHOW_ELEMENT

      );

      for (let node = treeWalker.currentNode; node; node = treeWalker.nextNode()) {

        const el = /** @type {Element} */ (node);

        const shadowRoot = /** @type {any} */ (el).shadowRoot;

        if (!shadowRoot) continue; // closed shadow root (or no shadow root)

        const matchInShadow = shadowRoot.querySelector(selector);

        if (matchInShadow) return matchInShadow;

        const matchDeeper = deepQuerySelector(selector, shadowRoot);

        if (matchDeeper) return matchDeeper;

      }

      return null;

    }

    const titleEl = deepQuerySelector("h1");

    console.log(titleEl?.textContent?.trim());

If DevTools cannot “see” the element easily, enable Shadow DOM inspection:

*   Chrome/Edge DevTools → Settings → Preferences → Elements → enable “Show user agent shadow DOM” (wording can vary).

How to scrape Shadow DOM via browser extension

----------------------------------------------

A browser extension can scrape Shadow DOM by injecting a **content script** into the page. The content script runs in the context of the page (with some isolation) and can access the DOM, including **open shadow roots**, just like code you run in DevTools.

Below is a minimal Chrome/Edge Manifest V3 extension example that extracts text from a Shadow DOM selector and sends it back to the extension.

### manifest.json

    {

      "manifest_version": 3,

      "name": "Shadow DOM Scraper (Example)",

      "version": "1.0.0",

      "permissions": ["activeTab", "scripting"],

      "host_permissions": ["<all_urls>"],

      "action": { "default_title": "Scrape Shadow DOM" },

      "background": { "service_worker": "background.js" }

    }

### background.js

    chrome.action.onClicked.addListener(async (tab) => {

      if (!tab.id) return;

      const [{ result }] = await chrome.scripting.executeScript({

        target: { tabId: tab.id },

        func: () => {

          function deepQuerySelector(selector, root = document) {

            const lightDomMatch = root.querySelector(selector);

            if (lightDomMatch) return lightDomMatch;

            const treeWalker = document.createTreeWalker(

              root instanceof Document ? root.documentElement : root,

              NodeFilter.SHOW_ELEMENT

            );

            for (

              let node = treeWalker.currentNode;

              node;

              node = treeWalker.nextNode()

            ) {

              const el = node;

              const shadowRoot = el.shadowRoot;

              if (!shadowRoot) continue;

              const matchInShadow = shadowRoot.querySelector(selector);

              if (matchInShadow) return matchInShadow;

              const matchDeeper = deepQuerySelector(selector, shadowRoot);

              if (matchDeeper) return matchDeeper;

            }

            return null;

          }

          // Replace with your selector:

          const el = deepQuerySelector(".price");

          return el ? el.textContent.trim() : null;

        },

      });

      console.log("Scraped value:", result);

    });

This approach works well for “open” shadow roots. If the site uses “closed” shadow roots, you cannot access them through shadowRoot, even in an extension. In that case, a practical workaround is to scrape what the user can see (rendered text), intercept network calls (if your extension is allowed to), or use the site’s public data attributes and APIs.

If you are building a web scraper that must handle Shadow DOM reliably at scale, you usually want a real browser automation tool (Playwright/Puppeteer) instead of pure HTML parsing, because it can execute JavaScript and interact with the live DOM.

----
url: https://webcrawlerapi.com/blog/html-vs-markdown-choosing-the-right-output-format
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

ComparisonRAGMarkdown10 min read to read

HTML vs Markdown: Choosing the Right Output Format for AI

=========================================================

Explore the differences between HTML and Markdown to determine which format best suits your web development and data processing needs.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Quick Comparison](#quick-comparison)

*   [Overview of HTML and Markdown](#overview-of-html-and-markdown)

*   [HTML Basics](#html-basics)

*   [Markdown Basics](#markdown-basics)

*   [Comparing HTML and Markdown](#comparing-html-and-markdown)

*   [Ease of Use](#ease-of-use)

*   [Customization Options](#customization-options)

*   [Tool and Platform Compatibility](#tool-and-platform-compatibility)

*   [Use Cases for HTML and Markdown](#use-cases-for-html-and-markdown)

*   [Web Crawling and Data Extraction](#web-crawling-and-data-extraction)

*   [Using Web Crawling APIs](#using-web-crawling-apis)

*   [Preparing Data for AI/LLM](#preparing-data-for-aillm)

*   [HTML vs Markdown Comparison Table](#html-vs-markdown-comparison-table)

*   [Conclusion](#conclusion)

*   [FAQs](#faqs)

*   [How to export data from a web scraper?](#how-to-export-data-from-a-web-scraper)

### Table of Contents

*   [Quick Comparison](#quick-comparison)

*   [Overview of HTML and Markdown](#overview-of-html-and-markdown)

*   [HTML Basics](#html-basics)

*   [Markdown Basics](#markdown-basics)

*   [Comparing HTML and Markdown](#comparing-html-and-markdown)

*   [Ease of Use](#ease-of-use)

*   [Customization Options](#customization-options)

*   [Tool and Platform Compatibility](#tool-and-platform-compatibility)

*   [Use Cases for HTML and Markdown](#use-cases-for-html-and-markdown)

*   [Web Crawling and Data Extraction](#web-crawling-and-data-extraction)

*   [Using Web Crawling APIs](#using-web-crawling-apis)

*   [Preparing Data for AI/LLM](#preparing-data-for-aillm)

*   [HTML vs Markdown Comparison Table](#html-vs-markdown-comparison-table)

*   [Conclusion](#conclusion)

*   [FAQs](#faqs)

*   [How to export data from a web scraper?](#how-to-export-data-from-a-web-scraper)

HTML ideal for complex layouts, interactive features, and web development tasks. Markdown, Perfect for simplicity, fast content creation, and AI workflows.

Struggling to choose between HTML and Markdown?

### Quick Comparison

Feature

HTML

Markdown

**Ease of Use**

Complex, requires precision

Simple and beginner-friendly

**Customization**

Extensive with CSS/JavaScript

Limited to basic formatting

**Data Processing**

Harder to parse nested tags

Easy to parse plain text

**Best For**

Interactive layouts, dynamic content

Documentation, AI datasets, blogs

**Key takeaway:** Choose HTML for visual and interactive projects. Opt for [Markdown](/blog/best-prompt-data) when simplicity and clean data are priorities. Both formats have their strengths - pick based on your workflow needs.

Overview of HTML and Markdown

-----------------------------

HTML and Markdown are two widely used tools for formatting and structuring digital content. Each serves a different purpose, catering to specific needs and workflows. Here's a closer look at their features and differences.

### HTML Basics

HTML (HyperText Markup Language) is the backbone of the web, providing detailed control over how content is structured and displayed. Its tag-based syntax allows for precise customization. For example:

    <h1>Main Title</h1>

    <p>

      This is a paragraph with <strong>bold text</strong> and <a href="#">links</a>.

    </p>

HTML is powerful because it supports complex layouts and interactive features. When paired with CSS and JavaScript, it becomes a versatile tool for creating dynamic web pages. It also handles multimedia, forms, and interactive elements, making it indispensable for web development and tasks like [web crawling](https://webcrawlerapi.com/scrapers/webcrawler/html) that require extracting specific elements [\[3\]](https://2markdown.com/blog/markdown-vs-html-content-creation).

### Markdown Basics

Markdown is a simpler alternative to HTML, designed for ease of use. Its plain text format is straightforward, making it a favorite among content creators for quick and efficient formatting. Here's an example of how Markdown achieves the same result as the HTML example:

    # Main Title

    This is a paragraph with **bold text** and [links](#).

Markdown is especially useful in web crawling and data workflows because of its:

*   **Plain Text Format**: Easier to parse and extract data from.

*   **Simplicity**: Readable by both humans and machines without extra processing.

*   **Metadata Support**: Features like front matter help organize content [\[4\]](https://codingnconcepts.com/markdown/markdown-vs-html/).

Feature

HTML

Markdown

Learning Curve

Requires understanding tags and attributes

Easy to pick up with intuitive syntax

Use Cases

Best for complex layouts and interactive features

Ideal for documentation and quick drafts

Customization

Highly flexible with CSS/JavaScript integration

Limited to basic formatting

Data Processing

Parsing requires handling nested tags

Plain text simplifies the process

Comparing HTML and Markdown

---------------------------

When deciding between HTML and Markdown for web crawling or data preparation, it's important to weigh their differences. Below, we break down the key aspects to consider.

### Ease of Use

HTML uses a tag-based syntax that offers a lot of power but can be tricky to master, requiring precision and attention to detail. On the other hand, Markdown relies on simple plain text formatting, making it easier to learn and less prone to errors. This simplicity allows users to create content quickly without much technical expertise.

That said, while Markdown is easier to use, HTML provides far more control for those who need detailed customization.

### Customization Options

HTML is ideal for creating complex layouts and adding interactivity to web pages. It offers extensive options for customization, making it indispensable for advanced web design and data extraction tasks. Markdown, however, focuses on simplicity and basic formatting. While this limits its flexibility, modern tools often enhance Markdown with plugins and extensions.

For instance, [WebCrawlerAPI](https://webcrawlerapi.com/) supports data extraction in both formats, giving users the freedom to choose based on their workflow requirements.

### Tool and Platform Compatibility

HTML is the backbone of all web content and works seamlessly across browsers and platforms, making it essential for projects that require precise control or intricate data extraction.

Markdown, though less flexible, shines in environments where simplicity and readability are priorities. It's especially popular on platforms like [GitHub](https://github.com/about) and content management systems. Here's a quick look at where Markdown is commonly used:

Platform

Advantage

GitHub

Built-in rendering support

Content Management Systems

Streamlined content creation

Documentation Tools

Easy version control

AI/LLM Pipelines

Clean, parseable format

Markdown's straightforward approach not only speeds up content creation but also minimizes errors [\[1\]](https://2markdown.com/blog/markdown-vs-html-comparison)[\[3\]](https://2markdown.com/blog/markdown-vs-html-content-creation).

Use Cases for HTML and Markdown

-------------------------------

### Web Crawling and Data Extraction

When it comes to web crawling, the choice between HTML and Markdown often depends on the type of content you're working with. HTML is perfect for handling detailed and interactive structures, such as product pages or dynamic web applications. It keeps all the intricate elements intact, making it a great fit for e-commerce and similar use cases.

Markdown, on the other hand, is ideal for extracting text-heavy content. It removes unnecessary styling but keeps the key formatting intact, which makes it especially useful for blogs, articles, and documentation.

Picking the right format is just the start. With modern APIs, you can easily extract content in your preferred format, no matter the complexity of your task.

### Using Web Crawling APIs

Today's web crawling tools are designed with flexibility in mind. Many, like WebCrawlerAPI, let you extract content in either HTML or Markdown, so you can choose the format that best suits your project without overhauling your setup.

Here’s a quick guide to how different formats work best in various scenarios:

Scenario

Recommended Format

Key Benefit

Content Aggregation

Markdown

Clean and easy-to-read output

Dynamic Web Apps

HTML

Retains complex structures

Documentation Sites

Markdown

Simplifies version control

E-commerce Data

HTML

Preserves product details

The format you choose also plays a big role in how well the data fits into more advanced workflows, such as those involving AI or large language models (LLMs).

### Preparing Data for AI/LLM

Web crawling results are often the starting point for creating datasets for AI and LLM projects. Here, the format can make a real difference. Markdown works well for creating training datasets because it’s easier to parse and can include metadata. HTML, on the other hand, is better suited for content that relies on structural and semantic clarity.

Modern tools even offer direct conversion from HTML to Markdown, specifically tailored for AI applications like Retrieval-Augmented Generation (RAG) [\[2\]](https://scrapingant.com/blog/markdown-efficient-data-extraction)[\[6\]](https://docs.scrapingant.com/llm-markdown). This streamlines the process of preparing content while keeping its structure intact.

###### sbb-itb-ac346ed

HTML vs Markdown Comparison Table

---------------------------------

Choosing between HTML and Markdown for tasks like web crawling, data preparation, and AI workflows can be tricky. Here's a side-by-side look at how they stack up in key areas:

Feature

HTML

Markdown

**Syntax & Readability**

Uses dense tags, making it harder to read

Clean and straightforward, very easy to follow [\[4\]](https://codingnconcepts.com/markdown/markdown-vs-html/)

**Ease of Use**

Requires more effort to learn and update

Quick to learn and simple to maintain [\[1\]](https://2markdown.com/blog/markdown-vs-html-comparison)

**Styling Control**

Offers full customization via CSS and inline styles

Limited to basic formatting capabilities [\[1\]](https://2markdown.com/blog/markdown-vs-html-comparison)[\[3\]](https://2markdown.com/blog/markdown-vs-html-content-creation)

**Platform Support**

Works seamlessly across all web browsers

Broad compatibility, but behavior can vary [\[1\]](https://2markdown.com/blog/markdown-vs-html-comparison)

**[Web Crawling Compatibility](https://webcrawlerapi.com/docs/crawling-types)**

Complex structure, harder to parse

Simplified structure for easier content extraction [\[2\]](https://scrapingant.com/blog/markdown-efficient-data-extraction)[\[3\]](https://2markdown.com/blog/markdown-vs-html-content-creation)

**AI/LLM Integration**

Often requires preprocessing steps

Works well for AI pipelines with metadata inclusion [\[2\]](https://scrapingant.com/blog/markdown-efficient-data-extraction)

**Use Case Strength**

Best for interactive applications and e-commerce sites

Ideal for blogs, documentation, and content management [\[1\]](https://2markdown.com/blog/markdown-vs-html-comparison)[\[3\]](https://2markdown.com/blog/markdown-vs-html-content-creation)

HTML shines when you need precise styling and universal browser support, making it ideal for interactive or visually rich projects. Markdown, on the other hand, is perfect for tasks where simplicity, speed, and compatibility with AI workflows are priorities.

Your choice should depend on your specific needs - whether it's maintaining detailed structure with HTML or opting for Markdown’s ease of use and processing advantages. Up next, we'll tackle common questions to help refine your decision further.

Conclusion

----------

HTML and Markdown serve different roles in web development and data workflows. The right choice depends on what your project requires and any technical limitations you may face.

HTML, known as the backbone of web development, provides detailed customization and precise layout control. This makes it a go-to for projects that demand interactive and visually complex elements. However, its complexity can be a hurdle for simpler tasks.

Markdown, on the other hand, stands out for its simplicity, especially in data workflows and AI-related tasks. Tools like [ScrapingAnt](https://scrapingant.com/) make it easier to convert HTML into Markdown, facilitating seamless integration with text-based models [\[6\]](https://docs.scrapingant.com/llm-markdown). Similarly, tools like [Firecrawl](https://www.firecrawl.dev/) boost productivity by leveraging Markdown's straightforward structure [\[5\]](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl).

When deciding between the two, consider factors like:

*   The specific needs of your project

*   Compatibility with tools and platforms

*   The skill level of your team

Markdown is great for preprocessing text in machine learning workflows, while HTML is essential for creating visually engaging, interactive designs [\[2\]](https://scrapingant.com/blog/markdown-efficient-data-extraction). In many cases, combining the two formats can deliver optimal results. For instance, Markdown is often used for content creation, while HTML handles user-facing interfaces, allowing teams to play to the strengths of both formats [\[3\]](https://2markdown.com/blog/markdown-vs-html-content-creation).

Both formats are likely to keep evolving, each refining its core benefits. The key is to align your choice with the demands of your project and the goals you want to achieve.

FAQs

----

### How to export data from a web scraper?

Exporting data from a [web scraper](https://webcrawlerapi.com/scrapers) depends on the format you need and how you plan to use the data. Many [web crawling APIs](https://webcrawlerapi.com/scrapers/webcrawler/crawler/description) allow exports in formats like HTML, Markdown and [TXT](/blog/cleaned-text-vs-markdown-choosing-the-right-output-format), catering to different workflows.

For **CSV exports**, you can use spreadsheet software to import the data. Make sure to use UTF-8 encoding and set the correct delimiters to avoid issues.

For **HTML and Markdown exports**, modern web scraping APIs provide flexible options tailored to specific use cases:

Output Format

Ideal For

Examples of Use

Markdown

Text-based tasks, AI workflows

Documentation, content analysis

HTML

Visual and interactive content

Web development, complex layouts

TXT

Simple text extraction

Data cleaning, basic analysis

When using web scraping APIs, tweaking settings like load times and delays can help improve results, especially when dealing with dynamic pages or large datasets.

Choose the export format based on your project’s needs. For instance, Markdown's clean structure is particularly helpful if you’re prepping data for AI or LLM pipelines, as it simplifies text processing. By selecting the right format, you can ensure smoother integration into tasks like content analysis, web development, or data preparation [\[2\]](https://scrapingant.com/blog/markdown-efficient-data-extraction)[\[5\]](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl).

----
url: https://webcrawlerapi.com/glossary/puppeteer/how-to-expose-the-url-property-for-links
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### How to expose the url property for links

If you need the full URL of a link in Puppeteer, use the url property that was added in the related change. You can read it from a link element like this:

    // using a link element handle

    const href = await linkHandle.getProperty('href');

    const url = await href.jsonValue();

    console.log(url); // full absolute URL

Or via evaluate:

    const url = await linkHandle.evaluate(el => el.href);

This exposes the link url for use in tests and automation.

----
url: https://webcrawlerapi.com/glossary/puppeteer/how-to-disable-xdg-open-popup-in-puppeteer
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

To stop the xdg-open popup in Puppeteer, configure a Chrome policy URLAllowlist and use a Chrome binary that reads that policy.

1.  Use a Chrome installation outside of npm or pnpm and point Puppeteer to it, for example executablePath: "/etc/opt/chrome/chrome".

2.  Create a managed policy file to allow URLs. Place it at /etc/opt/chrome/policies/managed/url\_allowlist.json with content:

    {

      "URLAllowlist": ["http://*", "https://*"]

    }

3.  Restart Chrome and run Puppeteer again.

Notes:

*   Do not install Chrome via npm or pnpm.

*   If you want to restrict to specific domains, adjust the URLAllowlist accordingly.

This approach addresses the issue without changing Puppeteer version.

----
url: https://webcrawlerapi.com/glossary/puppeteer/what-is-the-correct-type-for-pageerror-event
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Summary

The pageerror event may emit not only Error objects but also values of unknown type. Treat the payload as unknown and log or stringify it instead of assuming it is an Error.

### How to handle

Use a guard to log the value you receive, and extract a message if possible.

    page.on('pageerror', (err) => {

      // err can be an Error or any other value

      if (err instanceof Error) {

        console.error('Page error:', err.message);

      } else {

        console.error('Page error:', String(err));

      }

    });

### Rationale

Puppeteer recently updated its pageerror event types to include unknown values in addition to Error objects. Handling both ensures you capture the error details without crashing.

----
url: https://webcrawlerapi.com/glossary/webcrawling/what-is-robots-txt
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Answer

robots.txt is a file at a site root that tells crawlers which paths they may or may not access. It uses a simple rule format that targets user agents and URL paths. Responsible crawlers check it before requesting pages. It is not a security mechanism, but a convention for crawl etiquette. Some sites also publish sitemap locations in robots.txt to help discovery. Following it helps reduce unnecessary load and avoids unwanted crawling.

----
url: https://webcrawlerapi.com/tools/website-to-md
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Tools → [Free website to Markdown converter](/tools/website-to-md)

------------------------------------------------------------------

Convert website to Markdown for LLM and AI

==========================================

### Insert a URL below to convert up to 100 pages of the website to Markdown for free.

Start

Frequently asked questions

--------------------------

### Can I use generated markdown content as llms.txt?

### Can I convert service API documentation to markdown?

### How many pages can I convert?

### Will I receive a file with the Markdown content?

### Can I use the markdown content in my AI model?

### Which URL should I insert?

### What if I need to convert more pages?

----
url: https://webcrawlerapi.com/glossary/playwright/how-to-fix-playwright-element-not-attached
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

The element was re-rendered between lookup and action, so the handle became stale.

Prefer locators (auto-retry) instead of storing ElementHandle values.

    const saveButton = page.getByRole('button', { name: 'Save' });

    await expect(saveButton).toBeVisible();

    await saveButton.click();

Common mistake: querying once with $ and clicking later after UI updates.

----
url: https://webcrawlerapi.com/glossary/playwright/how-to-fix-net-err-connection-refused
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

net::ERR\_CONNECTION\_REFUSED means nothing is listening on the target host/port.

Start the web server before tests and verify baseURL or target URL.

    // playwright.config.ts

    export default {

      use: { baseURL: 'http://127.0.0.1:3000' },

      webServer: {

        command: 'npm run dev',

        url: 'http://127.0.0.1:3000',

        reuseExistingServer: true,

      },

    };

This removes race conditions where tests run before the app is ready.

----
url: https://webcrawlerapi.com/glossary/playwright/how-to-fix-playwright-target-closed-crash
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

This error appears when the browser process exits unexpectedly or test cleanup closes context/page early.

Check for crashes, reduce parallel workers, and avoid using closed fixtures.

    import { test } from '@playwright/test';

    test.use({ trace: 'on-first-retry' });

    // In CI, try fewer workers if browser crashes under memory pressure.

    // npx playwright test --workers=2

If it only fails in CI, inspect memory limits and container sandbox settings.

----
url: https://webcrawlerapi.com/glossary/scraping/what-is-web-scraping
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Answer

Web scraping is the process of extracting specific data from web pages and converting it into structured formats. Instead of saving full pages, a scraper targets fields like titles, prices, or contact details. It typically parses HTML or API responses and maps them to a schema. Scraping is often paired with crawling to discover the pages to extract from. The result is usable data for analysis, monitoring, or automation.

----
url: https://webcrawlerapi.com/glossary/scraping/how-do-you-handle-pagination-when-scraping
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Answer

Handle pagination by identifying the next page link, page parameter, or API cursor. Start from the first page and iterate until there is no next page or results are empty. Keep a limit to avoid infinite loops or unexpected structures. Store the page number or cursor to resume if the job fails. Consistent pagination handling ensures full coverage without duplicates.

----
url: https://webcrawlerapi.com/docs/guides/cleaning
----

Advanced Cleaning on Crawled Data

=================================

Copy MarkdownOpen

It is possible to add extra cleaning options to your crawling job. There is a special parameter called `clean_selectors`.

[Cleaning Selectors](#cleaning-selectors)

-----------------------------------------

Cleaning selectors are used to clean the data in the crawled pages. They are applied to the data after the data is crawled. All found elements will be cleaned using the cleaning selectors.

The default value is:

    script, style, noscript, iframe, img, footer, header, nav, head

Format is a comma separated list of CSS selectors.

### [API Example](#api-example)

    {

    		"url": "https://books.toscrape.com/",

    		"scrape_type": "markdown",

    		"items_limit": 10,

    		"clean_selectors": ".card, #main-header"

    }

[Content Filtering

Previous Page](/docs/guides/filters)

### On this page

[Cleaning Selectors](#cleaning-selectors)[API Example](#api-example)

----
url: https://webcrawlerapi.com/changelog/2025-07-16-n8n-integration
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

July 16, 2025

n8n Integration Available

=========================

----
url: https://webcrawlerapi.com/blog/how-dom-smoothie-rust-mozilla-readability-alternative-works
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

RustTechnical

How dom\_smoozie Rust Mozilla Readability alternative works

===========================================================

A practical, step-by-step explanation of how dom\_smoothie (Rust) works as a Mozilla Readability alternative for main-content extraction.

Written byAndrew

Published onFeb 7, 2026

### Table of Contents

*   [The most important heuristics in plain English](#the-most-important-heuristics-in-plain-english)

*   [High-level end-to-end flow (\`parse()\` stages)](#high-level-end-to-end-flow-parse-stages)

*   [Core extraction model: candidates, scoring, ancestors, links, winners](#core-extraction-model-candidates-scoring-ancestors-links-winners)

*   [1) Candidate collection](#1-candidate-collection)

*   [2) Base scoring](#2-base-scoring)

*   [3) Ancestor propagation](#3-ancestor-propagation)

*   [4) Intrinsic score and class weighting](#4-intrinsic-score-and-class-weighting)

*   [5) Link density adjustment](#5-link-density-adjustment)

*   [6) Top candidates](#6-top-candidates)

*   [Candidate selection modes: Readability vs DomSmoothie](#candidate-selection-modes-readability-vs-domsmoothie)

*   [\`CandidateSelectMode::Readability\`](#candidateselectmodereadability)

*   [\`CandidateSelectMode::DomSmoothie\`](#candidateselectmodedomsmoothie)

*   [Sibling merge and why it matters](#sibling-merge-and-why-it-matters)

*   [Cleanup pipeline (\`prep\_article\`) and practical effect](#cleanup-pipeline-prep_article-and-practical-effect)

*   [Retry strategy and policies (\`Strict/Moderate/Clean/Raw\`)](#retry-strategy-and-policies-strictmoderatecleanraw)

*   [Preflight \`is\_probably\_readable()\` and when to use](#preflight-is_probably_readable-and-when-to-use)

*   [Tuning knobs and tradeoffs](#tuning-knobs-and-tradeoffs)

*   [Failure cases and debugging checklist](#failure-cases-and-debugging-checklist)

*   [Practical WebCrawlerAPI usage (\`main\_content\_only\`)](#practical-webcrawlerapi-usage-main_content_only)

### Table of Contents

*   [The most important heuristics in plain English](#the-most-important-heuristics-in-plain-english)

*   [High-level end-to-end flow (\`parse()\` stages)](#high-level-end-to-end-flow-parse-stages)

*   [Core extraction model: candidates, scoring, ancestors, links, winners](#core-extraction-model-candidates-scoring-ancestors-links-winners)

*   [1) Candidate collection](#1-candidate-collection)

*   [2) Base scoring](#2-base-scoring)

*   [3) Ancestor propagation](#3-ancestor-propagation)

*   [4) Intrinsic score and class weighting](#4-intrinsic-score-and-class-weighting)

*   [5) Link density adjustment](#5-link-density-adjustment)

*   [6) Top candidates](#6-top-candidates)

*   [Candidate selection modes: Readability vs DomSmoothie](#candidate-selection-modes-readability-vs-domsmoothie)

*   [\`CandidateSelectMode::Readability\`](#candidateselectmodereadability)

*   [\`CandidateSelectMode::DomSmoothie\`](#candidateselectmodedomsmoothie)

*   [Sibling merge and why it matters](#sibling-merge-and-why-it-matters)

*   [Cleanup pipeline (\`prep\_article\`) and practical effect](#cleanup-pipeline-prep_article-and-practical-effect)

*   [Retry strategy and policies (\`Strict/Moderate/Clean/Raw\`)](#retry-strategy-and-policies-strictmoderatecleanraw)

*   [Preflight \`is\_probably\_readable()\` and when to use](#preflight-is_probably_readable-and-when-to-use)

*   [Tuning knobs and tradeoffs](#tuning-knobs-and-tradeoffs)

*   [Failure cases and debugging checklist](#failure-cases-and-debugging-checklist)

*   [Practical WebCrawlerAPI usage (\`main\_content\_only\`)](#practical-webcrawlerapi-usage-main_content_only)

Hi, I'm Andrew. I work on scraping and crawling systems every day in [WebCrawlerAPI](https://webcrawlerapi.com/).

If you already used Mozilla Readability, dom\_smoothie will feel familiar. The same core idea is used: score the DOM, find the main container, clean it, and return article content. But extra controls are added in dom\_smoothie for retries, candidate selection, and output shaping.

If you want the Mozilla baseline explanation first, read: [Mozilla Readability Algorithm (Readability.js) explanation](/blog/mozilla-readability-algorithm-readabilityjs). If you want a code-first JavaScript integration guide, read: [Extracting article or blogpost content with Mozilla Readability](/blog/how-to-extract-article-or-blogpost-content-in-js-using-readabilityjs).

In this post, I will explain how it works in real life, where it fails, and how to tune it without guessing.

The most important heuristics in plain English

----------------------------------------------

These heuristics do most of the heavy lifting:

1.  Unlikely blocks are removed early. Sidebars, banners, menus, modal UI, and hidden nodes are filtered out.

2.  Class and id names are weighted. Positive names get score bonus. Negative names get penalty.

3.  Text-like blocks are scored, not full pages. Paragraph-ish nodes are used as the signal source.

4.  Scores are propagated to ancestors. The real article is usually a parent container, not a single <p>.

5.  Link density is used as a penalty. Navigation-heavy blocks usually contain many links and little writing.

6.  Siblings around the winner are merged. This recovers paragraphs split by ads, widgets, or template wrappers.

7.  Conditional cleanup runs after extraction. Tables, forms, junk embeds, empty tags, and noisy blocks are removed.

That is the practical core. It is not magic. It is a layered heuristic pipeline.

High-level end-to-end flow (parse() stages)

-------------------------------------------

At high level, Readability::parse() in dom\_smoothie does this:

1.  Check parse budget (max\_elements\_to\_parse).

2.  Parse metadata from JSON-LD (optional).

3.  Parse metadata from <meta> and <title>, then merge.

4.  Prepare DOM (remove scripts/styles/comments, normalize messy HTML).

5.  Pre-filter obvious noise (hidden/dialog/byline/duplicate title nodes).

6.  Run main content extraction and pick best candidate.

7.  Clean extracted content with prep\_article().

8.  Post-process URLs/links/classes.

9.  Build final Article object.

If extraction fails, GrabFailed is returned.

Real-life note: default parse() can retry with relaxed heuristics if output is too short. This costs CPU, but success rate improves on weird pages.

Core extraction model: candidates, scoring, ancestors, links, winners

---------------------------------------------------------------------

Inside each extraction attempt, the flow is simple and strict.

### 1) Candidate collection

The body is walked in document order.

*   Unlikely candidates can be removed

*   Empty structural nodes can be removed

*   Score-eligible tags are collected (section, h2-h6, p, td, pre)

*   div blocks are normalized into paragraph-like structure when needed

This normalization step is important because many pages use <div> for everything.

### 2) Base scoring

Very short text blocks are ignored (< 25 chars).

For valid text blocks, base score is built from:

*   constant base (2)

*   punctuation signal (comma-like count)

*   text length bonus (capped)

### 3) Ancestor propagation

That score is pushed up to ancestors up to depth 5:

*   parent gets full share

*   grandparent gets half

*   higher levels get smaller fraction

This is how wrapper containers win, not leaf text nodes.

### 4) Intrinsic score and class weighting

Ancestors also get intrinsic score by tag type:

*   div gets positive prior

*   pre, td, blockquote get smaller positive prior

*   lists/forms/headings/tables can get penalties depending on type

If class/id weighting is enabled, positive patterns add points and negative patterns subtract points.

### 5) Link density adjustment

For candidates above a minimum score threshold, score is adjusted:

adjusted = score \* (1 - linkDensity)

This prevents nav-like blocks from winning when link text dominates.

### 6) Top candidates

Only top n\_top\_candidates are kept (default 5). The highest score starts as the top candidate, then selection mode logic can promote a better ancestor.

Candidate selection modes: Readability vs DomSmoothie

-----------------------------------------------------

dom\_smoothie supports two selection modes with slightly different behavior.

### CandidateSelectMode::Readability

*   Mozilla-like behavior

*   Looks for common ancestor among high-scoring alternatives

*   Uses overlap and relative strength checks

*   Usually safer for classic article templates

### CandidateSelectMode::DomSmoothie

*   Intersects ancestor sets of strong alternatives

*   Tries to choose the strongest meaningful common ancestor

*   Can be better on fragmented modern layouts with wrappers

In practice:

*   Start with Readability if you need conservative behavior

*   Switch to DomSmoothie if content is often split or wrapper-heavy

Sibling merge and why it matters

--------------------------------

After top candidate is chosen, extraction is not finished.

The algorithm also checks siblings under the same parent and appends the ones that look article-like. Without this step, many articles lose intro or trailing paragraphs.

Sibling inclusion uses:

*   threshold based on top score (max(10, topScore \* 0.2))

*   class-name bonus if sibling class matches top candidate class

*   paragraph heuristics for unscored siblings:

    *   enough length

    *   sentence-like text

    *   low link density

This single step fixes a lot of real pages that inject ads between paragraphs.

Cleanup pipeline (prep\_article) and practical effect

-----------------------------------------------------

prep\_article() is a cleanup pass over extracted content. Order matters.

Main actions:

1.  Remove tiny share/social blocks

2.  Mark data tables vs layout tables

3.  Repair lazy images (data-\* to real src/srcset)

4.  Conditionally clean forms/fieldsets

5.  Remove junk nodes (footer, aside, noisy embeds, inputs)

6.  Remove negative-weight headings

7.  Conditionally clean table, ul, div

8.  Rename h1 to h2

9.  Strip presentational attributes/styles

10.  Remove empty paragraphs and extra <br>

11.  Flatten single-cell tables

Practical effect:

*   Output becomes much more stable for markdown conversion

*   Placeholder images and layout artifacts are reduced

*   Link and text quality gets better for LLM/RAG pipelines

Retry strategy and policies (Strict/Moderate/Clean/Raw)

-------------------------------------------------------

dom\_smoothie exposes fixed policies:

*   Strict = StripUnlikelys + WeightClasses + CleanConditionally

*   Moderate = WeightClasses + CleanConditionally

*   Clean = CleanConditionally

*   Raw = no heuristic flags

Default parse() does staged fallback if extracted text is below char\_threshold:

1.  run strict

2.  disable StripUnlikelys

3.  disable WeightClasses

4.  disable CleanConditionally

If threshold is never reached, the longest attempt is returned.

Use parse\_with\_policy(policy) when you want one deterministic pass.

Preflight is\_probably\_readable() and when to use

--------------------------------------------------

is\_probably\_readable() is a cheap gate before full extraction.

It checks nodes like p, pre, article, and some div\-related patterns, then:

*   skips hidden/unlikely/list-like paragraph nodes

*   requires minimum text per node (default 140 chars)

*   accumulates score using sqrt(textLen - minContentLen)

*   returns true when threshold is reached (default 20)

Use it when:

*   you crawl many non-article pages

*   you need to save CPU on extraction

*   you want fast early rejection for nav/search/list pages

Do not use it as a quality guarantee. It is a preflight only.

Tuning knobs and tradeoffs

--------------------------

Most useful options in practice:

*   char\_threshold Higher value reduces false positives, but can drop short valid articles.

*   n\_top\_candidates More candidates helps hard pages, but increases processing cost.

*   min\_score\_to\_adjust Changes when link-density penalty starts.

*   candidate\_select\_mode Readability is conservative, DomSmoothie can recover fragmented content better.

*   max\_elements\_to\_parse Protects CPU/memory on huge DOMs, but can fail giant pages if too low.

*   disable\_json\_ld Faster and simpler metadata path, but you may lose high-quality structured metadata.

*   keep\_classes and classes\_to\_preserve Better styling compatibility vs cleaner output.

*   text\_mode (Raw, Formatted, Markdown) Choose based on downstream pipeline, not personal preference.

Failure cases and debugging checklist

-------------------------------------

No extractor works on all pages. These are common failure modes.

*   Content loaded by JS after initial HTML (empty shell problem)

*   List/search pages that look text-heavy but are not articles

*   Very short pages that cannot pass thresholds

*   Pages with extreme link-heavy templates

*   Broken HTML where wrappers are malformed

*   Aggressive cleanup removing valid blocks

My quick checklist:

1.  Inspect fetched HTML first. Is real content present server-side?

2.  Check is\_probably\_readable() result before parse.

3.  Compare Strict vs Raw output lengths.

4.  Try alternative candidate\_select\_mode.

5.  Lower char\_threshold for short-form sources.

6.  Inspect sibling merge impact (lost intro/outro is a common signal).

7.  Verify lazy image normalization if media looks empty.

8.  Log top candidate score, link density, and final text length.

Practical WebCrawlerAPI usage (main\_content\_only)

---------------------------------------------------

If you do not want to run extraction infra yourself, use main\_content\_only in WebCrawlerAPI.

    // Node 18+

    const response = await fetch("https://api.webcrawlerapi.com/v1/scrape", {

      method: "POST",

      headers: {

        Authorization: "Bearer YOUR_API_KEY",

        "Content-Type": "application/json",

      },

      body: JSON.stringify({

        url: "https://example.com/article",

        main_content_only: true,

        scrape_type: "markdown",

      }),

    });

    const data = await response.json();

    console.log(data);

For quick experiments, you can also use the free tool: [HTML Main Content Readability](https://webcrawlerapi.com/tools/html-main-content-readability).

If you want the Mozilla baseline first, read this guide: [Mozilla Readability Algorithm (Readability.js) explanation](/blog/mozilla-readability-algorithm-readabilityjs). If you want the JS implementation tutorial, read: [Extracting article or blogpost content with Mozilla Readability](/blog/how-to-extract-article-or-blogpost-content-in-js-using-readabilityjs).

----
url: https://webcrawlerapi.com/blog/how-to-crawl-the-website-with-python
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

PythonTutorialAPI10 min read to read

How to crawl the website with Python

====================================

There are several options for how to crawl the content of the website using Python. All methods have their pros and cons. Let's take a look at more detail.

Written byAndrew

Published onFeb 6, 2026

### Table of Contents

*   [How to Crawl a Website with Python: Complete Guide with Code Examples](#how-to-crawl-a-website-with-python-complete-guide-with-code-examples)

*   [Simplest copy-paste working Python crawling example](#simplest-copy-paste-working-python-crawling-example)

*   [What is Web Crawling (and How It Differs from Scraping)](#what-is-web-crawling-and-how-it-differs-from-scraping)

*   [Simple Python Website Crawler with Requests and BeautifulSoup](#simple-python-website-crawler-with-requests-and-beautifulsoup)

*   [Installing the Required Libraries](#installing-the-required-libraries)

*   [Crawling a Single Page](#crawling-a-single-page)

*   [Following Links to Crawl Multiple Pages](#following-links-to-crawl-multiple-pages)

*   [Extracting and Storing Data](#extracting-and-storing-data)

*   [Building a Production Web Crawler with Scrapy](#building-a-production-web-crawler-with-scrapy)

*   [Why Scrapy for Larger Projects](#why-scrapy-for-larger-projects)

*   [Creating Your First Scrapy Spider](#creating-your-first-scrapy-spider)

*   [Scrapy Crawling Rules and Link Extraction](#scrapy-crawling-rules-and-link-extraction)

*   [Processing and Exporting Scraped Data](#processing-and-exporting-scraped-data)

*   [Crawling JavaScript-Heavy Websites with Python](#crawling-javascript-heavy-websites-with-python)

*   [The JavaScript Problem](#the-javascript-problem)

*   [Using Selenium for JavaScript Rendering](#using-selenium-for-javascript-rendering)

*   [Playwright as a Selenium Alternative](#playwright-as-a-selenium-alternative)

*   [Crawling All Links on a Website (Full Site Crawl)](#crawling-all-links-on-a-website-full-site-crawl)

*   [Method 1: Start with \`sitemap.xml\`](#method-1-start-with-sitemapxml)

*   [Method 2: Breadth-first link crawling (BFS)](#method-2-breadth-first-link-crawling-bfs)

*   [Best Practices for Python Web Crawlers](#best-practices-for-python-web-crawlers)

*   [Respecting \`robots.txt\` and Rate Limiting](#respecting-robotstxt-and-rate-limiting)

*   [Handling Errors and Retries](#handling-errors-and-retries)

*   [Using User Agents and Headers](#using-user-agents-and-headers)

*   [Avoiding Blocks and CAPTCHA](#avoiding-blocks-and-captcha)

*   [Scaling Your Python Crawler (When DIY Gets Hard)](#scaling-your-python-crawler-when-diy-gets-hard)

*   [Common Use Cases for Python Web Crawlers](#common-use-cases-for-python-web-crawlers)

*   [Price monitoring](#price-monitoring)

*   [Lead generation (contact discovery)](#lead-generation-contact-discovery)

*   [SEO audits and competitor research](#seo-audits-and-competitor-research)

*   [Content aggregation](#content-aggregation)

*   [Market research](#market-research)

*   [Troubleshooting Common Crawling Problems](#troubleshooting-common-crawling-problems)

*   [Frequently Asked Questions](#frequently-asked-questions)

*   [Is web crawling legal?](#is-web-crawling-legal)

*   [What is the difference between crawling and scraping?](#what-is-the-difference-between-crawling-and-scraping)

*   [How fast should a crawler run?](#how-fast-should-a-crawler-run)

*   [Should \`robots.txt\` be respected?](#should-robotstxt-be-respected)

*   [What is the best Python library for crawling?](#what-is-the-best-python-library-for-crawling)

*   [How should pagination be handled?](#how-should-pagination-be-handled)

*   [How should duplicates be handled?](#how-should-duplicates-be-handled)

*   [Crawl data from the website with an API in Python.](#crawl-data-from-the-website-with-an-api-in-python)

*   [When should a crawling API be used?](#when-should-a-crawling-api-be-used)

*   [Start crawling job in Python.](#start-crawling-job-in-python)

### Table of Contents

*   [How to Crawl a Website with Python: Complete Guide with Code Examples](#how-to-crawl-a-website-with-python-complete-guide-with-code-examples)

*   [Simplest copy-paste working Python crawling example](#simplest-copy-paste-working-python-crawling-example)

*   [What is Web Crawling (and How It Differs from Scraping)](#what-is-web-crawling-and-how-it-differs-from-scraping)

*   [Simple Python Website Crawler with Requests and BeautifulSoup](#simple-python-website-crawler-with-requests-and-beautifulsoup)

*   [Installing the Required Libraries](#installing-the-required-libraries)

*   [Crawling a Single Page](#crawling-a-single-page)

*   [Following Links to Crawl Multiple Pages](#following-links-to-crawl-multiple-pages)

*   [Extracting and Storing Data](#extracting-and-storing-data)

*   [Building a Production Web Crawler with Scrapy](#building-a-production-web-crawler-with-scrapy)

*   [Why Scrapy for Larger Projects](#why-scrapy-for-larger-projects)

*   [Creating Your First Scrapy Spider](#creating-your-first-scrapy-spider)

*   [Scrapy Crawling Rules and Link Extraction](#scrapy-crawling-rules-and-link-extraction)

*   [Processing and Exporting Scraped Data](#processing-and-exporting-scraped-data)

*   [Crawling JavaScript-Heavy Websites with Python](#crawling-javascript-heavy-websites-with-python)

*   [The JavaScript Problem](#the-javascript-problem)

*   [Using Selenium for JavaScript Rendering](#using-selenium-for-javascript-rendering)

*   [Playwright as a Selenium Alternative](#playwright-as-a-selenium-alternative)

*   [Crawling All Links on a Website (Full Site Crawl)](#crawling-all-links-on-a-website-full-site-crawl)

*   [Method 1: Start with \`sitemap.xml\`](#method-1-start-with-sitemapxml)

*   [Method 2: Breadth-first link crawling (BFS)](#method-2-breadth-first-link-crawling-bfs)

*   [Best Practices for Python Web Crawlers](#best-practices-for-python-web-crawlers)

*   [Respecting \`robots.txt\` and Rate Limiting](#respecting-robotstxt-and-rate-limiting)

*   [Handling Errors and Retries](#handling-errors-and-retries)

*   [Using User Agents and Headers](#using-user-agents-and-headers)

*   [Avoiding Blocks and CAPTCHA](#avoiding-blocks-and-captcha)

*   [Scaling Your Python Crawler (When DIY Gets Hard)](#scaling-your-python-crawler-when-diy-gets-hard)

*   [Common Use Cases for Python Web Crawlers](#common-use-cases-for-python-web-crawlers)

*   [Price monitoring](#price-monitoring)

*   [Lead generation (contact discovery)](#lead-generation-contact-discovery)

*   [SEO audits and competitor research](#seo-audits-and-competitor-research)

*   [Content aggregation](#content-aggregation)

*   [Market research](#market-research)

*   [Troubleshooting Common Crawling Problems](#troubleshooting-common-crawling-problems)

*   [Frequently Asked Questions](#frequently-asked-questions)

*   [Is web crawling legal?](#is-web-crawling-legal)

*   [What is the difference between crawling and scraping?](#what-is-the-difference-between-crawling-and-scraping)

*   [How fast should a crawler run?](#how-fast-should-a-crawler-run)

*   [Should \`robots.txt\` be respected?](#should-robotstxt-be-respected)

*   [What is the best Python library for crawling?](#what-is-the-best-python-library-for-crawling)

*   [How should pagination be handled?](#how-should-pagination-be-handled)

*   [How should duplicates be handled?](#how-should-duplicates-be-handled)

*   [Crawl data from the website with an API in Python.](#crawl-data-from-the-website-with-an-api-in-python)

*   [When should a crawling API be used?](#when-should-a-crawling-api-be-used)

*   [Start crawling job in Python.](#start-crawling-job-in-python)

How to Crawl a Website with Python: Complete Guide with Code Examples

=====================================================================

Possible ways to crawl website in Python:

*   [Simplest copy-paste working Python crawling example](#simplest-copy-paste-working-python-crawling-example)

*   [Simple Python Website Crawler with Requests and BeautifulSoup](#simple-python-website-crawler-with-requests-and-beautifulsoup)

*   [Building a Production Web Crawler with Scrapy](#building-a-production-web-crawler-with-scrapy)

*   [Crawling JavaScript-Heavy Websites with Python](#crawling-javascript-heavy-websites-with-python)

*   [Crawl data from the website with an API in Python](#crawl-data-from-the-website-with-an-api-in-python)

### Simplest copy-paste working Python crawling example

Before we dive into different approaches, here's a minimal working crawler you can run right now:

    #!/usr/bin/env python3

    # Install dependencies first

    # pip install requests beautifulsoup4

    import requests

    from bs4 import BeautifulSoup

    from urllib.parse import urljoin, urlparse

    from collections import deque

    def crawl(start_url, max_pages=10):

        """Simple web crawler - just copy and run!"""

        visited = []

        queue = deque([start_url])

        domain = urlparse(start_url).netloc

        while queue and len(visited) < max_pages:

            url = queue.popleft()

            if url in visited:

                continue

            try:

                # Fetch the page

                response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=10)

                response.raise_for_status()

                visited.append(url)

                print(f"✓ Crawled: {url}")

                # Parse HTML and find all links

                soup = BeautifulSoup(response.text, "html.parser")

                for link in soup.find_all("a", href=True):

                    full_url = urljoin(url, link["href"])

                    # Only crawl same domain

                    if urlparse(full_url).netloc == domain and full_url not in visited:

                        queue.append(full_url)

            except Exception as e:

                print(f"✗ Failed: {url} - {e}")

        return visited

    # Run the crawler

    if __name__ == "__main__":

        urls = crawl("https://quotes.toscrape.com/", max_pages=5)

        print(f"\nTotal pages crawled: {len(urls)}")

Save this as crawler.py and run with python3 crawler.py. It will crawl up to 5 pages from quotes.toscrape.com.

**What this does:** It starts at one URL, fetches the HTML with requests, parses it with BeautifulSoup to find all links, and follows links within the same domain. Perfect for learning, but for production use (handling JavaScript, rate limits, proxies, etc.) - keep reading.

If you want to crawl a website with Python, you can get surprisingly far with a small script - as long as you know what will break in real life. I will start with a copy-paste crawler that actually runs, then build it up step by step (from Requests + BeautifulSoup to Scrapy, and to browser tools for JavaScript-heavy sites). Along the way I will show the boring but important parts: robots.txt, rate limits, and what to do when you hit 403/429 blocks. **If you only need a few hundred pages, DIY is fine** - if you need thousands, retries, proxies, and scheduling become the real work, and that is where a service like [WebCrawlerAPI](https://webcrawlerapi.com) can make sense later.

* * *

What is Web Crawling (and How It Differs from Scraping)

-------------------------------------------------------

People mix these terms up constantly, so let me clear it up.

**Crawling** is about discovering pages. You start at one URL, grab all the links on that page, then visit those links, grab more links, and keep going. Think of it like exploring a maze - you're mapping out what exists, not necessarily reading every sign on the wall.

**Scraping** is about extracting specific data from pages you already found. You grab product prices, article titles, contact info, reviews - whatever data you actually need from the HTML.

Here's the real difference in practice:

*   A **crawler** hits 100 pages and returns a list of URLs

*   A **scraper** hits those same 100 pages and returns structured data (JSON, CSV, database rows)

Most real projects do both. You crawl to find all product pages on an e-commerce site, then scrape each page to extract the price, title, and specs. The crawler discovers, the scraper extracts.

Scraping has its own problems (parsing messy HTML, handling JavaScript, dealing with rate limits), but crawling adds the complexity of navigation logic on top.

If you just need data from 5 specific URLs you already know? Skip the crawler, just scrape those pages directly. If you need to discover everything on a site first? You need a proper crawler.

* * *

Simple Python Website Crawler with Requests and BeautifulSoup

-------------------------------------------------------------

This section builds up a working crawler step by step. If you want to see the complete final version first, check out [this gist](https://gist.github.com/n10ty/57e1379ba608f0bda369c5825b27f4fa) - it's a production-ready crawler with robots.txt handling, proper delays, and CSV export. We'll break down the key parts below.

If you want the smallest possible one-file crawler (dedupe + same-site scope + URL normalization), see: [BeautifulSoup4 Web Crawler](/blog/beatifulsoup-webcrawler).

### Installing the Required Libraries

You need two packages: requests for fetching web pages, and beautifulsoup4 for parsing HTML.

    pip install requests beautifulsoup4

**What each library does:**

*   **requests** - Makes HTTP requests to fetch web pages. It handles all the low-level networking, headers, cookies, timeouts. Much easier than Python's built-in urllib.

*   **beautifulsoup4** - Parses messy HTML into a tree you can navigate with simple Python code. Handles broken HTML that would crash a strict parser.

**Python version:** You need Python 3.7 or higher. These libraries work with Python 3.12+ just fine.

If you're in a virtual environment (you should be):

    python3 -m venv .venv

    source .venv/bin/activate  # On Windows: .venv\Scripts\activate

    pip install requests beautifulsoup4

That's it. No browser drivers, no headless Chrome, no Docker - just two pure Python packages.

### Crawling a Single Page

Let's start simple. Fetch one page, grab its title and all the links on it.

    import requests

    from bs4 import BeautifulSoup

    from urllib.parse import urljoin

    def crawl_single_page(url: str):

        r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=10)

        r.raise_for_status()

        soup = BeautifulSoup(r.text, "html.parser")

        title = soup.title.get_text(strip=True) if soup.title else ""

        links = [urljoin(url, a["href"]) for a in soup.select("a[href]")]

        return title, links

**What happens here:**

1.  **requests.get()** fetches the page. The User-Agent header makes us look like a browser instead of Python (some sites block the default Python user agent).

2.  **BeautifulSoup** parses the HTML. The html.parser is built into Python - no extra install needed.

3.  **soup.title.get\_text()** extracts the text from the <title> tag.

4.  **soup.find\_all("a")** finds every <a> tag with an href attribute.

5.  **urljoin()** converts relative URLs like /page/2 into absolute URLs like https://quotes.toscrape.com/page/2.

This works fine for one page. But if you try to run this on 100 pages, you'll hit problems: no retry logic, no delay between requests, no way to avoid visiting the same page twice.

### Following Links to Crawl Multiple Pages

Now we scale it up. Visit multiple pages by following links, but stay on the same domain and avoid infinite loops.

    import requests

    from bs4 import BeautifulSoup

    from urllib.parse import urljoin, urlparse

    from collections import deque

    def crawl_multiple_pages(seed_url, max_pages=10):

        domain = urlparse(seed_url).netloc

        visited = set()

        queue = deque([seed_url])

        while queue and len(visited) < max_pages:

            url = queue.popleft()

            if url in visited:

                continue

            visited.add(url)

            soup = BeautifulSoup(requests.get(url, timeout=10).text, "html.parser")

            for a in soup.select("a[href]"):

                next_url = urljoin(url, a["href"])

                if urlparse(next_url).netloc == domain:

                    queue.append(next_url)

        return list(visited)

**Key parts:**

*   **deque** - A queue for breadth-first crawling. We add links to the right, pop URLs from the left. This crawls level by level instead of diving deep into one branch.

*   **visited set** - Prevents visiting the same URL twice. Crucial for avoiding infinite loops.

*   **domain check** - urlparse(full\_url).netloc == domain keeps us on the same site. Without this, we'd crawl the entire internet.

*   **try/except** - If one page fails (timeout, 404, connection error), we skip it and keep going.

**What's still missing:** This doesn't respect robots.txt, doesn't add delays between requests (will get you blocked fast), and doesn't handle redirects properly. The full gist example fixes all of this.

### Extracting and Storing Data

Now let's extract real data and save it somewhere useful. The main idea is that one crawl loop will produce structured rows, and then those rows will be exported.

    import csv

    from dataclasses import dataclass

    @dataclass

    class PageResult:

        url: str

        status: int

        title: str

    def save_to_csv(rows: list[PageResult], output_path: str) -> None:

        with open(output_path, "w", newline="", encoding="utf-8") as f:

            w = csv.DictWriter(f, fieldnames=["url", "status", "title"])

            w.writeheader()

            for r in rows:

                w.writerow({"url": r.url, "status": r.status, "title": r.title})

    # In the crawl loop, PageResult objects will be appended and then exported:

    # results.append(PageResult(url=final_url, status=resp.status_code, title=title))

    # save_to_csv(results, "out/crawl_results.csv")

**What this adds:**

*   **dataclass** - Clean way to store structured data. Better than dicts for type safety.

*   **Session object** - Reuses the same HTTP connection. Faster than creating a new connection for every request.

*   **normalize\_url()** - Removes URL fragments (#section) so page.html and page.html#top count as the same page.

*   **Content-Type check** - Skips PDFs, images, and other non-HTML files. Prevents trying to parse binary data with BeautifulSoup.

*   **time.sleep()** - Adds a 0.5 second delay between requests. This is critical. Without delays, many sites will ban your IP after 10-20 requests.

*   **CSV export** - Saves data in a format you can open in Excel, import into a database, or process with pandas.

**Alternative: JSON output**

If you prefer JSON instead of CSV:

    import json

    def save_to_json(rows, output_path: str) -> None:

        with open(output_path, "w", encoding="utf-8") as f:

            json.dump([r.__dict__ for r in rows], f, indent=2)

**Real-world extraction:**

For production use, you'd extract more fields:

    # Examples of common extractions

    description = (soup.find("meta", attrs={"name": "description"}) or {}).get("content", "")

    heading = (soup.find("h1").get_text(strip=True) if soup.find("h1") else "")

    price_el = soup.select_one("span.price")

    product_price = price_el.get_text(strip=True) if price_el else None

This basic crawler will get you surprisingly far for small-scale projects. For the complete version with robots.txt handling and better error handling, see the [full gist example](https://gist.github.com/n10ty/57e1379ba608f0bda369c5825b27f4fa).

**When this breaks:** JavaScript-heavy sites, aggressive rate limiting, CAPTCHAs, login-required pages. We'll cover those problems in the next sections.

* * *

Building a Production Web Crawler with Scrapy

---------------------------------------------

If you're crawling hundreds of pages and the Requests+BeautifulSoup approach starts to feel like you're duct-taping features together (retry logic here, rate limiting there, duplicate detection somewhere else), it's time to switch to Scrapy. **Scrapy is a production web crawling framework** - not just a library. It handles all the annoying infrastructure stuff so you can focus on extracting data.

Check all examples in the [Scrapy Website Crawler Examples](https://github.com/WebCrawlerAPI/crawl-scrapy-examples) Github repo.

### Why Scrapy for Larger Projects

Scrapy gives you features that would take weeks to build yourself:

*   **Built-in concurrency** - Scrapy handles multiple requests in parallel automatically. You write single-threaded code, Scrapy runs it concurrently using Twisted. No threading, no async/await complexity. Set CONCURRENT\_REQUESTS = 16 and you're crawling 16 pages at once.

*   **Automatic retries** - Network fails, timeouts, 500 errors - Scrapy retries automatically with exponential backoff. Configurable retry counts and delays.

*   **robots.txt handling** - Set ROBOTSTXT\_OBEY = True and Scrapy checks robots.txt before every request. No manual parsing needed.

*   **Request prioritization** - Scrapy uses a priority queue. You can mark certain URLs as high priority and they'll get crawled first.

*   **Middlewares and pipelines** - Clean separation between fetching data (spider), processing data (pipeline), and request handling (middleware). Add logging, duplicate filtering, database saving without touching your spider code.

*   **Response caching** - Enable HTTP cache middleware and Scrapy stores responses on disk. Run the same crawl 100 times while testing your parser without hitting the server once.

#### When to use Scrapy instead of BeautifulSoup in Python

*   Crawling more than 50 pages

*   Need to crawl regularly (daily/weekly jobs)

*   Multi-step crawling (list pages → detail pages → pagination)

*   Need structured data output (JSON, CSV, database)

*   Care about politeness (delays, robots.txt)

#### When to stick with Requests

*   One-off script for 5-10 pages

*   Simple proof of concept

*   Already embedded in a larger codebase

The initial setup cost is higher with Scrapy, but for any serious crawling work, it pays off fast.

### Creating Your First Scrapy Spider

First, install Scrapy:

    pip install scrapy

Unlike Requests, you don't need BeautifulSoup - Scrapy includes its own HTML parser (using lxml under the hood, which is faster than BeautifulSoup).

**Simple spider example:**

    import scrapy

    class QuotesSpider(scrapy.Spider):

        name = "quotes"

        allowed_domains = ["quotes.toscrape.com"]

        start_urls = ["https://quotes.toscrape.com/"]

        custom_settings = {"ROBOTSTXT_OBEY": True, "DOWNLOAD_DELAY": 0.5}

        def parse(self, response):

            for quote in response.css("div.quote"):

                yield {

                    "text": quote.css("span.text::text").get(),

                    "author": quote.css("small.author::text").get(),

                }

            next_page = response.css("li.next a::attr(href)").get()

            if next_page:

                yield response.follow(next_page, callback=self.parse)

**Run it:**

    scrapy runspider quotes_spider.py -o output.json

Full example in [crawl\_spider.py](https://github.com/WebCrawlerAPI/crawl-scrapy-examples/blob/master/quotes_spider.py).

That's it. Scrapy handles everything: fetches pages, calls parse() for each response, follows the links you yield, exports results to JSON.

**Key parts explained:**

*   **name** - Spider identifier. Required. Used when running via scrapy crawl quotes.

*   **allowed\_domains** - Scrapy won't follow links outside these domains. Safety feature to prevent runaway crawls.

*   **start\_urls** - List of URLs to start crawling. Scrapy fetches these first.

*   **custom\_settings** - Spider-specific settings. Override global config without editing files.

*   **parse(response)** - Called for every successful response. Return/yield dictionaries (data) or Request objects (more pages to crawl).

*   **response.css()** - CSS selector API. ::text extracts text, ::attr(href) extracts attributes, .get() returns first match, .getall() returns all matches.

*   **response.follow()** - Creates a new Request. Handles relative URLs automatically. You specify the callback method.

**CSS selectors vs XPath:**

Scrapy supports both. CSS is more readable for simple cases:

    # CSS

    response.css("div.quote span.text::text").get()

    # XPath (equivalent)

    response.xpath("//div[@class='quote']//span[@class='text']/text()").get()

Use CSS for 90% of cases. Switch to XPath when you need complex logic like "find the table cell in the same row as the one containing 'Price'".

**Complete working example:**

The full spider code including author page parsing is available in [crawl-scrapy-examples](https://github.com/WebCrawlerAPI/crawl-scrapy-examples). It demonstrates multiple callback methods and structured data extraction.

### Scrapy Crawling Rules and Link Extraction

For complex crawling patterns (follow all pagination, follow all category pages, but don't follow external links), use **CrawlSpider** instead of the basic Spider. You define rules, Scrapy does the rest.

    from scrapy.linkextractors import LinkExtractor

    from scrapy.spiders import CrawlSpider, Rule

    class QuotesCrawlSpider(CrawlSpider):

        name = "quotes_crawl"

        allowed_domains = ["quotes.toscrape.com"]

        start_urls = ["https://quotes.toscrape.com/"]

        rules = (

            Rule(LinkExtractor(restrict_css="li.next a"), callback="parse_quotes", follow=True),

        )

        def parse_quotes(self, response):

            for quote in response.css("div.quote"):

                yield {

                    "text": quote.css("span.text::text").get(),

                    "author": quote.css("small.author::text").get(),

                }

Full example in [quotes\_crawlspider.py](https://github.com/WebCrawlerAPI/crawl-scrapy-examples/blob/master/quotes_crawlspider.py). Run it and pagination will be followed automatically based on the rule.

**How rules work:**

Each Rule tells Scrapy:

1.  **What links to extract** - LinkExtractor() finds matching links

2.  **What to do with them** - Call a callback method to parse the page

3.  **Whether to follow** - If follow=True, Scrapy extracts links from those pages too

**LinkExtractor options:**

    # Common patterns

    LinkExtractor(restrict_css="a.product-link")

    LinkExtractor(allow=r"/product/\d+", deny=r"/admin/")

**Depth limiting:**

Set DEPTH\_LIMIT to prevent crawling too deep. Depth 0 is start URLs, depth 1 is pages linked from start URLs, depth 2 is pages linked from those, etc.

    custom_settings = {

        "DEPTH_LIMIT": 2,  # Only crawl 2 levels deep

    }

### Processing and Exporting Scraped Data

For production crawlers, you want structured data, not just dictionaries. Scrapy provides **Items** for type-safe data structures and **Pipelines** for processing.

**Define data structures with Items:**

    from scrapy import Item, Field

    class QuoteItem(Item):

        text = Field()

        author = Field()

**Use Items in your spider:**

    class QuotesItemSpider(scrapy.Spider):

        name = "quotes_items"

        start_urls = ["https://quotes.toscrape.com/"]

        def parse(self, response):

            for quote in response.css("div.quote"):

                item = QuoteItem()

                item["text"] = quote.css("span.text::text").get()

                item["author"] = quote.css("small.author::text").get()

                yield item

**Why use Items instead of dicts?**

*   **Type safety** - Know what fields exist

*   **Validation** - Add field processors to clean data

*   **IDE autocomplete** - Better developer experience

*   **Pipeline compatibility** - Pipelines can check item types

**Data processing with Pipelines:**

Full example in [quotes\_items.py](https://github.com/WebCrawlerAPI/crawl-scrapy-examples/blob/master/quotes_items.py).

Pipelines receive items after extraction and before export. Use them to clean, validate, deduplicate, or save to databases.

    class QuotesPipeline:

        def __init__(self):

            self.seen_quotes = set()

        def process_item(self, item, spider):

            if isinstance(item, QuoteItem):

                text = (item.get("text") or "").strip()

                if text in self.seen_quotes:

                    from scrapy.exceptions import DropItem

                    raise DropItem("duplicate")

                self.seen_quotes.add(text)

                item["text"] = text

            return item

**Enable pipelines in settings:**

    custom_settings = {

        "ITEM_PIPELINES": {

            "myproject.pipelines.QuotesPipeline": 300,

            "myproject.pipelines.DatabasePipeline": 400,

        },

    }

The number (300, 400) is the priority - lower numbers run first.

**Export formats:**

Scrapy exports to multiple formats out of the box:

    scrapy runspider spider.py -o output.json

    scrapy runspider spider.py -o output.csv

**Custom export to database:**

For database export, use a pipeline:

    import sqlite3

    class DatabasePipeline:

        def open_spider(self, spider):

            self.conn = sqlite3.connect("quotes.db")

            self.cursor = self.conn.cursor()

        def process_item(self, item, spider):

            if isinstance(item, QuoteItem):

                self.cursor.execute(

                    "INSERT INTO quotes VALUES (?, ?, ?)",

                    (item.get("text"), item.get("author"), "")

                )

            return item

**Complete working example** with Items and Pipeline: [quotes\_items.py](https://github.com/WebCrawlerAPI/crawl-scrapy-examples/blob/master/quotes_items.py). It demonstrates structured data extraction, duplicate detection, and data cleaning.

**Real production pattern:**

In production, you'd have multiple pipelines:

1.  **ValidationPipeline** (priority 100) - Check required fields, validate formats

2.  **CleaningPipeline** (priority 200) - Clean text, normalize data

3.  **DuplicatePipeline** (priority 300) - Filter duplicates

4.  **DatabasePipeline** (priority 400) - Save to database

5.  **ImagePipeline** (priority 500) - Download and process images (built-in)

Each pipeline does one thing. Easy to test, easy to debug, easy to reorder.

Scrapy transforms web crawling from "writing networking code" to "writing extraction logic." You focus on what data to extract, Scrapy handles how to fetch it reliably at scale.

* * *

Crawling JavaScript-Heavy Websites with Python

----------------------------------------------

If your crawler keeps returning empty pages, you are probably not doing anything wrong. You are just fetching the wrong thing.

Modern sites often ship a tiny HTML shell and then render the real content in the browser with JavaScript. requests, BeautifulSoup, and vanilla Scrapy will happily download the shell. Then you parse it. And you get... nothing.

This section is about fixing that without turning your crawler into a fragile, slow headless-browser monster.

### The JavaScript Problem

The core problem is simple:

*   requests.get(url).text returns the initial HTML document.

*   The stuff you actually want (products, posts, quotes, etc.) gets loaded later via XHR/fetch and rendered by the browser.

**Quick reality check (30 seconds):**

1.  Open the page in Chrome.

2.  Right click -> View Page Source.

3.  Search for the thing you want (a product title, a quote, a price).

If it is not in "View Page Source" but it is visible in the normal page, you are looking at a JavaScript-rendered site.

Before you reach for a headless browser, try the cheap wins first:

*   **Look for an underlying JSON API.** Open DevTools -> Network -> Fetch/XHR, refresh, and watch what endpoints return the data. If the data is already in JSON, scraping the JSON is faster and more reliable than rendering HTML.

*   **Check for embedded state.** Next.js pages often have data inside \_\_NEXT\_DATA\_\_. Many apps ship a big JSON blob in a <script> tag.

*   **Sitemaps still work.** Even JS-heavy sites often expose URLs in sitemap.xml. Discovery can stay "static" while only a subset of pages gets rendered.

If those options fail (or you need the fully rendered DOM), you need a browser renderer: [Selenium](https://www.selenium.dev/) or [Playwright](https://playwright.dev/).

### Using Selenium for JavaScript Rendering

Selenium drives a real browser (Chrome, Firefox, etc.) and gives you the rendered DOM. That is the whole point.

The catch is that browsers are heavier than HTTP requests. So the workflow is usually kept simple:

1.  Open a page.

2.  Wait for a selector that proves content rendered.

3.  Extract HTML (or the specific fields) and pass it back to your Python parser.

The Selenium example is kept as a public gist (so it can be copied into any project without hunting around this repo):

[https://gist.github.com/n10ty/988fe84ee2bb0722e2e14303ba36d3b7](https://gist.github.com/n10ty/988fe84ee2bb0722e2e14303ba36d3b7)

Here is the core Selenium flow in a few lines (open -> wait -> grab rendered HTML):

    from selenium import webdriver

    from selenium.webdriver.common.by import By

    from selenium.webdriver.support.ui import WebDriverWait

    from selenium.webdriver.support import expected_conditions as EC

    driver = webdriver.Chrome()

    driver.get("https://quotes.toscrape.com/js/")

    WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".quote")))

    html = driver.page_source

    driver.quit()

What I would do in real crawls:

*   **Render only when you must.** Rendering every page is slow and expensive.

*   **Wait for a specific element.** Waiting for "page load" is not enough on many sites.

*   **Set timeouts** and treat rendering as flaky (retries, backoff).

*   **Disable images/fonts** to speed up loads.

### Playwright as a Selenium Alternative

Playwright does the same job (browser automation), but it is usually more predictable than Selenium for crawling work. It also has a first-class Python library, so the whole pipeline can stay in Python.

The full working script is kept as a public gist:

[Playwright renderer (Python) gist](https://gist.github.com/n10ty/cd545791de009a033f944e568e3eb8be)

Here is the core Playwright flow in a few lines (open -> wait -> grab rendered HTML):

    from playwright.sync_api import sync_playwright

    with sync_playwright() as p:

        browser = p.chromium.launch(headless=True)

        page = browser.new_page()

        page.goto("https://quotes.toscrape.com/js/", wait_until="networkidle")

        page.wait_for_selector(".quote")

        html = page.content()

        browser.close()

**When to use which:**

*   If you want the most "it just works" Python browser renderer, Playwright is usually the cleanest path.

*   If you already have Selenium infrastructure (grid, browser profiles, existing scripts), stick with Selenium.

*   If you only need data that already exists in JSON, skip browsers entirely and hit the JSON endpoint.

* * *

Crawling All Links on a Website (Full Site Crawl)

-------------------------------------------------

There are two ways to discover pages on a domain:

1.  **Ask the site for a list** (sitemaps).

2.  **Walk the site like a user** (follow links).

In real crawls, both approaches will be used.

*   Sitemaps will be used to get coverage fast.

*   Link-following will be used to find orphan pages, parameterized URLs, and things that never made it into the sitemap.

### Method 1: Start with sitemap.xml

This will be the highest ROI move on many sites.

*   Discovery will be fast.

*   You will not get trapped in infinite calendars.

*   You will not accidentally hammer the same nav pages 10,000 times.

The catch: sitemaps will not always exist, and they will not always be complete.

Here is a minimal sitemap URL collector:

    import xml.etree.ElementTree as ET

    from urllib.parse import urljoin

    import requests

    def get_sitemap_urls(base_url: str) -> list[str]:

        xml = requests.get(urljoin(base_url, "/sitemap.xml"), timeout=20).text

        root = ET.fromstring(xml)

        return [loc.text.strip() for loc in root.findall(".//{*}loc") if loc.text]

If the sitemap index pattern is used (<sitemapindex> pointing to multiple sitemaps), it is the same approach: parse <loc> and recurse.

### Method 2: Breadth-first link crawling (BFS)

This is the classic crawler loop: queue -> fetch -> extract links -> enqueue.

What will matter in practice is scope control. A full-site crawl will be destroyed by:

*   infinite query strings (?page=1, ?page=2, ...)

*   faceted navigation (?color=red&size=m&brand=...)

*   calendars

*   internal search pages

*   duplicate pages (same content, different URLs)

If you want a crawl you can trust, these controls will be added:

*   **Domain allowlist** (stay on one domain)

*   **URL normalization** (remove fragments, normalize trailing slashes)

*   **Query strategy** (drop all query params, or allow a small allowlist)

*   **Depth/page limits** (hard stops)

*   **Content-type filters** (HTML only)

Here is a compact BFS crawler skeleton with those guardrails:

    from collections import deque

    from urllib.parse import urljoin, urlparse, urldefrag

    import requests

    from bs4 import BeautifulSoup

    def normalize(url: str) -> str:

        url, _frag = urldefrag(url)

        return url.rstrip("/")

    def crawl_site(seed_url: str, max_pages: int = 200) -> list[str]:

        domain = urlparse(seed_url).netloc

        seen = set()

        queue = deque([seed_url])

        while queue and len(seen) < max_pages:

            url = normalize(queue.popleft())

            if url in seen or urlparse(url).netloc != domain:

                continue

            if urlparse(url).query:  # Drop query params by default.

                continue

            seen.add(url)

            html = requests.get(url, timeout=20).text

            soup = BeautifulSoup(html, "html.parser")

            queue.extend(normalize(urljoin(url, a["href"])) for a in soup.select("a[href]"))

        return list(seen)

If only one thing will be copied from this section, it should be this idea:

**Discovery and fetching will be separated.** URLs will be discovered (sitemaps + BFS), then fetched and extracted with the right tool (Requests/Scrapy/Playwright), depending on what each URL needs.

* * *

Best Practices for Python Web Crawlers

--------------------------------------

The crawler that works on a demo site is not the crawler that survives a real site.

This is where the boring parts will save you.

### Respecting robots.txt and Rate Limiting

robots.txt is not a law. It is a policy file.

If a site says "do not crawl /private", it should not be crawled.

At a minimum, this will be checked:

*   whether the URL is allowed for your crawler user agent

*   whether a crawl delay is specified

Python has a standard library parser:

    from urllib.parse import urljoin

    from urllib.robotparser import RobotFileParser

    def robots_allows(url: str, user_agent: str = "*") -> bool:

        rp = RobotFileParser()

        rp.set_url(urljoin(url, "/robots.txt"))

        rp.read()

        return rp.can_fetch(user_agent, url)

Rate limiting will be treated as part of correctness.

*   A crawler that gets blocked at page 50 is not "fast".

*   It is just wrong.

The simple rule: **go slower than you think.** Then speed up with concurrency only after error rates and blocks are under control.

### Handling Errors and Retries

Failures will happen. You will see:

*   timeouts

*   temporary 5xx responses

*   429 rate limits

*   random connection resets

So retries will be added, and they will be polite.

    import random

    import time

    import requests

    def get_with_backoff(session: requests.Session, url: str, tries: int = 5) -> requests.Response:

        for attempt in range(tries):

            try:

                resp = session.get(url, timeout=20)

                if resp.status_code < 400 and resp.status_code != 429:

                    return resp

            except Exception:

                resp = None

            time.sleep(min(30, 2 ** attempt) + random.random())

        if resp is None:

            raise RuntimeError("request failed")

        resp.raise_for_status()

        return resp

In production, failures will be written to a log with:

*   URL

*   status code

*   exception type

*   retry count

*   timestamp

If you cannot answer "how many URLs failed and why", you do not have a crawler yet.

### Using User Agents and Headers

The default Python user agent is a red flag for some sites.

This does not mean you should pretend to be Chrome 124 with 40 headers.

It means:

*   a realistic User-Agent

*   basic Accept headers

*   consistent behavior (timeouts, redirects)

    session.headers.update({

        "User-Agent": "Mozilla/5.0 (compatible; WebCrawlerAPI/1.0)",

        "Accept": "text/html,application/xhtml+xml",

        "Accept-Language": "en-US,en;q=0.9",

    })

If sessions/cookies are required (login flows), things will get harder fast. At that point, browser automation or an API-based approach will usually be chosen.

### Avoiding Blocks and CAPTCHA

This is the part people try to skip.

Blocking will happen when:

*   too many requests are sent from one IP

*   patterns look too bot-like (same path cadence, no cookies, no JS)

*   the target has aggressive bot protection

The early warning signs:

*   403 spikes

*   429 spikes

*   HTML that suddenly becomes a challenge page

*   response sizes that drop to a tiny constant

What will help before anything fancy:

1.  **Slow down.**

2.  **Respect robots.txt.**

3.  **Cache responses while developing parsers.**

4.  **Stop crawling when blocks start.** (Backoff, rotate targets, retry later.)

Proxies can help, but they will not fix a broken crawler design.

* * *

Scaling Your Python Crawler (When DIY Gets Hard)

------------------------------------------------

DIY crawlers break in predictable ways:

*   A laptop will not like thousands of browser sessions.

*   IP blocks will show up as soon as the crawl is big enough.

*   Retrying and scheduling will become the real work.

At this point, three paths are usually taken:

1.  **Stay DIY, but invest in infrastructure.** Queues, storage, distributed workers, observability.

2.  **Move more of the work to Scrapy.** Concurrency and retry logic will be managed for you.

3.  **Use a crawling API** when rendering, proxies, retries, and job scheduling are not where you want to spend your time.

This is where a service like WebCrawlerAPI can make sense.

The tradeoff is simple:

*   Money will be spent.

*   Engineering time will be saved.

If your crawl is a one-off for 50 pages, it will not be worth it. If you are [running jobs daily](/blog/convert-any-website-to-rss-feed) across thousands of URLs, it often will be.

* * *

Common Use Cases for Python Web Crawlers

----------------------------------------

The crawler is just a tool. The value comes from what is built on top.

### Price monitoring

Product pages will be crawled on a schedule, price fields will be extracted, and diffs will be stored.

Real-life caveats:

*   prices will be personalized

*   currencies will change by region

*   stock will be hidden behind JS

### Lead generation (contact discovery)

This usually means crawling:

*   team pages

*   directory pages

*   "contact" pages

Then extracting emails, phone numbers, or forms.

Be careful here. Legal rules will differ by country, and terms of service will exist.

### SEO audits and competitor research

This is a classic crawl job:

*   find all internal URLs

*   check status codes (404/500)

*   identify redirect chains

*   detect thin pages (low word count)

*   map internal links and depth

### Content aggregation

Blogs, docs, and knowledge bases will be crawled to:

*   build internal search

*   create datasets

*   keep local mirrors

If the goal is just “tell me when this page changes”, a feed is often simpler than a full crawl pipeline. See: [convert any website to an RSS feed](/blog/convert-any-website-to-rss-feed).

### Market research

This is the messy one.

You will be dealing with:

*   inconsistent HTML

*   JS-heavy listing pages

*   rate limits

*   pagination patterns that change without warning

This is where crawlers become products.

* * *

Troubleshooting Common Crawling Problems

----------------------------------------

### Problem: "My crawler returns empty content"

Likely cause: JavaScript rendering.

Fix:

*   check View Page Source

*   find the JSON endpoint in DevTools

*   use Selenium/Playwright only where needed

### Problem: "I keep getting blocked (403/429)"

Likely cause: rate limiting or bot protection.

Fix:

*   slow down

*   add backoff and retries

*   reduce concurrency

*   respect robots.txt

*   stop the crawl when blocks spike and retry later

### Problem: "My crawl never ends"

Likely cause: infinite URL space.

Fix:

*   drop query params by default

*   add max\_pages and/or depth limits

*   add allow/deny patterns

### Problem: "My output has duplicates"

Likely cause: URL variants and redirects.

Fix:

*   normalize URLs (remove fragments, consistent trailing slashes)

*   store and dedupe final URLs after redirects

*   consider canonical URLs if provided

### Problem: "It is too slow"

Fix:

*   cache responses during development

*   use a requests.Session()

*   add controlled concurrency (Scrapy will do this well)

*   avoid rendering unless needed

* * *

Frequently Asked Questions

--------------------------

### Is web crawling legal?

It depends.

Public pages can be crawled, but terms of service, robots.txt, and local laws can apply. This is not legal advice.

If you are crawling anything sensitive (accounts, paywalls, personal data), talk to a lawyer.

### What is the difference between crawling and scraping?

Crawling discovers pages. Scraping extracts data from those pages.

Most projects will do both.

### How fast should a crawler run?

As slow as needed to avoid being blocked and to avoid hurting the target site.

If you cannot get stable results at 1 request per second, going faster will not help.

### Should robots.txt be respected?

Yes.

If the goal is a reliable crawl, you do not want to fight the target site.

### What is the best Python library for crawling?

*   For small scripts: requests + BeautifulSoup.

*   For real crawling: Scrapy.

*   For JS-heavy pages: Playwright or Selenium.

*   API if you don't have time to build your own crawler and just need data.

### How should pagination be handled?

If there is a clear "next" link, follow it.

If pagination is query-based (?page=2), add an allowlist and a hard cap.

### How should duplicates be handled?

URLs will be normalized. Final redirected URLs will be stored. Canonical URLs will be considered when available.

Crawl data from the website with an API in Python.

--------------------------------------------------

Using an API is the shortcut when crawling stops being "just a script" and starts turning into infrastructure. Browser rendering, proxies, retries, and fingerprinting will already be solved on the other side, so time is not spent rebuilding them. The crawl can be made more predictable too: one request starts a job, results come back in a consistent format, and failures are handled with retries instead of manual babysitting.

### When should a crawling API be used?

When the crawl becomes infrastructure:

*   proxies

*   browser rendering

*   job scheduling

*   retries at scale

If those are not your focus, using API can be the right move.

### Start crawling job in Python.

Assuming you have your access key, here is the code with the basic parameters to crawl any site with Python:

    #!/usr/bin/env python3

    from webcrawlerapi import WebCrawlerAPI

    API_KEY = "Your API KEY from https://dash.webcrawlerapi.com/access"

    crawler = WebCrawlerAPI(api_key=API_KEY)

    job = crawler.crawl(

        url="https://books.toscrape.com",

        scrape_type="markdown",

        items_limit=10,

    )

    print(job.status)

    print(len(job.job_items))

Conclusion: Start Crawling Websites with Python Today

-----------------------------------------------------

If you only need a few pages, the copy-paste Requests crawler will be enough.

If you need hundreds, Scrapy will save you time.

If the site is JavaScript-heavy, a browser renderer will be used for the pages that need it, not for everything.

And if the crawl becomes a recurring job with retries, proxies, and rendering, that is when a managed crawler like [WebCrawlerAPI](https://webcrawlerapi.com) will start to look reasonable.

----
url: https://webcrawlerapi.com/blog/best-webcrawler-api-in-2025
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

APIComparison10 min read to read

Best Web Crawler API in 2025

============================

Top Web Crawler APIs in 2025. Most popular web scraping tools for AI, e-commerce, and SEO.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [1\. WebCrawlerAPI](#1-webcrawlerapi)

*   [2\. ScrapingBee](#2-scrapingbee)

*   [3\. ScraperAPI](#3-scraperapi)

*   [4\. WebScrapingAPI](#4-webscrapingapi)

*   [Key Features](#key-features)

*   [Drawbacks](#drawbacks)

*   [Comparison of Pros and Cons](#comparison-of-pros-and-cons)

*   [Conclusion](#conclusion)

### Table of Contents

*   [1\. WebCrawlerAPI](#1-webcrawlerapi)

*   [2\. ScrapingBee](#2-scrapingbee)

*   [3\. ScraperAPI](#3-scraperapi)

*   [4\. WebScrapingAPI](#4-webscrapingapi)

*   [Key Features](#key-features)

*   [Drawbacks](#drawbacks)

*   [Comparison of Pros and Cons](#comparison-of-pros-and-cons)

*   [Conclusion](#conclusion)

Check [top-5 website crawler for AI and RAG](https://webcrawlerapi.com/blog/top-5-best-firecrawl-alternatives) article if you want to train your model based on website data.

Web crawling APIs in 2025 are essential for extracting data from websites efficiently. Whether you're working on AI model training, e-commerce, or SEO, choosing the right tool can save time and resources. Here's a quick comparison of the top 7 web crawling APIs:

*   **[WebCrawlerAPI](https://webcrawlerapi.com/)**: Handles large-scale data extraction with advanced parsing tools and pay-as-you-go pricing.

*   **[ScrapingBee](https://www.scrapingbee.com/)**: Ideal for dynamic websites, offering JavaScript rendering, proxy rotation, and geotargeting.

*   **[ScraperAPI](https://www.scraperapi.com/)**: Scraper with features, geotargeting, and JavaScript rendering for e-commerce scraping.

*   **[WebScrapingAPI](https://www.webscrapingapi.com/)**: Focused on privacy compliance and enterprise-grade data solutions with custom formatting options.

**Quick Comparison Table:**

Tool

Key Features

Best For

Starting Price

WebCrawlerAPI

Scalable, advanced parsing, pay-as-you-go

Small to big projects

$0/month

ScrapingBee

JavaScript rendering, proxy rotation

Big and Enteprise projects

$49/month

ScraperAPI

Anti-bot, geotargeting, e-commerce scraping

Big and Enteprise projects

$49/month

WebScrapingAPI

Privacy compliance, custom solutions

Regulated industries, Enteprise

$499/month

Each tool offers distinct advantages depending on your needs, from handling dynamic content to privacy-focused enterprise solutions. Read on to find the best fit for your project.

1\. [WebCrawlerAPI](https://webcrawlerapi.com/)

-----------------------------------------------

WebCrawlerAPI simplifies [web crawling](https://webcrawlerapi.com/scrapers/webcrawler/html) and data extraction with a clear, usage-based pricing model - no hidden fees or subscriptions. Its distributed system processes millions of pages effortlessly, while advanced parsing tools transform HTML into clean text or Markdown. This makes it an excellent choice for AI and machine learning projects that require well-structured data [\[1\]](https://spider.cloud/guides/spider-api).

Integrating WebCrawlerAPI is straightforward, with support for multiple programming languages like JavaScript, Python, PHP, and .NET [\[3\]](https://webcrawlerapi.com). Its combination of ease of use and powerful features ensures it can handle large-scale data extraction tasks.

Feature

Description

Distributed System

Handles millions of pages without issues

Flexible Output

Converts data into HTML, Text, or Markdown

Anti-Bot Measures

Automatically bypasses CAPTCHAs and IP blocks

Easy Integration

Works seamlessly with popular coding languages

Pay-As-You-Go Pricing

No subscriptions, only pay for what you use

Thanks to its focus on clean data extraction and flexible output formats, WebCrawlerAPI is a go-to tool for machine learning and AI applications. When compared to competitors like ScraperAPI, its straightforward pricing model and advanced parsing tools make it a practical and budget-friendly option.

Though WebCrawlerAPI excels in scalability and efficiency, other tools may offer alternative features worth considering.

2\. [ScrapingBee](https://www.scrapingbee.com/)

-----------------------------------------------

ScrapingBee is a web scraping tool built to handle dynamic websites. It offers features like JavaScript rendering, premium proxies, and anti-bot defenses. Pricing starts at $49/month for smaller projects and goes up to $599/month for enterprise-level needs. The tool uses a credit-based system, allowing users to customize features such as stealth mode and CAPTCHA bypass to suit their specific requirements [\[4\]](https://www.scraperapi.com/blog/scrapingbee-alternatives-for-automated-web-scraping/).

This credit-based approach lets users adjust their usage based on project demands. Features like JavaScript rendering and premium proxies consume credits differently, so efficient planning can help keep costs under control [\[5\]](https://blog.apify.com/scrapingbee-review/).

Here's what ScrapingBee brings to the table:

*   **Proxy rotation** with built-in CAPTCHA bypass

*   **Geotargeting** for location-specific scraping

*   **Custom headers** for tailored requests

*   **Browser fingerprint rotation** for better anonymity

*   **Concurrent request handling** to manage multiple tasks at once

Non-technical users can benefit from ScrapingBee's no-code integration with Make, making it easier to extract data without writing code [\[4\]](https://www.scraperapi.com/blog/scrapingbee-alternatives-for-automated-web-scraping/). Its strong proxy network and ability to handle complex, dynamic websites set it apart.

For businesses looking to scale, higher-tier plans provide dedicated support for custom solutions [\[6\]](https://www.trustradius.com/products/scrapingbee/pricing). However, using advanced features can increase API credit usage, so careful planning is essential for large-scale operations.

While ScrapingBee is excellent for dynamic websites, tools like ScraperAPI might be better suited for other specific tasks.

3\. [ScraperAPI](https://www.scraperapi.com/)

---------------------------------------------

ScraperAPI delivers a simple web scraping solution, charging per successful request to help keep costs predictable.

Here’s what it brings to the table for handling complex scraping tasks:

*   **JavaScript rendering** to manage dynamic content

*   **Geotargeting** to access location-specific data

*   **Advanced anti-bot bypassing** techniques

*   **Automatic proxy rotation** for seamless operation

For e-commerce platforms like Amazon, ScraperAPI uses a 5-credit-per-request model, offering clear pricing for retail data projects [\[8\]](https://www.scraperapi.com/pricing/).

Pricing starts at $49/month for Hobby(!) projects, with custom options available for high-volume users. The service is rated 4.3/5 on G2 and 4.6/5 on [Capterra](https://www.capterra.com/) [\[8\]](https://www.scraperapi.com/pricing/). Developers will appreciate its detailed documentation, which is designed to be accessible for all skill levels [\[7\]](https://www.scraperapi.com/web-scraping/best-web-scraping-apis/).

For Enterprise users, premium support includes perks like a dedicated account manager, Slack-based assistance, and tailored solutions.

**Drawbacks to consider:**

*   A smaller proxy pool compared to some competitors

*   Fewer advanced features compared to other platforms [\[7\]](https://www.scraperapi.com/web-scraping/best-web-scraping-apis/)

ScraperAPI offers a 7-day free trial with 5,000 API credits [\[8\]](https://www.scraperapi.com/pricing/).

That said, while ScraperAPI’s affordability and simplicity are appealing, tools like [WebScrapingAPI](https://webcrawlerapi.com) might offer features better suited to specific needs.

###### sbb-itb-ac346ed

4\. [WebScrapingAPI](https://www.webscrapingapi.com/)

-----------------------------------------------------

If you're looking for a tool with enterprise-level features and a strong focus on privacy, **WebScrapingAPI** stands out as a solid option. It's designed for advanced data extraction while ensuring compliance with privacy regulations [\[9\]](https://www.webscrapingapi.com/pricing/web-scraping-services).

The platform offers two pricing plans to suit different business needs:

Feature

Standard Plan ($449/mo)

Custom Plan ($999/mo)

Data Structure

Unified

Custom

Delivery Format

JSON ([Amazon S3](https://aws.amazon.com/s3/))

Multiple formats (JSON, CSV)

SLA

Standard

Enterprise-grade

Compliance & Support

Privacy regulation adherence, personalized assistance

Privacy regulation adherence, personalized assistance

WebScrapingAPI focuses on delivering high-quality data with minimal effort. The platform automates the entire process - from extraction to cleaning and delivery - in formats like JSON or CSV [\[9\]](https://www.webscrapingapi.com/pricing/web-scraping-services). This automation saves time and eliminates the hassle of manual data preparation.

For businesses handling sensitive information or operating in regulated sectors, WebScrapingAPI ensures compliance with privacy laws throughout the data extraction process [\[9\]](https://www.webscrapingapi.com/pricing/web-scraping-services)[\[10\]](https://www.scraperapi.com/blog/web-scraping-pricing-and-choosing-the-right-solution/).

### Key Features

*   Consistent outputs with a unified data structure

*   Adjustable crawl intervals for flexibility

*   Seamless integration with Amazon S3

*   Enterprise-grade SLA for high-volume users

*   Adherence to privacy regulations

### Drawbacks

*   Higher starting price compared to some competitors

*   Limited geo-targeting options

*   No mention of premium proxy support [\[10\]](https://www.scraperapi.com/blog/web-scraping-pricing-and-choosing-the-right-solution/)

WebScrapingAPI is a great fit for enterprises needing reliable, scalable data solutions with custom formatting options [\[9\]](https://www.webscrapingapi.com/pricing/web-scraping-services).

For those seeking a no-code, versatile alternative, **Scrapestorm** might be worth exploring.

Comparison of Pros and Cons

---------------------------

Here's a breakdown of how these web scraping tools compare, focusing on the features that matter most to users in 2025.

Tool

Key Strengths

Limitations

Best For

Starting Price

WebCrawlerAPI

• Enterprise-level automation and scalability • Various output formats • Extra scrapers

• Lower customization, perfect for simple use-case

Small-medium projects

$0/month

ScrapingBee

• Easy to use with multi-language support JavaScript rendering

• Lacks custom JavaScript functions

Big and Enteprise projects

$49/month

ScraperAPI

• Handles CAPTCHA • Offers multi-country proxies

• Limited customization • Basic JavaScript capabilities

Big and Enteprise projects

$49/month

WebScrapingAPI

• GDPR/CCPA compliance • Multiple output formats • Enterprise support

• Higher starting cost • Limited free tier

Regulated industries with strict compliance needs

$499/month

When choosing a tool, think about your specific needs. For example, **WebCrawlerAPI** is ideal for small and medium sized users, offering a solid infrastructure capable of handling large-scale operations. Its advanced features cater to businesses that need reliable, scalable data extraction [\[9\]](https://www.webscrapingapi.com/pricing/web-scraping-services).

For smaller projects, **ScrapingBee** and **ScraperAPI** are excellent starting points. ScrapingBee, with its beginner-friendly interface, is perfect for teams just entering the web scraping space [\[15\]](https://iamrizwan.me/web-scraping-api/). Meanwhile, WebScrapingAPI's emphasis on legal compliance makes it a strong contender for industries with strict regulatory requirements [\[9\]](https://www.webscrapingapi.com/pricing/web-scraping-services).

Here’s a quick guide based on use cases:

*   **Enterprise-scale operations**: WebCrawlerAPI is the go-to option with its powerful feature set.

*   **AI/ML data extraction**: ScraperAPI offers tailored tools for machine learning projects.

*   **Compliance-focused tasks**: WebScrapingAPI stands out for its adherence to legal standards.

Conclusion

----------

After evaluating top Webcrawling tools, **WebCrawlerAPI stands out as the leading option for 2025**. Its infrastructure is designed to handle large-scale operations without hiccups [\[16\]](https://www.geeksforgeeks.org/design-web-crawler-system-design/). Starting at just $20/month for 10,000 pages, its pay-as-you-go pricing model keeps costs manageable while adapting to business growth [\[2\]](https://webcrawlerapi.com/#pricing).

With a strong emphasis on delivering clean, structured data, WebCrawlerAPI is particularly appealing to AI developers and researchers [\[14\]](https://webcrawlerapi.com/blog/top-5-best-firecrawl-alternatives). Its ability to manage dynamic content, navigate anti-bot systems, and efficiently render JavaScript sets it apart in terms of technical capabilities.

The platform's reliable infrastructure and dedicated support ensure smooth performance, even during high-demand periods [\[16\]](https://webcrawlerapi.com/docs/getting-started). While it may require some technical know-how, its advanced features and budget-friendly pricing make WebCrawlerAPI the top choice to crawl website with API in 2025.

----
url: https://webcrawlerapi.com/glossary/puppeteer/how-can-devtools-windows-be-treated-as-a-page
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

DevTools windows can be treated as regular pages by enabling the handleDevToolsAsPage option when launching or connecting. This option makes DevTools targets appear as normal pages in the browser context.

Code:

    const browser = await Puppeteer.launch({ handleDevToolsAsPage: true });

To also apply when connecting to an existing browser:

    const browser = await Puppeteer.connect({ browserWSEndpoint, handleDevToolsAsPage: true });

----
url: https://webcrawlerapi.com/glossary/scraping/is-web-scraping-legal
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Answer

Web scraping legality depends on the site terms, the data collected, and local laws. Public data may be allowed, but many sites restrict automated access in their terms of service. Scraping personal data can trigger privacy rules and compliance obligations. Bypassing paywalls or authentication can be illegal or a breach of contract. When the stakes are high, get permission or legal review.

Read more about this in our [Web Scraping Ethics: What is legal and what is not?](/blog/web-scraping-ethics) blog post.

----
url: https://webcrawlerapi.com/glossary/webcrawling/what-data-does-a-web-crawler-collect
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Answer

Common crawler data includes URLs, status codes, headers, page content, metadata, links, and timestamps. Many systems also store canonical URLs, redirect chains, and content hashes for deduplication. If rendering is needed, crawlers can capture the final DOM or even screenshots. Some pipelines also attach extracted fields or structured data for downstream use. The exact fields depend on the purpose of the crawl. Keeping a consistent schema makes analysis and monitoring easier.

----
url: https://webcrawlerapi.com/blog/markdown-vs-csv-choosing-the-right-format-for-llm-prompts
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

ComparisonMarkdownCSVRAG

Markdown vs CSV: Choosing the Right Format for LLM Prompts

==========================================================

Markdown vs CSV for scraped data and prompt inputs: when tables help, when they break, and what works best for RAG and pipelines.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What Markdown is good at](#what-markdown-is-good-at)

*   [What CSV is good at](#what-csv-is-good-at)

*   [Use cases in web crawling, scraping, and RAG](#use-cases-in-web-crawling-scraping-and-rag)

*   [When Markdown should be used](#when-markdown-should-be-used)

*   [When CSV should be used](#when-csv-should-be-used)

*   [Practical tradeoffs](#practical-tradeoffs)

*   [Markdown tables are not a contract](#markdown-tables-are-not-a-contract)

*   [CSV breaks on "real world" text](#csv-breaks-on-real-world-text)

*   [Node.js snippet: Convert a small CSV into JSON records](#nodejs-snippet-convert-a-small-csv-into-json-records)

*   [Conclusion](#conclusion)

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What Markdown is good at](#what-markdown-is-good-at)

*   [What CSV is good at](#what-csv-is-good-at)

*   [Use cases in web crawling, scraping, and RAG](#use-cases-in-web-crawling-scraping-and-rag)

*   [When Markdown should be used](#when-markdown-should-be-used)

*   [When CSV should be used](#when-csv-should-be-used)

*   [Practical tradeoffs](#practical-tradeoffs)

*   [Markdown tables are not a contract](#markdown-tables-are-not-a-contract)

*   [CSV breaks on "real world" text](#csv-breaks-on-real-world-text)

*   [Node.js snippet: Convert a small CSV into JSON records](#nodejs-snippet-convert-a-small-csv-into-json-records)

*   [Conclusion](#conclusion)

Markdown is used for readable documents. CSV is used for rows and columns. Confusion is usually created when a Markdown table is expected to behave like a CSV file.

A full format overview is provided in [Best Prompt Data](/blog/best-prompt-data).

Quick comparison

----------------

Topic

Markdown

CSV

Best for

Narrative text with light structure

Flat tabular data

Parsing reliability

Medium

High (when quoting is correct)

Human readability

High

Medium

Nested data

Awkward

Not supported

Common failure

Tables drift in formatting

Commas, quotes, newlines in fields

What Markdown is good at

------------------------

Markdown is usually selected for:

*   Summaries, notes, extraction explanations

*   Long text that should remain readable

*   Mixed content: headings, bullets, code blocks

Markdown as an output format is compared in [Cleaned Text vs Markdown](/blog/cleaned-text-vs-markdown-choosing-the-right-output-format).

What CSV is good at

-------------------

CSV is usually selected for:

*   One row per page (or per product, per listing)

*   Easy export to spreadsheets and BI tools

*   Simple ingestion into databases

If structured objects are needed, [CSV vs Plain Text](/blog/csv-vs-plain-text-choosing-the-right-format-for-llm-prompts) and [JSON vs CSV](/blog/json-vs-csv-choosing-the-right-format-for-llm-prompts) are worth reading.

Use cases in web crawling, scraping, and RAG

--------------------------------------------

### When Markdown should be used

Markdown is usually preferred when:

*   The output is a report, not a dataset

*   Evidence and quotes should be preserved in a readable way

*   The model is expected to explain edge cases

### When CSV should be used

CSV is usually preferred when:

*   A flat dataset is being produced (price list, directory, catalog)

*   A predictable schema is needed (columns)

*   Rows will be deduped, filtered, or joined downstream

For RAG ingestion, CSV is usually not used as-is. The content is often converted into text chunks and metadata. If chunking is the main goal, [Markdown vs Plain Text](/blog/markdown-vs-plain-text-choosing-the-right-format-for-llm-prompts) is usually more relevant.

Practical tradeoffs

-------------------

### Markdown tables are not a contract

Markdown tables are often reformatted by models. Column alignment, escaped pipes, and wrapped text can be changed. If the output must be parsed, CSV or JSON is usually safer.

### CSV breaks on "real world" text

CSV stays simple until commas, quotes, and newlines appear inside fields. That is common in scraped content (descriptions, addresses). Quoting rules must be enforced.

Node.js snippet: Convert a small CSV into JSON records

------------------------------------------------------

A minimal CSV parser is shown. It is safe only for simple CSV without escaped quotes inside quoted fields. For production parsing, a dedicated CSV parser is usually used.

    // Node 18+

    // Minimal CSV to JSON for simple data (no escaped quotes support).

    import { readFile } from "node:fs/promises";

    const csv = (await readFile("data.csv", "utf8")).trimEnd();

    const lines = csv.split("\n");

    const headers = lines[0].split(",").map((s) => s.trim());

    const rows = [];

    for (const line of lines.slice(1)) {

      const cols = line.split(",").map((s) => s.trim());

      const obj = {};

      for (let i = 0; i < headers.length; i++) obj[headers[i]] = cols[i] ?? "";

      rows.push(obj);

    }

    console.log(JSON.stringify(rows.slice(0, 3), null, 2));

Conclusion

----------

*   Markdown is usually used for readable reports and explanations.

*   CSV is usually used for flat datasets with predictable columns.

*   If strict structure is required and nesting is needed, JSON is usually preferred over CSV.

If CSV is being considered mainly for readability, YAML can be evaluated too in [YAML vs CSV](/blog/yaml-vs-csv-choosing-the-right-format-for-llm-prompts).

----
url: https://webcrawlerapi.com/legal/Webcrawlerapi%20DPA.pdf
----

Data Processing Agreement (DPA) This Data Processing Agreement   ("Agreement") is entered into by and between:  Controller:   Any customer of WebcrawlerAPI using the service to process data (the "Controller")  Processor:   WebcrawlerAPI, operated by 103labs (the "Processor") Together referred to as the "Parties."  1. Subject Matter   This Agreement governs the processing of personal data performed by the Processor on behalf of the Controller as required for using the WebcrawlerAPI service.  2. Duration   This Agreement remains in effect for as long as the Controller uses the services of the Processor.  3. Nature and Purpose of Processing   The Processor provides web crawling and data extraction services. The Controller may submit URLs or other inputs to be processed. The purpose of the processing is to retrieve and analyze publicly available web content as instructed by the Controller.  4. Type of Personal Data   The Controller determines the nature of the data processed. This may include IP addresses, metadata, or other personal data contained in public websites.  5. Obligations of the Processor   - Process personal data only on documented instructions from the Controller. - Ensure persons authorized to process personal data are under confidentiality obligations. - Implement appropriate technical and organizational measures to ensure data security. - Assist the Controller in fulfilling obligations under GDPR, including data subject rights, security, and breach notifications. - At the Controller's choice, delete or return personal data after the end of service provision. - Make available all information necessary to demonstrate compliance and allow for audits.  6.   Sub-Processors   The   Controller   authorizes   the   use   of   sub-processors   listed   at:   https:// webcrawlerapi.com/legal/subprocessors   The   Processor   will   inform   the   Controller   of   any   intended changes.  7. International Data Transfers   The Processor ensures that data transferred outside the EU/EEA is protected   by   Standard   Contractual   Clauses   (SCCs)   or   equivalent   safeguards   in   line   with   GDPR requirements.  8. Liability   Each Party shall be liable for damages resulting from violations of this Agreement or applicable data protection laws to the extent it is responsible.  9. Governing Law   This Agreement shall be governed by and interpreted in accordance with the laws of the Netherlands.  10. Termination   This Agreement terminates automatically when the Controller no longer uses the services of the Processor.  1

IN WITNESS WHEREOF , this Agreement is made available by the Processor as of the Effective Date of service use. For questions or a signed copy, contact: support@webcrawlerapi.com  2

----
url: https://webcrawlerapi.com/glossary/playwright/how-to-fix-playwright-timeouterror-action-timeout
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

TimeoutError usually means Playwright could not find an actionable element in time.

Use a more specific locator, wait for UI readiness, and set a realistic timeout.

    import { test, expect } from '@playwright/test';

    test('click when ready', async ({ page }) => {

      await page.goto('https://example.com/login');

      const submit = page.getByRole('button', { name: 'Sign in' });

      await expect(submit).toBeVisible({ timeout: 10000 });

      await submit.click({ timeout: 10000 });

    });

If this happens often, set defaults once:

    test.use({ actionTimeout: 10000, navigationTimeout: 15000 });

----
url: https://webcrawlerapi.com/glossary/playwright/how-to-fix-playwright-protocol-error-invalid-argument
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

A protocol invalid-argument error means a browser command received unsupported or malformed input.

Validate option shapes and browser-specific support before calling the API.

    // Example: permission must be valid and context must have a proper origin.

    await context.grantPermissions(['geolocation'], {

      origin: 'https://example.com',

    });

If it is browser-specific, reproduce in Chromium/WebKit/Firefox separately and gate behavior by project.

----
url: https://webcrawlerapi.com/glossary/webcrawling/is-web-crawling-legal
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Answer

Web crawling legality depends on the website, the data you collect, and the laws in your jurisdiction. Many sites allow crawling of public pages but restrict use through terms of service. You should respect robots.txt and avoid bypassing access controls. Collecting personal or copyrighted data can introduce privacy and IP risks. If you plan to resell or publish results, the requirements are usually stricter. When in doubt, get permission or legal guidance.

Read more about this in our [Web Scraping Ethics: What is legal and what is not?](/blog/web-scraping-ethics) blog post.

----
url: https://webcrawlerapi.com/glossary/puppeteer/what-does-httprequestpostdata-deprecated-in-puppeteer
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

The deprecation note indicates that the HTTPRequest.postData API is deprecated in Puppeteer. This means you should avoid using it in new code and plan to migrate away from it, as it may be removed in a future Puppeteer release. The issue updates the docs to reflect this deprecation and does not remove the API immediately.

If you need to inspect request payloads, rely on other aspects of the Request data (such as URL and headers) and handle payloads via alternative flows. For example:

    page.on('request', req => {

      console.log(req.url(), req.method(), req.headers());

      // do not call req.postData()

    });

This keeps your code future-proof while you adapt to the deprecation.

----
url: https://webcrawlerapi.com/glossary/webcrawling/how-often-should-you-crawl-a-site
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Answer

Match crawl frequency to how often content changes and how quickly you need updates. High‑change sites may need multiple crawls per day, while static sites can be crawled far less often. Start with a conservative schedule and adjust based on observed update rates. Pay attention to server responses and slow down if you see rate limits or errors. Your own resource constraints and data freshness goals also matter. The right schedule is a balance of coverage, freshness, and politeness.

----
url: https://webcrawlerapi.com/blog/top-web-scraping-apis-in-2025
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

APIComparison10 min read to read

Top 6 Web Scraping APIs in 2025

===============================

Top 6 Scraping API in 2025. Get content or structure data with a single API call.

Written byAndrew

Published onJan 31, 2026

### Table of Contents

*   [What Is a Web Scraping API?](#what-is-a-web-scraping-api)

*   [1\. Bright Data](#1-bright-data)

*   [2\. ScrapingBee](#2-scrapingbee)

*   [3\. Decodo (formerly Smartproxy)](#3-decodo-formerly-smartproxy)

*   [4.ScraperAPI](#4scraperapi)

*   [5\. ScrapingDog](#5-scrapingdog)

*   [6\. WebcrawlerAPI](#6-webcrawlerapi)

### Table of Contents

*   [What Is a Web Scraping API?](#what-is-a-web-scraping-api)

*   [1\. Bright Data](#1-bright-data)

*   [2\. ScrapingBee](#2-scrapingbee)

*   [3\. Decodo (formerly Smartproxy)](#3-decodo-formerly-smartproxy)

*   [4.ScraperAPI](#4scraperapi)

*   [5\. ScrapingDog](#5-scrapingdog)

*   [6\. WebcrawlerAPI](#6-webcrawlerapi)

What Is a Web Scraping API?

---------------------------

A web scraping API is used to collect data from websites without building your own scraper. A website URL is simply sent to the API, and the data is returned in an easy-to-use format like JSON or CSV (If you need Markdown, you can use [URL to Markdown API](https://webcrawlerapi.com/scrapers/webcrawler/url-to-md/description)). Things like proxies, browser headers, and CAPTCHAs are handled for you.

However, it allows you to get only a single page content. If you need content from all pages of the website use [Webcrawler API](https://webcrawlerapi.com/blog/what-is-a-web-crawling-api).

Here is the top list:

1.  BrightData

2.  Decodo (formerly SmartProxy)

3.  ScrapingBee

4.  ScaperAPI

5.  ScrapingDog

6.  WebcrawlerAPI

### 1\. [Bright Data](https://brightdata.com)

Bright Data is known for offering powerful tools for scraping complex websites. Its main APIs are the Web Unlocker and SERP API. The Web Unlocker API is used for general web scraping. It supports JavaScript rendering and advanced anti-bot protections. A large residential proxy network is included.

The SERP API is designed for search engine scraping. It offers a 99% success rate and can solve CAPTCHAs.

Data is returned in JSON, CSV, and other formats. Delivery can be done by API, email, cloud storage, or other methods.

Pricing: Starts at $1.5 per 1,000 requests. A pay-as-you-go plan is also available. A 7-day free trial is offered.

### [2\. ScrapingBee](https://www.scrapingbee.com/)

ScrapingBee is used to scrape complex websites without writing much code. It handles rotating proxies, JavaScript rendering, and CAPTCHA solving for you. A Stealth Proxy feature is being tested to improve scraping of hard-to-access sites.

The API was built for simplicity. Minimal setup is needed, and it works well even for beginners. Developers can quickly connect the API using tutorials and client libraries in different languages.

ScrapingBee is often chosen for scraping e-commerce, booking, and real estate platforms. Data is returned in structured formats like JSON.

Pricing: Starts at $49 per month. Business plans are available from $599. A free trial is not clearly listed, but flexible plans are provided.

### [3\. Decodo (formerly Smartproxy)](https://decodo.com/)

Decodo is used to scrape web, SERP, e-commerce, and social media platforms with high success and performance. It offers dedicated APIs for each use case and supports proxy rotation, JavaScript rendering, and anti-detection by default. Developers can integrate using Postman, GitHub code samples, or an API playground for live testing.

Results are returned in JSON, and support is available 24/7. Decodo is often chosen for its balance between quality and pricing, especially with Core and Advanced plans offering flexibility for different levels of usage.

Pricing: Starts at $29/month for 100K requests ($0.29/1K). Advanced plans include more features and geotargeting. A 7-day free trial or a 14-day money-back guarantee is available.

### [4.ScraperAPI](https://www.scraperapi.com/)

ScraperAPI is built for scraping easier websites at a low cost. It supports multiple programming languages like Python, PHP, and Java.

Headers and sessions can be customized.

Google and Amazon data can be extracted by changing parameters.

Proxy rotation and CAPTCHA solving are available. It is a good choice for developers looking for a simple solution.

Pricing: A free plan includes 1,000 credits. Paid plans start at $49 per month for 100,000 credits. A 7-day trial with 5,000 credits is offered.

### [5\. ScrapingDog](https://www.scrapingdog.com/)

Scrapingdog is used to collect data from almost any website with minimal setup. It provides a simple Web Scraping API that can be integrated quickly into any development environment. Clear documentation and regular tutorials are offered to help developers get started.

Each new user is given 1,000 free credits for testing. The cost per request starts at just $0.0002 and drops to below $0.000063 with higher volume, making it one of the most affordable options available.

Customer support is available 24/7 to assist with any issues related to the service.

Key Features:

*   Easy integration with clean documentation

*   Video tutorials and blog support

*   1,000 free credits on sign-up

*   Scales to very low cost per scrape with volume

*   24/7 customer support

Scrapingdog is often chosen by developers looking for a low-cost, developer-friendly API that works out-of-the-box for a wide range of websites.

### [6\. WebcrawlerAPI](https://webcrawlerapi.com/)

WebcrawlerAPI is used to extract complete website content in formats like Markdown, HTML, or plain text—perfect for developers building AI or LLM applications. With a simple setup, internal links are followed, JavaScript is rendered, and CAPTCHAs are bypassed automatically. No subscription is required: pricing is based only on successful requests. Developers can start crawling with just a few lines of code in Node.js, [Python](https://webcrawlerapi.com/blog/how-to-crawl-the-website-with-python), or PHP. WebcrawlerAPI is often chosen for projects needing clean, ready-to-use content for training models or powering search.

WebcrawlerAPI has pay-as-you-go model. The price starts from 0.002$ per page.

Key Features:

*   Full website crawling with smart link handling

*   Supports JS rendering and anti-bot bypass

*   Markdown, HTML, or plain text output

*   Pay-per-use with no hidden fees

*   SDKs for Node.js, Python, PHP, C#

*   LLM framework integration (Langchain)

*   S3 compatible storage upload

*   Ideal for LLM training and RAG use cases

----
url: https://webcrawlerapi.com/changelog/2025-03-14-headless-browser-improvements
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

March 14, 2025

Headless Browser Improvements

=============================

Major improvements to our headless browser implementation for enhanced web scraping capabilities:

*   Improved anti-bot protection bypass mechanisms

*   Enhanced blocking of non-essential content:

    *   Advertisement content filtering

    *   Cookie consent banner removal

    *   Other non-page-content elements blocking

*   These updates result in cleaner data extraction and improved scraping reliability

----
url: https://webcrawlerapi.com/blog/best-web-crawler-api-2024
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

15 min read to read

What is the best crawling API in 2024?

======================================

How to choose crawler API which fits your needs? What are the best web crawling APIs in 2024?

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Oxylab](#oxylab)

*   [Crawlbase](#crawlbase)

*   [Usescraper](#usescraper)

*   [Apify](#apify)

*   [WebcrawlerAPI](#webcrawlerapi)

*   [Conclusion](#conclusion)

### Table of Contents

*   [Oxylab](#oxylab)

*   [Crawlbase](#crawlbase)

*   [Usescraper](#usescraper)

*   [Apify](#apify)

*   [WebcrawlerAPI](#webcrawlerapi)

*   [Conclusion](#conclusion)

**If you are one of the competitors mentioned in the comparison and you see a typo, or we described your API wrong, please, [contact us](/cdn-cgi/l/email-protection#c1a9a4adadae81b6a4a3a2b3a0b6ada4b3a0b1a8efa2aeac) as soon as possible. We want to provide the most objective comparison to choose the best web crawler API.**

The choice of a crawler API is not to be taken lightly. It's a vital base service for AI chatbots, SEO tools, and other applications. In this context, a web crawler API is one of the fundamental building blocks of all businesses. A wrong choice can have

We searched for Web crawler API y, and here is the list with basic descriptions, features and prices.

There are the best web crawler APIs:

*   [Oxylab](#oxylab)

*   [Crawlbase](#crawlbase)

*   [Usescraper](#usescraper)

*   [Apify](#apify)

*   [WebcrawlerAPI](#webcrawlerapi)

Oxylab

------

URL: [https://developers.oxylabs.io/scraper-apis/web-crawler](https://developers.oxylabs.io/scraper-apis/web-crawler)

On the date of this article, Oxylab is the top result on Google. Oxylab is one of the largest scraping providers, offering various services, including scraping solutions, residential proxies, and crawled datasets. While web crawling is relatively new and not a central feature for them, they do offer basic crawling capabilities:

#### Features

Read more about Oxylab web crawler in [Oxylab web crawler documentation](https://developers.oxylabs.io/scraper-apis/web-crawler).

*   javascript render

*   filtering by the max depth and regular expressions

*   custom user agent type

*   custom geolocation

*   output format: raw html or parsed json

*   upload to custom storage, like S3

*   scheduling

#### Pricing

Pricing is subscription-based:

*   from 2.8$ per 1k pages in plan for 49$/month.

*   to 1.9$ per 1k pages in plan for 2000$/month.

Crawlbase

---------

URL: [https://crawlbase.com/crawling-api-avoid-captchas-blocks](https://crawlbase.com/crawling-api-avoid-captchas-blocks)

Significant scraping and crawling provider. The core feature is data scraping. Powerful temples for all big websites, like Amazon, eBay, etc., allow the extraction of formatted data. Choosing to crawl Crawlbase provides a custom setup crawler running on a dedicated virtual machine.

#### Features

*   javascript rendering

*   scraping params: custom user agent

*   custom javascript execution

*   custom request headers

*   custom request cookies

*   custom geolocation

*   page screenshot (if you want to take a screenshot of any page, use the best screenshot API for this: [https://screenshotone.com](https://screenshotone.com))

*   upload to Crawlbase cloud storage

*   output format: raw html or parsed json

*   variety of scrapers to extract and format data from crawled pages

#### Pricing

For pricing Crawlbase use pay-as-you-go model:

*   from 6$/1k pages (with javascript rendering)

*   to 0.08$/1k pages if you want to crawl more than 1B pages per month

Usescraper

----------

URL: [https://usescraper.com/](https://usescraper.com/)

This is a good project made by indiehacker. It has a scraping and crawling API and the possibility of using a no-code UI to crawl the full website.

#### Features

*   Comprehensive UI to manage crawling and scraping

*   javascript rendering

*   multi-site crawling per job

*   webhook update

*   exclude page by list

*   exclude elements from the page by the css selector

*   output formats: raw html, text, markdown

*   crawl data expiration

*   page limit per job

*   skip pages by content size

*   crawl pages from the sitemap

*   block resources

*   include linked files (e.g. PDFs, images)

#### Pricing

Simple pay-as-you-go pricing:

*   1$ per 1k pages

Apify

-----

URL: [https://apify.com/](https://apify.com/)

Apify is a platform that allows the building and deploying custom scrapers and crawlers. Although this is not a classical API, you can still have a programmatic interface for your crawlers hosted there. Apify is the best option if you are familiar with coding and want to build a highly custom crawler by your own. To be able to crawl a website, you have to build your own crawler code or use one of the ready templates and deploy it.

#### Features

*   code your own crawler

*   all essential pieces available on the platform allows you to build your own crawler using coding

#### Pricing

*   per resources, used by a virtual machine where your crawler is hosted

WebcrawlerAPI

-------------

URL: [https://webcrawlerapi.com/](https://webcrawlerapi.com/)

Although it is possible to scrape the data, Wecrawling API aims at crawling content mainly. Webcrawler API is a new solution on the market and is actively adding new features. You can even request any feature yourself.

#### Features

*   comprehensive UI to manage jobs

*   javascript rendering

*   page limit per job

*   white and black lists using regular expressions

*   webhook update

*   output format: raw html

*   clean content (remove all html tags and useless data from content)

*   build in residential proxies

#### Pricing

Pricing is simple pay-as-you-go:

*   2$ per 1k pages

### Conclusion

We tried to do the most impartial and independent analysis. Please get in touch [with us](/cdn-cgi/l/email-protection#a4ccc1c8c8cbe4d3c1c6c7d6c5d3c8c1d6c5d4cd8ac7cbc9) if there is any incorrect information.

We are are working hard on Wecrawler API and want to help you solve your problem the the fastest way. If you're going to use Webcrawler API and missing some feature, don't hesitate to contact us via email: [\[email protected\]](/cdn-cgi/l/email-protection#2048454c4c4f60574542435241574c45524150490e434f4d).

----
url: https://webcrawlerapi.com/glossary/puppeteer/how-can-i-get-detailed-initiator-data-from-cdp-in-puppeteer
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Use the capability to retrieve detailed initiator data from CDP when available, and filter out goog: data from events by enabling bidi mode.

    # enable filtering of goog: data in bidi only mode

    export PUPPETEER_WEBDRIVER_BIDI_ONLY=1

    const puppeteer = require('puppeteer');

    (async () => {

      const browser = await puppeteer.launch();

      const page = await browser.newPage();

      // access detailed initiator data via CDP when CDP data is available

      const client = await page.target().createCDPSession();

      await client.send('Network.enable');

      client.on('Network.requestWillBeSent', (params) => {

        // detailed initiator information is available here

        console.log(params.initiator);

      });

      await page.goto('https://example.com');

      await browser.close();

    })();

Notes: This approach uses CDP data for initiator details when available and filters out internal goog: data when PUPPETEER\_WEBDRIVER\_BIDI\_ONLY is enabled.

----
url: https://webcrawlerapi.com/glossary/webcrawling/what-are-common-web-crawling-tools
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Answer

Common web crawling tools include Scrapy, Apache Nutch, Playwright, Puppeteer, and managed crawler platforms. Scrapy and Nutch are strong for large‑scale HTML crawling and scheduling. Playwright and Puppeteer are better when you need JavaScript rendering. Managed platforms reduce infrastructure work but add usage costs. Your choice depends on scale, dynamic content, and operational complexity. Many teams mix tools to cover different page types.

----
url: https://webcrawlerapi.com/glossary/playwright/how-to-fix-execution-context-was-destroyed
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

This happens when evaluation starts on one document and the page navigates before it finishes.

Coordinate the click and navigation in one Promise.all so Playwright tracks both.

    await Promise.all([      page.waitForNavigation({ waitUntil: 'domcontentloaded' }),

      page.getByRole('link', { name: 'Next page' }).click(),

    ]);

    await page.evaluate(() => document.title);

Also avoid long-running page.evaluate() calls right before known navigations.

----
url: https://webcrawlerapi.com/blog/json-vs-plain-text-choosing-the-right-format-for-llm-prompts
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

ComparisonJSONRAG

JSON vs Plain Text: Choosing the Right Format for LLM Prompts

=============================================================

JSON vs plain text for scraping and RAG pipelines: when strict fields are needed, when raw text is enough, and how to choose safely.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What JSON is good at](#what-json-is-good-at)

*   [What plain text is good at](#what-plain-text-is-good-at)

*   [Use cases in web crawling, scraping, and RAG](#use-cases-in-web-crawling-scraping-and-rag)

*   [When JSON should be used](#when-json-should-be-used)

*   [When plain text should be used](#when-plain-text-should-be-used)

*   [Practical tradeoffs](#practical-tradeoffs)

*   [Plain text makes QA harder](#plain-text-makes-qa-harder)

*   [JSON can lose nuance](#json-can-lose-nuance)

*   [Node.js snippet: Attach metadata to plain text for RAG](#nodejs-snippet-attach-metadata-to-plain-text-for-rag)

*   [Conclusion](#conclusion)

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What JSON is good at](#what-json-is-good-at)

*   [What plain text is good at](#what-plain-text-is-good-at)

*   [Use cases in web crawling, scraping, and RAG](#use-cases-in-web-crawling-scraping-and-rag)

*   [When JSON should be used](#when-json-should-be-used)

*   [When plain text should be used](#when-plain-text-should-be-used)

*   [Practical tradeoffs](#practical-tradeoffs)

*   [Plain text makes QA harder](#plain-text-makes-qa-harder)

*   [JSON can lose nuance](#json-can-lose-nuance)

*   [Node.js snippet: Attach metadata to plain text for RAG](#nodejs-snippet-attach-metadata-to-plain-text-for-rag)

*   [Conclusion](#conclusion)

JSON and plain text usually serve different goals. JSON is used when fields must be extracted and parsed. Plain text is used when content must be read, embedded, or searched without strict structure.

A broader overview is available in [Best Prompt Data](/blog/best-prompt-data).

Quick comparison

----------------

Topic

JSON

Plain Text

Best for

Structured extraction

Raw content and simple inputs

Parsing reliability

High

Low

Human readability

Medium

High

RAG embeddings

Good (metadata)

Good (content)

Common failure

Invalid JSON

Ambiguous boundaries and missing fields

What JSON is good at

--------------------

JSON is usually selected when:

*   Product, article, or directory fields must be extracted

*   Downstream systems expect predictable keys

*   Validation and schema constraints are required

If a readable report is needed, [Markdown vs JSON](/blog/markdown-vs-json-choosing-the-right-format-for-llm-prompts) can be a better fit.

What plain text is good at

--------------------------

Plain text is usually selected when:

*   Source content is being fed into embeddings

*   Formatting is unnecessary or harmful

*   A later step will perform extraction

If the source is HTML, output choices are covered in [HTML vs Cleaned Text](/blog/html-vs-cleaned-text-choosing-the-right-output-format) and [Cleaned Text vs Markdown](/blog/cleaned-text-vs-markdown-choosing-the-right-output-format).

Use cases in web crawling, scraping, and RAG

--------------------------------------------

### When JSON should be used

JSON is usually preferred when:

*   A database insert will happen

*   Deduping is done by keys (sku, url, canonical\_url)

*   Multiple fields must be extracted per page

### When plain text should be used

Plain text is usually preferred when:

*   The goal is semantic search over page content

*   Chunking and embedding are the next steps

*   "Good enough" extraction is acceptable, or extraction is deferred

If headings are useful for chunking, Markdown can be used instead, as covered in [Markdown vs Plain Text](/blog/markdown-vs-plain-text-choosing-the-right-format-for-llm-prompts).

Practical tradeoffs

-------------------

### Plain text makes QA harder

Without fields, it becomes harder to check if "price" or "author" was extracted correctly. Everything becomes a text search problem.

### JSON can lose nuance

If the entire page is forced into JSON fields, nuance can be lost unless a raw text field is included too.

A common compromise is:

*   Plain text (or Markdown) is stored as content

*   JSON metadata is stored as meta

Node.js snippet: Attach metadata to plain text for RAG

------------------------------------------------------

This pattern keeps the chunk text clean while keeping metadata separate.

    // Node 18+

    // Wrap plain text content with a JSON metadata envelope.

    import { readFile } from "node:fs/promises";

    const content = await readFile("content.txt", "utf8");

    const record = {

      url: "https://example.com/page",

      title: "Example Page",

      content,

    };

    console.log(JSON.stringify(record, null, 2));

Conclusion

----------

*   JSON is usually selected for extraction and reliable parsing.

*   Plain text is usually selected for content-first RAG ingestion and low overhead.

*   A hybrid is often used: plain text for content and JSON for metadata.

If the decision is between human-friendly structure and raw text, [Markdown vs Plain Text](/blog/markdown-vs-plain-text-choosing-the-right-format-for-llm-prompts) should be compared next.

----
url: https://webcrawlerapi.com/changelog/2025-03-31-dashboard-improvements
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

March 31, 2025

Major Dashboard Improvements

============================

Major dashboard improvements 💫

*   Enhanced login with email form:

    *   Implemented rate limiting for magic link emails

    *   Improved user experience and security

*   Dashboard page enhancements:

    *   Added time period toggles (24h, 7d, 15d, 30d)

    *   Implemented total counter for each period

    *   Enhanced graphs for funds spent and crawled pages

*   New dedicated billing page:

    *   Comprehensive payment history

    *   Detailed payment usage tracking for all time

----
url: https://webcrawlerapi.com/docs/structured-outputs
----

Structured Outputs with Prompts

===============================

Copy MarkdownOpen

Define JSON schemas to structure AI responses when using prompts for data extraction

[Structured Outputs with Prompts](#structured-outputs-with-prompts)

===================================================================

Structured Outputs ensure that AI-generated responses adhere to a JSON schema you define. This feature eliminates the need to validate or retry incorrectly formatted responses, making it perfect for extracting structured data from web pages.

[Benefits](#benefits)

---------------------

*   **Reliable type-safety**: No need to validate or retry incorrectly formatted responses

*   **Consistent formatting**: The AI output will always match your defined structure

*   **Simpler implementation**: Define your schema once and get predictable results every time

[How It Works](#how-it-works)

-----------------------------

When you provide a `prompt`, the `/v2/scrape` endpoint returns a JSON object in `structured_data` instead of markdown or HTML. Add an optional `response_schema` to enforce a strict JSON schema for the response. The schema follows the [JSON Schema](https://json-schema.org/) format used by OpenAI Structured Outputs.

[Basic Example](#basic-example)

-------------------------------

Extract product information with a guaranteed structure:

    curl --request POST \

      --url https://api.webcrawlerapi.com/v2/scrape \

      --header 'Authorization: Bearer YOUR_API_KEY' \

      --header 'Content-Type: application/json' \

      --data '{

        "url": "https://example.com/product/widget",

        "prompt": "Extract product details from this page",

        "response_schema": {

          "type": "object",

          "properties": {

            "product_name": {"type": "string"},

            "price": {"type": "number"},

            "in_stock": {"type": "boolean"},

            "description": {"type": "string"}

          },

          "required": ["product_name", "price", "in_stock"],

          "additionalProperties": false

        }

      }'

Response:

    {

      "success": true,

      "status": "done",

      "page_status_code": 200,

      "page_title": "Premium Widget",

      "structured_data": {

        "product_name": "Premium Widget",

        "price": 29.99,

        "in_stock": true,

        "description": "A high-quality widget for all your needs"

      }

    }

[Schema Format](#schema-format)

-------------------------------

Your `response_schema` must be a valid JSON Schema object. OpenAI structured outputs are strict, so we recommend following these conventions to avoid schema validation errors:

### [Recommended Fields](#recommended-fields)

*   `type`: Use `"object"` at the root level

*   `properties`: Define the structure of your data

*   `required`: Include required property names for predictable output

*   `additionalProperties`: Set to `false` to keep the output strict

### [Supported Types](#supported-types)

*   `string` - Text data

*   `number` - Numeric values (integers or decimals)

*   `boolean` - True/false values

*   `object` - Nested objects

*   `array` - Lists of items

*   `enum` - Predefined set of values

[Advanced Examples](#advanced-examples)

---------------------------------------

### [Nested Objects](#nested-objects)

Extract business information with address details:

    {

      "type": "object",

      "properties": {

        "business_name": {"type": "string"},

        "phone": {"type": "string"},

        "address": {

          "type": "object",

          "properties": {

            "street": {"type": "string"},

            "city": {"type": "string"},

            "state": {"type": "string"},

            "postal_code": {"type": "string"}

          },

          "required": ["street", "city"],

          "additionalProperties": false

        }

      },

      "required": ["business_name", "address"],

      "additionalProperties": false

    }

### [Arrays of Objects](#arrays-of-objects)

Extract multiple products from a listing page:

    {

      "type": "object",

      "properties": {

        "products": {

          "type": "array",

          "items": {

            "type": "object",

            "properties": {

              "name": {"type": "string"},

              "price": {"type": "number"},

              "rating": {"type": "number"}

            },

            "required": ["name", "price"],

            "additionalProperties": false

          }

        }

      },

      "required": ["products"],

      "additionalProperties": false

    }

### [Enum Constraints](#enum-constraints)

Restrict values to predefined options:

    {

      "type": "object",

      "properties": {

        "product_name": {"type": "string"},

        "category": {

          "type": "string",

          "enum": ["electronics", "clothing", "books", "home"]

        },

        "condition": {

          "type": "string",

          "enum": ["new", "used", "refurbished"]

        }

      },

      "required": ["product_name", "category", "condition"],

      "additionalProperties": false

    }

### [Optional Fields](#optional-fields)

Use null union types for optional fields:

    {

      "type": "object",

      "properties": {

        "name": {"type": "string"},

        "email": {"type": ["string", "null"]},

        "phone": {"type": ["string", "null"]}

      },

      "required": ["name", "email", "phone"],

      "additionalProperties": false

    }

Even though all fields are in the `required` array, `email` and `phone` can be `null` if the information isn't available.

[Schema Constraints](#schema-constraints)

-----------------------------------------

To ensure performance and reliability, structured outputs have these limitations:

*   **Maximum properties**: 5,000 object properties total

*   **Nesting depth**: Maximum 10 levels of nested objects

*   **Enum values**: Maximum 1,000 enum values across all enum properties

*   **String length**: Total string length of all property names, enum values, and const values cannot exceed 120,000 characters

[Error Handling](#error-handling)

---------------------------------

### [Invalid Schema](#invalid-schema)

If your schema is invalid, you'll receive an error from the AI model:

    {

      "success": false,

      "error_code": "invalid_schema",

      "error_message": "Invalid response schema format"

    }

### [No Prompt Provided](#no-prompt-provided)

The `response_schema` parameter only works when a `prompt` is also provided. If you include a schema without a prompt, it will be ignored.

### [Prompt Without a Schema](#prompt-without-a-schema)

If you send a `prompt` without `response_schema`, the API still returns `structured_data`, but uses JSON-object mode instead of strict schema validation.

### [LLM Refusal](#llm-refusal)

In rare cases, the AI may refuse to process content for safety reasons. You'll receive a refusal message explaining why.

[Pricing](#pricing)

-------------------

Structured outputs cost the same as regular prompts: **$0.002 per request** with prompt (in addition to the base crawling cost).

[SDK Support](#sdk-support)

---------------------------

### [JavaScript/TypeScript](#javascripttypescript)

    import WebcrawlerAPI from 'webcrawlerapi';

    const client = new WebcrawlerAPI({ apiKey: 'YOUR_API_KEY' });

    const response = await client.scrapeUrl({

      url: 'https://example.com/product',

      prompt: 'Extract product details',

      response_schema: {

        type: 'object',

        properties: {

          name: { type: 'string' },

          price: { type: 'number' },

          in_stock: { type: 'boolean' }

        },

        required: ['name', 'price', 'in_stock'],

        additionalProperties: false

      }

    });

    console.log(response.structured_data);

### [Python](#python)

    from webcrawlerapi import WebcrawlerAPI

    client = WebcrawlerAPI(api_key='YOUR_API_KEY')

    response = client.scrape_url(

        url='https://example.com/product',

        prompt='Extract product details',

        response_schema={

            'type': 'object',

            'properties': {

                'name': {'type': 'string'},

                'price': {'type': 'number'},

                'in_stock': {'type': 'boolean'}

            },

            'required': ['name', 'price', 'in_stock'],

            'additionalProperties': False

        }

    )

    print(response['structured_data'])

[Best Practices](#best-practices)

---------------------------------

1.  **Clear property names**: Use descriptive, self-documenting property names

2.  **Specific prompts**: Combine schemas with clear, specific prompts for best results

3.  **Start simple**: Begin with basic schemas and add complexity as needed

4.  **Test iteratively**: Test your schemas with sample pages to refine the structure

5.  **Handle nulls**: Use null unions for optional data that may not always be present

[Related Documentation](#related-documentation)

-----------------------------------------------

*   [Async Requests](/docs/async-requests) - Process multiple pages with structured output

*   [Crawling Types](/docs/crawling-types) - Different output formats available

*   [Rate Limits](/docs/rate-limits) - API usage limits and best practices

[Caching

Previous Page](/docs/api/caching)[Content Filtering

Next Page](/docs/guides/filters)

### On this page

[Structured Outputs with Prompts](#structured-outputs-with-prompts)[Benefits](#benefits)[How It Works](#how-it-works)[Basic Example](#basic-example)[Schema Format](#schema-format)[Recommended Fields](#recommended-fields)[Supported Types](#supported-types)[Advanced Examples](#advanced-examples)[Nested Objects](#nested-objects)[Arrays of Objects](#arrays-of-objects)[Enum Constraints](#enum-constraints)[Optional Fields](#optional-fields)[Schema Constraints](#schema-constraints)[Error Handling](#error-handling)[Invalid Schema](#invalid-schema)[No Prompt Provided](#no-prompt-provided)[Prompt Without a Schema](#prompt-without-a-schema)[LLM Refusal](#llm-refusal)[Pricing](#pricing)[SDK Support](#sdk-support)[JavaScript/TypeScript](#javascripttypescript)[Python](#python)[Best Practices](#best-practices)[Related Documentation](#related-documentation)

----
url: https://webcrawlerapi.com/blog/what-is-the-difference-between-web-crawling-and-scraping
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

TechnicalStart here2 min read to read

What is the difference between web crawling and scraping?

=========================================================

Scraping and crawling are techniques used to automate data retrieval from the Web. Though they are slightly different, both have different goals and processes.

Written byAndrew

Published onFeb 1, 2026

Scraping and crawling are techniques used to automate data retrieval from the Web. Key differences between the two include their goals and processes.

**Web crawling** is the process of discovering and fetching pages by following links. It aims to cover many pages (sometimes an entire site) and collect their content and metadata. In real life, good crawling is also about limits: scope rules, deduplication, and being polite with rate limits.

**Scraping** is the process of extracting specific data from web pages. It is more targeted and aims to obtain particular information from a page, such as prices or product descriptions, event dates, or user emails. Unlike web crawling, the scraping process uses various techniques to circumvent blockages, for example, rotating proxies, changing the browser's User Agent, and emulating user behaviour.

----
url: https://webcrawlerapi.com/changelog/2025-02-19-webpage-to-markdown
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

February 19, 2025

Webpage to Markdown Tool Launch

===============================

A new tool [Webpage to Markdown](https://webcrawlerapi.com/tools/website-to-md) has been added. This tool converts any documentation or website into a beautiful Markdown file. It is free and does not require an API key. It can crawl up to 100 pages.

----
url: https://webcrawlerapi.com/glossary/puppeteer/what-is-the-difference-between-browser-close-and-browser-disconnect
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

browser.close() and browser.disconnect() both end your current control flow, but they affect the browser lifecycle differently.

*   browser.close() closes the entire browser process.

*   It closes all pages/tabs and releases browser resources.

*   Use it when your automation run is fully finished.

*   browser.disconnect() only detaches your Puppeteer client.

*   The browser process keeps running with its pages still open.

*   Use it when you want the browser to continue running (for example, shared or remote sessions).

Quick rule:

*   Use close() to stop the browser.

*   Use disconnect() to stop only your connection.

    const browser = await puppeteer.connect({ browserWSEndpoint });

    // Keep browser running, just detach this script

    await browser.disconnect();

    // Fully shut down browser process

    // await browser.close();

----
url: https://webcrawlerapi.com/glossary?category=scraping
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Glossary

Web Scraping & API Glossary

===========================

Comprehensive glossary of web scraping, crawling, and API terms. Learn the essential concepts and terminology used in web data extraction.

AllPlaywrightPuppeteerScrapingWebcrawling

----
url: https://webcrawlerapi.com/docs/sdk/langchain
----

LangChain Integration

=====================

Copy MarkdownOpen

Seamlessly convert websites and webpages into markdown format for LLM data processing pipelines

The WebCrawlerAPI LangChain integration allows you to seamlessly convert websites and webpages into markdown or cleaned content format, making it perfect for LLM data processing pipelines. This integration requires no subscription and provides a straightforward way to incorporate web crawling capabilities into your LangChain document processing workflow.

[Installation](#installation)

-----------------------------

First, obtain your API key from WebCrawlerAPI, then install the package using pip:

    pip install webcrawlerapi-langchain

[Usage](#usage)

---------------

### [Basic Loading](#basic-loading)

The simplest way to use the WebCrawlerAPI loader is through the basic loading method:

    from webcrawlerapi_langchain import WebCrawlerAPILoader

    # Initialize the loader

    loader = WebCrawlerAPILoader(

        url="https://example.com",

        api_key="your-api-key",

        scrape_type="markdown",

        items_limit=10

    )

    # Load documents

    documents = loader.load()

    # Use documents in your LangChain pipeline

    for doc in documents:

        print(doc.page_content[:100])

        print(doc.metadata)

### [Advanced Loading Methods](#advanced-loading-methods)

The SDK supports multiple loading patterns to suit different use cases:

#### [Async Loading](#async-loading)

For asynchronous operations:

    # Async loading

    documents = await loader.aload()

#### [Lazy Loading](#lazy-loading)

When dealing with large datasets:

    # Lazy loading

    for doc in loader.lazy_load():

        print(doc.page_content[:100])

#### [Async Lazy Loading](#async-lazy-loading)

Combining asynchronous and lazy loading:

    # Async lazy loading

    async for doc in loader.alazy_load():

        print(doc.page_content[:100])

[Configuration Options](#configuration-options)

-----------------------------------------------

The WebCrawlerAPILoader accepts the following configuration parameters:

Parameter

Type

Description

`url`

string

The target URL to crawl

`api_key`

string

Your WebCrawlerAPI API key

`scrape_type`

string

Type of scraping (options: "html", "cleaned", "markdown")

`items_limit`

integer

Maximum number of pages to crawl

`whitelist_regexp`

string

Regex pattern for URL whitelist

`blacklist_regexp`

string

Regex pattern for URL blacklist

[Best Practices](#best-practices)

---------------------------------

1.  **Rate Limiting**: Be mindful of API rate limits when crawling multiple pages.

2.  **Error Handling**: Always implement proper error handling for network issues and API responses.

3.  **Content Type**: Choose the appropriate `scrape_type` based on your LLM's requirements:

    *   Use "markdown" for structured content

    *   Use "cleaned" for plain text

    *   Use "html" for raw HTML content

[Example Use Cases](#example-use-cases)

---------------------------------------

### [Document QA System](#document-qa-system)

You can find working code example in the GitHub [Book Information Extractor with LangChain and WebcrawlerAPI](https://github.com/WebCrawlerAPI/webcrawlerapi-examples/tree/master/python/langchain-basic)

    from langchain.chains import RetrievalQA

    from langchain.embeddings import OpenAIEmbeddings

    from langchain.vectorstores import Chroma

    from webcrawlerapi_langchain import WebCrawlerAPILoader

    # Load documents from a website

    loader = WebCrawlerAPILoader(

        url="https://docs.example.com",

        api_key="your-api-key",

        scrape_type="markdown"

    )

    documents = loader.load()

    # Create vector store

    embeddings = OpenAIEmbeddings()

    vectorstore = Chroma.from_documents(documents, embeddings)

    # Create QA chain

    qa_chain = RetrievalQA.from_chain_type(

        llm=OpenAI(),

        chain_type="stuff",

        retriever=vectorstore.as_retriever()

    )

[Error Handling](#error-handling)

---------------------------------

The loader implements robust error handling for common scenarios:

    try:

        loader = WebCrawlerAPILoader(

            url="https://example.com",

            api_key="your-api-key"

        )

        documents = loader.load()

    except WebCrawlerAPIError as e:

        print(f"API Error: {e}")

    except ValidationError as e:

        print(f"Configuration Error: {e}")

    except Exception as e:

        print(f"Unexpected Error: {e}")

[Support](#support)

-------------------

For additional support or to report issues, please visit the [WebCrawlerAPI documentation](https://webcrawlerapi.com/docs) or the [GitHub repository](https://github.com/webcrawlerapi/langchain-integration).

[.NET

Previous Page](/docs/sdk/dotnet)[MCP Server

Next Page](/docs/sdk/mcp)

### On this page

[Installation](#installation)[Usage](#usage)[Basic Loading](#basic-loading)[Advanced Loading Methods](#advanced-loading-methods)[Async Loading](#async-loading)[Lazy Loading](#lazy-loading)[Async Lazy Loading](#async-lazy-loading)[Configuration Options](#configuration-options)[Best Practices](#best-practices)[Example Use Cases](#example-use-cases)[Document QA System](#document-qa-system)[Error Handling](#error-handling)[Support](#support)

----
url: https://webcrawlerapi.com/blog/markdown-vs-plain-text-choosing-the-right-format-for-llm-prompts
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

ComparisonMarkdownRAG

Markdown vs Plain Text: Choosing the Right Format for LLM Prompts

=================================================================

Markdown vs plain text for prompts and scraped content: structure, readability, chunking for RAG, and practical tradeoffs.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What Markdown is good at](#what-markdown-is-good-at)

*   [What plain text is good at](#what-plain-text-is-good-at)

*   [Use cases in web crawling, scraping, and RAG](#use-cases-in-web-crawling-scraping-and-rag)

*   [When Markdown should be used](#when-markdown-should-be-used)

*   [When plain text should be used](#when-plain-text-should-be-used)

*   [Practical tradeoffs](#practical-tradeoffs)

*   [Markdown can inflate tokens](#markdown-can-inflate-tokens)

*   [Plain text can hide hierarchy](#plain-text-can-hide-hierarchy)

*   [Node.js snippet: Create simple RAG chunks from Markdown headings](#nodejs-snippet-create-simple-rag-chunks-from-markdown-headings)

*   [Conclusion](#conclusion)

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What Markdown is good at](#what-markdown-is-good-at)

*   [What plain text is good at](#what-plain-text-is-good-at)

*   [Use cases in web crawling, scraping, and RAG](#use-cases-in-web-crawling-scraping-and-rag)

*   [When Markdown should be used](#when-markdown-should-be-used)

*   [When plain text should be used](#when-plain-text-should-be-used)

*   [Practical tradeoffs](#practical-tradeoffs)

*   [Markdown can inflate tokens](#markdown-can-inflate-tokens)

*   [Plain text can hide hierarchy](#plain-text-can-hide-hierarchy)

*   [Node.js snippet: Create simple RAG chunks from Markdown headings](#nodejs-snippet-create-simple-rag-chunks-from-markdown-headings)

*   [Conclusion](#conclusion)

Markdown and plain text can look similar, but different expectations are created. Markdown implies structure (headings, lists). Plain text implies that structure is not needed and should not be relied on.

A broader guide to prompt data formats is provided in [Best Prompt Data](/blog/best-prompt-data).

Quick comparison

----------------

Topic

Markdown

Plain Text

Best for

Readable structured docs

Raw content and simple prompts

Parsing reliability

Medium

Low (no explicit structure)

Human readability

High

High (but less scannable)

RAG chunking

Good (headings help)

Good (simpler, fewer tokens)

Common failure

Inconsistent formatting

Missing boundaries, ambiguous sections

What Markdown is good at

------------------------

Markdown is usually selected when:

*   Sections should be clear (H2/H3 headings)

*   Lists should remain lists

*   Code examples should be fenced and preserved

Markdown output tradeoffs are covered in [Cleaned Text vs Markdown](/blog/cleaned-text-vs-markdown-choosing-the-right-output-format).

What plain text is good at

--------------------------

Plain text is usually selected when:

*   A minimum surface area is wanted (no markup)

*   The content is already clean and should not be restructured

*   Prompt tokens should be reduced by removing formatting

If the source is HTML, the output decision is covered in [HTML vs Cleaned Text](/blog/html-vs-cleaned-text-choosing-the-right-output-format).

Use cases in web crawling, scraping, and RAG

--------------------------------------------

### When Markdown should be used

Markdown is usually preferred when:

*   The output will be read by humans

*   Chunk boundaries should follow headings

*   Quotes, bullet points, and code blocks matter for meaning

### When plain text should be used

Plain text is usually preferred when:

*   The text is being embedded and retrieved by similarity search

*   Formatting noise should be removed

*   Simple extraction is being done with a second pass later

For strict extraction into fields, plain text is usually not enough. JSON is usually chosen, as covered in [Markdown vs JSON](/blog/markdown-vs-json-choosing-the-right-format-for-llm-prompts).

Practical tradeoffs

-------------------

### Markdown can inflate tokens

Headings and bullet syntax add tokens. That cost can matter when large crawls are processed. Plain text can be cheaper to store and embed.

### Plain text can hide hierarchy

If multiple sections exist (pricing, terms, specs), headings can be valuable. Without them, chunking and retrieval can get worse.

Node.js snippet: Create simple RAG chunks from Markdown headings

----------------------------------------------------------------

This chunker is intentionally simple. It splits on \## and keeps the heading with the chunk.

    // Node 18+

    // Split Markdown into chunks by H2 headings.

    import { readFile } from "node:fs/promises";

    const md = await readFile("page.md", "utf8");

    const parts = md.split(/\n##\s+/);

    const chunks = [];

    for (let i = 0; i < parts.length; i++) {

      const text = i === 0 ? parts[i] : "## " + parts[i];

      const trimmed = text.trim();

      if (trimmed) chunks.push(trimmed);

    }

    console.log("Chunks:", chunks.length);

    console.log("First chunk preview:\n", chunks[0]?.slice(0, 300));

Conclusion

----------

*   Markdown is usually selected when readable structure helps.

*   Plain text is usually selected when simplicity and lower overhead are more important than structure.

*   For many RAG pipelines, plain text is used for embeddings and Markdown is used for human review outputs.

If the decision is really about tables, CSV should be compared in [Markdown vs CSV](/blog/markdown-vs-csv-choosing-the-right-format-for-llm-prompts).

----
url: https://webcrawlerapi.com/changelog/2025-09-25-september-updates
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

September 25, 2025

Sep '25 Updates

===============

*   **Google News feed scraping**: Fixed and now working reliably

*   **New Integration Page**: Added to dashboard showing all available WebCrawlerAPI integrations including code, no-code, and storage options

*   **Webhook status tracking**: Now visible in dashboard with "Resend" button to retry failed webhook deliveries

*   **Infrastructure optimizations**: Enhanced crawling and scraping performance

*   **WWW handling**: Websites with and without www subdomain are now processed correctly

*   **Model upgrade**: Switched to google/gemini-2.5-flash-lite for prompt processing - significantly faster than OpenAI models

*   **Main content extraction**: New parameter to extract only useful text from webpages, perfect for blog posts and articles

*   **Milestone achieved**: WebCrawlerAPI crossed 750K total crawled pages this month

----
url: https://webcrawlerapi.com/glossary/webcrawling/what-is-crawl-budget
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Answer

Crawl budget is the number of pages a crawler can fetch within time and resource constraints. It is limited by your crawler capacity and by how much load the target site can handle. Budgets help keep crawling predictable and prevent overloading servers. They also prioritize which pages matter most for your use case. A good budget considers page importance, update frequency, and error rates. Tuning the budget improves coverage without causing blocks.

----
url: https://webcrawlerapi.com/changelog/2025-11-25-november-updates
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

November 25, 2025

Nov '25 Updates

===============

*   **Max depth parameter**: New max\_depth parameter for [crawling](https://webcrawlerapi.com/docs/api/crawl) to control link following depth

*   **Usage endpoint**: Admin endpoint to retrieve usage statistics and metrics with date range filtering and optional daily breakdown ([documentation](https://webcrawlerapi.com/docs/api/usage))

*   **Microsoft login fix**: Fixed authentication issues with Microsoft/Outlook email accounts

*   **Documentation redesign**: New [documentation design](https://webcrawlerapi.com/docs) with copy-to-markdown feature for easy sharing with AI assistants

*   **PDF parsing improvements**: Enhanced PDF content extraction and parsing

*   **CSV/TSV file parsing**: Added support for parsing CSV and TSV files directly from URLs

----
url: https://webcrawlerapi.com/docs/async-requests
----

Async and Webhooks

==================

Copy MarkdownOpen

How to do async requests with Webcrawler API

All request in Webcrawler API are asynchronous by default.

[What is asynchronous request?](#what-is-asynchronous-request)

--------------------------------------------------------------

Asynchronous communication means that the client (you) does not have to wait for the server to finish processing a request. Instead, there are two ways to get the result:

1.  The server will notify the client once the task is completed **via Webhook**.

2.  The client can check the status of the task **via API call**.

### [Why use asynchronous requests?](#why-use-asynchronous-requests)

Crawling and scraping jobs can take a long time to complete. Using asynchronous, the client can continue with other tasks without waiting for the server’s response.

[Using webhooks](#using-webhooks)

---------------------------------

Using webhooks with WebcrawlerAPI allows you to deliver the results of the job to your URL as a POST body. To use webhook you need to provide a URL where the server will send a POST request once the task is completed. It means to add a `webhook_url` parameter to the request.

Request example:

    {

        "url": "https://stripe.com/",

        "webhook_url": "https://yourserver.com/webhook"

    }

Once the job is completed, the server will send a POST request to the provided URL with the payload:

    {

        "id": "b1b1b1b1-b1b1-b1b1-b1b1-b1b1b1b1b1b1",

        ...

    }

[Using API calls](#using-api-calls)

-----------------------------------

To check the status of the **crawling** job you can use the foll API call:

    curl --request GET \

      --url https://api.webcrawlerapi.com/v1/job/b1b1b1b1-b1b1-b1b1-b1b1-b1b1b1b1b1b1 \

      --header 'Authorization: Bearer <YOUR API TOKEN HERE>'

Response will contains job info and the job status:

    {

        "id": "b1b1b1b1-b1b1-b1b1-b1b1-b1b1b1b1b1b1",

        "status": "done",

        ...

    }

[Asynchronous Scraping](#asynchronous-scraping)

-----------------------------------------------

For single page scraping, you can also use asynchronous mode with the `/v2/scrape` endpoint by adding the `async=true` query parameter. This allows you to get a job ID immediately and then retrieve the result using the GET scrape API method.

### [Making an async scrape request](#making-an-async-scrape-request)

    curl --request POST \

      --url 'https://api.webcrawlerapi.com/v2/scrape?async=true' \

      --header 'Authorization: Bearer <YOUR API TOKEN HERE>' \

      --header 'Content-Type: application/json' \

      --data '{

        "url": "https://webcrawlerapi.com"

      }'

This will immediately return a job ID:

    {

        "id": "b1b1b1b1-b1b1-b1b1-b1b1-b1b1b1b1b1b1"

    }

### [Retrieving the scrape result](#retrieving-the-scrape-result)

Once you have the job ID, you can check the status and retrieve the result using the GET scrape endpoint:

    curl --request GET \

      --url https://api.webcrawlerapi.com/v2/scrape/b1b1b1b1-b1b1-b1b1-b1b1-b1b1b1b1b1b1 \

      --header 'Authorization: Bearer <YOUR API TOKEN HERE>'

The response will contain the scraping result or status:

    {

        "success": true,

        "status": "done",

        "markdown": "# Example Page\n\nContent here...",

        "page_status_code": 200,

        "page_title": "Example Page Title"

    }

If the job is still processing, you'll get:

    {

        "status": "pending"

    }

[Crawling output format types

Previous Page](/docs/crawling-types)[Any website to feed

Next Page](/docs/feeds)

### On this page

[What is asynchronous request?](#what-is-asynchronous-request)[Why use asynchronous requests?](#why-use-asynchronous-requests)[Using webhooks](#using-webhooks)[Using API calls](#using-api-calls)[Asynchronous Scraping](#asynchronous-scraping)[Making an async scrape request](#making-an-async-scrape-request)[Retrieving the scrape result](#retrieving-the-scrape-result)

----
url: https://webcrawlerapi.com/blog/5-famous-web-scraping-court-cases-where-scrapers-won
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

LegalWeb ScrapingWeb Crawling

5 Famous Web Scraping Court Cases Where Scrapers Won

====================================================

Five well-known court cases that favored scraping/crawling (or narrowed anti-scraping theories), plus practical takeaways on public data, CFAA, copyright, and EU database rights.

Written byAndrew

Published onFeb 8, 2026

### Table of Contents

*   [5 Famous Web Scraping Court Cases Where Scrapers Won](#5-famous-web-scraping-court-cases-where-scrapers-won)

*   [1) hiQ Labs, Inc. v. LinkedIn Corp. (9th Cir., 2019; reaffirmed 2022)](#1-hiq-labs-inc-v-linkedin-corp-9th-cir-2019-reaffirmed-2022)

*   [2) Van Buren v. United States (U.S. Supreme Court, 2021)](#2-van-buren-v-united-states-us-supreme-court-2021)

*   [3) Perfect 10, Inc. v. Amazon.com, Inc. (Google Image Search) (9th Cir., 2007)](#3-perfect-10-inc-v-amazoncom-inc-google-image-search-9th-cir-2007)

*   [4) British Horseracing Board Ltd v. William Hill (CJEU, 2004) (C-203/02)](#4-british-horseracing-board-ltd-v-william-hill-cjeu-2004-c-20302)

*   [5) Fixtures Marketing Ltd v. OPAP (CJEU, 2004) (C-444/02)](#5-fixtures-marketing-ltd-v-opap-cjeu-2004-c-44402)

### Table of Contents

*   [5 Famous Web Scraping Court Cases Where Scrapers Won](#5-famous-web-scraping-court-cases-where-scrapers-won)

*   [1) hiQ Labs, Inc. v. LinkedIn Corp. (9th Cir., 2019; reaffirmed 2022)](#1-hiq-labs-inc-v-linkedin-corp-9th-cir-2019-reaffirmed-2022)

*   [2) Van Buren v. United States (U.S. Supreme Court, 2021)](#2-van-buren-v-united-states-us-supreme-court-2021)

*   [3) Perfect 10, Inc. v. Amazon.com, Inc. (Google Image Search) (9th Cir., 2007)](#3-perfect-10-inc-v-amazoncom-inc-google-image-search-9th-cir-2007)

*   [4) British Horseracing Board Ltd v. William Hill (CJEU, 2004) (C-203/02)](#4-british-horseracing-board-ltd-v-william-hill-cjeu-2004-c-20302)

*   [5) Fixtures Marketing Ltd v. OPAP (CJEU, 2004) (C-444/02)](#5-fixtures-marketing-ltd-v-opap-cjeu-2004-c-44402)

5 Famous Web Scraping Court Cases Where Scrapers Won

====================================================

Web scraping is not automatically legal or illegal.

Legality depends on what you access (public vs. behind login), what you copy (facts vs. creative expression), how you access it (respecting access controls and technical measures), what your Terms of Service say (contract), and which jurisdiction applies (US vs. EU/UK rules can differ a lot).

If you want a practical baseline before reading case law, start with ethics and operational "be polite" behavior:

*   [Web Scraping Ethics: What is legal and what is not?](/blog/web-scraping-ethics)

*   [How to Build a Web Crawler](/blog/how-to-build-a-web-crawler)

This post is not legal advice.

1) hiQ Labs, Inc. v. LinkedIn Corp. (9th Cir., 2019; reaffirmed 2022)

---------------------------------------------------------------------

**Case:** _hiQ Labs, Inc. v. LinkedIn Corp._, U.S. Court of Appeals for the Ninth Circuit.

*   2019 opinion (PDF): [https://cdn.ca9.uscourts.gov/datastore/opinions/2019/09/09/17-16783.pdf](https://cdn.ca9.uscourts.gov/datastore/opinions/2019/09/09/17-16783.pdf)

*   2022 opinion (PDF): [https://cdn.ca9.uscourts.gov/datastore/opinions/2022/04/18/17-16783.pdf](https://cdn.ca9.uscourts.gov/datastore/opinions/2022/04/18/17-16783.pdf)

**What was being scraped:** hiQ collected data from public LinkedIn profile pages to power analytics products.

**Claims raised:** The fight centered on the Computer Fraud and Abuse Act (CFAA) and whether scraping public pages becomes "without authorization" after LinkedIn objects and tries to block access.

**What the court held (plain English):** Public web pages are treated differently from breaking into a protected system. The Ninth Circuit was skeptical of turning "we sent you a cease-and-desist" into a CFAA "hacking" claim for genuinely public pages.

**Why it mattered:** In the US, CFAA threats are a common anti-scraping strategy. This case made that strategy harder (at least for public pages, in this circuit, on these facts).

**Practical takeaway:** Public vs. gated access is the first branch in your legal risk tree. The moment you scrape behind login/paywalls, or bypass technical controls, the risk profile changes sharply.

2) Van Buren v. United States (U.S. Supreme Court, 2021)

--------------------------------------------------------

**Case:** _Van Buren v. United States_, Supreme Court of the United States.

*   Opinion (PDF): [https://www.supremecourt.gov/opinions/20pdf/19-783\_k53l.pdf](https://www.supremecourt.gov/opinions/20pdf/19-783_k53l.pdf)

**Why it is scraping-relevant:** This is not a "web scraping" fact pattern, but it is a major CFAA interpretation decision. Many scraping disputes try to reframe policy or purpose violations as CFAA claims.

**What the Court held (plain English):** The Court read "exceeds authorized access" narrowly. Having legitimate access and then using it for an improper purpose is not automatically a CFAA violation; the concept focuses more on crossing access boundaries (getting into parts you are not entitled to access).

**Why it mattered:** It reduced the risk that "you violated a policy/ToS" turns into a federal computer crime theory by itself.

**Practical takeaway:** A ToS violation may still create contract risk, but it is less likely (by itself) to be treated as CFAA "hacking" in many contexts. Authentication bypass and defeating access controls remain high risk.

3) Perfect 10, Inc. v. Amazon.com, Inc. (Google Image Search) (9th Cir., 2007)

------------------------------------------------------------------------------

**Case:** _Perfect 10, Inc. v. Amazon.com, Inc._, U.S. Court of Appeals for the Ninth Circuit.

*   Opinion (PDF): [https://cases.justia.com/federal/appellate-courts/ca9/06-55405/0655405-2011-02-25.pdf](https://cases.justia.com/federal/appellate-courts/ca9/06-55405/0655405-2011-02-25.pdf)

**What was being crawled:** Google's systems crawled and indexed images, and displayed thumbnails in image search results.

**Claims raised:** Copyright infringement claims against a large-scale crawler/search product.

**What the court held (plain English):** The court treated thumbnails in a search/discovery context as strongly transformative and found fair use on key points. (This is not a blank check to republish full-size images or entire works.)

**Why it mattered:** It validated an important crawler pattern: copy only what is necessary to index/search, show reduced representations (snippets/thumbnails), and route users to the source.

**Practical takeaway:** The closer your scraped output is to being a substitute for the original (full articles, full-resolution images, full databases), the higher your IP risk. Snippets and indexing use cases are easier to defend than wholesale republication.

4) British Horseracing Board Ltd v. William Hill (CJEU, 2004) (C-203/02)

------------------------------------------------------------------------

**Case:** _The British Horseracing Board Ltd and Others v William Hill Organization Ltd_, Court of Justice of the European Union (Grand Chamber).

*   Judgment (EUR-Lex): [https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:62002CJ0203](https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:62002CJ0203)

**What was being "extracted":** Racing data was used by a betting operator; the dispute focused on the EU Database Directive's "sui generis" database right.

**Claims raised:** "Extraction" / "re-utilisation" of substantial parts of a database, and repeated extraction of insubstantial parts.

**What the court held (plain English):** Database-right protection was narrowed in important ways. Investment in creating the underlying data was not automatically treated as investment in obtaining the contents for database-right protection.

**Why it mattered:** In the EU/UK, a lot of scraping conflicts are really database-right conflicts. This line of cases limited how far database-right claims can reach for certain fact-heavy datasets.

**Practical takeaway:** In EU/UK, treat database right as a first-class risk category. Avoid rebuilding someone else's database, minimize volume, and be explicit about what fields you take and why.

5) Fixtures Marketing Ltd v. OPAP (CJEU, 2004) (C-444/02)

---------------------------------------------------------

**Case:** _Fixtures Marketing Ltd v Organismos prognostikon agonon podosfairou AE (OPAP)_, Court of Justice of the European Union (Grand Chamber).

*   Judgment (EUR-Lex): [https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:62002CJ0444](https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:62002CJ0444)

**What was being "extracted":** Football fixture lists (structured facts: dates/teams) used for betting products.

**Claims raised:** EU Database Directive "sui generis" database right.

**What the court held (plain English):** A fixtures list can qualify as a "database," but protection is not automatic. Effort spent creating the underlying schedule is not the same as protected investment in obtaining/verifying/presenting existing independent materials.

**Why it mattered:** It reinforced that database right is not simply a reward for creating facts/events.

**Practical takeaway:** Even where database right is weak, contract/ToS and technical access restrictions can still create real risk. Legal safety is a bundle, not a single doctrine.

Closing: what these cases do (and do not) mean

----------------------------------------------

These cases do not mean "scraping is always legal." They show that legality depends on context.

Key factors that tend to decide outcomes:

*   Public vs. gated access (login, paywall, auth boundaries)

*   What is copied (facts/fields vs. protected expression)

*   How much is copied (substantial parts; systematic rebuilding)

*   Technical measures (blocks, challenges, circumvention)

*   Terms/permissions (ToS, robots.txt, licenses, notices)

*   Jurisdiction (US CFAA/copyright/contract vs. EU/UK database right and related doctrines)

If you want scraping to be sustainable, build compliance into your crawler from day one: scope limits, rate limiting, backoff, and clear rules about what content you store and republish.

----
url: https://webcrawlerapi.com/glossary/puppeteer/what-typescript-target-should-i-use-to-compile-puppeteer-core-types
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

To fix the TS18028 error, set the TypeScript target to ES2015 or higher. The error occurs because private identifiers (#) require ES2015+ emitted code. If esnext alone doesn't solve it, ensure you're using the right tsconfig and a recent TypeScript version.

Example tsconfig.json:

    {

      "compilerOptions": {

        "target": "ES2015",

        "module": "ESNext",

        "lib": ["ES2015", "DOM"],

        "strict": true,

        "skipLibCheck": true

      }

    }

Notes:

*   Some projects document the required target in tsconfig.base.json; ensure your project uses at least ES2015.

*   Install a recent TypeScript version locally and regenerate the config if needed:

    npm install -D typescript

    npx tsc --init

----
url: https://webcrawlerapi.com/glossary/webcrawling/how-is-web-crawling-different-from-web-scraping
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Answer

Web crawling focuses on discovering and retrieving pages, while web scraping extracts specific data from those pages. Crawling answers “what pages exist,” and scraping answers “what information is on those pages.” In practice, you often crawl first to build the URL list, then scrape the fields you care about. Crawling deals with link discovery, deduplication, and scheduling. Scraping deals with page parsing, selectors, and data validation. Both are usually combined in a full data pipeline.

----
url: https://webcrawlerapi.com/glossary/scraping/what-is-the-best-data-format-for-scraped-data
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Answer

The best format depends on how you plan to use the data. CSV is simple and works well for tabular data and quick analysis. JSON handles nested structures and is easy to pass between services. Parquet is efficient for large datasets and analytics. Choose a format that fits your storage, query, and downstream tooling.

----
url: https://webcrawlerapi.com/glossary?category=webcrawling
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Glossary

Web Scraping & API Glossary

===========================

Comprehensive glossary of web scraping, crawling, and API terms. Learn the essential concepts and terminology used in web data extraction.

AllPlaywrightPuppeteerScrapingWebcrawling

----
url: https://webcrawlerapi.com
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Turn websites into LLM data

---------------------------

Webcrawling and data scraping API for RAG and LLM

=================================================

[Start crawling](https://dash.webcrawlerapi.com)

> Web crawling has never been so easy

> 

> Make a simple API call with URL and receive every page content formatted and ready for RAG or LLM context

Trusted by 1000+ developers at...

---------------------------------

[](https://aiflowchat.com)

[](https://controlhippo.com)

[](https://unicap.ai)

[](https://usefulai.co.uk)

[](https://workmatte.ai)

[](https://grugnotes.com)

### Integrate in 60 seconds

NodeJSPythonPHP.NETJava

    // npm i webcrawlerapi-js

    import webcrawlerapi from "webcrawlerapi-js";

    async function main() {

        const client = new webcrawlerapi.WebcrawlerClient(

            "YOUR API ACCESS KEY HERE",

        )

        const syncJob = await client.crawl({

                "items_limit": 10,

                "url": "https://stripe.com/",

                "scrape_type": "markdown"

            }

        )

        console.log(syncJob);

    }

    main().catch(console.error);

[Get Your API Access Key](https://dash.webcrawlerapi.com)

Make your RAG better

Everything you need to build your RAG

-------------------------------------

You give us a link - we give you a content for your RAG or LLM context.

### Main content in Markdown

Extract the main content only from any website or page in Markdown or Text format. Perfect for RAG or LLM.

### Production-ready crawling

We handle everything for you: proxies, unblockers, retries, browsers, CAPTCHAs, anti-bot protection, JS and more.

### Accurate content parsing

Accurate content parsing that just works. Focus on your product, not web crawling and scraping.

### Fast support

No AI chatbots. Real humans, real engineers, real support. We are here to help you.

100+

----

Developers using every day

91%

---

Success rate

9s

--

Average crawling time

Without writing a line of code

No-code integrations

--------------------

Quickly integrate web crawling into your workflows using popular no-code platforms.

Zapier

Make

n8n.io

Integrately

[Integrate Without Code](https://dash.webcrawlerapi.com)

Simple, transparent pricing

---------------------------

Start with pay-per-request, or save with a monthly subscription. Top-up credits are always available when your included allowance runs out.

### Pay As You Go

No commitment

From $0.002 / page

*   Unlimited proxy included

*   Up to 5 parallel requests

*   Pay only for successful requests

*   Content cleaning included

*   Run prompts over content for extra 0.002$

[Try for free](https://dash.webcrawlerapi.com)

### Standard

Best for growing teams

Save 25%

$99/month

*   From $0.0015 / page

*   Unlimited proxy included

*   Up to 50 parallel requests

*   Pay only for successful requests

*   Content cleaning included

*   Run prompts over content for extra 0.002$

[Try for free](https://dash.webcrawlerapi.com/billing)

### Scale

For high-volume crawling

Save 50%

$499/month

*   From $0.001 / page

*   Unlimited proxy included

*   Up to 50 parallel requests

*   Pay only for successful requests

*   Content cleaning included

*   Run prompts over content for extra 0.002$

[Try for free](https://dash.webcrawlerapi.com/billing)

Need more than 1M pages/month? [Contact us](/cdn-cgi/l/email-protection#6602030b0926110304051407110a031407160f4805090b) for enterprise pricing.

Frequently Asked Questions

--------------------------

Everything you need to know about our web crawling service

### What is WebcrawlerAPI?

### Can I only crawl specific pages or the website?

### Can I use crawled data in RAG or train my own AI model?

### Do I need to pay a subscription to use WebcrawlerAPI?

### Can I try WebcrawlerAPI before purchasing?

### What if I need help with integration?

----
url: https://webcrawlerapi.com/blog/yaml-vs-plain-text-choosing-the-right-format-for-llm-prompts
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

ComparisonYAMLRAG

YAML vs Plain Text: Choosing the Right Format for LLM Prompts

=============================================================

YAML vs plain text for prompt data and scraping workflows: when structured manifests help and when raw text is the safer choice.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What YAML is good at](#what-yaml-is-good-at)

*   [What plain text is good at](#what-plain-text-is-good-at)

*   [Use cases in web crawling, scraping, and RAG](#use-cases-in-web-crawling-scraping-and-rag)

*   [When YAML should be used](#when-yaml-should-be-used)

*   [When plain text should be used](#when-plain-text-should-be-used)

*   [Practical tradeoffs](#practical-tradeoffs)

*   [YAML is not ideal for large generated datasets](#yaml-is-not-ideal-for-large-generated-datasets)

*   [Plain text makes structured QA difficult](#plain-text-makes-structured-qa-difficult)

*   [Node.js snippet: Combine YAML-like config with plain text content](#nodejs-snippet-combine-yaml-like-config-with-plain-text-content)

*   [Conclusion](#conclusion)

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What YAML is good at](#what-yaml-is-good-at)

*   [What plain text is good at](#what-plain-text-is-good-at)

*   [Use cases in web crawling, scraping, and RAG](#use-cases-in-web-crawling-scraping-and-rag)

*   [When YAML should be used](#when-yaml-should-be-used)

*   [When plain text should be used](#when-plain-text-should-be-used)

*   [Practical tradeoffs](#practical-tradeoffs)

*   [YAML is not ideal for large generated datasets](#yaml-is-not-ideal-for-large-generated-datasets)

*   [Plain text makes structured QA difficult](#plain-text-makes-structured-qa-difficult)

*   [Node.js snippet: Combine YAML-like config with plain text content](#nodejs-snippet-combine-yaml-like-config-with-plain-text-content)

*   [Conclusion](#conclusion)

YAML and plain text are often used at different stages. YAML is usually used for structured manifests and small records. Plain text is usually used for page content and embeddings.

A broader overview is available in [Best Prompt Data](/blog/best-prompt-data).

Quick comparison

----------------

Topic

YAML

Plain Text

Best for

Config-like data and manifests

Raw content and simple outputs

Parsing reliability

Medium (indentation matters)

Low (no structure)

Human readability

High

High

RAG fit

Good for metadata

Good for content

Common failure

Indentation and implicit types

Missing boundaries and ambiguity

What YAML is good at

--------------------

YAML is usually selected when:

*   A job manifest is being created (rules, filters, selectors)

*   Humans will tweak values

*   Nested config is needed and comments matter

If strict parsing is required, JSON can be preferred, as covered in [JSON vs YAML](/blog/json-vs-yaml-choosing-the-right-format-for-llm-prompts).

What plain text is good at

--------------------------

Plain text is usually selected when:

*   The focus is on content, not fields

*   Embeddings will be created for RAG

*   Formatting should be minimized

If structure is helpful for chunking, Markdown can be compared in [Markdown vs Plain Text](/blog/markdown-vs-plain-text-choosing-the-right-format-for-llm-prompts).

Use cases in web crawling, scraping, and RAG

--------------------------------------------

### When YAML should be used

YAML is usually preferred when:

*   Extraction rules are being passed between humans

*   A small record is being stored, and a schema is not enforced

*   Comments are needed to explain choices

### When plain text should be used

Plain text is usually preferred when:

*   The goal is search and retrieval over page content

*   Chunking will be done later

*   The output must be resilient to minor formatting issues

If the output is coming from HTML, the "raw vs cleaned" decision is covered in [HTML vs Cleaned Text](/blog/html-vs-cleaned-text-choosing-the-right-output-format).

Practical tradeoffs

-------------------

### YAML is not ideal for large generated datasets

If thousands of YAML records are emitted by a model, indentation mistakes and typing surprises become frequent. JSON or CSV is usually safer at that scale.

### Plain text makes structured QA difficult

If a "price" field is required, plain text alone can make validation hard. JSON can be compared in [JSON vs Plain Text](/blog/json-vs-plain-text-choosing-the-right-format-for-llm-prompts).

Node.js snippet: Combine YAML-like config with plain text content

-----------------------------------------------------------------

A common pattern is: a config is kept in YAML and content is kept as plain text, then both are wrapped into a JSON record for ingestion.

    // Node 18+

    // Wrap plain text content with a config object.

    const config = {

      extract: ["title", "author", "date"],

      language: "en",

    };

    const content = "Long page text goes here...";

    const record = { config, content };

    console.log(JSON.stringify(record, null, 2));

Conclusion

----------

*   YAML is usually selected for human-edited manifests and config-like data.

*   Plain text is usually selected for content-first outputs and embeddings.

*   In crawling and RAG pipelines, YAML often describes what should be extracted, while plain text carries the actual page content.

If a tabular export is needed, [YAML vs CSV](/blog/yaml-vs-csv-choosing-the-right-format-for-llm-prompts) can be compared too.

----
url: https://webcrawlerapi.com/glossary/puppeteer/what-target-should-i-set-for-typescript-to-build-puppeteer-types
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

If you run into TS18028 private identifiers errors when compiling Puppeteer types with TypeScript, set the TypeScript target to ES2020 or newer. Example tsconfig snippet:

    {

      "compilerOptions": {

        "target": "ES2020",

        "module": "ESNext",

        "lib": ["dom", "es2020"],

        "strict": true

      }

    }

Notes:

*   Use a TypeScript version that supports private identifiers (TS 4.x+).

*   If your environment requires older targets, you may need to avoid using private class fields or adjust downlevel.

----
url: https://webcrawlerapi.com/glossary/puppeteer/why-is-page-goto-never-awaited-for-firefox-addons-page
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Summary

Firefox addon pages navigated via moz-extension:// are treated as webextension contexts. Puppeteer currently does not emit the usual navigation events (like domContentLoaded or load) for such pages, so page.goto may not resolve as expected in this context.

### Why this happens

*   Firefox treats moz-extension pages as webextension contexts. In this scenario, Puppeteer does not start modules that emit the load/domContentLoaded events for these pages.

*   While some navigation events may surface (e.g., navigationCommitted), the key DOM load events are not reliably emitted for addon pages, which leads to the navigation promise not resolving in the expected way.

*   This is a Firefox behavior/limitation rather than a straightforward Puppeteer bug fix.

### Practical guidance

*   Do not rely on domContentLoaded or load to determine that a moz-extension navigation is complete.

*   Instead, proceed after a short fixed delay or wait for a known UI element within the addon page to become available before interacting.

*   If you need to interact with UI, target known selectors directly rather than waiting for navigation lifecycle events.

### Minimal workaround (avoid waiting for dom loaded)

    // Skip waiting for dom loaded when testing firefox addon pages

    const internalUUID = await readInternalUuidSomehow() // your logic to obtain addon id

    page.goto(`moz-extension://${internalUUID}/static/app.html`)

    await sleep(0.5)

    await page.focus('#my-input')

    // Working for typing letters

    await page.keyboard.type('foobar')

    // Not relying on domContentLoaded to proceed

### Notes

*   This limitation is tied to how Firefox handles addon contexts and is not a Puppeteer feature toggle. See discussions around webextension contexts and navigation events for moz-extension pages for context.

----
url: https://webcrawlerapi.com/glossary/scraping/how-do-you-scrape-javascript-heavy-sites
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Answer

Use a headless browser to render the page before extracting data. Wait for key selectors to appear or for network activity to settle. When possible, intercept API calls and parse JSON directly. Rendering is slower, so keep concurrency low and cache results. This approach captures the same data a user sees in the browser.

----
url: https://webcrawlerapi.com/blog/what-is-llm
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Technical3 min read to read

What is an llms.txt File?

=========================

Learn about llms.txt files, a standard way to document AI models used in your projects, promoting transparency and trust in AI-powered applications.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Why Use an llms.txt File?](#why-use-an-llmstxt-file)

*   [How Does It Work?](#how-does-it-work)

*   [Format of an llms.txt File](#format-of-an-llmstxt-file)

*   [Origin of the llms.txt Format](#origin-of-the-llmstxt-format)

*   [Easily Generate Your Own llms.txt File](#easily-generate-your-own-llmstxt-file)

*   [Who Should Use llms.txt?](#who-should-use-llmstxt)

*   [Conclusion](#conclusion)

### Table of Contents

*   [Why Use an llms.txt File?](#why-use-an-llmstxt-file)

*   [How Does It Work?](#how-does-it-work)

*   [Format of an llms.txt File](#format-of-an-llmstxt-file)

*   [Origin of the llms.txt Format](#origin-of-the-llmstxt-format)

*   [Easily Generate Your Own llms.txt File](#easily-generate-your-own-llmstxt-file)

*   [Who Should Use llms.txt?](#who-should-use-llmstxt)

*   [Conclusion](#conclusion)

An llms.txt file is a simple text file that lists information about Large Language Models (LLMs) used in projects or applications. It is similar to other common files you might have seen, such as robots.txt, but its purpose is to clearly show details about the AI models involved.

Why Use an llms.txt File?

-------------------------

When you're building an app or a service that uses artificial intelligence, it is important to be clear about what AI technology is being used. The llms.txt file helps users, developers, and even regulators easily see:

*   Which LLMs (like ChatGPT, GPT-4, or any other model) your app uses.

*   The provider or company behind each AI model.

*   Basic details like version numbers, licenses, or usage guidelines.

This transparency makes your project more credible and trustworthy, as users can better understand the technology behind your software.

How Does It Work?

-----------------

An llms.txt file is placed at the root of your domain, typically accessible through a URL like https://example.com/llms.txt. Users and developers can easily view this file to quickly understand the AI models your project uses. It serves as a standard way of providing clear, structured information about your AI resources.

Format of an llms.txt File

--------------------------

According to [llmstxt.org](https://llmstxt.org/), an llms.txt file uses Markdown format instead of traditional structured formats like XML or JSON. The reason is that Markdown is easy for both humans and language models to read.

The llms.txt Markdown file should include these specific sections, in this order:

1.  **H1 Heading**: This is the title and the only mandatory section.

2.  **Blockquote**: A brief description of the project, summarizing key points.

3.  **Optional Detailed Sections**: These can include paragraphs, lists, or other markdown content providing more details about the project.

4.  **File Lists (optional)**: Defined by H2 headers, these sections contain lists of markdown links. Each link includes:

    *   \[Link Title\](https://link\_url) format, optionally followed by : Additional details.

5.  **Optional Section**: Marked explicitly as "Optional", containing secondary information that can be skipped when less context is needed.

Here's a basic example:

    # Web crawling and data extraction | Webcrawlerapi

    > This is a collection of pages from webcrawlerapi.com, formatted for language models.

    ## Available Pages

    - [Web crawling and data extraction | Webcrawlerapi](https://webcrawlerapi.com)

    - [Scrapers Marketplace | WebcrawlerAPI](https://webcrawlerapi.com/scrapers)

    - [WebCrawling | WebcrawlerAPI docs](https://webcrawlerapi.com/docs/getting-started)

    - [Blog | Webcrawlerapi](https://webcrawlerapi.com/blog)

    - [Website to Markdown Free Tool | WebcrawlerAPI](https://webcrawlerapi.com/tools/website-to-md)

Origin of the llms.txt Format

-----------------------------

The llms.txt format is inspired by other transparency-oriented files like robots.txt. It originated from a need for clearer disclosure about the use of AI models in software projects. The exact origin of the format isn't tied to a single company; rather, it emerged organically within the AI development community as companies and developers sought standard ways to communicate AI usage transparently and consistently.

Easily Generate Your Own llms.txt File

--------------------------------------

You don't have to manually create an llms.txt file from scratch. There are easy-to-use tools available online. For example, you can quickly and freely generate an llms.txt file using the free tool available at [webcrawlerapi.com](/tools/llmstxt-generator). This tool simplifies the process, automatically preparing a properly formatted text file for you to download and use immediately.

Who Should Use llms.txt?

------------------------

Anyone who builds or hosts software using AI should consider adding an llms.txt file. It helps with transparency, building trust with your users, and clearly communicating the AI technology you depend on. Developers, companies, researchers, and even hobbyists can benefit from clearly listing their AI technologies.

Conclusion

----------

An llms.txt file is a simple, transparent way to share important information about the AI models your website or application uses. It promotes trust, clarity, and openness, benefiting both creators and users of AI-powered technologies. With tools available online, setting up your own llms.txt file is quick and easy.

----
url: https://webcrawlerapi.com/docs/sdk/js
----

JavaScript and TypeScript (Node.js)

===================================

Copy MarkdownOpen

Learn how to use the WebCrawler API JavaScript SDK to crawl websites and extract data.

[Installation](#installation)

-----------------------------

    npm i webcrawlerapi-js

[Usage](#usage)

---------------

### [Synchronous Crawling](#synchronous-crawling)

The synchronous method waits for the crawl to complete and returns all data at once.

    import webcrawlerapi from "webcrawlerapi-js";

    const client = new webcrawlerapi.WebcrawlerClient("YOUR_API_KEY");

    // Synchronous crawling

    const result = await client.crawl({

        "url": "https://stripe.com/",

        "scrape_type": "markdown",

        "items_limit": 10

    });

    for (const item of syncJob.job_items) {

        item.getContent().then((content) => {

            console.log(content.slice(0, 100));

        })

    }

    console.log(result);

### [Asynchronous Crawling](#asynchronous-crawling)

The asynchronous method returns a job ID immediately and allows you to check the status later.

    import webcrawlerapi from "webcrawlerapi-js";

    const client = new webcrawlerapi.WebcrawlerClient("YOUR_API_KEY");

    // Start the async crawl job

    const job = await client.crawlAsync({

        "url": "https://stripe.com/",

        "scrape_type": "markdown",

        "items_limit": 10

    });

    // Get the job ID

    const jobId = job.id;

    // Check job status

    let jobStatus = await client.getJob(jobId);

    console.log(jobStatus);

    // You can poll the job status until it's complete

    while (jobStatus.status === 'in_progress') {

        await new Promise(resolve => setTimeout(resolve, jobStatus.recommended_pull_delay_ms));

        jobStatus = await client.getJob(jobId);

    }

    console.log('Final result:', jobStatus);

### [Options](#options)

Both methods support the following options:

*   `url`: The target URL to crawl

*   `scrape_type`: Type of content to extract ('markdown', 'html', etc.)

*   `items_limit`: Maximum number of pages to crawl

*   `whitelist_regexp`: Regular expression for allowed URLs

*   `blacklist_regexp`: Regular expression for blocked URLs

*   `webhook_url`: URL to receive notifications when the job completes

### [GetContent](#getcontent)

The job item contains a link to its content. For convenience, there is a `getContent()` method that allows you to easily access this content. Here's an example:

    const result = await client.crawl({

        "url": "https://stripe.com/",

        "scrape_type": "markdown",

        "items_limit": 10

    });

    for (const item of syncJob.job_items) {

        item.getContent().then((content) => {

            console.log(content.slice(0, 100));

        })

    }

This method retrieves the full content associated with the job, which can be useful for processing or displaying the job's data.

[Manage Feeds

Previous Page](/docs/api/feed/feed-manage)[Python

Next Page](/docs/sdk/python)

### On this page

[Installation](#installation)[Usage](#usage)[Synchronous Crawling](#synchronous-crawling)[Asynchronous Crawling](#asynchronous-crawling)[Options](#options)[GetContent](#getcontent)

----
url: https://webcrawlerapi.com/docs/actions/s3-upload
----

S3 Upload

=========

Copy MarkdownOpen

How to upload crawled data directly to Amazon S3 or compatible storage

The S3 Upload action allows you to automatically upload the crawled data to your Amazon S3 bucket or any S3-compatible storage service. This is particularly useful for integrating the crawl results directly into your data pipeline without requiring an additional step to download and then upload the data.

[Usage](#usage)

---------------

**Security Warning**: We temporarily store your S3 credentials while the job is processing. All credentials are automatically removed immediately after the job completes.

To use the S3 upload action, include an `actions` array in your request with an action of type `upload_s3`. This action requires several parameters to authenticate and specify the destination in your S3 bucket.

### [Required Parameters](#required-parameters)

Parameter

Type

Description

`type`

string

Must be set to `upload_s3`

`path`

string

The file path/key where the data will be stored in your bucket

`access_key_id`

string

Your S3 access key ID

`secret_access_key`

string

Your S3 secret access key

`bucket`

string

The name of your S3 bucket

`endpoint`

string

The S3 endpoint URL (especially needed for S3-compatible services)

If you haven't created the bucket in the `us-east-1` AWS region, please, specify your bucket region through an endpoint in a format like [https://s3.{your-region}.amazonaws.com](https://s3.%7Byour-region%7D.amazonaws.com).

[Example Request](#example-request)

-----------------------------------

cURLJavaScriptPython

    curl -i --request POST \

      --url https://api.webcrawlerapi.com/v1/crawl \

      --header 'Authorization: Bearer YOUR_API_KEY' \

      --data '{

      "url": "https://books.toscrape.com/",

      "scrape_type": "markdown",

      "items_limit": 20,

      "actions": [        {

          "type": "upload_s3",

          "path": "/testupload",

          "access_key_id": "<ACCESS_KEY>",

          "secret_access_key": "<SECRET_KEY>",

          "bucket": "mybucket",

          "endpoint": "https://s3.eu-west-1.amazonaws.com"

        }

      ]

    }'

    const s3Upload = {

      "type": "upload_s3",

      "path": "/testupload",

      "access_key_id": "ACCESS_KEY",

      "secret_access_key": "<SECRET_KEY>",

      "bucket": "mybucket",

      "endpoint": "https://s3.eu-west-1.amazonaws.com"

    };

    try {

      // async way - promise will be resolved with all the data

      const syncJob = await client.crawl({

        "url": "https://books.toscrape.com/",

        "scrape_type": "markdown",

        "items_limit": 20,

      }, s3Upload);

      console.log(`Job ID: ${syncJob.id}`);

    } catch (error) {

      console.error("Error uploading to S3:", error);

    }

    s3_action = UploadS3Action(

        path="/testupload",

        access_key_id="<ACCESS_KEY>",

        secret_access_key="<SECRET_KEY>",

        bucket="mybucket",

        endpoint="https://s3.eu-west-1.amazonaws.com"

    )

    # Start a synchronous crawling job (blocks until completion)

    print("Starting crawling job...")

    job = crawler.crawl(

        url="https://books.toscrape.com/",

        scrape_type="markdown",

        items_limit=20,

        actions=s3_action,  # Add the S3 upload action

        max_polls=100  # Maximum number of status checks

    )

    print(f"Job completed with ID: {job.id}")

[Response](#response)

---------------------

When the S3 upload action is successfully executed, the response will include information about the upload:

    {

      "id": "5f7b1b7b-7b7b-4b7b-8b7b-7b7b7b7b7b7b",

      "actions": [        {

          "type": "upload_s3",

          "status": "success",

          "path": "/testupload"

        }

      ]

    }

[Compatible Storage Services](#compatible-storage-services)

-----------------------------------------------------------

This action works with:

*   Amazon S3

*   Cloudflare R2

*   DigitalOcean Spaces

*   Backblaze B2

*   Any other S3-compatible storage service

[Error Handling](#error-handling)

---------------------------------

If there's an error with the S3 upload, the action's status will be set to `error` with a message explaining the issue:

    {

    	"error_code": "invalid_request",

    	"error_message": "invalid S3 credentials: operation error S3: PutObject, https response error StatusCode: 403, api error InvalidAccessKeyId: The AWS Access Key Id you provided does not exist in our records."

    }

If you upload files to a private, non-accessible bucket, subsequent attempts to retrieve the content using the file URL might fail. Ensure that you have proper permissions set up for accessing the uploaded files if you need to retrieve them later.

[Zapier

Previous Page](/docs/sdk/zapier)[Crawling output format types

Next Page](/docs/crawling-types)

### On this page

[Usage](#usage)[Required Parameters](#required-parameters)[Example Request](#example-request)[Response](#response)[Compatible Storage Services](#compatible-storage-services)[Error Handling](#error-handling)

----
url: https://webcrawlerapi.com/docs/sdk/python
----

Python

======

Copy MarkdownOpen

Learn how to use the WebCrawler API Python SDK to crawl websites and extract data.

[Installation](#installation)

-----------------------------

    pip install webcrawlerapi

[Usage](#usage)

---------------

### [Synchronous Crawling](#synchronous-crawling)

The synchronous method waits for the crawl to complete and returns all data at once.

    from webcrawlerapi import WebCrawlerAPI

    # Initialize the client

    crawler = WebCrawlerAPI(api_key="YOUR_API_KEY")

    # Synchronous crawling

    result = crawler.crawl(

        url="https://example.com",

        scrape_type="markdown",

        items_limit=10

    )

    print(f"Job completed with status: {result.status}")

    print(f"Number of items crawled: {len(result.job_items)}")

### [Asynchronous Crawling](#asynchronous-crawling)

The asynchronous method returns a job ID immediately and allows you to check the status later.

    from webcrawlerapi import WebCrawlerAPI

    import time

    # Initialize the client

    crawler = WebCrawlerAPI(api_key="YOUR_API_KEY")

    # Start async crawl job

    job = crawler.crawl_async(

        url="https://example.com",

        scrape_type="markdown",

        items_limit=10

    )

    # Get the job ID

    job_id = job.id

    # Check job status

    job_status = crawler.get_job(job_id)

    # Poll until job is complete

    while job_status.status == 'in_progress':

        time.sleep(job_status.recommended_pull_delay_ms / 1000)  # Convert ms to seconds

        job_status = crawler.get_job(job_id)

    # Process results

    if job_status.status == 'done':

        for item in job_status.job_items:

            print(f"Page title: {item.title}")

            print(f"Original URL: {item.original_url}")

            print(f"Markdown content URL: {item.markdown_content_url}")

[Available Parameters](#available-parameters)

---------------------------------------------

Both crawling methods support these parameters:

*   `url` (required): The target URL to crawl

*   `scrape_type`: Type of content to extract ('markdown', 'html', 'cleaned')

*   `items_limit`: Maximum number of pages to crawl (default: 10)

*   `whitelist_regexp`: Regular expression for allowed URLs

*   `blacklist_regexp`: Regular expression for blocked URLs

*   `webhook_url`: URL to receive notifications when the job completes

*   `max_polls`: Maximum number of status checks (sync only, default: 100)

[Response Objects](#response-objects)

-------------------------------------

### [Job Object](#job-object)

    job.id                         # Unique job identifier

    job.status                     # Job status (new, in_progress, done, error)

    job.url                        # Original crawl URL

    job.created_at                 # Job creation timestamp

    job.finished_at                # Job completion timestamp

    job.job_items                  # List of crawled items

    job.recommended_pull_delay_ms  # Recommended delay between status checks

### [JobItem Object](#jobitem-object)

    item.id                    # Unique item identifier

    item.original_url          # URL of the crawled page

    item.title                 # Page title

    item.status                # Item status

    item.page_status_code      # HTTP status code

    item.markdown_content_url  # URL to markdown content (if applicable)

    item.raw_content_url       # URL to raw content

    item.cleaned_content_url   # URL to cleaned content

[JavaScript

Previous Page](/docs/sdk/js)[PHP

Next Page](/docs/sdk/php)

### On this page

[Installation](#installation)[Usage](#usage)[Synchronous Crawling](#synchronous-crawling)[Asynchronous Crawling](#asynchronous-crawling)[Available Parameters](#available-parameters)[Response Objects](#response-objects)[Job Object](#job-object)[JobItem Object](#jobitem-object)

----
url: https://webcrawlerapi.com/docs/sdk/java
----

Java

====

Copy MarkdownOpen

Learn how to use the WebCrawler API Java SDK to crawl websites and extract data.

[Obtain an API Key](#obtain-an-api-key)

---------------------------------------

To use the WebCrawler API, you need to obtain an API key. You can do this by [signing up for a free account](https://dash.webcrawlerapi.com/access).

[Installation](#installation)

-----------------------------

The Java SDK is a standalone, single-file implementation that requires **no external dependencies**. Simply copy the `WebCrawlerAPI.java` file into your project.

### [Download the SDK](#download-the-sdk)

Get the SDK from the [GitHub repository](https://github.com/WebCrawlerAPI/java-sdk):

    # Download directly

    curl -O https://raw.githubusercontent.com/WebCrawlerAPI/java-sdk/main/WebCrawlerAPI.java

    # Or clone the repository

    git clone https://github.com/WebCrawlerAPI/java-sdk.git

### [Add to Your Project](#add-to-your-project)

Copy `WebCrawlerAPI.java` into your project's source directory:

    # For a standalone project

    cp WebCrawlerAPI.java /path/to/your/project/src/

    # For Maven projects

    cp WebCrawlerAPI.java /path/to/your/project/src/main/java/

    # For Gradle projects

    cp WebCrawlerAPI.java /path/to/your/project/src/main/java/

[Requirements](#requirements)

-----------------------------

*   Java 17 or higher

*   No external dependencies or build tools required

[Usage](#usage)

---------------

### [Quick Start](#quick-start)

Here's a simple example to get you started:

    // Initialize the client

    WebCrawlerAPI client = new WebCrawlerAPI("YOUR_API_KEY");

    try {

        // Scrape a single page

        WebCrawlerAPI.ScrapeResult result = client.scrape(

            "https://example.com",

            "markdown"

        );

        if ("done".equals(result.status)) {

            System.out.println("Content: " + result.content);

        }

    } catch (WebCrawlerAPI.WebCrawlerAPIException e) {

        System.err.println("Error: " + e.getMessage());

    }

### [Synchronous Crawling](#synchronous-crawling)

The synchronous method waits for the crawl to complete and returns all data at once.

    // Initialize the client

    WebCrawlerAPI client = new WebCrawlerAPI("YOUR_API_KEY");

    try {

        // Crawl a website

        WebCrawlerAPI.CrawlResult result = client.crawl(

            "https://example.com",  // URL to crawl

            "markdown",             // Scrape type: html, cleaned, or markdown

            10                      // Maximum number of pages to crawl

        );

        System.out.println("Crawl completed with status: " + result.status);

        System.out.println("Number of items crawled: " + result.items.size());

        // Access crawled items

        for (WebCrawlerAPI.CrawlItem item : result.items) {

            System.out.println("URL: " + item.url);

            System.out.println("Status: " + item.status);

            System.out.println("Content URL: " + item.getContentUrl("markdown"));

        }

    } catch (WebCrawlerAPI.WebCrawlerAPIException e) {

        System.err.println("Error: " + e.getMessage());

    }

### [Asynchronous Scraping](#asynchronous-scraping)

Start a scrape job and check status later:

    // Initialize the client

    WebCrawlerAPI client = new WebCrawlerAPI("YOUR_API_KEY");

    try {

        // Start scrape job asynchronously

        String scrapeId = client.scrapeAsync("https://example.com", "html");

        System.out.println("Scrape started with ID: " + scrapeId);

        // Do other work here...

        // Check status later

        WebCrawlerAPI.ScrapeResult result = client.getScrape(scrapeId);

        System.out.println("Status: " + result.status);

        // Poll until complete

        while (!"done".equals(result.status) && !"error".equals(result.status)) {

            Thread.sleep(2000);

            result = client.getScrape(scrapeId);

        }

        if ("done".equals(result.status)) {

            System.out.println("Content: " + result.html);

        }

    } catch (Exception e) {

        System.err.println("Error: " + e.getMessage());

    }

### [Scraping Single Pages](#scraping-single-pages)

For scraping a single page without crawling (synchronous):

    // Initialize the client

    WebCrawlerAPI client = new WebCrawlerAPI("YOUR_API_KEY");

    try {

        // Scrape a single page

        WebCrawlerAPI.ScrapeResult result = client.scrape(

            "https://example.com",

            "markdown"

        );

        System.out.println("Scrape status: " + result.status);

        if ("done".equals(result.status)) {

            System.out.println("Content: " + result.markdown);

        }

    } catch (WebCrawlerAPI.WebCrawlerAPIException e) {

        System.err.println("Error: " + e.getMessage());

    }

[API Methods](#api-methods)

---------------------------

### [crawl()](#crawl)

Crawl a website and return all discovered pages.

    CrawlResult crawl(String url, String scrapeType, int itemsLimit)

    CrawlResult crawl(String url, String scrapeType, int itemsLimit, int maxPolls)

**Parameters:**

*   `url` (String, required): The target URL to crawl

*   `scrapeType` (String): Type of content to extract: `"markdown"`, `"html"`, or `"cleaned"`

*   `itemsLimit` (int): Maximum number of pages to crawl

*   `maxPolls` (int, optional): Maximum polling attempts (default: 100)

### [scrape()](#scrape)

Scrape a single page synchronously (waits for completion).

    ScrapeResult scrape(String url, String scrapeType)

    ScrapeResult scrape(String url, String scrapeType, int maxPolls)

**Parameters:**

*   `url` (String, required): The target URL to scrape

*   `scrapeType` (String): Type of content to extract: `"markdown"`, `"html"`, or `"cleaned"`

*   `maxPolls` (int, optional): Maximum polling attempts (default: 100)

### [scrapeAsync()](#scrapeasync)

Start a scrape job asynchronously (returns immediately).

    String scrapeAsync(String url, String scrapeType)

**Parameters:**

*   `url` (String, required): The target URL to scrape

*   `scrapeType` (String): Type of content to extract: `"markdown"`, `"html"`, or `"cleaned"`

**Returns:** Scrape ID (String) that can be used with `getScrape()`

### [getScrape()](#getscrape)

Get the status and result of a scrape job.

    ScrapeResult getScrape(String scrapeId)

**Parameters:**

*   `scrapeId` (String, required): The scrape ID returned from `scrapeAsync()`

[Response Objects](#response-objects)

-------------------------------------

### [CrawlResult](#crawlresult)

Contains the result of a crawl operation.

    public class CrawlResult {

        public String id;                      // Job ID

        public String status;                  // Job status: "new", "in_progress", "done", "error"

        public String url;                     // Original URL

        public String scrapeType;              // Scrape type used

        public int recommendedPullDelayMs;     // Recommended delay between polls

        public List<CrawlItem> items;          // List of crawled items

    }

### [CrawlItem](#crawlitem)

Individual crawled page item.

    public class CrawlItem {

        public String url;                     // Page URL

        public String status;                  // Item status

        public String rawContentUrl;           // URL to raw HTML content

        public String cleanedContentUrl;       // URL to cleaned content

        public String markdownContentUrl;      // URL to markdown content

        // Helper method to get content URL based on scrape type

        public String getContentUrl(String scrapeType);

    }

### [ScrapeResult](#scraperesult)

Contains the result of a scrape operation.

    public class ScrapeResult {

        public String status;                  // Scrape status: "in_progress", "done", "error"

        public String content;                 // Scraped content (based on scrape_type)

        public String html;                    // Raw HTML content

        public String markdown;                // Markdown content

        public String cleaned;                 // Cleaned text content

        public String url;                     // Page URL

        public int pageStatusCode;             // HTTP status code

    }

[Error Handling](#error-handling)

---------------------------------

The SDK throws `WebCrawlerAPIException` for API errors:

    try {

        WebCrawlerAPI.CrawlResult result = client.crawl(url, "markdown", 10);

        // Process result...

    } catch (WebCrawlerAPI.WebCrawlerAPIException e) {

        System.err.println("Error code: " + e.getErrorCode());

        System.err.println("Error message: " + e.getMessage());

    }

Common error codes:

*   `network_error` - Network/connection error

*   `invalid_response` - Invalid API response

*   `interrupted` - Operation was interrupted

*   `unknown_error` - Unknown error occurred

[Advanced Usage](#advanced-usage)

---------------------------------

### [Custom Base URL](#custom-base-url)

For testing or custom endpoints:

    // Use custom API endpoint (e.g., for local development)

    WebCrawlerAPI client = new WebCrawlerAPI(

        "YOUR_API_KEY",

        "http://localhost:8080"  // Custom base URL

    );

### [Control Polling Behavior](#control-polling-behavior)

Customize the maximum number of polling attempts:

    // Crawl with custom max polls

    WebCrawlerAPI.CrawlResult result = client.crawl(

        "https://example.com",

        "markdown",

        10,

        50  // Max 50 polling attempts

    );

    // Scrape with custom max polls

    WebCrawlerAPI.ScrapeResult scrape = client.scrape(

        "https://example.com",

        "markdown",

        30  // Max 30 polling attempts

    );

[Complete Example](#complete-example)

-------------------------------------

Here's a complete example showing compilation and execution:

    public class MyApp {

        public static void main(String[] args) {

            // Get API key from environment variable

            String apiKey = System.getenv("API_KEY");

            if (apiKey == null || apiKey.isEmpty()) {

                System.err.println("Error: API_KEY environment variable not set");

                System.exit(1);

            }

            // Create client

            WebCrawlerAPI client = new WebCrawlerAPI(apiKey);

            try {

                // Crawl a website

                WebCrawlerAPI.CrawlResult result = client.crawl(

                    "https://books.toscrape.com",

                    "markdown",

                    5

                );

                System.out.println("Found " + result.items.size() + " items");

                // Display results

                for (WebCrawlerAPI.CrawlItem item : result.items) {

                    System.out.println("URL: " + item.url);

                    System.out.println("Status: " + item.status);

                }

            } catch (WebCrawlerAPI.WebCrawlerAPIException e) {

                System.err.println("Error: " + e.getMessage());

                System.exit(1);

            }

        }

    }

### [Compile and Run](#compile-and-run)

    # Compile (make sure WebCrawlerAPI.java is in the same directory)

    javac MyApp.java WebCrawlerAPI.java

    # Run with your API key

    API_KEY=your-api-key java MyApp

    # Or for local testing

    API_KEY=test-api-key API_BASE_URL=http://localhost:8080 java MyApp

[More Information](#more-information)

-------------------------------------

For more examples and the complete source code, visit the [GitHub repository](https://github.com/WebCrawlerAPI/java-sdk).

[PHP

Previous Page](/docs/sdk/php)[.NET

Next Page](/docs/sdk/dotnet)

### On this page

[Obtain an API Key](#obtain-an-api-key)[Installation](#installation)[Download the SDK](#download-the-sdk)[Add to Your Project](#add-to-your-project)[Requirements](#requirements)[Usage](#usage)[Quick Start](#quick-start)[Synchronous Crawling](#synchronous-crawling)[Asynchronous Scraping](#asynchronous-scraping)[Scraping Single Pages](#scraping-single-pages)[API Methods](#api-methods)[crawl()](#crawl)[scrape()](#scrape)[scrapeAsync()](#scrapeasync)[getScrape()](#getscrape)[Response Objects](#response-objects)[CrawlResult](#crawlresult)[CrawlItem](#crawlitem)[ScrapeResult](#scraperesult)[Error Handling](#error-handling)[Advanced Usage](#advanced-usage)[Custom Base URL](#custom-base-url)[Control Polling Behavior](#control-polling-behavior)[Complete Example](#complete-example)[Compile and Run](#compile-and-run)[More Information](#more-information)

----
url: https://webcrawlerapi.com/docs/sdk/make
----

Make (formerly Integromat) WebcrawlerAPI integration

====================================================

Copy MarkdownOpen

How to get website content for LLM training using Make and WebCrawlerAPI.

Make (formerly Integromat) is a powerful automation platform that allows you to connect various apps and services to automate tasks. You can use Make to integrate WebCrawlerAPI for crawling websites and extracting data, which can then be used for training large language models (LLMs) or other purposes.

1.  First, you need to create a WebCrawlerAPI account and get your [API key](https://dash.webcrawlerapi.com/access).

2.  Next, log in to your Make account and search for the WebcrawlerAPI app in the app directory.

3.  Click on the WebcrawlerAPI app to open its details page, then click "Create a connection".

4.  In the connection settings, enter your WebCrawlerAPI API key and click "Save".

5.  Now you can use the WebcrawlerAPI app in your Make scenarios. To scrape a webpage, add the WebcrawlerAPI module to your scenario and configure it with the URL you want to scrape and any additional parameters you need.

[n8n

Previous Page](/docs/sdk/n8n)[Zapier

Next Page](/docs/sdk/zapier)

----
url: https://webcrawlerapi.com/blog/beatifulsoup-webcrawler
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

PythonTutorialWeb Crawling

BeautifulSoup4 Web Crawler

==========================

A tiny BeautifulSoup4 + requests crawler that stays on one site, normalizes URLs, and deduplicates links.

Written byAndrew

Published onFeb 3, 2026

### Table of Contents

*   [BeautifulSoup4 web crawler in one file](#beautifulsoup4-web-crawler-in-one-file)

*   [How to run it](#how-to-run-it)

*   [What it does (and why it works)](#what-it-does-and-why-it-works)

*   [1) Deduplication](#1-deduplication)

*   [2) Scope control (same site)](#2-scope-control-same-site)

*   [3) URL normalization](#3-url-normalization)

*   [Politeness (delay + timeout)](#politeness-delay-timeout)

*   [What this script does not do](#what-this-script-does-not-do)

### Table of Contents

*   [BeautifulSoup4 web crawler in one file](#beautifulsoup4-web-crawler-in-one-file)

*   [How to run it](#how-to-run-it)

*   [What it does (and why it works)](#what-it-does-and-why-it-works)

*   [1) Deduplication](#1-deduplication)

*   [2) Scope control (same site)](#2-scope-control-same-site)

*   [3) URL normalization](#3-url-normalization)

*   [Politeness (delay + timeout)](#politeness-delay-timeout)

*   [What this script does not do](#what-this-script-does-not-do)

If you want a tiny “just works” crawler, start here. Copy-paste this file, run it, and you will get a deduped list of internal URLs.

    #!/usr/bin/env python3

    import argparse

    import collections

    import re

    import sys

    import time

    import urllib.parse

    import requests

    from bs4 import BeautifulSoup

    SKIP_SCHEMES = {"mailto", "tel", "javascript", "data"}

    DROP_QUERY_PREFIXES = {"utm_"}

    DROP_QUERY_KEYS = {"fbclid", "gclid", "igshid"}

    def normalize_url(raw: str, *, base: str | None = None) -> str | None:

        try:

            u = urllib.parse.urljoin(base, raw) if base else raw

            p = urllib.parse.urlsplit(u)

        except Exception:

            return None

        if not p.scheme or p.scheme.lower() in SKIP_SCHEMES:

            return None

        scheme = p.scheme.lower()

        netloc = p.netloc.lower()

        # drop default ports

        if (scheme == "http" and netloc.endswith(":80")) or (scheme == "https" and netloc.endswith(":443")):

            netloc = netloc.rsplit(":", 1)[0]

        # normalize path: collapse // and ensure leading /

        path = re.sub(r"/+$", "", p.path or "/") or "/"

        path = re.sub(r"/{2,}", "/", path)

        # normalize query: drop common tracking params and sort remaining

        q = urllib.parse.parse_qsl(p.query, keep_blank_values=True)

        q2: list[tuple[str, str]] = []

        for k, v in q:

            lk = k.lower()

            if lk in DROP_QUERY_KEYS:

                continue

            if any(lk.startswith(prefix) for prefix in DROP_QUERY_PREFIXES):

                continue

            q2.append((k, v))

        q2.sort(key=lambda kv: (kv[0], kv[1]))

        query = urllib.parse.urlencode(q2, doseq=True)

        # drop fragments

        return urllib.parse.urlunsplit((scheme, netloc, path, query, ""))

    def same_site(url: str, root: str) -> bool:

        try:

            a = urllib.parse.urlsplit(url)

            b = urllib.parse.urlsplit(root)

            return a.scheme == b.scheme and a.netloc == b.netloc

        except Exception:

            return False

    def extract_links(html: str, *, base_url: str) -> list[str]:

        soup = BeautifulSoup(html, "html.parser")

        out: list[str] = []

        for a in soup.select("a[href]"):

            href = a.get("href")

            if not href or not isinstance(href, str):

                continue

            n = normalize_url(href, base=base_url)

            if n:

                out.append(n)

        return out

    def crawl(start_url: str, *, max_pages: int, delay_s: float, timeout_s: float) -> list[str]:

        start = normalize_url(start_url)

        if not start:

            raise ValueError("Invalid start URL")

        session = requests.Session()

        session.headers.update({"User-Agent": "BeautifulSoupCrawler/1.0 (+https://webcrawlerapi.com)"})

        seen: set[str] = set()

        q: collections.deque[str] = collections.deque([start])

        crawled: list[str] = []

        while q and len(crawled) < max_pages:

            url = q.popleft()

            if url in seen:

                continue

            if not same_site(url, start):

                continue

            seen.add(url)

            try:

                res = session.get(url, timeout=timeout_s, allow_redirects=True)

            except requests.RequestException:

                continue

            final_url = normalize_url(res.url) or url

            if final_url not in seen:

                seen.add(final_url)

            crawled.append(final_url)

            ctype = (res.headers.get("content-type") or "").lower()

            if res.ok and "text/html" in ctype:

                for link in extract_links(res.text, base_url=final_url):

                    if link not in seen:

                        q.append(link)

            if delay_s:

                time.sleep(delay_s)

        return crawled

    def main(argv: list[str]) -> int:

        ap = argparse.ArgumentParser(description="Tiny site crawler using BeautifulSoup4")

        ap.add_argument("start_url", help="Seed URL, e.g. https://example.com")

        ap.add_argument("--max-pages", type=int, default=100, help="Hard stop")

        ap.add_argument("--delay", type=float, default=0.2, help="Delay between requests (seconds)")

        ap.add_argument("--timeout", type=float, default=10.0, help="Request timeout (seconds)")

        args = ap.parse_args(argv)

        urls = crawl(args.start_url, max_pages=args.max_pages, delay_s=args.delay, timeout_s=args.timeout)

        for u in urls:

            print(u)

        print(f"\nCrawled {len(urls)} pages", file=sys.stderr)

        return 0

    if __name__ == "__main__":

        raise SystemExit(main(sys.argv[1:]))

BeautifulSoup4 web crawler in one file

======================================

Hi, I'm Andrew. This is a tiny crawler that is built for one job: start from a URL, follow links, and keep going.

It is not a production crawler. It is a learning script that shows the core loop: fetch -> parse -> enqueue -> dedupe.

How to run it

-------------

The virtualenv is already created in this repo at content/blog/beatifulsoup-webcrawler/extra/code/.venv.

    cd content/blog/beatifulsoup-webcrawler/extra/code

    source .venv/bin/activate

    python -m pip install -r requirements.txt

    python crawler.py https://example.com --max-pages 50 --delay 0.2

URLs are printed to stdout. The progress line (Crawled N pages) is printed to stderr.

If you want the longer, step-by-step version (from copy-paste crawler to more production-ish concerns), read: [How to crawl the website with Python](/blog/how-to-crawl-the-website-with-python).

What it does (and why it works)

-------------------------------

This script is small, but it is not naive. Three guardrails are doing most of the work.

### 1) Deduplication

Without dedupe, the crawl becomes infinite. Navigation menus alone can re-add the same URLs forever.

In the script, seen is the simplest correct version:

    seen: set[str] = set()

    if url in seen:

        continue

    seen.add(url)

### 2) Scope control (same site)

If scope is not defined, the crawler leaves the site. It follows socials, auth providers, CDNs, random third-party links.

This crawler stays strict: same scheme + same host.

    from urllib.parse import urlsplit

    def same_site(url: str, root: str) -> bool:

        a = urlsplit(url)

        b = urlsplit(root)

        return a.scheme == b.scheme and a.netloc == b.netloc

It is boring. It is also safe.

### 3) URL normalization

Duplicates are not only caused by re-visiting the same link. They are also caused by URL variants.

*   /page vs /page/

*   #fragment variants

*   tracking params like utm\_\*, fbclid, gclid

So normalization is used before URLs are added to the queue. In this script it is done in normalize\_url().

Here is the small idea, without all the extra rules:

    from urllib.parse import urljoin, urlsplit, urlunsplit

    def normalize(raw: str, base: str) -> str:

        u = urljoin(base, raw)

        p = urlsplit(u)

        return urlunsplit((p.scheme.lower(), p.netloc.lower(), p.path.rstrip("/"), p.query, ""))

The tradeoff is real. Aggressive normalization can merge pages that are actually different. That is why tracking params are dropped first, not everything.

Politeness (delay + timeout)

----------------------------

Two settings are used so the crawl does not hang and the target is not hammered.

    import time

    import requests

    session = requests.Session()

    timeout_s = 10.0

    delay_s = 0.2

    res = session.get(url, timeout=timeout_s, allow_redirects=True)

    time.sleep(delay_s)

0.2s can still be too fast for many sites. If you crawl something you do not control, go slower.

What this script does not do

----------------------------

This is where real crawling work starts.

*   robots.txt parsing and per-URL allow checks

*   backoff on 429 and retries on flaky networks

*   JavaScript rendering (SPA pages that ship empty HTML)

*   anti-bot handling

*   storage (results, crawl state, link graph)

Also: crawling is only step one. After you fetch pages, you usually need to clean the HTML (remove scripts, nav, boilerplate) before you can use the text. See: [Clean crawled or scraped data with BeautifulSoup in Python](/blog/clean-crawled-data-with-beautifulsoup-in-python).

If those features are needed, a script stops being a script. Crawling infrastructure is being built.

When that point is reached, a managed service like [WebCrawlerAPI](https://webcrawlerapi.com) is usually cheaper than rebuilding everything.

----
url: https://webcrawlerapi.com/changelog/2025-04-13-langchain-integration
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

April 13, 2025

🦜🔗 Introducing WebcrawlerAPI LangChain Integration 🤖

=======================================================

We're thrilled to announce the release of our official LangChain integration! The new webcrawlerapi-langchain package makes it seamless to incorporate WebcrawlerAPI's powerful web crawling capabilities into your LangChain document processing pipelines.

### Key Features:

*   🚀 Simple integration with LangChain's document loaders

*   📄 Multiple content formats (markdown, cleaned text, HTML)

*   ⚡️ Async and lazy loading support

*   🔄 Built-in retry mechanisms, proxies and error handling

*   🎯 Configurable URL filtering with regex patterns

### Quick Start:

    pip install webcrawlerapi-langchain

    from webcrawlerapi_langchain import WebCrawlerAPILoader

    loader = WebCrawlerAPILoader(

        url="https://example.com",

        api_key="your-api-key",

        scrape_type="markdown"

    )

    documents = loader.load()

### Perfect for:

*   Building AI-powered knowledge bases

*   Creating document QA systems

*   Training custom language models

*   Processing web content for LLM applications

### Need an integration example?

Check our [WebcrawlerAPI examples](https://github.com/WebCrawlerAPI/webcrawlerapi-examples/tree/master/python/langchain-basic)

Check out our [LangChain SDK documentation](/docs/sdk/langchain) for detailed usage instructions and examples. Start building powerful AI applications with web data today!

----
url: https://webcrawlerapi.com/blog/javascript-rendering-in-web-crawling-complete-guide
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

JSTechnical10 min read to read

JavaScript Rendering in Web Crawling

====================================

Explore essential tools and strategies for effective JavaScript rendering in web crawling, overcoming challenges in dynamic websites.

Written byAndrew

Published onJan 31, 2026

### Table of Contents

*   [JavaScript Rendering in Web Crawling: Complete Guide](#javascript-rendering-in-web-crawling-complete-guide)

*   [Tools for Handling JavaScript Rendering](#tools-for-handling-javascript-rendering)

*   [Using Puppeteer for Chrome-Based Rendering](#using-puppeteer-for-chrome-based-rendering)

*   [Playwright: Multi-Browser Compatibility](#playwright-multi-browser-compatibility)

*   [Selenium: Broad Support and Flexibility](#selenium-broad-support-and-flexibility)

*   [Tips for Efficient JavaScript Rendering](#tips-for-efficient-javascript-rendering)

*   [Optimizing Rendering Settings](#optimizing-rendering-settings)

*   [Overcoming Anti-Bot Protections](#overcoming-anti-bot-protections)

*   [Advanced Tools and Services for JavaScript Rendering](#advanced-tools-and-services-for-javascript-rendering)

*   [WebCrawlerAPI: Built for High-Volume Crawling](#webcrawlerapi-built-for-high-volume-crawling)

*   [Firecrawl: Tailored for Dynamic Content](#firecrawl-tailored-for-dynamic-content)

*   [Comparing JavaScript Rendering Tools and Services](#comparing-javascript-rendering-tools-and-services)

*   [Key Takeaways](#key-takeaways)

*   [Choosing the Right Tool](#choosing-the-right-tool)

*   [Final Recommendations](#final-recommendations)

### Table of Contents

*   [JavaScript Rendering in Web Crawling: Complete Guide](#javascript-rendering-in-web-crawling-complete-guide)

*   [Tools for Handling JavaScript Rendering](#tools-for-handling-javascript-rendering)

*   [Using Puppeteer for Chrome-Based Rendering](#using-puppeteer-for-chrome-based-rendering)

*   [Playwright: Multi-Browser Compatibility](#playwright-multi-browser-compatibility)

*   [Selenium: Broad Support and Flexibility](#selenium-broad-support-and-flexibility)

*   [Tips for Efficient JavaScript Rendering](#tips-for-efficient-javascript-rendering)

*   [Optimizing Rendering Settings](#optimizing-rendering-settings)

*   [Overcoming Anti-Bot Protections](#overcoming-anti-bot-protections)

*   [Advanced Tools and Services for JavaScript Rendering](#advanced-tools-and-services-for-javascript-rendering)

*   [WebCrawlerAPI: Built for High-Volume Crawling](#webcrawlerapi-built-for-high-volume-crawling)

*   [Firecrawl: Tailored for Dynamic Content](#firecrawl-tailored-for-dynamic-content)

*   [Comparing JavaScript Rendering Tools and Services](#comparing-javascript-rendering-tools-and-services)

*   [Key Takeaways](#key-takeaways)

*   [Choosing the Right Tool](#choosing-the-right-tool)

*   [Final Recommendations](#final-recommendations)

JavaScript Rendering in Web Crawling: Complete Guide

====================================================

JavaScript rendering is crucial for extracting data from modern dynamic websites that rely on frameworks like [React](https://opensource.fb.com/projects/react/), [Angular](https://angular.io/), or [Vue.js](https://vuejs.org/). Crawlers often face challenges with delayed content loading, anti-bot measures, and high resource demands. Here's a quick summary of tools and strategies to handle JavaScript-heavy sites:

*   **Tools for JavaScript Rendering**:

    *   **[Puppeteer](https://developer.chrome.com/docs/puppeteer)**: Chrome-based automation for dynamic content.

    *   **[Playwright](https://playwright.dev/)**: Multi-browser support with fast execution.

    *   **[Selenium](https://www.selenium.dev/)**: Cross-browser compatibility for enterprise needs.

    *   **[WebCrawlerAPI](https://webcrawlerapi.com/)**: Cloud-based, scalable crawling solution.

*   **Key Tips**:

    *   Use server-side rendering (SSR) for better crawling efficiency.

    *   Optimize rendering settings (e.g., 1-5 second timeouts, selective resource loading).

    *   Handle anti-bot measures with proxy rotation and randomized delays.

*   **Quick Comparison**:

Feature

Puppeteer

Playwright

Selenium

WebCrawlerAPI

**Browser Support**

Chrome/Chromium

Multi-browser

All major

Cloud-based

**Setup**

Moderate

Easy

Complex

No setup

**Best Use Case**

Chrome tasks

Flexibility

Enterprise-level

High-volume

**Pricing**

Free

Free

Free

$20/10,000 pages

Choose the right tool based on your project's size, browser needs, and team expertise. With these strategies, you can efficiently handle JavaScript-rendered content for [web crawling](https://webcrawlerapi.com/scrapers/webcrawler/html).

Tools for Handling JavaScript Rendering

---------------------------------------

Modern web crawlers face challenges like delayed content loading and anti-bot protections. Thankfully, several tools are available to tackle these issues effectively. Below, we break down three of the top solutions in 2025.

### Using [Puppeteer](https://developer.chrome.com/docs/puppeteer) for Chrome-Based Rendering

Puppeteer, a Node.js library from Google, is built for Chrome-based rendering and offers precise browser automation via its high-level API. Its integration with Chrome/Chromium makes it a go-to choice for handling complex dynamic content.

Here's a quick comparison of Puppeteer's standout features:

Feature

How It Works

Why It Matters

Headless Mode

Automates Chrome without UI

Saves resources during processing

JavaScript Execution

Leverages Chrome's V8 engine

Handles dynamic content seamlessly

Memory Management

Built-in garbage collection

Efficient for long-running crawls

### [Playwright](https://playwright.dev/): Multi-Browser Compatibility

Playwright stands out for its speed, clocking an average execution time of 4.513 seconds [\[2\]](https://betterstack.com/community/comparisons/playwright-cypress-puppeteer-selenium-comparison/). It supports Chromium, Firefox, and WebKit through a single API, making it highly versatile.

Some of its key features include:

*   Shadow DOM traversal to handle hidden elements in web components

*   Network interception for managing requests and responses

*   Geolocation mocking for testing location-based features

*   Support for multiple browser contexts in parallel

### [Selenium](https://www.selenium.dev/): Broad Support and Flexibility

Selenium remains a trusted option for complex crawling tasks, with an average execution time of 4.590 seconds [\[2\]](https://betterstack.com/community/comparisons/playwright-cypress-puppeteer-selenium-comparison/). Its cross-browser and multi-language support make it ideal for enterprise-level operations.

> "Selenium's language and browser support make it indispensable for enterprise-level crawling requiring cross-browser compatibility."

Selenium works with all major browsers, including Chrome, Firefox, Edge, and Safari, and supports languages like Java, [Python](https://webcrawlerapi.com/blog/how-to-crawl-the-website-with-python), C#, Ruby, and JavaScript. This flexibility is especially useful for teams managing diverse tech stacks or older systems.

To get the best results, fine-tune your chosen tool's settings, such as timeouts and rendering configurations. This is especially important for single-page applications (SPAs) or sites with heavy JavaScript dependencies. With these tools, you’ll be better equipped to handle JavaScript rendering challenges efficiently.

Tips for Efficient JavaScript Rendering

---------------------------------------

### Optimizing Rendering Settings

Getting JavaScript rendering right means fine-tuning your [crawling tools](https://webcrawlerapi.com/scrapers) to balance speed and thoroughness. Start by enabling JavaScript mode and setting a render timeout between **1-5 seconds** to effectively capture dynamic content.

Here are some key settings to focus on:

Setting

Recommended Value

Why It Matters

Window Size

1366x768

Matches standard desktop resolution for consistent rendering.

Resource Loading

Selective

Loads only essential resources, cutting down unnecessary overhead.

If you're dealing with sites that rely heavily on JavaScript, you might need longer timeouts. Just keep in mind that this can slow down crawling, especially on larger websites.

### Overcoming Anti-Bot Protections

Dynamic websites often use anti-bot measures to block crawlers. To keep your access uninterrupted, you’ll need to employ some advanced techniques.

*   Use tools like **Puppeteer** or **Playwright** to randomize browser fingerprints (e.g., screen resolution, plugins) and mimic human behavior.

*   Rotate proxies to avoid IP-based blocks during high-volume crawling.

*   Add randomized delays of **2-5 seconds** between requests to reduce the chances of detection.

For websites with stricter protections, services like **WebCrawlerAPI** can handle JavaScript rendering and bypass anti-bot measures using their infrastructure. These strategies are especially useful for large-scale operations where consistent access to dynamic content is critical.

###### sbb-itb-ac346ed

Advanced Tools and Services for JavaScript Rendering

----------------------------------------------------

When basic JavaScript rendering options fall short, specialized tools and APIs step in to handle more complex web crawling tasks. These tools manage JavaScript rendering and data extraction while offering features tailored for large-scale or intricate projects.

### [WebCrawlerAPI](https://webcrawlerapi.com/): Built for High-Volume Crawling

WebCrawlerAPI is designed to process JavaScript-rendered content efficiently, even at scale. Its cloud-based system can process a page in an average of **5 seconds**, making it a solid choice for projects with tight deadlines.

Feature

Capability

Benefit

Content Formats

HTML, Markdown, Text

Works seamlessly with various data types

Infrastructure

Cloud-based, distributed

Handles large volumes without delays

Pricing Model

Pay-per-use ($20/10,000 pages)

Budget-friendly for flexible needs

Integration

NodeJS, Python, PHP, .NET

Compatible with popular programming languages

Thanks to its distributed setup, WebCrawlerAPI maintains consistent performance, even during high-demand periods. Additionally, its anti-bot features ensure uninterrupted access to target sites.

### [Firecrawl](https://www.firecrawl.dev/): Tailored for Dynamic Content

Firecrawl is crafted for extracting data from JavaScript-heavy websites. It automates complex rendering tasks and outputs structured data in formats that suit your needs.

Key features include:

*   Smart algorithms for parsing complex web apps

*   Efficient JavaScript execution management

*   Automated data cleaning and formatting

*   Reliable performance for ongoing operations

Both WebCrawlerAPI and Firecrawl tackle the challenges posed by JavaScript-rendered sites, offering scalable and efficient solutions. Deciding between them depends on your specific needs, such as data format preferences, speed requirements, or integration ease.

With these tools in your arsenal, you can confidently choose the one that aligns best with your project's demands.

Comparing JavaScript Rendering Tools and Services

-------------------------------------------------

This section breaks down the features of Puppeteer, Playwright, Selenium, and WebCrawlerAPI, highlighting how they stack up for web crawling projects. Here's a quick comparison of their capabilities:

Feature

Puppeteer

Playwright

Selenium

WebCrawlerAPI

**Browser Support**

Chrome, Chromium

Chrome, Firefox, Safari, Edge

All major browsers

Multiple browsers via cloud

**Speed Performance**

Optimized for lightweight tasks

Fast with parallel execution

Moderate with some overhead

~5 seconds per page

**Language Support**

Node.js

JavaScript, TypeScript, Python, C#

Java, Python, C#, Ruby, JavaScript

Multiple via REST API

**Setup Complexity**

Moderate; best for Node.js developers

Easy, with detailed documentation

High; requires more configuration

No setup; fully managed cloud solution

**Infrastructure Needs**

Self-hosted

Self-hosted

Self-hosted

Cloud-based

**Pricing Model**

Free, open-source

Free, open-source

Free, open-source

$20/10,000 pages

### Key Takeaways

*   **Puppeteer**: Perfect for Chrome-based automation in Node.js environments. It’s a solid pick for handling JavaScript-heavy tasks that need quick rendering.

*   **Playwright**: Offers speed, flexibility, and multi-browser support. Its debugging tools and clear documentation make it beginner-friendly for web crawling teams.

*   **Selenium**: A go-to choice for enterprise-level projects, thanks to its long-standing reputation and broad language support. However, it requires more effort to configure.

*   **WebCrawlerAPI**: A cloud-based service that skips setup entirely. It’s ideal for high-volume projects needing consistent and hassle-free performance.

### Choosing the Right Tool

When deciding which tool to use, think about these factors:

*   **Project Size**: Open-source options work well for smaller projects, while cloud solutions like WebCrawlerAPI are better for large-scale operations.

*   **Team Expertise**: If your team lacks DevOps skills, a cloud-based option is easier to manage.

*   **Browser Compatibility**: Make sure the tool supports the browsers you need for your project.

The best choice depends on your specific needs, whether it’s simplicity, scalability, or advanced browser support.

Conclusion: Key Points to Remember

----------------------------------

Handling JavaScript rendering effectively is crucial for extracting data from dynamic websites and maintaining strong SEO performance. Research shows that issues with JavaScript rendering can severely affect a website's visibility and ranking potential [\[1\]](https://www.1into2.com/javascript-rendering-challenges-in-seo/).

Each tool offers unique benefits: **Puppeteer** excels at Chrome-specific tasks, **Playwright** supports multiple browsers, **Selenium** suits enterprise-level projects, and **WebCrawlerAPI** specializes in scalable, cloud-based crawling. These tools cater to different needs, from self-hosted solutions to managed services, making it essential to align your choice with your project's requirements.

With the right tools and strategies, you can tackle challenges like delayed content loading and anti-bot measures, ensuring smooth and efficient data extraction from JavaScript-heavy websites.

### Final Recommendations

When it comes to efficient web crawling, consider these tips:

**Technical Implementation**:

*   Use server-side rendering (SSR) whenever possible to enhance crawling efficiency [\[1\]](https://www.1into2.com/javascript-rendering-challenges-in-seo/).

*   Set rendering timeouts between 1–5 seconds to properly capture dynamic content.

*   Utilize headless browsers to streamline content extraction [\[3\]](https://www.scrapehero.com/web-crawling-challenges-and-solutions/).

**Tool Selection**: Match tools to your project's size, browser needs, and team expertise:

*   Cloud-based options like **WebCrawlerAPI** are great for quick setups.

*   For flexibility and multi-browser support, go with **Playwright**.

*   Use **Puppeteer** for Chrome-focused tasks.

*   Choose **Selenium** for enterprise-grade compatibility.

The key to success is selecting tools that balance performance, scalability, and ease of use. By applying these strategies and staying updated on new technologies, you can effectively manage JavaScript-rendered content in your web crawling projects.

----
url: https://webcrawlerapi.com/glossary/puppeteer/how-to-open-devtools-for-a-page-in-puppeteer
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

To open DevTools for a page in Puppeteer, use the new Page.openDevTools() method. It calls the DevTools interface for the target page and returns a Page that points to the DevTools instance.

    const devtools = await page.openDevTools();

    // devtools is a Page instance representing the DevTools UI for the page

----
url: https://webcrawlerapi.com/glossary/playwright/how-to-fix-playwright-execution-context-destroyed
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

This happens when the page navigates or reloads while your script is still evaluating in the old document.

Wait for navigation intentionally and then re-query locators.

    await Promise.all([      page.waitForURL('**/dashboard'),

      page.getByRole('button', { name: 'Sign in' }).click(),

    ]);

    await expect(page.getByRole('heading', { name: 'Dashboard' })).toBeVisible();

Do not keep stale ElementHandle references across navigations.

----
url: https://webcrawlerapi.com/docs/sdk/php
----

PHP

===

Copy MarkdownOpen

Learn how to use the WebCrawler API PHP SDK to crawl websites and extract data.

[Obtain an API Key](#obtain-an-api-key)

---------------------------------------

To use the WebCrawler API, you need to obtain an API key. You can do this by [signing up for a free account](https://dash.webcrawlerapi.com/access).

[Installation](#installation)

-----------------------------

    composer require webcrawlerapi/sdk

[Requirements](#requirements)

-----------------------------

*   PHP 8.1 or higher

*   Composer

*   ext-json PHP extension

*   Guzzle HTTP Client 7.0 or higher

[Usage](#usage)

---------------

### [Synchronous Crawling](#synchronous-crawling)

The synchronous method waits for the crawl to complete and returns all data at once.

    <?php

    require_once('vendor/autoload.php');

    use WebCrawlerAPI\WebCrawlerAPI;

    // Initialize the client

    $crawler = new WebCrawlerAPI('YOUR_API_KEY');

    // Synchronous crawling

    $job = $crawler->crawl(

        url: 'https://example.com',

        scrapeType: 'markdown',

        itemsLimit: 10

    );

    echo "Job completed with status: {$job->status}\n";

    // Access job items and their content

    foreach ($job->jobItems as $item) {

        echo "Page title: {$item->title}\n";

        echo "Original URL: {$item->originalUrl}\n";

        // Get the content based on job's scrape_type

        // Returns null if item is not in "done" status

        $content = $item->getContent();

        if ($content) {

            echo "Content preview: " . substr($content, 0, 200) . "...\n";

        } else {

            echo "Content not available (item not done)\n";

        }

    }

### [Asynchronous Crawling](#asynchronous-crawling)

The asynchronous method returns a job ID immediately and allows you to check the status later.

    <?php

    require_once('vendor/autoload.php');

    use WebCrawlerAPI\WebCrawlerAPI;

    // Initialize the client

    $crawler = new WebCrawlerAPI('YOUR_API_KEY');

    // Start async crawl job

    $response = $crawler->crawlAsync(

        url: 'https://example.com',

        scrapeType: 'markdown',

        itemsLimit: 10

    );

    // Get the job ID

    $jobId = $response->id;

    echo "Crawling job started with ID: {$jobId}\n";

    // Check job status

    $job = $crawler->getJob($jobId);

    echo "Job status: {$job->status}\n";

    // Poll until complete if needed

    while ($job->status === 'in_progress') {

        sleep($job->recommendedPullDelayMs / 1000); // Convert ms to seconds

        $job = $crawler->getJob($jobId);

    }

    // Process results

    if ($job->status === 'done') {

        foreach ($job->jobItems as $item) {

            echo "Page title: {$item->title}\n";

            echo "Original URL: {$item->originalUrl}\n";

        }

    }

[Available Parameters](#available-parameters)

---------------------------------------------

-------------------------------------

### [Job Object](#job-object)

    $job->id                         // Unique job identifier

    $job->status                     // Job status (new, in_progress, done, error)

    $job->url                        // Original crawl URL

    $job->createdAt                  // Job creation timestamp

    $job->finishedAt                 // Job completion timestamp

    $job->jobItems                   // Array of crawled items

    $job->recommendedPullDelayMs     // Recommended delay between status checks

### [JobItem Object](#jobitem-object)

    $item->id                    // Unique item identifier

    $item->originalUrl           // URL of the crawled page

    $item->title                 // Page title

    $item->status                // Item status

    $item->pageStatusCode        // HTTP status code

    $item->markdownContentUrl    // URL to markdown content (if applicable)

    $item->rawContentUrl         // URL to raw content

    $item->cleanedContentUrl     // URL to cleaned content

    $item->getContent()          // Method to get content based on scrape_type (returns null if not "done")

[Python

Previous Page](/docs/sdk/python)[Java

Next Page](/docs/sdk/java)

### On this page

[Obtain an API Key](#obtain-an-api-key)[Installation](#installation)[Requirements](#requirements)[Usage](#usage)[Synchronous Crawling](#synchronous-crawling)[Asynchronous Crawling](#asynchronous-crawling)[Available Parameters](#available-parameters)[Response Objects](#response-objects)[Job Object](#job-object)[JobItem Object](#jobitem-object)

----
url: https://webcrawlerapi.com/blog/markdown-vs-json-choosing-the-right-format-for-llm-prompts
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

ComparisonMarkdownJSONRAG

Markdown vs JSON: Choosing the Right Format for LLM Prompts

===========================================================

A practical comparison of Markdown and JSON for LLM prompt inputs, scraping outputs, and RAG ingestion, with clear tradeoffs and examples.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What Markdown is good at](#what-markdown-is-good-at)

*   [What JSON is good at](#what-json-is-good-at)

*   [Use cases in web crawling, scraping, and RAG](#use-cases-in-web-crawling-scraping-and-rag)

*   [When Markdown should be used](#when-markdown-should-be-used)

*   [When JSON should be used](#when-json-should-be-used)

*   [Practical prompt patterns](#practical-prompt-patterns)

*   [Pattern 1: Markdown instructions + JSON output](#pattern-1-markdown-instructions-json-output)

*   [Pattern 2: Markdown report with embedded JSON blocks](#pattern-2-markdown-report-with-embedded-json-blocks)

*   [Node.js snippet: Extract a JSON code block from Markdown](#nodejs-snippet-extract-a-json-code-block-from-markdown)

*   [Conclusion](#conclusion)

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What Markdown is good at](#what-markdown-is-good-at)

*   [What JSON is good at](#what-json-is-good-at)

*   [Use cases in web crawling, scraping, and RAG](#use-cases-in-web-crawling-scraping-and-rag)

*   [When Markdown should be used](#when-markdown-should-be-used)

*   [When JSON should be used](#when-json-should-be-used)

*   [Practical prompt patterns](#practical-prompt-patterns)

*   [Pattern 1: Markdown instructions + JSON output](#pattern-1-markdown-instructions-json-output)

*   [Pattern 2: Markdown report with embedded JSON blocks](#pattern-2-markdown-report-with-embedded-json-blocks)

*   [Node.js snippet: Extract a JSON code block from Markdown](#nodejs-snippet-extract-a-json-code-block-from-markdown)

*   [Conclusion](#conclusion)

Markdown and JSON are both used as "prompt data", but different failure modes are triggered by each. Markdown is usually chosen when humans are expected to read or edit the content. JSON is usually chosen when machines are expected to parse it reliably.

For a broader map of formats, [Best Prompt Data](/blog/best-prompt-data) should be read first.

Quick comparison

----------------

Topic

Markdown

JSON

Best for

Mixed text + structure

Strict structure + validation

Parsing reliability

Medium

High (when schema is used)

Human readability

High

Medium

LLM output stability

Medium

High (when keys are constrained)

Common failure

Broken structure in long docs

Trailing commas, quoting, schema drift

What Markdown is good at

------------------------

Markdown is a lightweight way to mix narrative text and lightweight structure (headings, bullet lists, code blocks). It is usually used when the prompt is expected to be iterated on by a human.

Typical uses:

*   Instructions and constraints that should be seen at a glance

*   A "report" style output that is expected to be read by a person

*   Small embedded JSON snippets inside fenced code blocks

Markdown output comparisons are covered in [HTML vs Markdown](/blog/html-vs-markdown-choosing-the-right-output-format) and [Cleaned Text vs Markdown](/blog/cleaned-text-vs-markdown-choosing-the-right-output-format).

What JSON is good at

--------------------

JSON is a strict data format. It is usually used when a downstream step is going to parse the result and store it, validate it, or feed it into another system.

Typical uses:

*   Extracted fields from crawled pages (title, price, author, date)

*   RAG ingestion where chunk metadata is expected to be consistent

*   Pipelines where schema validation is needed

A related format tradeoff is covered in [JSON vs YAML](/blog/json-vs-yaml-choosing-the-right-format-for-llm-prompts).

Use cases in web crawling, scraping, and RAG

--------------------------------------------

### When Markdown should be used

Markdown is usually preferred when:

*   The output is expected to be read by a human (audits, summaries, notes)

*   The result includes long text where strict structure is not required

*   The model is expected to quote passages and keep them readable

A common pattern is: JSON is used for extracted fields, while Markdown is used for a human-facing explanation.

### When JSON should be used

JSON is usually preferred when:

*   The output must be parsed without ambiguity

*   A contract is needed (schema, required keys, value types)

*   Records are expected to be stored in a database as objects

*   RAG metadata (url, title, headings, chunk\_id) must be consistent

If the content is tabular, [JSON vs CSV](/blog/json-vs-csv-choosing-the-right-format-for-llm-prompts) can be a better comparison to read next.

Practical prompt patterns

-------------------------

### Pattern 1: Markdown instructions + JSON output

This pattern is often used to keep instructions readable while forcing the model to emit parseable data.

*   Instructions are written in Markdown

*   Output is required as JSON only, with an example object

*   A validator is used in the pipeline

### Pattern 2: Markdown report with embedded JSON blocks

This pattern is often used when both humans and machines are involved.

*   A short JSON block is embedded in a fenced code block

*   The rest is written as narrative Markdown

Node.js snippet: Extract a JSON code block from Markdown

--------------------------------------------------------

This snippet is intentionally simple. If multiple JSON blocks are expected, iteration should be added.

    // Node 18+

    // Extract the first ```json ... ``` block from Markdown and parse it.

    import { readFile } from "node:fs/promises";

    const md = await readFile("output.md", "utf8");

    const match = md.match(/```json\s*([\s\S]*?)\s*```/i);

    if (!match) {

      throw new Error("No ```json``` block found");

    }

    const jsonText = match[1];

    const data = JSON.parse(jsonText);

    console.log("Parsed keys:", Object.keys(data));

Conclusion

----------

*   Markdown is usually chosen for human readability and mixed narrative content.

*   JSON is usually chosen for strict extraction, validation, and reliable downstream parsing.

*   For many crawling and RAG pipelines, a hybrid approach is used: Markdown for instructions and JSON for results.

If a plain narrative output is being considered, [Markdown vs Plain Text](/blog/markdown-vs-plain-text-choosing-the-right-format-for-llm-prompts) should be compared too.

----
url: https://webcrawlerapi.com/glossary/puppeteer/how-to-expose-connection-in-cdpbrowsercontext
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Summary

This change adds a public getter to CdpBrowserContext to expose the internal Connection object. It returns the private #connection while preserving encapsulation.

### How to use

    // Assuming you have a CdpBrowserContext instance

    const connection = cdpBrowserContext.connection; // Connection

    // Example: send a direct CDP command

    await connection.send('Target.setAutoAttach', { /* options */ });

### Details

*   This is an additive change that does not modify existing APIs.

*   No external documentation updates are required for this internal API addition.

----
url: https://webcrawlerapi.com/glossary/puppeteer/what-is-the-correct-event-to-create-a-response-in-puppeteer-webdriver
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

To align with the protocol behavior, create the Response when the responseStarted event fires, rather than after the response is fully received. This makes the Response object available earlier and mirrors how the network start is reported by the underlying CDP. Implement by listening for the responseStarted event in your transport layer and constructing the Response at that moment. Example:

    // concept example

    client.on('Network.responseStarted', (params) => {

      // instantiate and fill a response object immediately

      const response = new Response(params.requestId);

      response.url = params.response.url;

      response.status = params.response.status;

      // ... fill other properties as needed

    });

This approach ensures the response exists as soon as the network response begins, matching the intended behavior.

----
url: https://webcrawlerapi.com/glossary/scraping/how-is-web-scraping-different-from-web-crawling
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Answer

Web crawling is about discovering and fetching pages, while web scraping is about extracting data from those pages. Crawlers focus on link traversal, deduplication, and scheduling. Scrapers focus on selectors, parsing logic, and data quality. You often crawl to build a URL list, then scrape those pages. Both can be part of the same pipeline, but they solve different problems.

----
url: https://webcrawlerapi.com/glossary/scraping/what-are-common-web-scraping-tools
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Answer

Common tools include Beautiful Soup, Scrapy, Playwright, Puppeteer, and Selenium. Lightweight parsers are great for static HTML and simple extraction. Browser automation tools help when content is rendered by JavaScript. Managed scraping platforms reduce infrastructure work but can be more expensive. The best tool depends on scale, complexity, and maintenance needs.

----
url: https://webcrawlerapi.com/glossary?category=playwright
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Glossary

Web Scraping & API Glossary

===========================

Comprehensive glossary of web scraping, crawling, and API terms. Learn the essential concepts and terminology used in web data extraction.

AllPlaywrightPuppeteerScrapingWebcrawling

----
url: https://webcrawlerapi.com/docs/crawling-types
----

Crawling output format types

============================

Copy MarkdownOpen

Several types of scraping you can perform with WebcrawlerAPI by setting the scrape\_type parameter

There are several types of scraping you can perform with WebcrawlerAPI. You can control it by setting the `scrape_type` parameter in the request.

Currently supported types are:

*   `markdown` - returns the content of the page in markdown format.

*   `cleaned` - returns the cleaned content of the page.

*   `html` - returns the raw HTML of the page.

### [Markdown format](#markdown-format)

Markdown type is pure content but with some extra markdown formatting, like headings, links, lists, etc.

Markdown formatting is more useful for LLMs and AI to pass as the reference data, as it is some extra sign to understand the words "weight". For example, headers give an understanding of what the text is about. This could help to achieve a better result by understanding context data better.

Example of markdown formatted text:

    # Example Domain

    This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.

    [More information...](https://www.iana.org/domains/example)

(Check out also [Scrape API](https://webcrawlerapi.com/docs/api/scrape))

### [Cleaned scraping](#cleaned-scraping)

Cleaned scraping is a type of scraping that removes unnecessary elements from the page. It returns the cleaned content of the page. [BeautifulSoup4](https://beautiful-soup-4.readthedocs.io/en/latest/) used to clean the data.

To use it, set the `scrape_type` option to `cleaned`.

Example of cleaned content:

    Example Domain

    Example Domain

        This domain is for use in illustrative examples in documents. You may use this

        domain in literature without prior coordination or asking for permission.

    More information...

### [HTML scraping](#html-scraping)

HTML scraping is the most basic type of scraping. It returns the raw HTML of the page. No manipulation is done on the content.

This is the default scrape option. To use it, omit the `scrape_type` option in the request or set it to `html`.

Example of the content:

    <!doctype html>

    <html>

    <head>

        <title>Example Domain</title>

        <meta charset="utf-8" />

        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />

        <meta name="viewport" content="width=device-width, initial-scale=1" />

    </head>

    <body>

    <div>

        <h1>Example Domain</h1>

        <p>This domain is for use in illustrative examples in documents. You may use this

        domain in literature without prior coordination or asking for permission.</p>

        <p><a href="https://www.iana.org/domains/example">More information...</a></p>

    </div>

    </body>

    </html>

[S3 Upload

Previous Page](/docs/actions/s3-upload)[Async requests and Webhooks

Next Page](/docs/async-requests)

### On this page

[Markdown format](#markdown-format)[Cleaned scraping](#cleaned-scraping)[HTML scraping](#html-scraping)

----
url: https://webcrawlerapi.com/blog/json-vs-yaml-choosing-the-right-format-for-llm-prompts
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

ComparisonJSONYAMLRAG

JSON vs YAML: Choosing the Right Format for LLM Prompts

=======================================================

JSON vs YAML for prompt data and scraped outputs: schema, validation, typing, and what breaks in real pipelines.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What JSON is good at](#what-json-is-good-at)

*   [What YAML is good at](#what-yaml-is-good-at)

*   [Use cases in web crawling, scraping, and RAG](#use-cases-in-web-crawling-scraping-and-rag)

*   [When JSON should be used](#when-json-should-be-used)

*   [When YAML should be used](#when-yaml-should-be-used)

*   [Practical failure modes](#practical-failure-modes)

*   [YAML implicit types](#yaml-implicit-types)

*   [JSON strictness is a feature](#json-strictness-is-a-feature)

*   [Node.js snippet: Enforce "JSON only" output in a pipeline](#nodejs-snippet-enforce-json-only-output-in-a-pipeline)

*   [Conclusion](#conclusion)

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What JSON is good at](#what-json-is-good-at)

*   [What YAML is good at](#what-yaml-is-good-at)

*   [Use cases in web crawling, scraping, and RAG](#use-cases-in-web-crawling-scraping-and-rag)

*   [When JSON should be used](#when-json-should-be-used)

*   [When YAML should be used](#when-yaml-should-be-used)

*   [Practical failure modes](#practical-failure-modes)

*   [YAML implicit types](#yaml-implicit-types)

*   [JSON strictness is a feature](#json-strictness-is-a-feature)

*   [Node.js snippet: Enforce "JSON only" output in a pipeline](#nodejs-snippet-enforce-json-only-output-in-a-pipeline)

*   [Conclusion](#conclusion)

JSON and YAML solve the same general problem: structured data. The difference is that JSON is strict, while YAML is flexible and human-friendly. That flexibility is where most surprises are introduced.

A broader guide is available in [Best Prompt Data](/blog/best-prompt-data).

Quick comparison

----------------

Topic

JSON

YAML

Best for

Machine parsing, APIs, validation

Human-edited config and small manifests

Schema validation

Strong

Possible, but less common in practice

Comments

Not supported

Supported

Typing surprises

Fewer

More (implicit types)

Common failure

Trailing commas, quoting

Indentation, implicit booleans/dates

What JSON is good at

--------------------

JSON is usually preferred when:

*   A downstream parser must not guess

*   A contract is required (keys, types, required fields)

*   Data is being stored as objects or sent over APIs

JSON paired with Markdown is covered in [Markdown vs JSON](/blog/markdown-vs-json-choosing-the-right-format-for-llm-prompts).

What YAML is good at

--------------------

YAML is usually preferred when:

*   Humans will edit the output

*   Comments are useful

*   Config-like nesting is needed, but strictness is not

If a readable document is needed instead of config, Markdown is often selected, as covered in [Markdown vs YAML](/blog/markdown-vs-yaml-choosing-the-right-format-for-llm-prompts).

Use cases in web crawling, scraping, and RAG

--------------------------------------------

### When JSON should be used

JSON is usually the safer choice when:

*   Page extractions will be inserted into a database

*   A batch crawl produces many records that must be merged or deduped

*   RAG metadata must be consistent across all chunks

### When YAML should be used

YAML is usually a fit when:

*   Extraction rules are being generated and edited manually

*   A "job spec" is being passed around by humans

*   Small manifests are being produced where a strict validator is not needed

For tabular datasets, CSV can be compared in [JSON vs CSV](/blog/json-vs-csv-choosing-the-right-format-for-llm-prompts).

Practical failure modes

-----------------------

### YAML implicit types

In YAML, on, yes, 2026-02-01, and 123 can be interpreted as boolean, date, and number depending on the parser. In scraping, that can silently change meaning.

### JSON strictness is a feature

The strictness is usually annoying for humans, but it is valuable for pipelines. If the model emits invalid JSON, the failure is immediate and detectable.

Node.js snippet: Enforce "JSON only" output in a pipeline

---------------------------------------------------------

The simplest enforcement is: parsing is attempted, and the job is failed if parsing fails. That behavior tends to tighten model behavior over time.

    // Node 18+

    // Fail fast if JSON is invalid.

    import { readFile } from "node:fs/promises";

    const text = await readFile("output.json", "utf8");

    let data;

    try {

      data = JSON.parse(text);

    } catch (e) {

      console.error("Invalid JSON output:", e.message);

      process.exit(1);

    }

    console.log("OK:", Array.isArray(data) ? "array" : "object");

Conclusion

----------

*   JSON is usually selected for reliability, validation, and downstream parsing.

*   YAML is usually selected for human-edited config-like content and comments.

*   For most scraping and RAG ingestion pipelines, JSON is usually the default unless human editing is a core requirement.

If minimal text is desired instead of structured data, [JSON vs Plain Text](/blog/json-vs-plain-text-choosing-the-right-format-for-llm-prompts) should be read next.

----
url: https://webcrawlerapi.com/changelog/2025-07-27-mcp-integration
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

July 27, 2025

MCP Integration Available

=========================

----
url: https://webcrawlerapi.com/blog/what-is-a-web-crawling-api
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

APITechnicalStart here1 min read to read

What is webcrawling API?

========================

Web crawling API allows developers to retrieve web data efficiently and programmatically, enabling the extraction of content from a website.

Written byAndrew

Published onJan 31, 2026

### Table of Contents

*   [Key Benefits:](#key-benefits)

*   [Understanding Web Crawling Basics](#understanding-web-crawling-basics)

*   [What is Web Crawling and Its Challenges?](#what-is-web-crawling-and-its-challenges)

*   [How a Web Crawling API Simplifies Data Extraction](#how-a-web-crawling-api-simplifies-data-extraction)

*   [Features of a Web Crawling API](#features-of-a-web-crawling-api)

*   [Scalability and Reliability of APIs](#scalability-and-reliability-of-apis)

*   [Using WebCrawlerAPI: A Practical Example](#using-webcrawlerapi-a-practical-example)

*   [Code Example: Data Extraction with WebCrawlerAPI](#code-example-data-extraction-with-webcrawlerapi)

*   [Customizing API Requests](#customizing-api-requests)

*   [Benefits of Using a Web Crawling API](#benefits-of-using-a-web-crawling-api)

*   [Efficiency and Scalability](#efficiency-and-scalability)

*   [Reliability and Accuracy](#reliability-and-accuracy)

*   [Flexibility and Customization](#flexibility-and-customization)

*   [FAQs](#faqs)

*   [What is the difference between API and web scraping?](#what-is-the-difference-between-api-and-web-scraping)

### Table of Contents

*   [Key Benefits:](#key-benefits)

*   [Understanding Web Crawling Basics](#understanding-web-crawling-basics)

*   [What is Web Crawling and Its Challenges?](#what-is-web-crawling-and-its-challenges)

*   [How a Web Crawling API Simplifies Data Extraction](#how-a-web-crawling-api-simplifies-data-extraction)

*   [Features of a Web Crawling API](#features-of-a-web-crawling-api)

*   [Scalability and Reliability of APIs](#scalability-and-reliability-of-apis)

*   [Using WebCrawlerAPI: A Practical Example](#using-webcrawlerapi-a-practical-example)

*   [Code Example: Data Extraction with WebCrawlerAPI](#code-example-data-extraction-with-webcrawlerapi)

*   [Customizing API Requests](#customizing-api-requests)

*   [Benefits of Using a Web Crawling API](#benefits-of-using-a-web-crawling-api)

*   [Efficiency and Scalability](#efficiency-and-scalability)

*   [Reliability and Accuracy](#reliability-and-accuracy)

*   [Flexibility and Customization](#flexibility-and-customization)

*   [FAQs](#faqs)

*   [What is the difference between API and web scraping?](#what-is-the-difference-between-api-and-web-scraping)

A **[web crawling API](https://webcrawlerapi.com/scrapers/webcrawler/crawler/api)** is a tool that automates the process of collecting data from websites, saving you the effort of writing complex code. It handles challenges like bypassing anti-bot systems, managing large-scale data, and rendering JavaScript-heavy pages.

### Key Benefits:

*   **Simplifies data extraction**: Extracts data in formats like JSON, HTML, Text, or Markdown.

*   **Handles technical hurdles**: Bypasses CAPTCHAs, rotates proxies, and mimics browser behavior.

*   **Scalable and reliable**: Processes large volumes of data efficiently with error handling.

For example, APIs like [WebCrawlerAPI](https://webcrawlerapi.com/) let you focus on analyzing data instead of worrying about infrastructure or anti-scraping techniques. Whether you're tracking competitor pricing or aggregating data for AI models, these APIs make it faster and easier.

Understanding Web Crawling Basics

---------------------------------

[Web crawling](https://webcrawlerapi.com/scrapers/webcrawler/crawler/description) is the automated process of scanning and indexing web pages by systematically browsing them and following links. Think of it as a tool that maps out the web, downloads pages, and organizes the data into an interconnected structure.

### What is Web Crawling and Its Challenges?

Web crawling involves scanning and processing web content, but it comes with its fair share of hurdles:

*   **Anti-bot defenses**: Tools like CAPTCHAs and IP blocking are designed to stop automated access.

*   **Dynamic content**: Pages that load content through JavaScript require more advanced handling.

*   **Data volume**: Extracting and managing large-scale data responsibly can be overwhelming.

*   **Frequent updates**: Websites often change content, making it tricky to keep data current.

These challenges are why many developers rely on [web crawling APIs](https://webcrawlerapi.com/blog/what-is-a-web-crawling-api). These APIs handle the heavy lifting - like bypassing technical barriers and managing infrastructure - so developers can focus on analyzing the data rather than worrying about how it’s collected.

Specialized web crawling APIs have become essential for tackling these issues. They offer scalable, efficient solutions while adhering to website policies and technical standards, simplifying the process of extracting and using web data.

How a Web Crawling API Simplifies Data Extraction

-------------------------------------------------

Web crawling APIs make extracting data much easier by offering pre-built tools and solutions. These tools take care of the technical challenges, providing organized, ready-to-use data for various applications.

### Features of a Web Crawling API

WebCrawlerAPI, for example, simplifies the process with features like anti-bot measures and JavaScript rendering. It also includes automated proxy rotation and optimized requests to ensure smooth access to even the most complex websites.

The API supports various output formats, catering to different requirements:

Output Format

Benefits

JSON

Simple to parse and manipulate data

HTML

Maintains the original structure and styling

Text

Provides clean content without any markup

Markdown

Keeps structure intact with lightweight formatting

### Scalability and Reliability of APIs

Scalability and reliability are key when dealing with large-scale data extraction. Web crawling APIs use distributed systems to handle high volumes of requests and ensure uptime with automated error handling and backup systems.

If a request fails, the API retries using alternative methods to retrieve the data. WebCrawlerAPI’s setup supports thousands of simultaneous requests while maintaining quality and speed. Plus, its pay-per-use model eliminates the need for businesses to worry about maintaining their own infrastructure.

Using [WebCrawlerAPI](https://webcrawlerapi.com/): A Practical Example

----------------------------------------------------------------------

Let's dive into how WebCrawlerAPI works with a straightforward example.

### Code Example: Data Extraction with WebCrawlerAPI

Below is a [Python](https://webcrawlerapi.com/blog/how-to-crawl-the-website-with-python) script showcasing both basic and advanced ways to use WebCrawlerAPI:

    import requests

    api_key = "your_api_key"

    url = "https://example.com"

    headers = {

        "Authorization": f"Bearer {api_key}",

        "Content-Type": "application/json",

        "User-Agent": "Mozilla/5.0"

    }

    params = {

        "proxy": {"type": "residential", "country": "US"},

        "wait_time": 2000

    }

    response = requests.get(f"https://api.webcrawlerapi.com/v1/crawl?url={url}",

                            headers=headers,

                            params=params)

    if response.status_code == 200:

        print(response.json())

    else:

        print("Failed to retrieve data")

This script sends a GET request to WebCrawlerAPI, and the response is returned in JSON format, making it simple to process and integrate.

### Customizing API Requests

WebCrawlerAPI offers several options to tailor requests for specific needs, like using proxies or modifying headers. Here's a breakdown of the main parameters:

Parameter

Purpose

Example Value

proxy

Define proxy type

{"type": "residential"}

headers

Mimic browser headers

{"User-Agent": "Mozilla/5.0"}

cookies

Manage sessions

{"sessionId": "abc123"}

wait\_time

Add delay between requests

2000 (milliseconds)

These parameters are especially useful when dealing with complex websites. The API takes care of proxy management and request optimization, freeing developers to focus on analyzing the extracted data.

Fine-tuning these settings ensures precise and efficient data gathering, setting the stage for deeper exploration in the next section.

###### sbb-itb-ac346ed

Benefits of Using a Web Crawling API

------------------------------------

Understanding how WebCrawlerAPI functions is one thing, but grasping the broader advantages of web crawling APIs can help developers, data scientists, and AI professionals make the most of these tools for reliable data extraction.

### Efficiency and Scalability

Web crawling APIs make large-scale data extraction much faster and easier. Research by [Oxylabs](https://oxylabs.io/) highlights that these tools can cut data extraction time by up to 90% compared to older scraping methods. They also allow users to manage multiple websites and process large amounts of data at the same time, all without overloading servers.

### Reliability and Accuracy

These APIs rely on advanced algorithms to ensure consistent and precise results. With features like JavaScript rendering, anti-bot measures, and error handling, they provide dependable data extraction. Plus, their ability to meet specific project requirements adds an extra layer of usefulness.

### Flexibility and Customization

Web crawling APIs offer options to customize requests, such as selecting output formats, configuring proxies, or adding custom headers. This makes them suitable for a wide range of applications, including gathering e-commerce data or aggregating content for AI training models, all while staying compliant with the terms of service of target websites.

Conclusion: Why Use a Web Crawling API?

---------------------------------------

Web crawling APIs have simplified data extraction, making it faster and more accessible for businesses and developers. They’ve changed how large-scale data collection is handled, offering a practical solution for complex tasks.

The key advantage of a web crawling API is its ability to **automate and simplify data extraction**. This reduces the need for manual work, allowing teams to focus on analyzing data and building applications instead [\[1\]](https://en.wikipedia.org/wiki/Web_spider)[\[3\]](https://www.elastic.co/what-is/web-crawler).

For example, APIs like WebCrawlerAPI help businesses collect data efficiently. E-commerce platforms can gather competitor pricing information, and content aggregators can pull data from multiple sources seamlessly [\[5\]](https://oxylabs.io/blog/crawling-vs-scraping). What once took weeks can now be done in just hours.

These APIs also tackle technical hurdles like JavaScript rendering and anti-bot defenses, all while staying compliant with website policies. They handle infrastructure management, so you don’t have to.

Web crawling APIs are scalable, adapting to your growing data needs without adding infrastructure headaches [\[3\]](https://www.elastic.co/what-is/web-crawler)[\[2\]](https://www.techtarget.com/whatis/definition/crawler). With automated and customizable data collection options, they’re an essential tool for any data-driven project.

FAQs

----

### What is the difference between API and web scraping?

Web crawling APIs and web scraping both extract data from websites, but they do so in different ways. Here's a side-by-side comparison:

Feature

Web Crawling API

Web Scraping

**Access Method**

Uses a structured interface

Extracts data directly from websites

**Reliability**

High, thanks to managed infrastructure

Can vary, depending on site changes

**Compliance**

Automatically follows website rules

Requires manual compliance setup

**Scalability**

Automatically scales as needed

Limited by your own infrastructure

**Maintenance**

Handled by the API provider

Requires frequent updates to keep up with changes

APIs tackle challenges like rate limits, JavaScript rendering, and anti-bot defenses without requiring additional effort on your part. They’re designed to simplify the process for users, offering a more seamless experience.

On the other hand, traditional web scraping provides more flexibility but demands a high level of technical expertise. It also requires constant updates to adapt to changes in website structures [\[1\]](https://en.wikipedia.org/wiki/Web_spider)[\[3\]](https://www.elastic.co/what-is/web-crawler). Think of it like this: using an API is like taking a ride-sharing service - it saves time and effort compared to maintaining and driving your own car.

While [web scraping tools](https://webcrawlerapi.com/scrapers) can pull data from almost any publicly available site, APIs may have restrictions on what data they can access. However, APIs often provide a more stable and efficient way to get the information you need [\[4\]](https://brightdata.com/blog/web-data/what-is-a-web-crawler)[\[5\]](https://oxylabs.io/blog/crawling-vs-scraping). Understanding these differences can help you choose the right tool for your data extraction needs.

----
url: https://webcrawlerapi.com/blog/5-famous-web-scraping-court-cases-where-scrapers-lost
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

LegalWeb ScrapingWeb Crawling

5 Famous Web Scraping Court Cases Where Scrapers Lost

=====================================================

Five well-cited scraping cases where courts sided with the target site or publisher, including what claims stuck, what was ordered, and what scrapers should learn.

Written byAndrew

Published onFeb 8, 2026

### Table of Contents

*   [5 Famous Web Scraping Court Cases Where Scrapers Lost](#5-famous-web-scraping-court-cases-where-scrapers-lost)

*   [1) eBay v. Bidder's Edge (N.D. Cal. 2000) - injunction against an auction-aggregator crawler](#1-ebay-v-bidders-edge-nd-cal-2000---injunction-against-an-auction-aggregator-crawler)

*   [2) Register.com v. Verio (2d Cir. 2004) - injunction upheld for automated WHOIS harvesting](#2-registercom-v-verio-2d-cir-2004---injunction-upheld-for-automated-whois-harvesting)

*   [3) Facebook v. Power Ventures (9th Cir. 2016) - post-notice automation treated as unauthorized access](#3-facebook-v-power-ventures-9th-cir-2016---post-notice-automation-treated-as-unauthorized-access)

*   [4) Associated Press v. Meltwater (S.D.N.Y. 2013) - news scraping + excerpts held not fair use](#4-associated-press-v-meltwater-sdny-2013---news-scraping-excerpts-held-not-fair-use)

*   [5) Ryanair DAC v. Flightbox (Ireland, 2023) - default judgment + injunction over screen scraping](#5-ryanair-dac-v-flightbox-ireland-2023---default-judgment-injunction-over-screen-scraping)

### Table of Contents

*   [5 Famous Web Scraping Court Cases Where Scrapers Lost](#5-famous-web-scraping-court-cases-where-scrapers-lost)

*   [1) eBay v. Bidder's Edge (N.D. Cal. 2000) - injunction against an auction-aggregator crawler](#1-ebay-v-bidders-edge-nd-cal-2000---injunction-against-an-auction-aggregator-crawler)

*   [2) Register.com v. Verio (2d Cir. 2004) - injunction upheld for automated WHOIS harvesting](#2-registercom-v-verio-2d-cir-2004---injunction-upheld-for-automated-whois-harvesting)

*   [3) Facebook v. Power Ventures (9th Cir. 2016) - post-notice automation treated as unauthorized access](#3-facebook-v-power-ventures-9th-cir-2016---post-notice-automation-treated-as-unauthorized-access)

*   [4) Associated Press v. Meltwater (S.D.N.Y. 2013) - news scraping + excerpts held not fair use](#4-associated-press-v-meltwater-sdny-2013---news-scraping-excerpts-held-not-fair-use)

*   [5) Ryanair DAC v. Flightbox (Ireland, 2023) - default judgment + injunction over screen scraping](#5-ryanair-dac-v-flightbox-ireland-2023---default-judgment-injunction-over-screen-scraping)

5 Famous Web Scraping Court Cases Where Scrapers Lost

=====================================================

Web scraping is not automatically legal or illegal.

Legality depends on what you access (public vs. behind login), what you copy (facts vs. creative expression), how you access it (respecting access controls and technical measures), what the Terms of Service say (contract), and which jurisdiction applies (US vs. EU/UK rules can differ a lot).

If you want a practical baseline before reading case law, start with the boring-but-important ethics and operational behavior:

*   [Web Scraping Ethics: What is legal and what is not?](/blog/web-scraping-ethics)

*   [How to Build a Web Crawler](/blog/how-to-build-a-web-crawler)

*   [What is the difference between web crawling and scraping?](/blog/what-is-the-difference-between-web-crawling-and-scraping)

This post is not legal advice.

1) eBay v. Bidder's Edge (N.D. Cal. 2000) - injunction against an auction-aggregator crawler

--------------------------------------------------------------------------------------------

**Who scraped whom:** Bidder's Edge scraped eBay.

**What data:** Auction listings and bidding status.

**Claims that mattered:** The case is widely cited for applying **trespass to chattels** reasoning to automated access.

**What the court did:** A **preliminary injunction** was granted, barring automated query programs/robots/crawlers from accessing eBay's systems without written authorization.

**Practical takeaway:** Even when the load looks "small," ignoring objections and scaling an automation pattern can get framed as unauthorized interference. If blocks and explicit objections are being pushed through, it is easier for a court to treat it as wrongful access rather than normal browsing.

Sources:

*   eBay, Inc. v. Bidder's Edge, Inc., 100 F. Supp. 2d 1058 (N.D. Cal. 2000): [http://www.tomwbell.com/NetLaw/Ch06/eBay.html](http://www.tomwbell.com/NetLaw/Ch06/eBay.html)

2) Register.com v. Verio (2d Cir. 2004) - injunction upheld for automated WHOIS harvesting

------------------------------------------------------------------------------------------

**Who scraped whom:** Verio scraped Register.com's WHOIS service.

**What data:** WHOIS registrant contact details harvested via automated queries and used for solicitation.

**Claims that mattered:** The Second Circuit credited **contract / ToS-based** arguments (with repeated use after notice), and also addressed related **trespass to chattels** and marketing/confusion concerns.

**What the court did:** A **preliminary injunction** was affirmed, restricting automated access and certain downstream uses of harvested WHOIS data.

**Practical takeaway:** "No click = no contract" is not a safe assumption when the terms are repeatedly encountered and the scraper has notice. Pairing scraping with unsolicited marketing tends to make the whole case worse.

Sources:

*   Register.com, Inc. v. Verio, Inc., 356 F.3d 393 (2d Cir. 2004) (summary + PDF link): [http://www.internetlibrary.com/cases/lib\_case392.cfm](http://www.internetlibrary.com/cases/lib_case392.cfm)

*   Opinion PDF: [http://www.internetlibrary.com/pdf/register-verio-2d-cir.pdf](http://www.internetlibrary.com/pdf/register-verio-2d-cir.pdf)

3) Facebook v. Power Ventures (9th Cir. 2016) - post-notice automation treated as unauthorized access

-----------------------------------------------------------------------------------------------------

**Who accessed whom:** Power Ventures accessed Facebook.

**What data:** Facebook user data and user-initiated messaging flows were accessed/triggered through automation; access continued after Facebook objected and technical blocking measures were involved.

**Claims that mattered:** The opinion is widely cited for how "authorization" was analyzed in the **CFAA** context when access continued after notice and blocks.

**What the court did:** The Ninth Circuit affirmed key parts of Facebook's win (and reversed/vacated other pieces), leaving Power Ventures in a losing posture on core theories tied to post-notice, post-block access.

**Practical takeaway:** When scraping/automation relies on user credentials or tokens and is continued after explicit objection and blocking steps, it stops looking like "public web scraping" and starts looking like access-control evasion.

Sources:

*   Facebook, Inc. v. Power Ventures, Inc., 844 F.3d 1058 (9th Cir. 2016) (PDF): [https://cdn.ca9.uscourts.gov/datastore/opinions/2016/07/12/13-17102.pdf](https://cdn.ca9.uscourts.gov/datastore/opinions/2016/07/12/13-17102.pdf)

4) Associated Press v. Meltwater (S.D.N.Y. 2013) - news scraping + excerpts held not fair use

---------------------------------------------------------------------------------------------

**Who scraped whom:** Meltwater scraped AP's news content and sold a monitoring product that included excerpts.

**What data:** Headlines/ledes/excerpts/snippets from AP articles (copyrighted expression).

**Claims that mattered:** **Copyright infringement** was found and Meltwater's **fair use** arguments were rejected in the decision.

**What the court did:** Summary judgment was granted against Meltwater on core copyright issues, limiting the "scrape + redistribute snippets" pattern when it looks like a paid substitute for the original.

**Practical takeaway:** Scraping facts is one thing; republishing expressive text at scale is another. If the output competes with the publisher (or replaces reading the source), fair use becomes hard to defend.

Sources:

*   Opinion (archived): [https://digitalcommons.law.scu.edu/cgi/viewcontent.cgi?article=1321&context=historical](https://digitalcommons.law.scu.edu/cgi/viewcontent.cgi?article=1321&context=historical)

*   U.S. Copyright Office Fair Use Index summary: [https://www.copyright.gov/fair-use/summaries/ap-meltwater-sdny2013.pdf](https://www.copyright.gov/fair-use/summaries/ap-meltwater-sdny2013.pdf)

5) Ryanair DAC v. Flightbox (Ireland, 2023) - default judgment + injunction over screen scraping

------------------------------------------------------------------------------------------------

**Who scraped whom:** Flightbox (a tech provider) was alleged to facilitate screen scraping of Ryanair.

**What data:** Flight/price/timetable data extracted via automated software and surfaced in third-party flows.

**Claims that mattered:** The case turned heavily on **website terms of use**, procedural/jurisdiction mechanics, and breach-related theories. (This was a default judgment; the defendant did not enter an appearance.)

**What the court did:** Judgment in default was allowed and **injunctive and declaratory relief** was granted, including a **prohibitory injunction** and a **quia timet injunction** (aimed at preventing likely future breach).

**Practical takeaway:** Outside the US, scraping fights often turn on contract/terms + injunction mechanics. If access is structured around assent-based terms, injunctive relief can follow.

Sources:

*   Law firm analysis (citing _\[2023\] IEHC 689_): [https://www.mhc.ie/latest/insights/screen-scraping-latest-jurisdiction-and-default-judgment-confirmed](https://www.mhc.ie/latest/insights/screen-scraping-latest-jurisdiction-and-default-judgment-confirmed)

*   Case summary: [https://www.casemine.com/judgement/uk/6581ef6c86838c62d2fc4551](https://www.casemine.com/judgement/uk/6581ef6c86838c62d2fc4551)

Closing: what these losses do (and do not) mean

-----------------------------------------------

These cases do not mean "scraping is always illegal." They show that legality depends on context.

Patterns that show up again and again:

*   Public page is not the same as permission to automate at scale.

*   Notice + continued access (especially with block evasion) tends to be a turning point.

*   Copying expressive content (not just facts) escalates risk quickly.

*   Courts often reach for **injunction-shaped** remedies when ongoing automation is framed as interference, breach, or substitution.

If scraping must be done, risk is usually reduced by:

*   Getting permission or using an official API/license when it exists.

*   Reading (and following) ToS; re-checking it periodically.

*   Avoiding auth bypass and avoiding block-evasion patterns.

*   Implementing strict rate limits, backoff, and respecting 429/Retry-After.

*   Respecting robots.txt and honoring opt-outs.

*   Minimizing what is copied (fields only; avoid republishing expressive text).

*   Being careful with personal data (privacy rules still apply even to "public" pages).

----
url: https://webcrawlerapi.com/glossary/puppeteer/why-does-http-request-headers-not-allow-mutating-data-in-puppeteer
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

This was fixed to prevent accidental mutations of the underlying headers. HttpRequest.headers() no longer allows mutating the returned headers object. If you need to modify headers, create a new headers object and use it when continuing the request, or apply default headers globally via setExtraHTTPHeaders.

----
url: https://webcrawlerapi.com/blog
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

AllAPIAtomCSVComparisonCustomer storyHTMLJSJSONJSON FeedLegalMarkdownPythonRAGRSSRustStart hereTechnicalTutorialWeb CrawlingWeb ScrapingYAMLbest practicesethicsweb scraping

Start here

----------

New to crawling or WebCrawlerAPI? These two posts give you the right mental model.

[### What is RAG (Retrieval-Augmented Generation)?

Start hereRAG

### What is RAG (Retrieval-Augmented Generation)?

Learn about RAG, a powerful technique that improves AI responses by combining language models with real-time information retrieval, making AI answers more accurate and up-to-date.

Apr 12, 2025•3 min read

Read guide](/blog/what-is-rag)

[### What is webcrawling API?

Start hereAPI

### What is webcrawling API?

Web crawling API allows developers to retrieve web data efficiently and programmatically, enabling the extraction of content from a website.

Apr 16, 2024•1 min read

Read guide](/blog/what-is-a-web-crawling-api)

[### What is the difference between web crawling and scraping?

Start hereTechnical

### What is the difference between web crawling and scraping?

Scraping and crawling are techniques used to automate data retrieval from the Web. Though they are slightly different, both have different goals and processes.

Apr 10, 2024•2 min read

Read guide](/blog/what-is-the-difference-between-web-crawling-and-scraping)

### Want the code to match the articles?

Crawl websites into Markdown, cleaned text, or HTML with WebCrawlerAPI — no crawler maintenance.

[Try WebCrawlerAPI](https://dash.webcrawlerapi.com)

All articles

------------

Everything else — tutorials, comparisons, and technical deep dives.

[### 5 Famous Web Scraping Court Cases Where Scrapers Won

LegalWeb Scraping

### 5 Famous Web Scraping Court Cases Where Scrapers Won

Five well-known court cases that favored scraping/crawling (or narrowed anti-scraping theories), plus practical takeaways on public data, CFAA, copyright, and EU database rights.

Feb 8, 2026

Read](/blog/5-famous-web-scraping-court-cases-where-scrapers-won)

[### 5 Famous Web Scraping Court Cases Where Scrapers Lost

LegalWeb Scraping

### 5 Famous Web Scraping Court Cases Where Scrapers Lost

Five well-cited scraping cases where courts sided with the target site or publisher, including what claims stuck, what was ordered, and what scrapers should learn.

Feb 8, 2026

Read](/blog/5-famous-web-scraping-court-cases-where-scrapers-lost)

[### How dom\_smoozie Rust Mozilla Readability alternative works

RustTechnical

### How dom\_smoozie Rust Mozilla Readability alternative works

A practical, step-by-step explanation of how dom\_smoothie (Rust) works as a Mozilla Readability alternative for main-content extraction.

Feb 7, 2026

Read](/blog/how-dom-smoothie-rust-mozilla-readability-alternative-works)

[### How to Convert Any Website to an RSS Feed

RSSAtom

### How to Convert Any Website to an RSS Feed

Need updates from a site you do not control? Create a WebCrawlerAPI feed for any URL, then read changes as JSON Feed or Atom (RSS-style) from simple endpoints.

Feb 6, 2026

Read](/blog/convert-any-website-to-rss-feed)

[### BeautifulSoup4 Web Crawler

PythonTutorial

### BeautifulSoup4 Web Crawler

A tiny BeautifulSoup4 + requests crawler that stays on one site, normalizes URLs, and deduplicates links.

Feb 3, 2026

Read](/blog/beatifulsoup-webcrawler)

[### YAML vs Plain Text: Choosing the Right Format for LLM Prompts

ComparisonYAML

### YAML vs Plain Text: Choosing the Right Format for LLM Prompts

YAML vs plain text for prompt data and scraping workflows: when structured manifests help and when raw text is the safer choice.

Feb 1, 2026

Read](/blog/yaml-vs-plain-text-choosing-the-right-format-for-llm-prompts)

[### YAML vs CSV: Choosing the Right Format for LLM Prompts

ComparisonYAML

### YAML vs CSV: Choosing the Right Format for LLM Prompts

YAML vs CSV for prompt data and scraping outputs: config manifests vs flat tables, with practical crawling and RAG examples.

Feb 1, 2026

Read](/blog/yaml-vs-csv-choosing-the-right-format-for-llm-prompts)

[### Markdown vs YAML: Choosing the Right Format for LLM Prompts

ComparisonMarkdown

### Markdown vs YAML: Choosing the Right Format for LLM Prompts

Markdown vs YAML for prompt inputs and scraped outputs: readability, parsing risk, and practical patterns for crawling and RAG ingestion.

Feb 1, 2026

Read](/blog/markdown-vs-yaml-choosing-the-right-format-for-llm-prompts)

[### Markdown vs Plain Text: Choosing the Right Format for LLM Prompts

ComparisonMarkdown

### Markdown vs Plain Text: Choosing the Right Format for LLM Prompts

Markdown vs plain text for prompts and scraped content: structure, readability, chunking for RAG, and practical tradeoffs.

Feb 1, 2026

Read](/blog/markdown-vs-plain-text-choosing-the-right-format-for-llm-prompts)

[### Markdown vs JSON: Choosing the Right Format for LLM Prompts

ComparisonMarkdown

### Markdown vs JSON: Choosing the Right Format for LLM Prompts

A practical comparison of Markdown and JSON for LLM prompt inputs, scraping outputs, and RAG ingestion, with clear tradeoffs and examples.

Feb 1, 2026

Read](/blog/markdown-vs-json-choosing-the-right-format-for-llm-prompts)

[### Markdown vs CSV: Choosing the Right Format for LLM Prompts

ComparisonMarkdown

### Markdown vs CSV: Choosing the Right Format for LLM Prompts

Markdown vs CSV for scraped data and prompt inputs: when tables help, when they break, and what works best for RAG and pipelines.

Feb 1, 2026

Read](/blog/markdown-vs-csv-choosing-the-right-format-for-llm-prompts)

[### JSON vs YAML: Choosing the Right Format for LLM Prompts

ComparisonJSON

### JSON vs YAML: Choosing the Right Format for LLM Prompts

JSON vs YAML for prompt data and scraped outputs: schema, validation, typing, and what breaks in real pipelines.

Feb 1, 2026

Read](/blog/json-vs-yaml-choosing-the-right-format-for-llm-prompts)

[### JSON vs Plain Text: Choosing the Right Format for LLM Prompts

ComparisonJSON

### JSON vs Plain Text: Choosing the Right Format for LLM Prompts

JSON vs plain text for scraping and RAG pipelines: when strict fields are needed, when raw text is enough, and how to choose safely.

Feb 1, 2026

Read](/blog/json-vs-plain-text-choosing-the-right-format-for-llm-prompts)

[### JSON vs CSV: Choosing the Right Format for LLM Prompts

ComparisonJSON

### JSON vs CSV: Choosing the Right Format for LLM Prompts

JSON vs CSV for scraped datasets and LLM prompt outputs: structure, nesting, parsing, and what works best for pipelines and RAG.

Feb 1, 2026

Read](/blog/json-vs-csv-choosing-the-right-format-for-llm-prompts)

[### HTML vs Cleaned Text vs Markdown: Which Should Be Used for RAG?

ComparisonRAG

### HTML vs Cleaned Text vs Markdown: Which Should Be Used for RAG?

A practical guide to choosing HTML, cleaned text, or Markdown for RAG ingestion from crawled pages, including tradeoffs and a simple decision flow.

Feb 1, 2026

Read](/blog/html-vs-cleaned-text-vs-markdown-which-should-be-used-for-rag)

[### HTML vs Cleaned Text: Choosing the Right Output Format

ComparisonHTML

### HTML vs Cleaned Text: Choosing the Right Output Format

HTML vs cleaned text for web crawling and RAG: what is preserved, what is lost, and which output format is safer for real pipelines.

Feb 1, 2026

Read](/blog/html-vs-cleaned-text-choosing-the-right-output-format)

[### CSV vs Plain Text: Choosing the Right Format for LLM Prompts

ComparisonCSV

### CSV vs Plain Text: Choosing the Right Format for LLM Prompts

CSV vs plain text for scraped outputs and prompt data: when a dataset is needed, when narrative text is enough, and what to avoid.

Feb 1, 2026

Read](/blog/csv-vs-plain-text-choosing-the-right-format-for-llm-prompts)

[### How to crawl the website with Python

PythonTutorial

### How to crawl the website with Python

There are several options for how to crawl the content of the website using Python. All methods have their pros and cons. Let's take a look at more detail.

Jan 31, 2026•10 min read

Read](/blog/how-to-crawl-the-website-with-python)

[web scrapingethics

### Web Scraping Ethics: What is legal and what is not?

Learn the ethical principles, legal considerations, and best practices for responsible web scraping. Understand how to respect website owners while collecting data legally and ethically.

Jan 30, 2026

Read](/blog/web-scraping-ethics)

[JSTechnical

### Mozilla Readability Algorithm (Readability.js) explanation

A simple, step-by-step breakdown of the Mozilla Readability.js algorithm: how it scores the DOM and extracts the main article content.

Jan 26, 2026

Read](/blog/mozilla-readability-algorithm-readabilityjs)

[Technical

### What is Shadow DOM? (And How to Scrape It)

Shadow DOM is a way to build encapsulated UI components on the web. Learn what Shadow DOM is, why it is hard to scrape, and how to scrape Shadow DOM in your browser or with a browser extension.

Jan 4, 2026•5 mins to read

Read](/blog/what-is-shadow-dom)

[JSTutorial

### Extracting article or blogpost content with Mozilla Readability

Extract clean article content from any web page using Mozilla's Readability library—the same algorithm that powers Firefox Reader View. Complete JavaScript code examples with HTML cleaning and error handling.

Sep 10, 2025

Read](/blog/how-to-extract-article-or-blogpost-content-in-js-using-readabilityjs)

[Customer story

### How AI FlowChat uses WebCrawlerAPI to add context to users' flows

I recently talked to Alex, founder of AI Flow Chat. Read the customer story about how AI Flow Chat is using WebCrawlerAPI in their user flows

Jul 25, 2025

Read](/blog/how-ai-flowchat-uses-webcrawlerapi-to-add-context-to-users-flows)

[Technical

### What is Cloudflare Web Crawler?

Read what is the Cloudflare Web Crawler, when to use it and when it is better to search some other solutions.

May 17, 2025•10 min read

Read](/blog/what-is-cloudflare-web-crawler)

[APIComparison

### Top 6 Web Scraping APIs in 2025

Top 6 Scraping API in 2025. Get content or structure data with a single API call.

May 11, 2025•10 min read

Read](/blog/top-web-scraping-apis-in-2025)

[### The Top 3 Best Screenshot APIs to Use in 2026

APIComparison

### The Top 3 Best Screenshot APIs to Use in 2026

See the top 3 screenshot APIs to try in 2025, with easy comparisons of prices, features, and free plans.

May 8, 2025•5 mins to read

Read](/blog/the-top-3-best-screenshot-apis-to-use-in-2025)

[### What is an llms.txt File?

Technical

### What is an llms.txt File?

Learn about llms.txt files, a standard way to document AI models used in your projects, promoting transparency and trust in AI-powered applications.

Mar 24, 2025•3 min read

Read](/blog/what-is-llm)

[### JavaScript Rendering in Web Crawling

JSTechnical

### JavaScript Rendering in Web Crawling

Explore essential tools and strategies for effective JavaScript rendering in web crawling, overcoming challenges in dynamic websites.

Jan 19, 2025•10 min read

Read](/blog/javascript-rendering-in-web-crawling-complete-guide)

[### How to Build a Web Crawler

### How to Build a Web Crawler

Learn the basics of building a web crawler from scratch. This guide covers key components, planning steps, common challenges, and best practices in simple terms.

Jan 19, 2025

Read](/blog/how-to-build-a-web-crawler)

[ComparisonAPI

### Top 6 best Firecrawl alternatives

Explore five web scraping tools that serve as alternatives to Firecrawl, each offering unique features for diverse data extraction needs.

Jan 17, 2025•10 min read

Read](/blog/top-5-best-firecrawl-alternatives)

[### Cleaned text vs Markdown: Choosing the Right Output Format for AI

ComparisonMarkdown

### Cleaned text vs Markdown: Choosing the Right Output Format for AI

Explore the differences between cleaned text and Markdown to determine the best format for your data processing and content management needs.

Jan 17, 2025•10 min read

Read](/blog/cleaned-text-vs-markdown-choosing-the-right-output-format)

[### HTML vs Markdown: Choosing the Right Output Format for AI

ComparisonRAG

### HTML vs Markdown: Choosing the Right Output Format for AI

Explore the differences between HTML and Markdown to determine which format best suits your web development and data processing needs.

Jan 15, 2025•10 min read

Read](/blog/html-vs-markdown-choosing-the-right-output-format)

[### How to crawl website with PHP

TutorialTechnical

### How to crawl website with PHP

Learn how to effectively crawl websites using PHP with frameworks like Goutte and Spatie/Crawler, or opt for the simplicity of WebCrawlerAPI.

Jan 12, 2025•15 min read

Read](/blog/how-to-crawl-website-with-php)

[### What is webcrawling?

Technical

### What is webcrawling?

Explore the automated process of web crawling, its essential functions, and the tools that simplify data collection from the vast web.

Jan 12, 2025•1 min read

Read](/blog/what-is-a-web-crawling)

[### How to Crawl Website with .NET and C#

TutorialTechnical

### How to Crawl Website with .NET and C#

Learn how to effectively crawl websites with .NET and C#, exploring frameworks and APIs for both simple and complex tasks.

Jan 11, 2025•15 min read

Read](/blog/how-to-crawl-website-with-net-and-c)

[### Best Web Crawler API in 2025

APIComparison

### Best Web Crawler API in 2025

Top Web Crawler APIs in 2025. Most popular web scraping tools for AI, e-commerce, and SEO.

Jan 9, 2025•10 min read

Read](/blog/best-webcrawler-api-in-2025)

[### Python vs Node.js: Which is Better for Web Crawling?

PythonComparison

### Python vs Node.js: Which is Better for Web Crawling?

Explore the strengths and weaknesses of Python and Node.js for web crawling, and find the best fit for your project needs.

Jan 6, 2025•30 min read

Read](/blog/python-vs-nodejs-which-is-better-for-web-crawling)

[### The Best Data Format for Your Prompt

RAGTechnical

### The Best Data Format for Your Prompt

Learn which data format is best for your prompt. Markdown, JSON, CSV, Plain Text, and YAML each have their strengths and weaknesses.

Nov 23, 2024•6 min read

Read](/blog/best-prompt-data)

[### How to upload website content to ChatGPT

RAGTutorial

### How to upload website content to ChatGPT

Learn how to upload website content to ChatGPT to generate human-like text based on the scraped content of your website.

Jun 23, 2024•15 min read

Read](/blog/upload-website-content-to-chatgpt)

[### How to extract XPath in Golang

TechnicalTutorial

### How to extract XPath in Golang

XPath is a powerful tool for selecting nodes in an XML document. In this article, we will show you how to extract XPath in Golang.

Jun 16, 2024•6 min read

Read](/blog/extract-xpath-golang)

[### What is Xpath?

Technical

### What is Xpath?

Xpath is a powerful query language for selecting nodes in an HTML document. Learn about the key features and aspects of Xpath.

Jun 1, 2024•1 min

Read](/blog/what-is-xpath)

[### Clean crawled or scraped data with BeatuifulSoup in Python

PythonTutorial

### Clean crawled or scraped data with BeatuifulSoup in Python

After crawling or scraping the webpage, the data may need to be cleaned. In this article, we provide a solution and code for using BeautifulSoup to remove unneeded content.

May 27, 2024•10 min read

Read](/blog/clean-crawled-data-with-beautifulsoup-in-python)

[### How to build a web crawler with Scrapy in Python

PythonTutorial

### How to build a web crawler with Scrapy in Python

Scrapy is a powerful tool for crawling and scraping websites. In this tutorial, you will learn how to build a crawler using this framework, render JavaScript, and save the content of the website page by page.

May 25, 2024•7 min read

Read](/blog/how-to-build-a-web-crawler-with-scrapy-in-python)

[### What is the best crawling API in 2024?

### What is the best crawling API in 2024?

How to choose crawler API which fits your needs? What are the best web crawling APIs in 2024?

Apr 26, 2024•15 min read

Read](/blog/best-web-crawler-api-2024)

----
url: https://webcrawlerapi.com/changelog/2025-03-06-monitoring-server-incident
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

March 6, 2025

Monitoring Server Incident Resolution

=====================================

The issue lasted for 9 hours but was not related to crawling. The root cause was a network issue affecting the monitoring server. Because the monitoring server was unavailable to the main job manager, each job report had to wait several minutes for a timeout response from the monitoring server.

As a result, the processing time for each job increased, and the job queue grew to several thousand jobs.

The incident has now been resolved. We are continuously working on improving our monitoring system to prevent similar issues in the future.

----
url: https://webcrawlerapi.com/blog/html-vs-cleaned-text-vs-markdown-which-should-be-used-for-rag
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

ComparisonRAGHTMLMarkdown

HTML vs Cleaned Text vs Markdown: Which Should Be Used for RAG?

===============================================================

A practical guide to choosing HTML, cleaned text, or Markdown for RAG ingestion from crawled pages, including tradeoffs and a simple decision flow.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What should be optimized for in RAG](#what-should-be-optimized-for-in-rag)

*   [Recommended decision flow](#recommended-decision-flow)

*   [Step 1: Is re-processing expected?](#step-1-is-re-processing-expected)

*   [Step 2: Is retrieval being done over full content?](#step-2-is-retrieval-being-done-over-full-content)

*   [Step 3: Is human review part of the workflow?](#step-3-is-human-review-part-of-the-workflow)

*   [Practical patterns that tend to work](#practical-patterns-that-tend-to-work)

*   [Pattern A: Store HTML, embed cleaned text](#pattern-a-store-html-embed-cleaned-text)

*   [Pattern B: Convert to Markdown, then chunk by headings](#pattern-b-convert-to-markdown-then-chunk-by-headings)

*   [Pattern C: Cleaned text only (fast path)](#pattern-c-cleaned-text-only-fast-path)

*   [Common RAG edge cases](#common-rag-edge-cases)

*   [Tables](#tables)

*   [Link directories](#link-directories)

*   [Boilerplate-heavy pages](#boilerplate-heavy-pages)

*   [Node.js snippet: A simple "store HTML + embed cleaned text" record](#nodejs-snippet-a-simple-store-html-embed-cleaned-text-record)

*   [Conclusion](#conclusion)

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What should be optimized for in RAG](#what-should-be-optimized-for-in-rag)

*   [Recommended decision flow](#recommended-decision-flow)

*   [Step 1: Is re-processing expected?](#step-1-is-re-processing-expected)

*   [Step 2: Is retrieval being done over full content?](#step-2-is-retrieval-being-done-over-full-content)

*   [Step 3: Is human review part of the workflow?](#step-3-is-human-review-part-of-the-workflow)

*   [Practical patterns that tend to work](#practical-patterns-that-tend-to-work)

*   [Pattern A: Store HTML, embed cleaned text](#pattern-a-store-html-embed-cleaned-text)

*   [Pattern B: Convert to Markdown, then chunk by headings](#pattern-b-convert-to-markdown-then-chunk-by-headings)

*   [Pattern C: Cleaned text only (fast path)](#pattern-c-cleaned-text-only-fast-path)

*   [Common RAG edge cases](#common-rag-edge-cases)

*   [Tables](#tables)

*   [Link directories](#link-directories)

*   [Boilerplate-heavy pages](#boilerplate-heavy-pages)

*   [Node.js snippet: A simple "store HTML + embed cleaned text" record](#nodejs-snippet-a-simple-store-html-embed-cleaned-text-record)

*   [Conclusion](#conclusion)

When RAG is being built on top of crawled pages, output format choices tend to decide the whole pipeline. HTML, cleaned text, and Markdown can all work, but different costs are paid.

Pairwise guides are available in:

*   [HTML vs Markdown](/blog/html-vs-markdown-choosing-the-right-output-format)

*   [Cleaned Text vs Markdown](/blog/cleaned-text-vs-markdown-choosing-the-right-output-format)

*   [HTML vs Cleaned Text](/blog/html-vs-cleaned-text-choosing-the-right-output-format)

Quick comparison

----------------

Topic

HTML

Cleaned Text

Markdown

Best for

Fidelity and re-processing

Embeddings and retrieval

Readable structure for humans

Keeps links (targets)

Yes

Usually no

Sometimes (depends on conversion)

Keeps structure

High (DOM)

Low

Medium

Token cost

High

Low

Medium

RAG chunking

Harder (needs parsing)

Simple

Simple (headings help)

What should be optimized for in RAG

-----------------------------------

In real pipelines, three goals are usually competing:

1.  Retrieval quality (what gets found)

2.  Answer quality (what gets used)

3.  Traceability (what was the source and where)

Those goals are affected by how much structure is preserved and how much noise is carried.

If extracted structured fields are required too, prompt data formats are covered in [Best Prompt Data](/blog/best-prompt-data).

Recommended decision flow

-------------------------

### Step 1: Is re-processing expected?

If parsing rules are expected to change, HTML is often stored as the source of truth. Cleaned text and Markdown can be re-generated later.

### Step 2: Is retrieval being done over full content?

If embeddings are the core, cleaned text is usually the default. It reduces noise and token cost.

### Step 3: Is human review part of the workflow?

If humans must read chunks, Markdown is often used because headings and lists remain scannable.

Practical patterns that tend to work

------------------------------------

### Pattern A: Store HTML, embed cleaned text

This pattern is common because both traceability and retrieval are supported.

*   HTML is stored for evidence and re-processing.

*   Cleaned text is chunked and embedded.

*   URLs and titles are stored as metadata.

### Pattern B: Convert to Markdown, then chunk by headings

This pattern is common for docs and knowledge bases.

*   HTML is converted to Markdown.

*   ## headings are used as chunk boundaries.

*   Lists and code blocks are preserved.

Markdown conversion tradeoffs are covered in [HTML vs Markdown](/blog/html-vs-markdown-choosing-the-right-output-format).

### Pattern C: Cleaned text only (fast path)

This pattern is used when:

*   The site is mostly prose

*   Links and tables are not critical

*   Cost and simplicity are prioritized

The downside is that structure and link targets can be lost.

Common RAG edge cases

---------------------

### Tables

If tables carry meaning (specs, pricing), cleaned text can flatten them into nonsense. HTML can preserve them, but additional parsing is required. Markdown tables can work, but generation is not always stable.

### Link directories

If a page is mostly links, cleaned text can lose targets. HTML keeps them. Markdown can keep them if links are preserved as \[text\](url).

### Boilerplate-heavy pages

HTML often includes repeated headers, footers, cookie banners, and navigation. If not removed, embeddings can be polluted. Cleaned text usually reduces this problem.

Node.js snippet: A simple "store HTML + embed cleaned text" record

------------------------------------------------------------------

This example shows a practical envelope for storage. No product-specific features are implied.

    // Node 18+

    // Create an ingestion record that keeps HTML for traceability

    // and keeps cleaned text for embedding.

    const record = {

      url: "https://example.com/page",

      fetched_at: new Date().toISOString(),

      html: "<html>...</html>",

      cleaned_text: "Readable content goes here...",

    };

    console.log(JSON.stringify(record, null, 2));

Conclusion

----------

*   HTML is usually selected for fidelity and re-processing.

*   Cleaned text is usually selected for embeddings and retrieval.

*   Markdown is usually selected when readable structure is valuable, especially for docs.

*   A mixed approach is often used: HTML for storage, cleaned text (or Markdown) for RAG.

If prompt input formats are being chosen too, [Best Prompt Data](/blog/best-prompt-data) should be read alongside these output guides.

----
url: https://webcrawlerapi.com/blog/markdown-vs-yaml-choosing-the-right-format-for-llm-prompts
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

ComparisonMarkdownYAMLRAG

Markdown vs YAML: Choosing the Right Format for LLM Prompts

===========================================================

Markdown vs YAML for prompt inputs and scraped outputs: readability, parsing risk, and practical patterns for crawling and RAG ingestion.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What Markdown is good at](#what-markdown-is-good-at)

*   [What YAML is good at](#what-yaml-is-good-at)

*   [Use cases in web crawling, scraping, and RAG](#use-cases-in-web-crawling-scraping-and-rag)

*   [When Markdown should be used](#when-markdown-should-be-used)

*   [When YAML should be used](#when-yaml-should-be-used)

*   [Practical tradeoffs and failure modes](#practical-tradeoffs-and-failure-modes)

*   [YAML typing surprises](#yaml-typing-surprises)

*   [Markdown "looks structured" but is not strict](#markdown-looks-structured-but-is-not-strict)

*   [Node.js snippet: Guard YAML-like output by forcing strings](#nodejs-snippet-guard-yaml-like-output-by-forcing-strings)

*   [Conclusion](#conclusion)

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What Markdown is good at](#what-markdown-is-good-at)

*   [What YAML is good at](#what-yaml-is-good-at)

*   [Use cases in web crawling, scraping, and RAG](#use-cases-in-web-crawling-scraping-and-rag)

*   [When Markdown should be used](#when-markdown-should-be-used)

*   [When YAML should be used](#when-yaml-should-be-used)

*   [Practical tradeoffs and failure modes](#practical-tradeoffs-and-failure-modes)

*   [YAML typing surprises](#yaml-typing-surprises)

*   [Markdown "looks structured" but is not strict](#markdown-looks-structured-but-is-not-strict)

*   [Node.js snippet: Guard YAML-like output by forcing strings](#nodejs-snippet-guard-yaml-like-output-by-forcing-strings)

*   [Conclusion](#conclusion)

Markdown and YAML are both selected for readability, but different kinds of ambiguity are introduced. Markdown is usually used for documents. YAML is usually used for configuration-like data with keys and values.

A bigger overview of formats is provided in [Best Prompt Data](/blog/best-prompt-data).

Quick comparison

----------------

Topic

Markdown

YAML

Best for

Narrative docs and reports

Config-shaped data and small records

Parsing reliability

Medium

Medium to High (but indentation mistakes hurt)

Human editing

Easy

Easy (until nesting gets deep)

Common failure

Structure drifts in long outputs

Indentation and implicit types surprise

RAG fit

Good for readable chunks

Good for metadata and small manifests

What Markdown is good at

------------------------

Markdown is usually used when:

*   A long answer is expected to be read by a human

*   Sections, headings, and lists are useful

*   Code blocks and examples must remain readable

Markdown as an output format is compared in [HTML vs Markdown](/blog/html-vs-markdown-choosing-the-right-output-format).

What YAML is good at

--------------------

YAML is usually used when:

*   Key-value structure is needed, but it should remain human-friendly

*   Config files or small manifests are being produced

*   Comments are helpful (YAML supports comments, JSON does not)

A close alternative is JSON, and the tradeoffs are covered in [JSON vs YAML](/blog/json-vs-yaml-choosing-the-right-format-for-llm-prompts).

Use cases in web crawling, scraping, and RAG

--------------------------------------------

### When Markdown should be used

Markdown is usually preferred when:

*   Page content is being summarized for a human review step

*   A "what was found" report is being generated (headings, bullets, quotes)

*   The primary value is the readable text, not strict fields

### When YAML should be used

YAML is usually preferred when:

*   A small extraction manifest is being produced (selectors, flags, rules)

*   A batch job definition is being generated and edited by hand

*   A compact record per page is enough, and strict validation is not required

If the output must be parsed and stored reliably, [Markdown vs JSON](/blog/markdown-vs-json-choosing-the-right-format-for-llm-prompts) should usually be chosen over YAML.

Practical tradeoffs and failure modes

-------------------------------------

### YAML typing surprises

YAML parsers can treat unquoted values as booleans, numbers, or dates. That behavior can be helpful, but it can also be surprising in scraping where strings are expected.

### Markdown "looks structured" but is not strict

A table in Markdown looks like a table, but it is not guaranteed to be parseable as a table. If a database insert is planned, JSON or CSV is usually safer.

Node.js snippet: Guard YAML-like output by forcing strings

----------------------------------------------------------

No YAML parser is used here on purpose. A common mitigation is: YAML is requested, but values are required to be quoted strings for predictable typing.

    // Node 18+

    // Simple check: ensure every ":" value is quoted.

    // This is not a YAML parser. It is a guardrail.

    import { readFile } from "node:fs/promises";

    const text = await readFile("output.yml", "utf8");

    const badLines = [];

    for (const [i, line] of text.split("\n").entries()) {

      const trimmed = line.trim();

      if (!trimmed || trimmed.startsWith("#") || !trimmed.includes(":")) continue;

      const idx = trimmed.indexOf(":");

      const value = trimmed.slice(idx + 1).trim();

      if (value && !value.startsWith('"')) {

        badLines.push({ line: i + 1, value });

      }

    }

    if (badLines.length) {

      console.error("Unquoted YAML values found:", badLines.slice(0, 10));

      process.exit(1);

    }

    console.log("OK: values look quoted");

Conclusion

----------

*   Markdown is usually selected for long, readable documents.

*   YAML is usually selected for config-like key-value data that is edited by humans.

*   For machine-parsed pipelines, JSON is usually more reliable than YAML.

If a flat dataset is being extracted, [YAML vs CSV](/blog/yaml-vs-csv-choosing-the-right-format-for-llm-prompts) should be compared too.

----
url: https://webcrawlerapi.com/blog/web-scraping-ethics
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

web scrapingethicsbest practices

Web Scraping Ethics: What is legal and what is not?

===================================================

Learn the ethical principles, legal considerations, and best practices for responsible web scraping. Understand how to respect website owners while collecting data legally and ethically.

Written byAndrew

Published onFeb 8, 2026

### Table of Contents

*   [Is Web Scraping Legal?](#is-web-scraping-legal)

*   [What Makes Web Scraping Ethical vs Unethical](#what-makes-web-scraping-ethical-vs-unethical)

*   [robots.txt: What It Means and Why You Must Respect It](#robotstxt-what-it-means-and-why-you-must-respect-it)

*   [Terms of Service and Contract Law](#terms-of-service-and-contract-law)

*   [Personal Data and Privacy Laws (GDPR, CCPA)](#personal-data-and-privacy-laws-gdpr-ccpa)

*   [Rate Limiting: Don't Break the Website](#rate-limiting-dont-break-the-website)

*   [When Web Scraping Is NOT Allowed](#when-web-scraping-is-not-allowed)

*   [Public vs Private Data: Know the Difference](#public-vs-private-data-know-the-difference)

*   [How to Identify Yourself (User-Agent Best Practices)](#how-to-identify-yourself-user-agent-best-practices)

*   [Should You Ask Permission First?](#should-you-ask-permission-first)

*   [APIs vs Scraping: When to Choose What](#apis-vs-scraping-when-to-choose-what)

*   [What Happens If You Scrape Unethically](#what-happens-if-you-scrape-unethically)

*   [Web Scraping Code of Conduct](#web-scraping-code-of-conduct)

### Table of Contents

*   [Is Web Scraping Legal?](#is-web-scraping-legal)

*   [What Makes Web Scraping Ethical vs Unethical](#what-makes-web-scraping-ethical-vs-unethical)

*   [robots.txt: What It Means and Why You Must Respect It](#robotstxt-what-it-means-and-why-you-must-respect-it)

*   [Terms of Service and Contract Law](#terms-of-service-and-contract-law)

*   [Personal Data and Privacy Laws (GDPR, CCPA)](#personal-data-and-privacy-laws-gdpr-ccpa)

*   [Rate Limiting: Don't Break the Website](#rate-limiting-dont-break-the-website)

*   [When Web Scraping Is NOT Allowed](#when-web-scraping-is-not-allowed)

*   [Public vs Private Data: Know the Difference](#public-vs-private-data-know-the-difference)

*   [How to Identify Yourself (User-Agent Best Practices)](#how-to-identify-yourself-user-agent-best-practices)

*   [Should You Ask Permission First?](#should-you-ask-permission-first)

*   [APIs vs Scraping: When to Choose What](#apis-vs-scraping-when-to-choose-what)

*   [What Happens If You Scrape Unethically](#what-happens-if-you-scrape-unethically)

*   [Web Scraping Code of Conduct](#web-scraping-code-of-conduct)

Is Web Scraping Legal?

----------------------

Web scraping itself isn't illegal, but legality depends on what you scrape and how you do it.

If you want real examples (and not just theory), these court-case roundups can help:

*   [5 Famous Web Scraping Court Cases Where Scrapers Won](/blog/5-famous-web-scraping-court-cases-where-scrapers-won)

*   [5 Famous Web Scraping Court Cases Where Scrapers Lost](/blog/5-famous-web-scraping-court-cases-where-scrapers-lost)

In most countries, scraping publicly available pages is often fine, but you can run into trouble if you bypass authentication, ignore robots.txt, violate Terms of Service, or scrape personal data without consent.

The biggest risks usually come from copyright issues, computer access laws (for example, CFAA in the US), and privacy regulations (GDPR/CCPA).

Before you scrape, check the site's Terms of Service, respect robots.txt, and make sure you're not collecting or using data in ways that can create legal or privacy problems (not legal advice).

What Makes Web Scraping Ethical vs Unethical

--------------------------------------------

Ethical scraping is mostly about respecting signals and reducing harm.

I start with permission: I read the site Terms of Service and check robots.txt, and if it's unclear, I ask or I don't scrape.

Privacy comes next. I don't collect personal data (PII) unless there is a strong, legitimate reason and real safeguards (not legal advice).

Then there is content. I avoid copying protected expression like full articles or other creative text wholesale; in real life, the safer path is extracting facts and structured fields, and adding attribution when it helps users understand the source.

And finally, behavior. Rate limit, back off on 429/5xx, keep requests light, and identify your crawler with a real User-Agent so you're not sneaking around.

Unethical scraping is the opposite pattern: bypassing logins or paywalls, evading blocks, hammering servers until they fall over, or republishing someone else's work in a way that undercuts the original.

robots.txt: What It Means and Why You Must Respect It

-----------------------------------------------------

robots.txt is a small text file at the site root (usually https://example.com/robots.txt) that tells crawlers what paths are allowed or disallowed for a given User-agent, and sometimes how fast they should crawl (Crawl-delay). It is not a password and it is not a security control. But it is a very clear permission signal, and ignoring it is one of the fastest ways to get blocked (and to create a bad relationship with the site owner). In real life, it also has edge cases: rules can differ per bot, the file can change, and it is easy to accidentally enqueue forbidden URLs if you only check once. So it should be treated as input to your crawler: fetch it, parse it with a real parser, cache it per host, and check every URL before you request it. If a page is disallowed, do not try to be clever. Either skip it, ask for permission, or use an official API if one exists.

Typical file looks like this:

    User-agent: *

    Disallow: /private/

    Allow: /blog/

    Crawl-delay: 5

Which means: most bots should not touch /private/, can crawl /blog/, and should wait around 5 seconds between requests.

You can implement this from scratch, but the format has enough edge cases that it is usually better to use a parser library.

    // Node 18+

    // Idea: fetch /robots.txt once per host and reuse it.

    import robotsParser from "robots-parser";

    export async function getRobotsForHost(origin) {

      const robotsUrl = new URL("/robots.txt", origin).toString();

      const res = await fetch(robotsUrl);

      const txt = res.ok ? await res.text() : "";

      return robotsParser(robotsUrl, txt);

    }

    export async function isAllowed(url, { robots }) {

      return robots.isAllowed(url, "MyCrawler/1.0");

    }

Terms of Service and Contract Law

---------------------------------

Terms of Service (ToS) is where "can I scrape this" is often answered.

If a site says "no automated access" and you scrape anyway, you may be breaking their contract rules (and sometimes that is enough to create a problem even if the pages are public).

Two practical rules that save time:

1.  If you need to log in, assume the rules are stricter. You're not just reading a public page anymore.

2.  If the ToS is explicit and you can't comply, stop. Don't build your scraper around "maybe they won't notice".

Also: ToS can change. If you're scraping at scale, treat it like a dependency and re-check it periodically.

Copyright: What You Can and Cannot Copy

---------------------------------------

Copyright is not about "data" in general. It's about creative expression.

In practice, a safe mental model is:

*   Facts and raw numbers are usually fine.

*   The way those facts are written, selected, and presented can be protected.

So scraping product prices, SKUs, and availability is very different from scraping and republishing full product descriptions or blog posts.

If your output looks like a copy of the original page, you're probably too close.

What I'd do instead:

*   Extract only the fields you need.

*   Store the source URL next to each record.

*   If you show any text back to users, keep it short and link to the source, or use an official API/license.

And one obvious-but-common mistake: "it is behind a paywall" doesn't mean "it is fair game if I can technically fetch it".

If your goal is to extract "main article text" (for search, summaries, RAG), this is where tooling matters. See: [Extracting article or blogpost content with Mozilla Readability](https://webcrawlerapi.com/blog/how-to-extract-article-or-blogpost-content-in-js-using-readabilityjs).

Personal Data and Privacy Laws (GDPR, CCPA)

-------------------------------------------

Privacy is where scraping goes from "kinda grey" to "danger" fast.

If you scrape anything that can identify a person (names + emails, phone numbers, user IDs, profiles, addresses, photos, IPs), you're in PII territory. Even if it is visible on a public page.

Practical rules:

*   Avoid PII unless you truly need it.

*   If you need it, define a legal basis and document it (not legal advice).

*   Minimize: collect the smallest possible set of fields.

*   Secure it: encryption at rest, access controls, audit logs.

*   Retention: delete it when it is no longer needed.

Also be careful with "public" sources like forums, social media, and review sites. People post publicly, but they still expect context. Bulk collection and republishing changes that context.

Rate Limiting: Don't Break the Website

--------------------------------------

Most "ethical" problems in scraping are operational.

If you send 50 requests per second to a small site, you're not doing research. You're doing a small DDoS.

Good defaults:

*   Concurrency per host: 1-3.

*   Delay between requests: 500ms-3000ms (add jitter).

*   Respect Retry-After on 429.

*   Back off on 5xx and timeouts.

*   Stop if errors keep rising.

Here is a tiny Node 18+ pattern that behaves like a polite visitor. It is not a full crawler, but the idea is the important part:

    // Node 18+

    // Idea: per-host delay + basic 429 backoff.

    const nextAt = new Map(); // host -> unix ms

    function sleep(ms) {

      return new Promise((r) => setTimeout(r, ms));

    }

    export async function politeFetch(url, { minDelayMs = 800 } = {}) {

      const u = new URL(url);

      const host = u.host;

      const waitMs = Math.max(0, (nextAt.get(host) ?? 0) - Date.now());

      if (waitMs) await sleep(waitMs);

      // Add jitter so you don't look like a metronome.

      const jitter = Math.floor(Math.random() * 400);

      nextAt.set(host, Date.now() + minDelayMs + jitter);

      const res = await fetch(url, {

        headers: {

          // Honest UA with contact is a good practice.

          "user-agent": "MyCrawler/1.0 (+mailto:[email protected])",

        },

      });

      // Back off if the site is telling you to slow down.

      if (res.status === 429) {

        const retryAfter = Number(res.headers.get("retry-after") ?? "0");

        const backoffMs = retryAfter > 0 ? retryAfter * 1000 : 10_000;

        nextAt.set(host, Date.now() + backoffMs);

      }

      return res;

    }

If you can't afford to crawl slowly, you probably can't afford the ethics and stability problems that come with crawling fast.

If you're building your own crawler, politeness and scheduling will be a big chunk of the work. This is covered in: [How to Build a Web Crawler](https://webcrawlerapi.com/blog/how-to-build-a-web-crawler).

When Web Scraping Is NOT Allowed

--------------------------------

Some cases are simple.

If you have to do any of the following, you're already past the "ethical" line:

*   Bypass authentication, scrape behind login, or reuse someone else's session.

*   Break or evade access controls (CAPTCHA solving, block evasion, paywall bypass).

*   Ignore explicit disallow rules in ToS or robots.txt.

*   Collect personal data at scale without a strong, defensible reason.

There are also categories that should be treated as high risk by default: health records, financial accounts, student data, and anything involving minors.

If you want to see what "not allowed" looks like in practice, this post is the companion: [5 Famous Web Scraping Court Cases Where Scrapers Lost](/blog/5-famous-web-scraping-court-cases-where-scrapers-lost)

Yes, you can technically scrape many of these.

That doesn't mean you should.

Public vs Private Data: Know the Difference

-------------------------------------------

"Public" means a page can be loaded without logging in.

It does not mean:

*   You're allowed to automate it.

*   You're allowed to republish it.

*   You're allowed to build a competing dataset from it.

"Private" is more than "behind a login". It can also mean:

*   Pages that are accessible but intentionally not indexed.

*   URLs that are meant for browsers, not bulk collection.

*   Data that is about individuals, even if visible.

If your project depends on the assumption that "public = free", it will break. First ethically. Then legally. Then operationally.

If crawling vs scraping terms are mixed up (it happens all the time), read: [What is the difference between web crawling and scraping?](https://webcrawlerapi.com/blog/what-is-the-difference-between-web-crawling-and-scraping)

How to Identify Yourself (User-Agent Best Practices)

----------------------------------------------------

If you want to be treated like a good actor, behave like one.

That starts with identification:

*   Set a clear User-Agent.

*   Include a way to contact you (email or URL).

*   Keep it consistent so site admins can understand what they're seeing.

Bad practice is pretending to be Chrome. It's not only shady. It also makes debugging harder when something goes wrong.

Also: don't leak secrets in headers. Never put API keys, auth tokens, or private URLs into a User-Agent string.

Should You Ask Permission First?

--------------------------------

If you're scraping a few pages for a personal script, you probably won't email anyone.

If you're scraping a site at scale, you should seriously consider asking.

I'd ask when:

*   The ToS is strict or unclear.

*   robots.txt disallows the paths you need.

*   You need high volume or frequent re-crawls.

*   The data is sensitive (PII) or business-critical.

In many cases, the answer you get is "use our API" or "here is a dump". That is a win. It's faster, cheaper, and more stable than fighting the site.

APIs vs Scraping: When to Choose What

-------------------------------------

If a site provides an API, use it.

APIs exist for a reason:

*   Clear rules and rate limits.

*   Stable structure.

*   Lower chance of breaking next week.

*   Explicit permission.

Scraping is what you do when there is no official way to get the data, or when the API doesn't cover your needs.

Before scraping, also look for alternatives that are often overlooked:

*   RSS feeds

*   sitemaps

*   bulk exports

*   public datasets

Scraping is a tool. It should not be your first choice by default.

If you want a deeper overview of the "use an API" route, start here: [What is webcrawling API?](https://webcrawlerapi.com/blog/what-is-a-web-crawling-api)

And if you're comparing vendors, this list can save time: [Top Web Scraping APIs in 2025](https://webcrawlerapi.com/blog/top-web-scraping-apis-in-2025)

What Happens If You Scrape Unethically

--------------------------------------

Unethical scraping usually fails in boring ways:

*   Your IPs get blocked.

*   You spend money on proxies and retries.

*   Your data quality gets worse (block pages, CAPTCHAs, partial HTML).

*   Your system becomes a pile of hacks that only works on Tuesdays.

And then there are real consequences:

*   Legal threats (letters, takedowns, lawsuits).

*   Compliance problems if PII is involved.

*   Reputation damage if you get called out.

The irony is that "aggressive scraping" is often slower long-term. It creates churn: blocks -> workaround -> blocks -> rewrite.

Web Scraping Code of Conduct

----------------------------

If you want a simple code of conduct, here is mine:

1.  Permission signals are respected (robots.txt, ToS, auth boundaries).

2.  Only necessary data is collected (minimize fields, minimize volume).

3.  Sites are not harmed (rate limits, backoff, stop on stress).

4.  Privacy is treated as a first-class constraint (no casual PII scraping).

5.  Identity is honest (User-Agent + contact).

6.  Data is used in context (no republishing that undercuts creators).

If you break any of these, you should have a very good reason. Most projects don't.

Checklist: Is Your Scraping Project Legal and Ethical?

------------------------------------------------------

Use this before you run a scraper overnight:

*    Have the Terms of Service been read?

*    Has robots.txt been checked (and is it being enforced per URL)?

*    Is the target page accessible without login (and are auth boundaries being respected)?

*    Is the data free of PII? If not, is there a documented legal basis and a retention plan (not legal advice)?

*    Is only the minimum necessary data being collected?

*    Is the scraper rate limited per host with backoff on 429/5xx?

*    Is there an honest User-Agent with contact info?

*    Is the data stored securely (access control, encryption, audit logs)?

*    Is there a clear use policy (no republishing or copying creative text wholesale)?

*    Is there a kill switch (stop conditions when errors spike)?

----
url: https://webcrawlerapi.com/glossary/puppeteer/how-can-landmarks-improve-accessibility-testing-in-puppeteer
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Overview

Landmarks such as header, nav, main, aside, and footer provide semantic regions that assist screen readers and keyboard navigation. The feature adds consideration of these landmarks in Puppeteer accessibility checks, making it easier to verify that landmark regions are present and navigable.

### What changed

*   Landmark regions are now surfaced in accessibility related checks and can be targeted in tests.

### How to use

    // 1) Check that landmark regions exist on the page

    const landmarkCount = await page.$$eval('header, nav, main, aside, footer', els => els.length);

    console.log('landmarks', landmarkCount);

    // 2) List landmark elements by tag name

    const landmarks = await page.$$eval('header, nav, main, aside, footer', els => els.map(el => el.tagName.toLowerCase()));

    console.log(landmarks);

    // 3) Inspect the accessibility tree to verify landmark roles

    const snapshot = await page.accessibility.snapshot();

    console.log(JSON.stringify(snapshot, null, 2));

### Why it matters

Using landmarks allows tests to more robustly model how real users navigate content, improving test reliability for keyboard only and screen reader users.

----
url: https://webcrawlerapi.com/postman
----

[Go to Postman homepage](https://www.postman.com/)

Product [Enterprise](https://www.postman.com/postman-enterprise/) Resources and Support [API Network](https://www.postman.com/explore)

Search

(Ctrl+K)

[Sign In](https://identity.getpostman.com/login?continue=https%3A%2F%2Fwww.postman.com%2Fwebcrawlerapi%2Fwebcrawlerapi-public-workspace%2Fcollection%2Fxz2fs5u%2Fwebcrawlerapi) [Sign Up for Free](https://identity.getpostman.com/signup?continue=https%3A%2F%2Fwww.postman.com%2Fwebcrawlerapi%2Fwebcrawlerapi-public-workspace%2Fcollection%2Fxz2fs5u%2Fwebcrawlerapi&utm_source=postman&utm_medium=app_web&utm_content=navbar&utm_term=sign_up)

![](https://res.cloudinary.com/postman/image/upload/t_user_profile_300/v1/user/default-9)

![](https://res.cloudinary.com/postman/image/upload/t_user_profile_300/v1/user/default-9)

+2

- #### Product

- [**Pricing**](https://go.pstmn.io/pricing)
- [**Enterprise**](https://go.pstmn.io/postman-enterprise/)
- #### Resources and Support

- [**Public API Network**](https://www.postman.com/explore)

[Sign In](https://identity.getpostman.com/login?continue=https%3A%2F%2Fwww.postman.com%2Fwebcrawlerapi%2Fwebcrawlerapi-public-workspace%2Fcollection%2Fxz2fs5u%2Fwebcrawlerapi) [Sign Up for Free](https://identity.getpostman.com/signup?continue=https%3A%2F%2Fwww.postman.com%2Fwebcrawlerapi%2Fwebcrawlerapi-public-workspace%2Fcollection%2Fxz2fs5u%2Fwebcrawlerapi&utm_source=postman&utm_medium=app_web&utm_content=navbar&utm_term=sign_up)

Collections

WebcrawlerAPI

POST

Start Crawl Job

GET

Get Job Status

PUT

Cancel Job

GET

Get Job URLs

POST

Start Scrape v2

Environments

Specs

Flows

WebcrawlerAPI

Environment

No environment

- WebcrawlerAPI

WebcrawlerAPI


[Run](https://www.postman.com/webcrawlerapi/webcrawlerapi-public-workspace/run/create?collection=34110378-939796dc-53da-417a-85b1-b5d1c66a49b5)

Share

- Overview
- Authorization
- Scripts
- Variables

0

0

# WebcrawlerAPI

![](https://res.cloudinary.com/postman/image/upload/t_user_profile_300/v1746437134/user/7900788ed365510e1606792d511da981)

[WebcrawlerAPI](https://www.postman.com/webcrawlerapi)

04:28 AM, June 09, 2025

Collection for WebcrawlerAPI endpoints including crawl, job status, job cancellation, and scraping endpoints

[View complete documentation](https://www.postman.com/webcrawlerapi/documentation/34110378-939796dc-53da-417a-85b1-b5d1c66a49b5)

AI

#### Sign in to use Postman Agent.

Sign in

Console

Online

#### Getting Started

[What is Postman?](https://www.postman.com/product/what-is-postman/)[Customer Stories](https://www.postman.com/customers/)[Download Postman](https://www.postman.com/downloads/)

#### API Platform

[Collaborate in Workspaces](https://www.postman.com/product/workspaces/)[Organize with Collections](https://www.postman.com/collection/)[Explore the API Client](https://www.postman.com/product/api-client/)[Build Postman Flows](https://www.postman.com/product/flows/)[Work smarter with Postbot](https://www.postman.com/product/postbot/)[Browse API Tools](https://www.postman.com/product/tools/)

#### Enterprise Solutions

[Enterprise Essentials](https://www.postman.com/solutions/enterprise-essentials/)[API Test Automation](https://www.postman.com/solutions/api-test-automation/)[Internal API Management](https://www.postman.com/solutions/internal-api-management/)

#### Learning

[Learning Center Docs](https://learning.postman.com/docs/introduction/overview/)[Postman Academy](https://academy.postman.com/)[White Papers](https://www.postman.com/whitepaper/)[Breaking Changes Show](https://www.postman.com/events/breaking-changes/)[Templates](https://www.postman.com/templates/)[Tutorials](https://quickstarts.postman.com/)[Webinars](https://www.postman.com/events/intergalactic/)[State of the API Report](https://www.postman.com/state-of-api/)[Guide to API-First](https://www.postman.com/api-first/)

#### Community and Events

[POST/CON](https://www.postman.com/postcon/)[Blog](https://blog.postman.com/)[Community](https://www.postman.com/community/)[Student Program](https://www.postman.com/company/student-program/)[Events](https://www.postman.com/events/)[Postman Swag](https://store.getpostman.com/)

#### Support

[Support Center](https://www.postman.com/support/)[Reseller Support](https://www.postman.com/support/resellers-support/)[Postman Status](https://status.postman.com/)[Release Notes](https://www.postman.com/downloads/release-notes/)[Contact Us](https://www.postman.com/company/contact-us/)

- Overview
- Authorization
- Scripts
- Variables

- WebcrawlerAPI

----
url: https://webcrawlerapi.com/blog/what-is-a-web-crawling
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Technical1 min read to read

What is webcrawling?

====================

Explore the automated process of web crawling, its essential functions, and the tools that simplify data collection from the vast web.

Written byAndrew

Published onJan 31, 2026

### Table of Contents

*   [What is Webcrawling](#what-is-webcrawling)

*   [How It Works:](#how-it-works)

*   [Web Crawling vs. Web Scraping:](#web-crawling-vs-web-scraping)

*   [How Webcrawling Works](#how-webcrawling-works)

*   [1\. Starting with a Seed URL](#1-starting-with-a-seed-url)

*   [2\. Fetching and Parsing Web Pages](#2-fetching-and-parsing-web-pages)

*   [3\. Repetition and Link Following](#3-repetition-and-link-following)

*   [1\. Starting with example.com](#1-starting-with-examplecom)

*   [2\. Crawling Subpages](#2-crawling-subpages)

*   [3\. Continuing the Process](#3-continuing-the-process)

*   [Webcrawling Tools and Technologies](#webcrawling-tools-and-technologies)

*   [1\. Overview of WebCrawlerAPI](#1-overview-of-webcrawlerapi)

*   [2\. Features of WebCrawlerAPI](#2-features-of-webcrawlerapi)

*   [Webcrawling vs. Web Scraping](#webcrawling-vs-web-scraping)

*   [Use Cases](#use-cases)

*   [Applications](#applications)

*   [Conclusion](#conclusion)

*   [FAQs](#faqs)

*   [What is meant by web crawling?](#what-is-meant-by-web-crawling)

*   [How are websites crawled?](#how-are-websites-crawled)

### Table of Contents

*   [What is Webcrawling](#what-is-webcrawling)

*   [How It Works:](#how-it-works)

*   [Web Crawling vs. Web Scraping:](#web-crawling-vs-web-scraping)

*   [How Webcrawling Works](#how-webcrawling-works)

*   [1\. Starting with a Seed URL](#1-starting-with-a-seed-url)

*   [2\. Fetching and Parsing Web Pages](#2-fetching-and-parsing-web-pages)

*   [3\. Repetition and Link Following](#3-repetition-and-link-following)

*   [1\. Starting with example.com](#1-starting-with-examplecom)

*   [2\. Crawling Subpages](#2-crawling-subpages)

*   [3\. Continuing the Process](#3-continuing-the-process)

*   [Webcrawling Tools and Technologies](#webcrawling-tools-and-technologies)

*   [1\. Overview of WebCrawlerAPI](#1-overview-of-webcrawlerapi)

*   [2\. Features of WebCrawlerAPI](#2-features-of-webcrawlerapi)

*   [Webcrawling vs. Web Scraping](#webcrawling-vs-web-scraping)

*   [Use Cases](#use-cases)

*   [Applications](#applications)

*   [Conclusion](#conclusion)

*   [FAQs](#faqs)

*   [What is meant by web crawling?](#what-is-meant-by-web-crawling)

*   [How are websites crawled?](#how-are-websites-crawled)

What is Webcrawling

===================

Web crawling is the automated process of discovering, navigating, and indexing web pages using programs called web crawlers. These crawlers are essential for tasks like:

*   **Indexing web content** for search engines to keep results updated.

*   **Collecting data** for research or business insights.

*   **Analyzing links** to understand website relationships.

### How It Works:

1.  **Start with a Seed URL**: Crawlers begin at a starting webpage.

2.  **Fetch and Parse Pages**: They extract content, metadata, and links.

3.  **Follow Links**: Crawlers navigate to new pages while avoiding duplicates and respecting website rules like robots.txt.

**Key Tools** like [WebCrawlerAPI](https://webcrawlerapi.com/) simplify this process by automating technical challenges like handling JavaScript and bypassing anti-bot mechanisms.

### Web Crawling vs. Web Scraping:

*   **Web Crawling**: Systematically discovers and indexes web pages.

*   **Web Scraping**: Extracts specific data from selected pages.

**Use Cases**

**Web Crawling**

**Web Scraping**

**Search Engines**

Index and update content

Not applicable

**E-commerce**

Identify product pages

Extract prices and inventory

**Market Research**

Map competitor websites

Gather specific metrics

Web crawling is the backbone of search engines and data collection, enabling businesses and researchers to navigate the vast web efficiently.

How Webcrawling Works

---------------------

Web crawling follows a structured process that allows search engines and other tools to systematically discover and index web pages. Let’s break it down step by step.

### 1\. Starting with a Seed URL

Crawling begins with a **seed URL** - the first URL that acts as the starting point for finding other pages. For example, if a crawler starts at example.com, this becomes the base for discovering additional links [\[1\]](https://www.promptcloud.com/blog/web-crawlers-a-complete-guide/)[\[3\]](https://en.wikipedia.org/wiki/Web_spider).

To manage its tasks, the crawler uses a **URL Frontier**, a prioritized queue that determines which URLs to visit next [\[7\]](https://www.promptcloud.com/blog/how-does-a-web-crawler-work/).

### 2\. Fetching and Parsing Web Pages

Once the crawler selects a URL, it sends a request to fetch the page's content. After downloading the page, it processes the HTML to extract key information like text, metadata, and links to other pages [\[1\]](https://www.promptcloud.com/blog/web-crawlers-a-complete-guide/)[\[3\]](https://en.wikipedia.org/wiki/Web_spider).

The **Parser component** plays a crucial role here, analyzing the page to extract content and identify links for further crawling [\[7\]](https://www.promptcloud.com/blog/how-does-a-web-crawler-work/).

### 3\. Repetition and Link Following

Crawling is a continuous process. For example, starting at example.com, the crawler might discover links to example.com/blog and example.com/about. It then follows these links to find even more pages, like example.com/blog/how-to-choose-a-book [\[1\]](https://www.promptcloud.com/blog/web-crawlers-a-complete-guide/)[\[3\]](https://en.wikipedia.org/wiki/Web_spider).

To operate effectively, crawlers:

*   **Track visited URLs** to avoid processing the same page multiple times.

*   **Prioritize URLs** based on relevance or importance.

*   **Limit request rates** to prevent overloading servers.

*   **Follow website rules**, such as those specified in robots.txt.

This systematic approach forms the core of web crawling. Tools like WebCrawlerAPI make the process easier by automating these steps [\[1\]](https://www.promptcloud.com/blog/web-crawlers-a-complete-guide/)[\[3\]](https://en.wikipedia.org/wiki/Web_spider).

Example: Web Crawling in Practice

---------------------------------

Here's a practical look at how web crawling works, following a crawler as it navigates through a website's structure.

### 1\. Starting with example.com

The crawler begins at example.com, identifying links in the navigation menu like:

*   example.com/blog

*   example.com/about

*   example.com/products

These URLs are added to the crawler's URL Frontier - a queue that determines the order of page visits [\[7\]](https://www.promptcloud.com/blog/how-does-a-web-crawler-work/).

### 2\. Crawling Subpages

When the crawler visits example.com/blog, it finds blog post URLs such as:

*   example.com/blog/how-to-choose-a-book

*   example.com/blog/top-10-books

*   example.com/blog/reading-tips

The crawler indexes the content of each post, extracts new links, and analyzes how the pages connect [\[6\]](https://hikeseo.co/learn/onsite/technical/crawling/).

### 3\. Continuing the Process

On pages like example.com/blog/how-to-choose-a-book, the crawler uncovers more links to:

*   Related articles

*   Category pages

*   Author profiles

*   Resource pages

This step-by-step process creates a detailed map of the website [\[5\]](https://www.elastic.co/what-is/web-crawler). Tools like WebCrawlerAPI make this easier by automating tasks like JavaScript rendering and bypassing anti-bot measures. This allows developers to focus on using the data rather than handling technical hurdles.

Next, let's dive into tools like WebCrawlerAPI that streamline and optimize web crawling.

Webcrawling Tools and Technologies

----------------------------------

Modern web crawling often involves navigating complex websites, handling JavaScript, and overcoming anti-bot defenses. Tools like **WebCrawlerAPI** make this process much easier by automating tasks such as link discovery and data extraction.

### 1\. Overview of [WebCrawlerAPI](https://webcrawlerapi.com/)

**WebCrawlerAPI** streamlines web crawling by tackling technical hurdles like JavaScript rendering and anti-bot mechanisms. This allows developers to focus on analyzing the data rather than dealing with maintenance issues. It can extract data in various formats, including Markdown, HTML, and plain text, making it adaptable to different projects.

### 2\. Features of WebCrawlerAPI

WebCrawlerAPI is designed to handle projects of any scale with ease. Some of its key features include:

*   **Compatibility with** [**Python**](https://webcrawlerapi.com/blog/how-to-crawl-the-website-with-python)**,** [**Node.js**](https://nodejs.org/en)**, and** [**PHP**](https://www.php.net/)

*   **Built-in algorithms for cleaning and validating data**

*   **A scalable cloud-based infrastructure**

The platform automatically manages challenges like CAPTCHAs and IP blocks, ensuring accurate data extraction while saving time. This automation makes it a more efficient alternative to building custom solutions.

###### sbb-itb-ac346ed

Webcrawling vs. Web Scraping

----------------------------

Web crawling and web scraping are two distinct processes with different goals. **Web crawling** is about discovering and indexing web pages systematically, while **web scraping** focuses on pulling specific data from selected pages. Knowing the difference can help developers pick the right method for their data collection tasks.

### Use Cases

Web crawling is commonly used for:

*   Indexing and monitoring web content for search engines or organizations [\[2\]](https://www.promptcloud.com/blog/data-scraping-vs-data-crawling/).

*   Preserving digital content through web archiving services.

*   Analyzing links and mapping site structures.

Web scraping, on the other hand, is ideal for:

*   Tracking prices in e-commerce.

*   Aggregating content from specific sources.

*   Collecting research data for analysis [\[4\]](https://www.scraperapi.com/web-scraping/crawling-vs-scraping/).

### Applications

In many industries, these two methods work together to achieve different goals. Here's how they complement each other:

Industry

Web Crawling Role

Web Scraping Role

E-commerce

Identify product pages

Extract prices and inventory

Market Research

Map competitor landscapes

Gather metrics and sentiment

Academic Research

Index research publications

Extract citations and data

Digital Marketing

Monitor site structure

Collect marketing metrics

For instance, market researchers might use web crawling to locate competitor pages, then apply web scraping to pull specific details like pricing or product features [\[2\]](https://www.promptcloud.com/blog/data-scraping-vs-data-crawling/)[\[4\]](https://www.scraperapi.com/web-scraping/crawling-vs-scraping/).

Conclusion

----------

Web crawling is the backbone of modern data collection, systematically navigating and indexing websites to power countless internet applications.

More than just gathering data, web crawling is essential for search engines, helping them keep search results up-to-date and wide-ranging [\[1\]](https://www.promptcloud.com/blog/web-crawlers-a-complete-guide/)[\[2\]](https://www.promptcloud.com/blog/data-scraping-vs-data-crawling/). Over time, this technology has adapted to tackle increasingly complex challenges [\[1\]](https://www.promptcloud.com/blog/web-crawlers-a-complete-guide/)[\[3\]](https://en.wikipedia.org/wiki/Web_spider).

Tools like WebCrawlerAPI have streamlined the process, offering automated solutions that make web data collection faster and more scalable. These tools ensure high-quality, accurate data while simplifying workflows for developers and data professionals.

Web crawling is making an impact across various industries:

Industry

Role of Web Crawling

Search Engines

Keeps content indexed and updated in real time

Digital Marketing

Enables market research and competitor analysis

Academic Research

Assists in large-scale data collection for studies

E-commerce

Supports price tracking and product catalog updates

As web technologies continue to advance, tools like WebCrawlerAPI will further refine the crawling process, improving how developers and organizations handle dynamic content and algorithms. This ensures web crawling remains a cornerstone of data-driven solutions.

FAQs

----

Now that we've covered the basics of web crawling, let's dive into some common questions about how it works.

### What is meant by web crawling?

Web crawling refers to the automated process of navigating and mapping interconnected web pages. It's essential for search engines and data collection systems, helping them maintain updated indexes of online content [\[1\]](https://www.promptcloud.com/blog/web-crawlers-a-complete-guide/)[\[2\]](https://www.promptcloud.com/blog/data-scraping-vs-data-crawling/). When search engines like Google crawl new pages, they add the discovered content to their index, making it searchable for users.

### How are websites crawled?

Crawling starts with seed URLs, which act as the starting points for discovering links [\[1\]](https://www.promptcloud.com/blog/web-crawlers-a-complete-guide/)[\[3\]](https://en.wikipedia.org/wiki/Web_spider). Here's a breakdown of the process:

Stage

Description

Example

Initial Access

Begins with a seed URL

example.com

Link Discovery

Identifies and queues new links

Finds example.com/blog

Organizing URLs

Prepares new URLs for crawling

Queues URLs for analysis

Content Processing

Downloads and analyzes content

Extracts text and metadata

> "Web crawling indexes websites by systematically following links and mapping their structure" [\[1\]](https://www.promptcloud.com/blog/web-crawlers-a-complete-guide/)[\[2\]](https://www.promptcloud.com/blog/data-scraping-vs-data-crawling/).

To ensure smooth crawling, organizations need to factor in:

*   Server capacity and timing of crawl requests

*   Website structure and navigation

*   Frequency of content updates

*   Any technical constraints

----
url: https://webcrawlerapi.com/blog/yaml-vs-csv-choosing-the-right-format-for-llm-prompts
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

ComparisonYAMLCSVRAG

YAML vs CSV: Choosing the Right Format for LLM Prompts

======================================================

YAML vs CSV for prompt data and scraping outputs: config manifests vs flat tables, with practical crawling and RAG examples.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What YAML is good at](#what-yaml-is-good-at)

*   [What CSV is good at](#what-csv-is-good-at)

*   [Use cases in web crawling, scraping, and RAG](#use-cases-in-web-crawling-scraping-and-rag)

*   [When YAML should be used](#when-yaml-should-be-used)

*   [When CSV should be used](#when-csv-should-be-used)

*   [Practical tradeoffs](#practical-tradeoffs)

*   [YAML becomes fragile at scale](#yaml-becomes-fragile-at-scale)

*   [CSV forces flattening](#csv-forces-flattening)

*   [Node.js snippet: Generate CSV from a simple YAML-like manifest](#nodejs-snippet-generate-csv-from-a-simple-yaml-like-manifest)

*   [Conclusion](#conclusion)

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What YAML is good at](#what-yaml-is-good-at)

*   [What CSV is good at](#what-csv-is-good-at)

*   [Use cases in web crawling, scraping, and RAG](#use-cases-in-web-crawling-scraping-and-rag)

*   [When YAML should be used](#when-yaml-should-be-used)

*   [When CSV should be used](#when-csv-should-be-used)

*   [Practical tradeoffs](#practical-tradeoffs)

*   [YAML becomes fragile at scale](#yaml-becomes-fragile-at-scale)

*   [CSV forces flattening](#csv-forces-flattening)

*   [Node.js snippet: Generate CSV from a simple YAML-like manifest](#nodejs-snippet-generate-csv-from-a-simple-yaml-like-manifest)

*   [Conclusion](#conclusion)

YAML and CSV are often picked for "human friendliness", but they represent different shapes. YAML is key-value and nested. CSV is flat rows and columns.

A full format overview is available in [Best Prompt Data](/blog/best-prompt-data).

Quick comparison

----------------

Topic

YAML

CSV

Best for

Config-like manifests

Flat tabular datasets

Human editing

High

High (spreadsheets)

Nesting

Supported

Not supported

Parsing reliability

Medium to High

High (with correct quoting)

Common failure

Indentation and implicit types

Commas/quotes/newlines in fields

What YAML is good at

--------------------

YAML is usually selected for:

*   Crawl/extraction manifests (rules, flags, selectors)

*   Small per-page records that humans will tweak

*   When comments are useful

YAML compared to JSON is covered in [JSON vs YAML](/blog/json-vs-yaml-choosing-the-right-format-for-llm-prompts).

What CSV is good at

-------------------

CSV is usually selected for:

*   Exports of extracted data

*   One row per page/product

*   Quick review in spreadsheets

If objects and metadata are needed, [JSON vs CSV](/blog/json-vs-csv-choosing-the-right-format-for-llm-prompts) is often the better comparison.

Use cases in web crawling, scraping, and RAG

--------------------------------------------

### When YAML should be used

YAML is usually preferred when:

*   Humans will edit the output before it is used

*   A manifest is needed (what to extract, how to filter)

*   Nesting is useful (per-domain rules, per-section options)

### When CSV should be used

CSV is usually preferred when:

*   A stable set of columns exists

*   Export and reporting is the primary goal

*   The data is already flat (directory listings, price tables)

For readable narrative outputs, Markdown is often selected instead, as covered in [Markdown vs YAML](/blog/markdown-vs-yaml-choosing-the-right-format-for-llm-prompts) and [Markdown vs CSV](/blog/markdown-vs-csv-choosing-the-right-format-for-llm-prompts).

Practical tradeoffs

-------------------

### YAML becomes fragile at scale

As nesting grows, small indentation errors can break parsing. That risk grows when thousands of records are generated.

### CSV forces flattening

When nested data exists, flattening decisions must be made. Those decisions are often the real problem, not the format.

Node.js snippet: Generate CSV from a simple YAML-like manifest

--------------------------------------------------------------

No YAML parser is used. A common approach is: YAML is kept for job manifests, and CSV is generated only for extracted tabular outputs.

    // Node 18+

    // Create a tiny CSV from a JSON array (standing in for parsed YAML).

    const items = [      { url: "https://example.com/a", category: "news" },

      { url: "https://example.com/b", category: "docs" },

    ];

    const headers = ["url", "category"];

    const lines = [headers.join(",")];

    for (const item of items) {

      lines.push(

        headers

          .map((h) => `"${String(item[h] ?? "").replaceAll('"', '""')}"`)

          .join(",")

      );

    }

    console.log(lines.join("\n"));

Conclusion

----------

*   YAML is usually selected for human-edited manifests and nested config-like data.

*   CSV is usually selected for flat datasets and exports.

*   In many crawling pipelines, YAML (or JSON) is used for configuration and CSV is used only as an export format.

If minimal text is desired, [YAML vs Plain Text](/blog/yaml-vs-plain-text-choosing-the-right-format-for-llm-prompts) can be compared next.

----
url: https://webcrawlerapi.com/blog/upload-website-content-to-chatgpt
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

RAGTutorial15 min read to read

How to upload website content to ChatGPT

========================================

Learn how to upload website content to ChatGPT to generate human-like text based on the scraped content of your website.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Download website content](#download-website-content)

*   [Add website content to CustomGTP](#add-website-content-to-customgtp)

### Table of Contents

*   [Download website content](#download-website-content)

*   [Add website content to CustomGTP](#add-website-content-to-customgtp)

Sometimes, you want to download website content and use it as CustomGTP knowledge base. You don't need any technical knowledge in order to do that. Here are the basic steps to build a simple chatbot with any website.

Download website content

------------------------

1.  First, go to [WebcrawlerAPI Dashboard Start new crawl job](https://dash.webcrawlerapi.com/jobs/new).

2.  Paste your website URL, set the webpage limit and choose the "cleaned content" option (Read more about what is "cleaned content" option [here](https://webcrawlerapi.com/docs/crawling/crawling-types/#cleaned-scraping).

Here is the setup with (Postman Learning Center)\[[https://learning.postman.com/](https://learning.postman.com/)\] as an example:

3.  Wait until all pages are crawled. This can take from a couple of seconds to several minutes, depending on the website page number.

4.  When job status is doneyou will see the "Download button". Tap the "Cleaned CSV" option and wait until your file is prepared.

You will see the download icon once it is ready.

Tap it to save the file locally with the website content.

### Add website content to CustomGTP

1.  Go to [New GPT](https://chatgpt.com/gpts/editor).

2.  Write the name and description of your website chatbot.

3.  Tap "Upload" and choose the file with website-crawled content.

4.  Write "Instructions", mentioning that GPT should search the knowledge base when answering (make sure the "Code Interpreter & Data Analysis" option is enabled).

5.  Tap the "Create" button.

That is it. You can now use this CustomGPT, and it will try to answer first using the crawled content you've attached.

----
url: https://webcrawlerapi.com/glossary/puppeteer/how-to-configure-cdp-message-id-generator-in-puppeteer
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

The CDP message ID generator can be configured by passing a custom idGenerator to the Connection constructor. This enables sharing the same transport across multiple connections without ID conflicts. The idGenerator option is optional; if omitted, the default generator is used.

Example:

    // example optional id generator

    const myIdGenerator = (() => {

      let id = 0;

      return () => ++id;

    })();

    const connection = new Connection(transport, { idGenerator: myIdGenerator });

----
url: https://webcrawlerapi.com/docs/access-key
----

API access key

==============

Copy MarkdownOpen

How to sign up and get an API key for Webcrawler API

To use the Webcrawler API you need to obtain an API key. The API key is used to authenticate your requests to the API.

To get an API key sign up and get your key at the [API key section](https://dash.webcrawlerapi.com/access).

The API key is a secret key that should be kept confidential. Do not share your API key in publicly accessible areas such as GitHub, client-side code, and so on. If you suspect that your API key has been compromised, you can regenerate it at any time.

Webcrawler API uses Bearer Token authentication scheme. You need to include the API key in the `Authorization` header of your requests, so the header will looks like this:

    Authorization: Bearer <PASTE YOUR API KEY HERE>

For example, to make your first request you can use the following curl command:

    curl -i --request POST \

      --url https://api.webcrawlerapi.com/v1/crawl \

      --header 'Authorization: Bearer 009b131d7080df28640e' \

      --header 'Content-Type: application/json' \

      --data '{

    	"items_limit": 5,

    	"url": "https://example.com",

        "scrape_type": "cleaned",

    }'

If you done everything correctly you will receive a response with a `200` status code and a JSON object with a `id` field. Example:

    {

        "id": "5f7b1b7b-7b7b-4b7b-8b7b-7b7b7b7b7b7b"

    }

### [Unauthorized error](#unauthorized-error)

If you receive an `401 Unauthorized` error with body like this:

    {

        "error": "Unauthorized"

    }

It means that you have provided an invalid API key. Please double-check and make sure that you have copied the key correctly.

If you still have any questions or need help, feel free to contact us at [\[email protected\]](/cdn-cgi/l/email-protection#681b1d1818071a1c281f0d0a0b1a091f040d1a091801460b0705). We are always happy to help you!

[Getting Started

Previous Page](/docs/getting-started)[Job

Next Page](/docs/job)

### On this page

[Unauthorized error](#unauthorized-error)

----
url: https://webcrawlerapi.com/blog/convert-any-website-to-rss-feed
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

RSSAtomJSON FeedTutorialWeb Scraping

How to Convert Any Website to an RSS Feed

=========================================

Need updates from a site you do not control? Create a WebCrawlerAPI feed for any URL, then read changes as JSON Feed or Atom (RSS-style) from simple endpoints.

Written byAndrew

Published onFeb 6, 2026

### Table of Contents

*   [How to Convert Any Website to an RSS Feed](#how-to-convert-any-website-to-an-rss-feed)

*   [What is being built](#what-is-being-built)

*   [Choose the right source URL (this matters more than tools)](#choose-the-right-source-url-this-matters-more-than-tools)

*   [Create the feed (POST /v2/feed)](#create-the-feed-post-v2feed)

*   [Receive updates as JSON Feed and RSS](#receive-updates-as-json-feed-and-rss)

*   [Scope the crawl so it does not run away](#scope-the-crawl-so-it-does-not-run-away)

*   [Why query params can break everything](#why-query-params-can-break-everything)

*   [Pick the content format you actually need](#pick-the-content-format-you-actually-need)

*   [Polling vs webhooks](#polling-vs-webhooks)

*   [What the webhook sends](#what-the-webhook-sends)

*   [Resending a webhook](#resending-a-webhook)

*   [Common failure modes (and what usually fixes them)](#common-failure-modes-and-what-usually-fixes-them)

*   [Related reading](#related-reading)

### Table of Contents

*   [How to Convert Any Website to an RSS Feed](#how-to-convert-any-website-to-an-rss-feed)

*   [What is being built](#what-is-being-built)

*   [Choose the right source URL (this matters more than tools)](#choose-the-right-source-url-this-matters-more-than-tools)

*   [Create the feed (POST /v2/feed)](#create-the-feed-post-v2feed)

*   [Receive updates as JSON Feed and RSS](#receive-updates-as-json-feed-and-rss)

*   [Scope the crawl so it does not run away](#scope-the-crawl-so-it-does-not-run-away)

*   [Why query params can break everything](#why-query-params-can-break-everything)

*   [Pick the content format you actually need](#pick-the-content-format-you-actually-need)

*   [Polling vs webhooks](#polling-vs-webhooks)

*   [What the webhook sends](#what-the-webhook-sends)

*   [Resending a webhook](#resending-a-webhook)

*   [Common failure modes (and what usually fixes them)](#common-failure-modes-and-what-usually-fixes-them)

*   [Related reading](#related-reading)

How to Convert Any Website to an RSS Feed

=========================================

Sometimes updates are needed from a website you do not own. A vendor blog. A changelog. A docs page that quietly changes. New posts and changelog updates are still wanted, but there is no official feed. In real life, that is when a site is converted to an RSS feed so updates can be pulled programmatically via an API.

It can be done with WebCrawlerAPI feeds: a feed is created with POST https://api.webcrawlerapi.com/v2/feed, then changes are read as JSON Feed 1.1 from GET https://api.webcrawlerapi.com/v2/feed/:id/json, or as Atom 1.0 (RSS-style) from GET https://api.webcrawlerapi.com/v2/feed/:id/rss.

    # 1) Create feed

    curl -sS -X POST "https://api.webcrawlerapi.com/v2/feed" \

      -H "Authorization: Bearer <YOUR_API_KEY>" \

      -H "Content-Type: application/json" \

      -d '{

        "url": "https://example.com/changelog",

        "name": "Example Changelog",

        "scrape_type": "markdown",

        "items_limit": 10,

        "max_depth": 1,

        "respect_robots_txt": false,

        "main_content_only": true

      }'

    # 2) Receive updates as JSON Feed 1.1

    # Content-Type: application/feed+json; charset=utf-8

    curl -sS "https://api.webcrawlerapi.com/v2/feed/<FEED_ID>/json?page=1&page_size=50" \

      -H "Authorization: Bearer <YOUR_API_KEY>"

    # 3) Receive updates as Atom 1.0 (RSS-style)

    # Content-Type: application/atom+xml; charset=utf-8

    curl -sS "https://api.webcrawlerapi.com/v2/feed/<FEED_ID>/rss?page=1&page_size=50" \

      -H "Authorization: Bearer <YOUR_API_KEY>"

Tiny response examples:

    {

      "version": "https://jsonfeed.org/version/1.1",

      "title": "WebCrawlerAPI Feed: example.com",

      "items": [        {

          "id": "item123",

          "url": "https://example.com/changelog/v1-2-3",

          "title": "v1.2.3",

          "summary": "New page discovered",

          "date_modified": "2026-02-06T10:00:00Z",

          "_webcrawlerapi": {

            "change_type": "new",

            "content_url": "https://cdn.webcrawlerapi.com/content/..."

          }

        }

      ]

    }

    <feed xmlns="http://www.w3.org/2005/Atom">

      <title>WebCrawlerAPI Feed: example.com</title>

      <entry>

        <id>urn:webcrawlerapi:feeditem:item123</id>

        <title>New: v1.2.3</title>

        <updated>2026-02-06T10:00:00Z</updated>

        <link href="https://example.com/changelog/v1-2-3" rel="alternate" />

        <summary type="text">New page discovered</summary>

      </entry>

    </feed>

What is being built

-------------------

Once a feed is set up, a stable output is produced that can be plugged into:

*   an RSS reader (Atom endpoint)

*   a Slack bot (webhook)

*   a cron job (poll JSON feed)

*   a database sync (store item IDs and change types)

This is the practical way to get RSS-style output without relying on the site owner.

Choose the right source URL (this matters more than tools)

----------------------------------------------------------

Most failures are caused by choosing the wrong URL.

If the goal is to convert a web page to an RSS feed, the “page” should be a listing page that changes over time, not the homepage.

What usually works:

*   Blog updates: /blog, not /

*   Changelog updates: /changelog, /releases, /updates

*   Docs notes: a “What’s new” index page

*   Security advisories: advisory index page, not a single CVE page

What should be avoided:

*   pages with infinite filters and sorts (faceted navigation)

*   internal search pages that change per user/session

*   URLs with tracking parameters (utm\_\*, fbclid, etc.)

In other words: start from the cleanest “index of updates” page you can find.

Create the feed (POST /v2/feed)

-------------------------------

Only one field is required: url.

Useful optional fields:

*   scrape\_type: markdown (default), cleaned, or html

*   items\_limit: max pages crawled per run (default: 10)

*   max\_depth: link-follow depth from the seed URL (0-10)

*   whitelist\_regexp: only URLs that match are crawled

*   blacklist\_regexp: URLs that match are skipped

*   respect\_robots\_txt: [robots.txt](/blog/how-to-crawl-the-website-with-python#respecting-robots-txt-and-rate-limiting) is respected when set to true (default: false)

*   main\_content\_only: boilerplate is removed when set to true (default: false)

*   webhook\_url: changes are pushed to your server when set

This is a practical starting payload for a changelog feed:

    curl -sS -X POST "https://api.webcrawlerapi.com/v2/feed" \

      -H "Authorization: Bearer <YOUR_API_KEY>" \

      -H "Content-Type: application/json" \

      -d '{

        "url": "https://example.com/changelog",

        "name": "Example Changelog",

        "scrape_type": "markdown",

        "items_limit": 20,

        "max_depth": 1,

        "respect_robots_txt": true,

        "main_content_only": true,

        "webhook_url": "https://yourserver.com/webhook"

      }'

The returned id is the only thing needed for the read endpoints.

Receive updates as JSON Feed and RSS

------------------------------------

Two formats are supported:

*   JSON Feed 1.1: easiest for code

*   Atom 1.0 (RSS-style): easiest for RSS readers

    # JSON Feed

    curl -sS "https://api.webcrawlerapi.com/v2/feed/<FEED_ID>/json" \

      -H "Authorization: Bearer <YOUR_API_KEY>"

    # Atom (RSS-style)

    curl -sS "https://api.webcrawlerapi.com/v2/feed/<FEED_ID>/rss" \

      -H "Authorization: Bearer <YOUR_API_KEY>"

Scope the crawl so it does not run away

---------------------------------------

In practice, this usually means: “[watch a small slice of the site](/blog/how-to-build-a-web-crawler#decide-on-depth-and-scope), and only the pages that matter”.

Scoping is where it is won or lost:

*   [max\_depth](/blog/how-to-build-a-web-crawler#decide-on-depth-and-scope) should be kept low (often 0 or 1)

*   whitelist\_regexp should be used to keep only the pages that matter

*   blacklist\_regexp should be used to block traps (search, tags, query strings)

### Why query params can break everything

Many sites generate endless URL variants:

*   ?page=2

*   ?sort=newest

*   ?tag=security

*   ?price\_min=...&price\_max=...

That creates an [infinite URL space](/blog/how-to-crawl-the-website-with-python#crawling-all-links-on-a-website-full-site-crawl). Budget is wasted. Duplicates appear. “New item” detection gets noisy.

Practical approach:

1.  Start with max\_depth: 0 or 1.

2.  Add a whitelist\_regexp that matches only the pages you want.

3.  Add a blacklist\_regexp that blocks obvious traps.

Pick the content format you actually need

-----------------------------------------

scrape\_type controls the stored content format:

*   markdown (default): good for reading and diffing

*   cleaned: good when “just the text” is needed

*   html: good when structure is needed (tables, code blocks, rich layout)

main\_content\_only can be enabled when nav/footers are too noisy. It is helpful, but it is not magic.

If you are building a crawler yourself, this decision usually lives in the [Parser](/blog/how-to-build-a-web-crawler#parser-the-reader) stage.

Polling vs webhooks

-------------------

Updates can be consumed by polling the feed endpoints. Or they can be pushed via a webhook.

webhook\_url is useful when:

*   latency matters (alerts should arrive quickly)

*   many feeds are tracked and fewer cron jobs are desired

*   a webhook receiver already exists

Even with webhooks, item processing should be made [idempotent](/blog/how-to-crawl-the-website-with-python#handling-errors-and-retries) by storing the JSON Feed item id, then reconciling by id on every fetch.

### What the webhook sends

When webhook\_url is set, an HTTP POST request is sent after a feed run completes. The request body is JSON Feed 1.1 (the same shape as GET /v2/feed/:id/json), so the same parser can be reused.

Two practical details should be known:

*   Only new and changed items are pushed to the webhook.

*   unavailable items are tracked in the feed, but are not pushed to the webhook.

This is why the webhook should be treated as a trigger, not as a database. When the webhook fires, the JSON feed can be fetched and reconciled by item.id.

### Resending a webhook

If your endpoint was down, the last completed feed run can be replayed:

    curl -sS -X POST "https://api.webcrawlerapi.com/v2/feed/<FEED_ID>/webhook/resend" \

      -H "Authorization: Bearer <YOUR_API_KEY>"

Common failure modes (and what usually fixes them)

--------------------------------------------------

*   [JavaScript rendering](/blog/how-to-crawl-the-website-with-python#crawling-javascript-heavy-websites-with-python): a different source URL is chosen if possible; otherwise a rendering approach is required

*   Missing dates: list pages with visible dates are preferred; detail pages may need to be crawled

*   Duplicates: query params are blocked; canonical paths are whitelisted

*   Pagination: a low depth is used and a whitelist is applied; full historical mirrors are avoided

*   403/429 blocks: scope is reduced and crawling is slowed; robots.txt is respected when needed

Related reading

---------------

*   [How to crawl the website with Python](/blog/how-to-crawl-the-website-with-python)

*   [How to Build a Web Crawler](/blog/how-to-build-a-web-crawler)

----
url: https://webcrawlerapi.com/changelog/2025-06-07-scrape-v2
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

June 7, 2025

Scrape API v2 Released

======================

Scrape API v2 is now live.

Easier and straight forward. The new version lets you run a prompt on the page. You can get results in markdown, cleaned text, or HTML. Scraping is now in synchronous mode, with a single API call.

The new endpoint is at https://api.webcrawlerapi.com/v2/scrape. See the [API Reference](/docs/api/scrape) for details.

What's new?

-----------

*   Scraping is now sync, with a single API call.

*   You can remove parts of the page using CSS selectors.

*   You can get results in markdown, cleaned text, or HTML.

*   You can run a prompt on the page.

*   Error handling is improved.

----
url: https://webcrawlerapi.com/glossary/puppeteer/how-can-i-expose-backendnodeid-in-the-a11y-snapshot
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

BackendNodeId is exposed in the a11y snapshot. Each node in the snapshot includes a backendNodeId that lets you map accessibility nodes to their corresponding DOM nodes. Example:

    const snapshot = await page.accessibility.snapshot({ interestingOnly: false });

    console.log(snapshot.nodes[0].backendNodeId);

----
url: https://webcrawlerapi.com/legal/subprocessor-list
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Sub-Processor List

==================

_Last updated: 20 June 2025_

WebcrawlerAPI uses a limited number of trusted third-party service providers ("sub-processors") to support the delivery and operation of our services. These sub-processors may process certain personal data on our behalf, as defined in our [Data Processing Agreement (DPA)](/legal/Webcrawlerapi%20DPA.pdf).

This page is intended to provide transparency about the identities and locations of these third parties, as required under Article 28 of the General Data Protection Regulation (GDPR).

* * *

### General Information

*   Sub-processors are engaged to perform specific services such as cloud hosting, data storage, analytics, content delivery, and email delivery.

*   All sub-processors are bound by written agreements to comply with GDPR requirements.

*   Data transfers outside the EU/EEA are protected using Standard Contractual Clauses (SCCs) or participation in the EU-U.S. Data Privacy Framework.

We review and update this list regularly.

* * *

### Current Sub-Processors

Sub-Processor

Data Location

Hetzner

EU

Scaleway

EU

Google Cloud Run

Global

Cloudflare Workers

Global

Cloudflare

Global

Brightdata

Global

PostHog (EU)

EU

Resend

US

* * *

### Updates and Notifications

To receive updates about sub-processor changes, contact: **[\[email protected\]](/cdn-cgi/l/email-protection#cbb8bebbbba4b9bf8bbcaea9a8b9aabca7aeb9aabba2e5a8a4a6)**. We will provide at least 30 days' notice before authorizing any new sub-processor that may have access to customer personal data, except where such engagement is required on an emergency basis.

If you object to the use of a new sub-processor on reasonable grounds, please notify us within the 30-day period.

----
url: https://webcrawlerapi.com/changelog/2025-07-03-make-integration
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

July 3, 2025

Make Integration Available

==========================

WebcrawlerAPI is now available on Make (formerly Integromat).

You can now automate web scraping workflows with other apps through Make. The integration uses the Scrape API v2 endpoint to scrape single webpages.

The Make app includes for now a single **Scrape a single webpage** action. It supports all v2 features including running prompts on scraped content to extract specific information or format the output. You can also specify output formats (markdown, cleaned text, or HTML) and use CSS selectors to clean unwanted elements.

To use the WebcrawlerAPI in the Make app just search for "WebcrawlerAPI" in the app store.

[WebcrawlerAPI on Make](https://www.make.com/en/integrations/webcrawlerapi)

Read our documentation on how to [integrate WebcrawlerAPI in Make](/docs/sdk/make).

----
url: https://webcrawlerapi.com/blog/how-to-crawl-website-with-php
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

TutorialTechnical15 min read to read

How to crawl website with PHP

=============================

Learn how to effectively crawl websites using PHP with frameworks like Goutte and Spatie/Crawler, or opt for the simplicity of WebCrawlerAPI.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Selecting a PHP Crawler Framework](#selecting-a-php-crawler-framework)

*   [Goutte](#goutte)

*   [Spatie/Crawler](#spatiecrawler)

*   [Choosing the Right Framework](#choosing-the-right-framework)

*   [Guide to Using Goutte](#guide-to-using-goutte)

*   [Setup](#setup)

*   [Crawling a Website](#crawling-a-website)

*   [Data Extraction](#data-extraction)

*   [How to Use WebCrawlerAPI](#how-to-use-webcrawlerapi)

*   [Why Choose WebCrawlerAPI?](#why-choose-webcrawlerapi)

*   [Comparing WebCrawlerAPI to Traditional PHP Frameworks](#comparing-webcrawlerapi-to-traditional-php-frameworks)

*   [Summary](#summary)

*   [FAQs](#faqs)

*   [Is PHP good for web scraping?](#is-php-good-for-web-scraping)

### Table of Contents

*   [Selecting a PHP Crawler Framework](#selecting-a-php-crawler-framework)

*   [Goutte](#goutte)

*   [Spatie/Crawler](#spatiecrawler)

*   [Choosing the Right Framework](#choosing-the-right-framework)

*   [Guide to Using Goutte](#guide-to-using-goutte)

*   [Setup](#setup)

*   [Crawling a Website](#crawling-a-website)

*   [Data Extraction](#data-extraction)

*   [How to Use WebCrawlerAPI](#how-to-use-webcrawlerapi)

*   [Why Choose WebCrawlerAPI?](#why-choose-webcrawlerapi)

*   [Comparing WebCrawlerAPI to Traditional PHP Frameworks](#comparing-webcrawlerapi-to-traditional-php-frameworks)

*   [Summary](#summary)

*   [FAQs](#faqs)

*   [Is PHP good for web scraping?](#is-php-good-for-web-scraping)

**Want to scrape website data using PHP?** Here’s a quick guide to get started. PHP offers two main paths for web crawling:

1.  **Frameworks like** [**Goutte**](https://github.com/FriendsOfPHP/Goutte) **and** [**Spatie/Crawler**](https://github.com/spatie/crawler):

    *   Goutte: Simple and great for static websites.

    *   Spatie/Crawler: Handles dynamic, JavaScript-heavy sites with advanced features.

2.  [**WebCrawlerAPI**](https://webcrawlerapi.com/):

    *   A cloud-based service for effortless, scalable web scraping without manual setup.

**Quick Comparison**:

Feature

Goutte

Spatie/Crawler

WebCrawlerAPI

JavaScript Support

No

Yes

Yes

Scalability

Manual setup

Manual setup

Automatic (cloud)

Ease of Use

Simple

Moderate

Very simple

Best For

Static websites

Dynamic websites

Large-scale projects

Whether you need full control through frameworks or a hassle-free API solution, PHP has you covered. Let’s dive into the details!

Selecting a PHP Crawler Framework

---------------------------------

When working on web crawling projects with PHP, **Goutte** and **Spatie/Crawler** are two popular options. Each has its own strengths, making them suitable for different types of tasks.

### [Goutte](https://github.com/FriendsOfPHP/Goutte)

Goutte is built on [Symfony](https://symfony.com/) components and is well-suited for large-scale crawling. Its object-oriented design and efficient handling of HTML/XML make it a great choice for straightforward data extraction. Beginners often find its intuitive DOM crawler easy to use [\[1\]](https://dzone.com/articles/8-awesome-php-web-scraping-libraries-and-tools).

Here’s a quick example of how Goutte works:

    use GoutteClient;

    $client = new Client();

    $crawler = $client->request('GET', 'https://www.example.com');

    $crawler->filter('h1')->each(function ($node) {

        echo $node->text();

    });

### [Spatie/Crawler](https://github.com/spatie/crawler)

Spatie/Crawler is better equipped for modern web applications, especially those with dynamic, client-side rendering. Pairing it with tools like [Puppeteer](https://github.com/puppeteer/puppeteer) allows it to handle JavaScript-heavy websites effectively [\[3\]](https://github.com/spekulatius/awesome-php-scrapers-and-crawlers).

Some of its standout features include:

*   Asynchronous crawling

*   Compliance with robots.txt

*   The ability to index PDFs [\[3\]](https://github.com/spekulatius/awesome-php-scrapers-and-crawlers)

Feature

Goutte

Spatie/Crawler

JavaScript Support

No

Yes

Memory Efficiency

Moderate

High

Ease of Use

Simple

Moderate

Best For

Static websites

Dynamic websites

Error Handling

Basic

Advanced

### Choosing the Right Framework

The best framework depends on your project's needs. If you’re working with static websites and simple structures, Goutte is a solid option. On the other hand, for dynamic sites or projects requiring features like asynchronous crawling, Spatie/Crawler is the better fit [\[1\]](https://dzone.com/articles/8-awesome-php-web-scraping-libraries-and-tools)[\[3\]](https://github.com/spekulatius/awesome-php-scrapers-and-crawlers).

Next, we’ll dive into how to set up and use Goutte for website crawling.

Guide to Using Goutte

---------------------

### Setup

Start by installing Goutte through [Composer](https://getcomposer.org/):

    composer require fabpot/goutte

Next, include the library in your project:

    require_once 'vendor/autoload.php';

    use GoutteClient;

### Crawling a Website

Here's how you can crawl a single webpage:

    $client = new Client();

    $crawler = $client->request('GET', 'https://books.toscrape.com');

    // Check if the response is successful

    if ($client->getResponse()->getStatusCode() === 200) {

        $pageTitle = $crawler->filter('title')->text();

        echo "Page Title: " . $pageTitle;

    }

To handle multiple pages, you can use the following approach:

    $baseUrl = 'https://books.toscrape.com/catalogue/page-';

    $page = 1;

    try {

        do {

            $crawler = $client->request('GET', $baseUrl . $page . '.html');

            // Process the data here

            $page++;

        } while ($crawler->filter('.next > a')->count() > 0);

    } catch (Exception $e) {

        echo "Error: " . $e->getMessage();

    }

### Data Extraction

To extract links, prices, or table data, use these examples:

    // Extract links and their text

    $links = $crawler->filter('a')->each(function ($node) {

        return [            'text' => $node->text(),

            'href' => $node->attr('href')

        ];

    });

    // Extract prices

    $prices = $crawler->filter('.price_color')->each(function ($node) {

        return $node->text();

    });

    // Extract table data

    $tableData = $crawler->filter('table tr')->each(function ($row) {

        return $row->filter('td')->each(function ($cell) {

            return $cell->text();

        });

    });

Below are some common CSS selectors and their purposes:

Selector

Purpose

.class-name

Selects elements by class name

a

Retrieves all hyperlinks

table tr td

Extracts content from table cells

#id-name

Targets elements with a specific ID

.parent .child

Finds nested elements within a parent container

With Goutte, you can efficiently scrape and extract data from websites. But if you're looking for another option, you might want to explore WebCrawlerAPI.

###### sbb-itb-ac346ed

Alternative: [WebCrawlerAPI](https://webcrawlerapi.com/)

--------------------------------------------------------

If you're looking for a simpler, managed option for web crawling, **WebCrawlerAPI** might be the solution. Unlike tools like Goutte or Spatie/Crawler that require custom setups, WebCrawlerAPI is a cloud-based service designed to let developers focus on their core tasks without worrying about infrastructure.

### How to Use WebCrawlerAPI

Here’s a quick example of how you can use WebCrawlerAPI in your PHP project:

Installation

    composer require webcrawlerapi/sdk

Usage

    use WebCrawlerAPIWebCrawlerAPI;

    // Initialize the client

    $crawler = new WebCrawlerAPI('your_api_key');

    // Synchronous crawling (blocks until completion)

    $job = $crawler->crawl(

        url: 'https://example.com',

        scrapeType: 'markdown',

        itemsLimit: 10

    );

### Why Choose WebCrawlerAPI?

WebCrawlerAPI brings several advantages to the table for developers:

*   **Simple Integration**: You can get started with just a few lines of code.

*   **Cloud Scalability**: Handles large-scale crawling without manual effort.

*   **Anti-Bot Features**: Includes tools like CAPTCHA bypassing.

*   **Multiple Output Formats**: Supports Markdown, Text and raw HTML.

*   **Pay-As-You-Go Pricing**: You only pay for the resources you use.

### Comparing WebCrawlerAPI to Traditional PHP Frameworks

Feature

WebCrawlerAPI

Traditional PHP Frameworks

Scalability

Automatically scales in the cloud

Requires manual configuration

Data Formats

HTML, Text, Markdown

Limited to framework capabilities

Maintenance

Fully managed service

Requires ongoing self-maintenance

This makes WebCrawlerAPI a great choice for developers who want a hassle-free, scalable solution for web crawling.

Summary

-------

Web crawling with PHP can be done using either traditional frameworks or modern API-based solutions. Tools like **Goutte** and **Spatie/Crawler** provide detailed control over the crawling process, while services like **WebCrawlerAPI** offer a managed, scalable option with minimal configuration.

Using PHP frameworks gives you more control over how the crawling works but requires manual setup and ongoing maintenance. On the other hand, **WebCrawlerAPI** handles challenges like anti-bot measures and scalability for you, making it a great choice for larger or more resource-intensive projects.

Frameworks like Goutte and Spatie/Crawler work well for smaller projects or when custom crawling behavior is essential [\[1\]](https://dzone.com/articles/8-awesome-php-web-scraping-libraries-and-tools)[\[3\]](https://github.com/spekulatius/awesome-php-scrapers-and-crawlers). If you're looking for a simpler solution that minimizes effort, **WebCrawlerAPI** takes care of the heavy lifting.

The right choice depends on your project's needs, including its size, technical complexity, and how much control you want over the crawling process.

Next, let’s dive into some common questions about using PHP for web crawling.

FAQs

----

### Is PHP good for web scraping?

PHP is a practical choice for web scraping, thanks to its array of libraries and tools that simplify the process of extracting and processing web data [\[1\]](https://dzone.com/articles/8-awesome-php-web-scraping-libraries-and-tools)[\[2\]](https://dev.to/oxylabs-io/web-scraping-with-php-ultimate-tutorial-35n).

With features like HTML parsing, HTTP request handling, and effective error management, PHP makes web scraping tasks smoother. Its design, tailored for web development, also aids in handling sessions and text encoding, making it well-suited for these projects [\[1\]](https://dzone.com/articles/8-awesome-php-web-scraping-libraries-and-tools)[\[3\]](https://github.com/spekulatius/awesome-php-scrapers-and-crawlers).

It's important to follow ethical scraping practices, such as respecting robots.txt guidelines and including request delays. Whether using frameworks or APIs, PHP offers a dependable solution for web scraping tasks [\[4\]](https://reintech.io/blog/building-web-scrapers-php-goutte-library).

----
url: https://webcrawlerapi.com/changelog/2025-02-18-pdf-content-rendering
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

February 18, 2025

PDF Content Rendering Implementation

====================================

PDF content rendering has been implemented. Text content can now be extracted from PDF files. When a website contains a PDF file, its content will be extracted and returned in the response as page content.

----
url: https://webcrawlerapi.com/changelog/2025-03-03-status-page-link
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

March 3, 2025

Status Page Link Added

======================

A status page link has been added to the website footer. The current status of WebCrawlerAPI services can now be checked at [status.webcrawlerapi.com](https://status.webcrawlerapi.com/status/main).

----
url: https://webcrawlerapi.com/glossary/webcrawling/what-is-web-crawling
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Answer

Web crawling is the automated process of discovering and fetching web pages by following links so you can build an index or dataset. A crawler usually starts from a seed list of URLs and expands as it finds new links. It tracks which pages it has visited to avoid loops and duplicates. Many crawlers also revisit pages on a schedule to detect updates. The output is typically a list of URLs plus page content and metadata. This data can power search, monitoring, or analytics workflows.

----
url: https://webcrawlerapi.com/tools/html-main-content-readability
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Tools → [HTML Main Content Readability](/tools/html-main-content-readability)

-----------------------------------------------------------------------------

HTML Main Content Readability

=============================

### Extract clean article content using Mozilla Readability.js

HTML Content \*

URL (Optional)

Providing a URL helps resolve relative links in the HTML

ConvertLoad Sample

----
url: https://webcrawlerapi.com/docs/feeds
----

Any website to feed

===================

Copy MarkdownOpen

Monitor websites for changes and get updates via RSS, JSON, or webhooks

[What is a feed?](#what-is-a-feed)

----------------------------------

A feed monitors a website for changes and delivers updates automatically. The system crawls your target website periodically and notifies you when content changes.

[Creating a feed](#creating-a-feed)

-----------------------------------

Create a feed by sending a POST request to `/v1/feeds`:

    curl --request POST \

      --url https://api.webcrawlerapi.com/v1/feeds \

      --header 'Authorization: Bearer <YOUR API KEY>' \

      --header 'Content-Type: application/json' \

      --data '{

        "url": "https://example.com",

        "name": "Example Blog",

        "scrape_type": "markdown",

        "max_depth": 1,

        "webhook_url": "https://webhook.site/287700f1-c94e-4ccb-839f-a7dc4b0992b1"

      }'

**Response:**

    {

      "id": "550e8400-e29b-41d4-a716-446655440000",

      "url": "https://example.com",

      "status": "active",

      "next_run_at": "2024-01-15T14:30:00Z",

      "webhook_url": "https://webhook.site/287700f1-c94e-4ccb-839f-a7dc4b0992b1"

    }

### [Parameters](#parameters)

Similar as for [Job](/docs/job)

*   `url` (required) - The website URL to monitor

*   `name` (optional) - A friendly name for your feed

*   `scrape_type` (optional) - Output format: `markdown`, `cleaned`, or `html` (default: `markdown`)

*   `max_depth` (optional) - Maximum crawl depth (0-10)

*   `items_limit` (optional) - Maximum pages to crawl (default: 10)

*   `webhook_url` (optional) - URL to receive change notifications

*   `whitelist_regexp` (optional) - Regex pattern to include URLs

*   `blacklist_regexp` (optional) - Regex pattern to exclude URLs

*   `respect_robots_txt` (optional) - Follow robots.txt rules

*   `main_content_only` (optional) - Extract main content only

See the [API Reference](/docs/api/feed/feed-create) for complete parameter details.

[Getting feed updates](#getting-feed-updates)

---------------------------------------------

There are three ways to receive feed updates:

### [1\. RSS/Atom Feed](#1-rssatom-feed)

Get updates in standard Atom 1.0 format for use with feed readers:

    curl --request GET \

      --url https://api.webcrawlerapi.com/v1/feeds/{feed_id}/rss \

      --header 'Authorization: Bearer <YOUR API KEY>'

Subscribe to this URL in any RSS/Atom reader.

See the [API Reference](/docs/api/feed/feed-rss) for details.

### [2\. JSON Feed](#2-json-feed)

Get updates in JSON Feed format:

    curl --request GET \

      --url https://api.webcrawlerapi.com/v1/feeds/{feed_id}/json \

      --header 'Authorization: Bearer <YOUR API KEY>'

**Response:**

    {

      "version": "https://jsonfeed.org/version/1",

      "title": "Example Blog",

      "home_page_url": "https://example.com",

      "feed_url": "https://api.webcrawlerapi.com/v1/feeds/{feed_id}/json",

      "items": [        {

          "id": "https://example.com/article-1",

          "url": "https://example.com/article-1",

          "title": "Article Title",

          "content_text": "Article content...",

          "date_published": "2024-01-15T14:30:00Z"

        }

      ]

    }

See the [API Reference](/docs/api/feed/feed-json) for details.

### [3\. Webhooks](#3-webhooks)

Receive a POST request when changes are detected. Add `webhook_url` when creating your feed:

    {

      "url": "https://example.com",

      "webhook_url": "https://yourserver.com/webhook"

    }

When changes are detected, you'll receive a POST request with the feed run details including change information. Each changed/new page will include a `content_url` field pointing to the page content in the format specified by your feed's `scrape_type` setting (markdown, cleaned, or html).

[Feed status](#feed-status)

---------------------------

Get information about a feed and its recent runs:

    curl --request GET \

      --url https://api.webcrawlerapi.com/v1/feeds/{feed_id} \

      --header 'Authorization: Bearer <YOUR API KEY>'

**Response:**

    {

      "id": "550e8400-e29b-41d4-a716-446655440000",

      "url": "https://example.com",

      "name": "Example Blog",

      "scrape_type": "markdown",

      "items_limit": 10,

      "status": "active",

      "next_run_at": "2024-01-16T14:30:00Z",

      "last_run_at": "2024-01-15T14:30:00Z",

      "created_at": "2024-01-01T10:00:00Z",

      "recent_runs": [        {

          "id": "run-123",

          "status": "completed",

          "pages_crawled": 10,

          "pages_changed": 2,

          "pages_new": 1,

          "pages_unavailable": 0,

          "pages_errors": 0,

          "cost_usd": 0.002,

          "started_at": "2024-01-15T14:30:00Z",

          "finished_at": "2024-01-15T14:32:00Z"

        }

      ]

    }

### [Response fields](#response-fields)

*   `status` - Feed status: `active`, `paused`, or `canceled`

*   `next_run_at` - When the next crawl will run

*   `last_run_at` - When the last crawl completed

*   `recent_runs` - Array of recent feed runs

    *   `pages_crawled` - Total pages processed

    *   `pages_changed` - Pages with content changes

    *   `pages_new` - Newly discovered pages

    *   `pages_unavailable` - Pages that returned 404 or similar

    *   `pages_errors` - Pages that failed to load

    *   `cost_usd` - Cost in USD for this run

See the [API Reference](/docs/api/feed/feed-get) for all available fields.

[Error handling](#error-handling)

---------------------------------

Feeds are automatically paused after 3 consecutive errors. This prevents unnecessary charges when a website becomes unreachable. You can resume the feed manually once the issue is resolved.

[Managing feeds](#managing-feeds)

---------------------------------

*   **List all feeds**: `GET /v1/feeds` - [API Reference](/docs/api/feed/feed-list)

*   **Pause a feed**: `POST /v1/feeds/{id}/pause` - [API Reference](/docs/api/feed/feed-manage)

*   **Resume a feed**: `POST /v1/feeds/{id}/resume` - [API Reference](/docs/api/feed/feed-manage)

*   **Delete a feed**: `DELETE /v1/feeds/{id}` - [API Reference](/docs/api/feed/feed-manage)

*   **Force run**: `POST /v1/feeds/{id}/run` - [API Reference](/docs/api/feed/feed-manage)

[Async requests and Webhooks

Previous Page](/docs/async-requests)[Errors

Next Page](/docs/errors)

### On this page

[What is a feed?](#what-is-a-feed)[Creating a feed](#creating-a-feed)[Parameters](#parameters)[Getting feed updates](#getting-feed-updates)[1\. RSS/Atom Feed](#1-rssatom-feed)[2\. JSON Feed](#2-json-feed)[3\. Webhooks](#3-webhooks)[Feed status](#feed-status)[Response fields](#response-fields)[Error handling](#error-handling)[Managing feeds](#managing-feeds)

----
url: https://webcrawlerapi.com/glossary/puppeteer/what-is-the-reason-for-removing-the-test-server-from-release-please
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

The test server was removed from the release-please workflow to simplify the release process and remove an unnecessary external dependency. This change eliminates the need to run a separate test server during release automation, leading to a cleaner and more reliable release flow.

----
url: https://webcrawlerapi.com/scrapers/webcrawler/crawler/description
----

API

POST /scrape

============

Copy MarkdownOpen

Endpoint to scrape a single webpage

Endpoint to scrape a single webpage.

    https://api.webcrawlerapi.com/v2/scrape

Format: JSON  

Method: POST

[Request](#request)

-------------------

Available request params:

*   `url` - (required) The URL of the webpage to scrape.

*   `prompt` - (optional) A prompt to run on the scraped content. This can be used to extract specific information or to format the output (Extra 0.002$ per prompt).

*   `output_format` - (optional) The format of the output. Can be `markdown`, `cleaned` or `html`. Default is `markdown`.

*   `main_content_only` - (optional) Extract only the main content of article or blog post. When set to `true`, the scraper will focus on extracting the primary article content while filtering out navigation, sidebars, ads, and other non-essential elements. Default is `false`.

*   `clean_selectors` - (optional) CSS selectors to clean from the output. Read more about advanced cleaning in [clean selectors](/docs/guides/cleaning).

*   `respect_robots_txt` - (optional) if set to `true`, the scraper will respect the website's robots.txt file and return an error if the URL is disallowed. Default is `false`.

Example:

    {

        "url": "https://www.example.com",

        "output_format": "markdown",

        "main_content_only": true,

        "clean_selectors": ".advertisement,.footer",

        "respect_robots_txt": true

    }

### [curl example](#curl-example)

    curl --request POST \

      --url https://api.webcrawlerapi.com/v2/scrape \

      --header 'Authorization: Bearer <YOUR_API_KEY>' \

      --header 'Content-Type: application/json' \

      --data '{

        "url": "https://www.example.com",

        "output_format": "markdown",

        "main_content_only": true,

        "clean_selectors": ".advertisement,.footer",

        "respect_robots_txt": true

      }'

[Response](#response)

---------------------

The response will contain a status and the output in the requested format.

    {

        "status": "done",

        "markdown": "## Example Product\n\nThis is an example product page. It has a title, a price, and a description.",

        "page_status_code": 200,

        "page_title": "Example Product"

    }

### [Scrape errors](#scrape-errors)

If the scrape fails, the response will have 200 status code but the `success` will be `false`, the `error_code` and `error_message` will be set.

For example:

    {

        "success": false,

        "error_code": "name_not_resolved",

        "error_message": "Unable to resolve domain name"

    }

Read more about error codes in [Error](/docs/errors) section.

[Error Responses](#error-responses)

-----------------------------------

*   `400 Bad Request` - Invalid parameters or missing required fields

*   `401 Unauthorized` - Invalid or missing API key

*   `402 Payment Required` - Insufficient account balance

*   `500 Internal Server Error` - Server-side error

Refer to [Async Requests](/docs/async-requests) for more information about handling asynchronous scraping jobs.

[Crawl

Previous Page](/docs/api/crawl)[Job Status

Next Page](/docs/api/job)

### On this page

[Request](#request)[curl example](#curl-example)[Response](#response)[Scrape errors](#scrape-errors)[Error Responses](#error-responses)

----
url: https://webcrawlerapi.com/docs/getting-started
----

Getting Started

===============

Copy MarkdownOpen

A guide how to get started with Webcrawler API

Webcrawler API helps you to extract data from websites. It is a powerful tool that can be used to extract data from websites that do not provide an API. Read more about it here: [Webcrawler API](https://webcrawlerapi.com/blog/what-is-a-web-crawling/)

[Prerequisites](#prerequisites)

-------------------------------

In order to use Webcrawler API you need first to obtain an API key:

1.  Register on [Webcrawler API Dashboard](https://dash.webcrawlerapi.com/)

2.  Navigate to the [API key section](https://dash.webcrawlerapi.com/access)

3.  Copy your API key

[Request](#request)

-------------------

To start using the WebcrawlerAPI you need to make an HTTP POST request to the API endpoint:

    https://api.webcrawlerapi.com/v1/crawl

with JSON body that contains parameters

**Note:** You must use the [API key](/docs/access-key) to authenticate requests to the API.

### [First request](#first-request)

To make your first request you can use the following curl command:

    curl --request POST \

      --url https://api.webcrawlerapi.com/v1/crawl \

      --header 'Authorization: Bearer <PASTE YOUR API KEY HERE>' \

      --data '{

    	"items_limit": 5,

    	"url": "https://stripe.com/",

    	"scrape_type": "markdown"

    }'

This command will start a new crawl [Job](/docs/job) that will extract data from the Stripe website. The `items_limit` parameter specifies how many items you want to extract. The `scrape_type` parameter specifies that you want to see `markdown` formatted data (read more about [Crawling Types](/docs/crawling-types).

Result:

    {

        "id": "5f7b1b7b-7b7b-4b7b-8b7b-7b7b7b7b7b7b", // <--- <CRAWL_JOB_ID>

    }

Crawling request is done in asynchronous way. It means that you will receive a response with a task id. You can use this task id to check the status of the scraping task (Read more about [Async Requests](/docs/async-requests))

### [Get crawling result](#get-crawling-result)

To get the crawling result you can use the following curl command:

    curl --request GET \

      --url https://api.webcrawlerapi.com/v1/job/<CRAWL_JOB_ID> \

      --header 'Authorization : Bearer <PASTE YOUR API KEY HERE>'

Result:

    {

    	"id": "5f7b1b7b-7b7b-4b7b-8b7b-7b7b7b7b7b7b",

    	"url": "https://stripe.com/",

    	...

    		"status": "done",

    	    "job_items": [    		{

    			"id": "be0c2ae2-8545-4c4a-8728-5dd122878098",

    			"job_id": "be0c2ae2-8545-4c4a-8728-5dd122878098",

    			"original_url": "https://stripe.com",

    			"page_status_code": 200,

    			"raw_content_url": "https://data.webcrawlerapi.com/raw/clrgcx48g0001ozloz9ficivc/be0c2ae2-8545-4c4a-8728-5dd122878098/https:__stripe_com",

    			"clean_content_url": "https://data.webcrawlerapi.com/clean/clrgcx48g0001ozloz9ficivc/be0c2ae2-8545-4c4a-8728-5dd122878098/https:__stripe_com",

    			...

            }

        ...

    }

[API access key

Next Page](/docs/access-key)

### On this page

[Prerequisites](#prerequisites)[Request](#request)[First request](#first-request)[Get crawling result](#get-crawling-result)

----
url: https://webcrawlerapi.com/glossary/playwright/how-to-fix-playwright-random-browser-crashes-in-ci
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Random exits are commonly resource-related: low memory, too many workers, or unstable shared CI agents.

Reduce concurrency, enable retries/traces, and isolate heavy suites.

    // playwright.config.ts

    import { defineConfig } from '@playwright/test';

    export default defineConfig({

      retries: 2,

      workers: process.env.CI ? 2 : undefined,

      use: { trace: 'on-first-retry' },

    });

When crashes persist, compare pass/fail runs with traces and browser stderr logs.

----
url: https://webcrawlerapi.com/glossary?category=puppeteer
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Glossary

Web Scraping & API Glossary

===========================

Comprehensive glossary of web scraping, crawling, and API terms. Learn the essential concepts and terminology used in web data extraction.

AllPlaywrightPuppeteerScrapingWebcrawling

----
url: https://webcrawlerapi.com/glossary/webcrawling/how-to-crawl-javascript-heavy-sites
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Answer

To crawl JavaScript‑heavy sites, use a headless browser to render pages before extracting content. Wait for critical elements to appear or for network activity to settle. You can also intercept API calls and collect JSON responses directly. Rendering is more resource‑intensive, so keep concurrency low. Cache rendered results when possible to reduce repeated work. This approach captures the same content a user sees in the browser.

----
url: https://webcrawlerapi.com/docs/job
----

What is Crawling Job?

=====================

Copy MarkdownOpen

Job - is a task that you can run on the Webcrawler API. It has an asynchronous nature. It means you will get a notification when it is done ([read more about async request](/docs/async-requests)).

### [Job request parameters](#job-request-parameters)

*   `url` - (required) the seed URL where the crawler starts. Can be any valid URL.

*   `scrape_type` - (default: `html`) the [type](/docs/crawling-types) of scraping you want to perform. Can be `html`, `cleaned`. `markdown`.

*   `items_limit` - (default: `10`) crawler will stops when it reaches this limit of pages for this job.

*   `webhook_url` - (optional) the URL where the server will send a POST request once the task is completed ([read more](/docs/async-requests) about webhooks and async requests).

*   `whitelist_regexp` - (optional) a regular expression to whitelist URLs. Only URLs that match the pattern will be crawled.

*   `blacklist_regexp` - (optional) a regular expression to blacklist URLs. URLs that match the pattern will be skipped.

*   `max_depth` - (optional) maximum depth of crawling from the starting URL. A value of `0` means only the starting page, `1` means the starting page plus pages directly linked from it, `2` adds one more level of depth, and so on. By default, there is no depth limit.

Example:

    {

        "url": "https://stripe.com/",

        "webhook_url": "https://yourserver.com/webhook",

        "items_limit": 10,

        "scrape_type": "markdown",

        "max_depth": 2

    }

    curl --request POST \

      --url https://api.webcrawlerapi.com/v1/crawl \

      --header 'Authorization: Bearer <PASTE YOUR API KEY HERE>' \

      --data '{

        "url": "https://stripe.com/",

        "webhook_url": "https://yourserver.com/webhook",

        "items_limit": 10,

        "scrape_type": "markdown",

        "max_depth": 2

     }'

### [Job response](#job-response)

*   `id` - the unique identifier of the job.

*   `org_id` - your organization identifier.

*   `url` - the seed URL where the crawler started.

*   `status` - the status of the job. Can be `new`, `in_progress`, `done`, `error`.

*   `scrape_type` - the type of scraping you want to perform (`html`, `cleaned` or `markdown`).

*   `whitelist_regexp` - a regular expression to whitelist URLs.

*   `blacklist_regexp` - a regular expression to blacklist URLs.

*   `items_limit` - the limit of pages for this job.

*   `max_depth` - maximum depth of crawling from the starting URL (if specified in the request).

*   `created_at` - the date when the job was created.

*   `finished_at` - the date when the job was finished.

*   `webhook_url` - the URL where the server will send a POST request once the task is completed.

*   `webhook_status` - the status of the webhook request.

*   `webhook_error` - the error message if the webhook request failed.

*   `job_items` - an array of items that were extracted from the pages.

    Job Item:

    *   `id` - the unique identifier of the item.

    *   `status` - the status of the item. Can be `new`, `in_progress`, `done`, `error`.

    *   `job_id` - the job identifier.

    *   `original_url` - the URL of the page.

    *   `page_status_code` - the status code of the page request.

    *   `raw_content_url` - the URL to the raw content of the page.

    *   `cleaned_content_url` - the URL to the cleaned content of the page (if `scrape_type` is `cleaned`. Check [Crawling Types](https://webcrawlerapi.com/docs/crawling-types#cleaned-scraping)).

    *   `markdown_content_url` - the URL to the markdown content of the page (if `scrape_type` is `markdown`. Check [Crawling Types](https://webcrawlerapi.com/docs/crawling-types#cleaned-scraping)).

    *   `title` - the title of the page (`<title>` tag content).

    *   `created_at` - the date when the item was created.

    *   `cost` - the cost of the item in $.

    *   `referred_url` - the URL where the page was referred from.

    *   `last_error` - the last error message if the item failed.

Example:

    {

    	"id": "abb39f29-087e-4714-aa05-15537be12f90",

    	"org_id": "cm48ww9kw00019rv7bsyfko1d",

    	"url": "https://books.toscrape.com/",

    	"status": "done",

    	"scrape_type": "markdown",

    	"whitelist_regexp": ".*category.*",

    	"blacklist_regexp": "",

    	"items_limit": 10,

    	"max_depth": 2,

    	"created_at": "2024-12-15T10:26:13.893Z",

    	"finished_at": "2024-12-15T10:26:37.118Z",

    	"updated_at": "2024-12-15T10:26:37.118Z",

    	"webhook_url": "",

    	"job_items": [    		{

    			"id": "a46f3117-f97a-4ca2-a434-6cfdcd022b72",

    			"job_id": "abb39f29-087e-4714-aa05-15537be12f90",

    			"original_url": "https://books.toscrape.com/catalogue/category/books/travel_2/index.html",

    			"page_status_code": 200,

    			"markdown_content_url": "https://data.webcrawlerapi.com/markdown/books.toscrape.com/https___books_toscrape_com_catalogue_category_books_travel_2_index_html",

    			"status": "done",

    			"title": "All products | Books to Scrape - Sandbox",

    			"last_error": "",

    			"created_at": "2024-12-15T10:26:17.941Z",

    			"updated_at": "2024-12-15T10:26:23.915Z",

    			"cost": 2000,

    			"referred_url": "https://books.toscrape.com/"

    		}

        ]

    }

[API access key

Previous Page](/docs/access-key)[Crawl

Next Page](/docs/api/crawl)

### On this page

[Job request parameters](#job-request-parameters)[Job response](#job-response)

----
url: https://webcrawlerapi.com/changelog/2026-03-09-subscriptions-structured-outputs-markdown-export
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

March 9, 2026

Subscriptions, Structured Outputs, and Markdown Export

======================================================

This March we shipped three customer-facing improvements across billing and extraction workflows.

Monthly subscriptions and plan management

-----------------------------------------

You can now switch to a monthly subscription directly from the dashboard and manage your plan without contacting support.

This includes:

*   **Subscription plans** with included monthly credits

*   **Self-serve upgrades and downgrades**

*   **Subscription management** for payment method and billing changes

*   **Renewal visibility** with current plan, next billing date, and included credit usage

*   **Cancellation controls** while keeping top-up credits available after subscription end

Structured outputs with JSON Schema

-----------------------------------

Prompt-based scraping on /v2/scrape now supports **structured outputs**.

When you send a prompt, you can also provide a response\_schema so the response follows a JSON Schema-defined structure.

Useful for:

*   **Reliable extraction** of fields like prices, contacts, or metadata

*   **Typed responses** that fit directly into apps, automations, and databases

*   **Nested data structures** including arrays, objects, enums, and optional fields

Combined markdown export for crawl jobs

---------------------------------------

Completed markdown crawl jobs can now be exported as a **single combined markdown file**.

Instead of fetching each page result one by one, you can download one consolidated output for the full job.

Useful for:

*   **RAG and knowledge base workflows**

*   **Content archiving** of full crawl runs

*   **Bulk processing** of complete jobs in one step

----
url: https://webcrawlerapi.com/changelog/2025-06-23-zapier-integration
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

June 23, 2025

Zapier Integration Available

============================

WebcrawlerAPI is now available on Zapier.

You can now automate web scraping workflows with other apps through Zapier. The integration uses the Scrape API v2 endpoint to scrape single webpages.

The Zapier app includes actions for:

*   Scrape a single webpage

The integration supports all v2 features including running prompts on scraped content to extract specific information or format the output. You can also specify output formats (markdown, cleaned text, or HTML) and use CSS selectors to clean unwanted elements.

Read [How to integrate no-code webcrawler in Zapier](/docs/sdk/zapier)

Connect your WebcrawlerAPI account and start building automated workflows.

[WebcrawlerAPI on Zapier](https://zapier.com/developer/public-invite/206901/b015f18eaa55e0545d9219f2942e94d1/)

----
url: https://webcrawlerapi.com/blog/how-to-crawl-website-with-net-and-c
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

TutorialTechnical15 min read to read

How to Crawl Website with .NET and C#

=====================================

Learn how to effectively crawl websites with .NET and C#, exploring frameworks and APIs for both simple and complex tasks.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Quick Comparison](#quick-comparison)

*   [Using Open-Source C# and .NET Frameworks for Web Crawling](#using-open-source-c-and-net-frameworks-for-web-crawling)

*   [Abot Framework Overview](#abot-framework-overview)

*   [Setting Up and Configuring Abot](#setting-up-and-configuring-abot)

*   [SkyScraper Framework Overview](#skyscraper-framework-overview)

*   [Setting Up and Using SkyScraper](#setting-up-and-using-skyscraper)

*   [Why Choose WebCrawlerAPI?](#why-choose-webcrawlerapi)

*   [How to Integrate WebCrawlerAPI with C#](#how-to-integrate-webcrawlerapi-with-c)

*   [Summary and Key Takeaways](#summary-and-key-takeaways)

*   [FAQs](#faqs)

*   [What is the best web scraping library for C#?](#what-is-the-best-web-scraping-library-for-c)

### Table of Contents

*   [Quick Comparison](#quick-comparison)

*   [Using Open-Source C# and .NET Frameworks for Web Crawling](#using-open-source-c-and-net-frameworks-for-web-crawling)

*   [Abot Framework Overview](#abot-framework-overview)

*   [Setting Up and Configuring Abot](#setting-up-and-configuring-abot)

*   [SkyScraper Framework Overview](#skyscraper-framework-overview)

*   [Setting Up and Using SkyScraper](#setting-up-and-using-skyscraper)

*   [Why Choose WebCrawlerAPI?](#why-choose-webcrawlerapi)

*   [How to Integrate WebCrawlerAPI with C#](#how-to-integrate-webcrawlerapi-with-c)

*   [Summary and Key Takeaways](#summary-and-key-takeaways)

*   [FAQs](#faqs)

*   [What is the best web scraping library for C#?](#what-is-the-best-web-scraping-library-for-c)

**Want to crawl websites efficiently using .NET and C#?** Here's everything you need to know to get started, from choosing the right tools to writing your first crawler. Whether you're extracting data from static pages or handling JavaScript-heavy sites, this guide covers:

*   **Top Tools:** Use open-source frameworks like [Abot](https://github.com/sjdirect/abot) for multithreaded crawling or SkyScraper for handling dynamic content with async/await.

*   **Code Examples:** Learn how to set up and configure crawlers for both frameworks.

*   **Simpler Options:** Explore [WebCrawlerAPI](https://webcrawlerapi.com/) for scalable, hassle-free crawling with features like proxy rotation and JavaScript rendering.

### Quick Comparison

Tool

Best For

Key Features

Setup Complexity

**Abot**

Custom crawling

Event-driven, respects robots.txt

Moderate

**SkyScraper**

Dynamic content

Async/await support, AJAX handling

Moderate

**WebCrawlerAPI**

Large-scale projects

JavaScript rendering, proxy management

Easy

**In short:** Use Abot for flexibility, SkyScraper for modern web content, or WebCrawlerAPI for simplicity and scale. Ready to dive in? Let’s explore these tools step-by-step!

Using Open-Source C# and .NET Frameworks for Web Crawling

---------------------------------------------------------

Now that we've looked at why C# and .NET are great for [web crawling](https://webcrawlerapi.com/scrapers/webcrawler/html), let's dive into two open-source frameworks that make the process easier: **Abot** and **SkyScraper**.

### [Abot](https://github.com/sjdirect/abot) Framework Overview

Abot is designed for high-performance, multithreaded crawling and offers features like configurable crawl depth, an event-driven structure, and respect for robots.txt and crawl delays.

Feature

Description

Event-Driven Architecture

Lets you add custom handlers for each stage of crawling

Configurable Crawl Depth

Control how deep the crawler explores a website

Polite Crawling

Automatically respects robots.txt and crawl delays

### Setting Up and Configuring Abot

Here's an example of how to use Abot for crawling:

    using Abot;

    using Abot.Poco;

    var crawler = new PoliteWebCrawler();

    var config = new CrawlConfiguration

    {

        MaxPagesToCrawl = 100,

        MaxLinksPerPage = 50,

        StartUrl = "https://example.com"

    };

    crawler.CrawlCompleted += (sender, e) =>

    {

        foreach (var page in e.CrawledPages)

        {

            var data = page.HtmlDocument.DocumentNode

                .SelectSingleNode("//div[@class='data']").InnerText;

            Console.WriteLine(data);

        }

    };

    crawler.Crawl(config);

This script sets up a crawler with limits on pages (100) and links per page (50). The CrawlCompleted event processes each page, extracting content from elements with the data class using SelectSingleNode.

### SkyScraper Framework Overview

SkyScraper leverages C#'s async/await features and Reactive Extensions for efficient handling of modern web content, including AJAX-loaded pages.

Feature

Description

Asynchronous Processing

Handles multiple requests at the same time

Dynamic Content Support

Works well with AJAX-loaded content

Data Flow Management

Simplifies processing of asynchronous data streams

### Setting Up and Using SkyScraper

Here's how to get started with SkyScraper:

    using SkyScraper;

    using SkyScraper.Poco;

    var crawler = new WebCrawler();

    var config = new CrawlConfiguration

    {

        StartUrl = "https://example.com/dynamic-page",

        MaxDepth = 3,

        DelayBetweenRequests = TimeSpan.FromSeconds(1)

    };

    await crawler.CrawlAsync(config);  // Start the crawl asynchronously

    foreach (var page in crawler.CrawledPages)  // Process each crawled page

    {

        var data = page.HtmlDocument.DocumentNode

            .SelectSingleNode("//div[@class='data']").InnerText;

        Console.WriteLine(data);

    }

This example sets a starting URL, limits the crawl depth to 3 levels, and adds a 1-second delay between requests. The CrawlAsync method handles the crawling, while the loop extracts and processes page data.

The right framework depends on your project's needs. Both Abot and SkyScraper are excellent for .NET-based web crawling, but simpler projects might benefit from API-based tools like WebCrawlerAPI, which we'll discuss next.

Alternative: Using [WebCrawlerAPI](https://webcrawlerapi.com/) for Crawling

---------------------------------------------------------------------------

If open-source frameworks like Abot and SkyScraper feel too complex or don't meet your needs, WebCrawlerAPI is a simpler and scalable option for web crawling in C# applications.

### Why Choose WebCrawlerAPI?

WebCrawlerAPI stands out by offering features that streamline [modern web crawling tasks](https://webcrawlerapi.com/docs/job). Here's a quick breakdown:

Feature

What It Does

Why It Matters

Automated JavaScript Rendering

Handles dynamic content seamlessly

Extracts data from JavaScript-heavy websites like SPAs

Infrastructure & Protection

Includes proxy rotation and cloud support

Ensures uninterrupted crawling at scale

Data Cleaning

Processes content automatically

Provides clean, structured data for immediate use

### How to Integrate WebCrawlerAPI with C#

[Setting up WebCrawlerAPI](https://webcrawlerapi.com/docs/getting-started) is straightforward, especially compared to traditional frameworks. Here's a sample implementation.

#### Installation

    dotnet add package WebCrawlerApi

#### Basic example

    using WebCrawlerApi;

    using WebCrawlerApi.Models;

    // Initialize the client

    var crawler = new WebCrawlerApiClient("YOUR_API_KEY");

    // Synchronous crawling (blocks until completion)

    var job = await crawler.CrawlAndWaitAsync(

        url: "https://example.com",

        scrapeType: "markdown",

        itemsLimit: 10,

    );

    Console.WriteLine($"Job completed with status: {job.Status}");

    // Access job items and their content

    foreach (var item in job.JobItems)

    {

        var content = await item.GetContentAsync();

        if (content != null)

        {

            Console.WriteLine($"Content length: {content.Length}");

            Console.WriteLine($"Content preview: {content[..Math.Min(200, content.Length)]}...");

        }

    }

Starting at just $20 per month for 10,000 pages, WebCrawlerAPI offers a budget-friendly solution that balances simplicity with enterprise-grade features. It’s an excellent choice for handling modern, complex, or large-scale web crawling projects.

Summary and Key Takeaways

-------------------------

Different tools suit different needs, and understanding their strengths can help you make the right choice.

Tool

Ideal For

Key Benefits

Abot Framework

[Custom crawling needs](https://webcrawlerapi.com/docs/api/crawl)

Flexible configuration, event-driven processing, plugin options

WebCrawlerAPI

Large-scale projects

Automatic JavaScript rendering, proxy management, data cleaning

**Abot Framework** is perfect for developers who need to fine-tune their crawling processes, while **WebCrawlerAPI** is a great option for enterprise-level projects, offering plans starting at $20/month for up to 10,000 pages. Its automated setup and ability to handle complex web technologies make it a dependable choice.

Here’s a quick breakdown of what each tool offers:

*   **Abot Framework:**

    *   Full control over the crawling process

    *   Seamless integration with existing systems

    *   Budget-friendly for smaller-scale projects

*   **WebCrawlerAPI:**

    *   Easy setup with minimal effort

    *   Handles modern web technologies effectively

    *   Scales effortlessly for large-volume crawling tasks

Pick Abot if you need customization and control, or go with WebCrawlerAPI for ease of use and scalability. Both tools bring unique strengths to the table.

FAQs

----

### What is the best web scraping library for C#?

Picking the right library can make web scraping much smoother. Here's a quick comparison of popular options and their strengths:

Tool

Primary Use Case

Key Strength

**[HtmlAgilityPack](https://html-agility-pack.net/)**

HTML parsing

Excellent for XPath-based data extraction

**HttpClient**

Page downloading

Supports asynchronous tasks and modern HTTP

**Abot**

Full crawling framework

Event-driven design with plugin capabilities

When deciding on a library, think about these factors:

*   **Project complexity**: For straightforward tasks, HtmlAgilityPack might be enough. For more advanced needs, combining tools could work better.

*   **Performance demands**: HttpClient is ideal for handling multiple requests efficiently with its asynchronous features.

*   **Long-term support**: Check for active community involvement and comprehensive documentation.

If you're working on a large-scale project, WebCrawlerAPI is worth exploring for its built-in anti-scraping features, as mentioned earlier.

----
url: https://webcrawlerapi.com/glossary/playwright/how-to-fix-net-err-internet-disconnected-or-err-failed
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

These errors indicate network failure (offline state, blocked DNS, proxy/VPN issues, or request blocking).

Stabilize tests by mocking external dependencies or waiting for required routes.

    await page.route('**/api/profile', route =>

      route.fulfill({

        status: 200,

        contentType: 'application/json',

        body: JSON.stringify({ id: 1, name: 'Ada' }),

      })

    );

    await page.goto('https://example.com/app');

For CI, verify outbound network access and proxy configuration.

----
url: https://webcrawlerapi.com/glossary/playwright/how-to-fix-playwright-cannot-read-properties-undefined
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

This is a test code bug, usually from missing fixture args or wrong variable scope.

Always destructure fixtures in the test signature.

    import { test } from '@playwright/test';

    test('profile opens', async ({ page }) => {

      await page.goto('https://example.com');

      await page.getByRole('link', { name: 'Profile' }).click();

    });

Common mistake: writing async () => { await page.goto(...) } without { page }.

----
url: https://webcrawlerapi.com/privacy
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Privacy Policy

--------------

We, at WebcrawlerAPI, respect your privacy and are committed to protecting it through this Policy.

### 1\. Collection of Information

When you use our services, we may collect information related to your device, your location, and the way you interact with our services. This might include your IP address, browser type, access times, and referring website addresses.

### 2\. Use your information

We use this information to provide and improve our services, understand how you use our services, develop new products or services, and communicate with you, our users.

### 3\. Sharing of Information

We do not sell or share your personal information to third parties for marketing purposes without your explicit consent. We might share your information with third parties who perform services on our behalf.

### 4\. Data Processing and Sub-Processors

For detailed information about how we process your data and the third-party service providers we use, please refer to our:

*   [Data Processing Agreement (DPA)](/legal/Webcrawlerapi%20DPA.pdf)

*   [Sub-Processor List](/legal/subprocessor-list)

### 5\. Personal Information protection

We take security seriously. We take several measures to protect your personal information from unauthorized access.

### 6\. Changes to This Policy

We might modify this Privacy Policy from time to time. When we do, we will let you know by changing the date at the top of the policy.

### 7\. Contact us

If you have any questions or suggestions about this Privacy Policy, you can contact us at [\[email protected\]](/cdn-cgi/l/email-protection#23505653534c515763544641405142544f465142534a0d404c4e).

----
url: https://webcrawlerapi.com/docs/sdk/n8n
----

n8n WebcrawlerAPI integration

=============================

Copy MarkdownOpen

How to get website content for LLM training using n8n and WebCrawlerAPI.

n8n is a powerful workflow automation tool that allows you to connect various services and automate tasks.You can use n8n to integrate WebCrawlerAPI for crawling websites and extracting data, which can then be used for training large language models (LLMs) or other purposes.

There are 2 ways to integrate WebcrawlerAPI with n8n: using the official WebcrawlerAPI node or the HTTP Request node. Both methods allow you to scrape webpages and extract data.

[Using the official WebcrawlerAPI node (recommended)](#using-the-official-webcrawlerapi-node-recommended)

---------------------------------------------------------------------------------------------------------

1.  Go to `Settings` and select `Community Nodes`.

Search for `n8n-nodes-webcrawlerapi` and click "Install".

After installation, you will see the WebcrawlerAPI node in the Community Nodes list

2.  In your workflow, add a new node and search the **WebcrawlerAPI** node.

You will see the WebcrawlerAPI node

3.  Click on the node to configure it. You will need to add your [API key](https://dash.webcrawlerapi.com/access) to connect your WebcrawlerAPI account.

4.  After entering your API key, click "Connect" to verify your credentials.

5.  Now you can configure the node to scrape a webpage!

Enter the URL you want to scrape in the "URL" field and select the output format (Markdown, Cleaned Text, or HTML).

[Using HTTP Request node](#using-http-request-node)

---------------------------------------------------

1.  In your workflow add a new node and select the **HTTP Request** node.

2.  Then tap "Import cURL" and paste the following snippet, using your [API key](https://dash.webcrawlerapi.com/access) and the URL you want to crawl:

    curl --request POST \

      --url https://api.webcrawlerapi.com/v2/scrape \

      --header 'Authorization: Bearer <YOUR API KEY>' \

      --header 'Content-Type: application/json' \

      --data '{

    	"url": "https://webcrawlerapi.com"

    }'

Tap "Test step" to make sure everything is working correctly. You should see the response with markdown from the API in the output panel.

[MCP Server

Previous Page](/docs/sdk/mcp)[Make.com

Next Page](/docs/sdk/make)

### On this page

[Using the official WebcrawlerAPI node (recommended)](#using-the-official-webcrawlerapi-node-recommended)[Using HTTP Request node](#using-http-request-node)

----
url: https://webcrawlerapi.com/changelog/2025-05-07-s3-r2-integration
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

May 7, 2025

📦New: S3 Compatible Storage Integration

========================================

WebcrawlerAPI now supports direct export to any S3 compatible storage.

### What's new:

*   Export crawl results directly to Amazon S3 buckets (or any S3 compatible storage: Cloudflare R2, DigitalOcean Spaces, Wasabi, Backblaze B2, etc.)

*   Simple setup with API keys and bucket information

*   We don't store your keys after job ends

### How it works:

When starting a job via API, just add several parameters, like access\_key\_id, secret\_access\_key and a few others. Crawled data will be placed under the specified path. Your keys will be deleted after the job ends. Read [Upload to S3](/docs/actions/s3-upload) docs for detailed information.

----
url: https://webcrawlerapi.com/blog/the-top-3-best-screenshot-apis-to-use-in-2025
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

APIComparison5 mins to read to read

The Top 3 Best Screenshot APIs to Use in 2026

=============================================

See the top 3 screenshot APIs to try in 2025, with easy comparisons of prices, features, and free plans.

Written byAndrew

Published onFeb 3, 2026

### Table of Contents

*   [Why Screenshot APIs Matter](#why-screenshot-apis-matter)

*   [Screenshot API Use Cases](#screenshot-api-use-cases)

*   [1\. ScreenshotOne](#1-screenshotone)

*   [2\. CaptureKit](#2-capturekit)

*   [3\. Scrapingdog Screenshot API](#3-scrapingdog-screenshot-api)

### Table of Contents

*   [Why Screenshot APIs Matter](#why-screenshot-apis-matter)

*   [Screenshot API Use Cases](#screenshot-api-use-cases)

*   [1\. ScreenshotOne](#1-screenshotone)

*   [2\. CaptureKit](#2-capturekit)

*   [3\. Scrapingdog Screenshot API](#3-scrapingdog-screenshot-api)

Why Screenshot APIs Matter

--------------------------

If you're involved in compliance archiving, monitoring UI changes, or tracking competitor pricing over time, a screenshot API can streamline your workflow significantly. With so many tools available, we've highlighted four of the newest and most promising APIs to consider in 2025.

Each of these has been tested using their free plans, and we provide a side-by-side comparison of features and pricing.

Screenshot API Use Cases

------------------------

Screenshot APIs offer flexible automation options for a wide range of business and development needs. Below are some of the most practical use cases where screenshot APIs like ScreenshotOne excel:

*   **Investor and Board Reporting Decks**. Automatically generate up-to-date screenshots for reports shared with investors and board members.

*   **Legal Evidence and Discovery Snapshots**. Capture and preserve webpages exactly as they appear for use in legal proceedings.

*   **Marketing Content Creation**. Turn dynamic, live web content into reusable assets for marketing campaigns and presentations.

*   **Automate Open Graph Image Generation**. Automatically generate social media preview images (OG images) for shared links and pages.

*   **SEO Performance Tracking**. Capture search result pages to track keyword rankings and ad placements over time.

*   **User Experience Documentation**. Document visual flows and interactions on your website for UX audits and improvements.

*   **Visual Regression Testing**. Detect changes or inconsistencies in your UI by comparing periodic screenshots across deployments.

*   **Website Archiving for Compliance**. Create regular snapshots of your website to meet regulatory and archival requirements.

1\. [ScreenshotOne](https://screenshotone.com/)

-----------------------------------------------

Best screenshot API is a ScreenshotOne. It is a stable and feature-rich screenshot API that's rapidly growing.

Pricing: Starts at $17/month for 2,000 screenshots.

Integration: Works with any programming language and No-Code tools.

Features: Rolling screenshots, dozens of options, cookie banner blocker, and an intuitive playground for testing.

Docs: Clean, good [documentation](https://screenshotone.com/docs/getting-started/).

2\. [CaptureKit](https://www.capturekit.dev/)

---------------------------------------------

CaptureKit stands out for its built-in device-specific screenshot capabilities, ideal for testing responsive designs.

Free Trial: 100 credits included.

Pricing: Starts at $7/month for 1,000 screenshots.

Docs: [documentation](https://docs.capturekit.dev/).

3\. [Scrapingdog Screenshot API](https://www.scrapingdog.com/screenshot-api/)

-----------------------------------------------------------------------------

Scrapingdog Screenshot API is a new addition to Scrapingdog's suite of web scraping tools. It allows you to capture full-page screenshots at scale.

Pricing: Starts at $40/month for 4,000 screenshots.

Documentation: great [documentation](https://docs.scrapingdog.com/screenshot-api)

----
url: https://webcrawlerapi.com/glossary/puppeteer/how-to-fix-duplicate-response-headers-puppeteer
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Summary

*   Duplicate header values should normally be merged into a single header value separated by a comma and a space. The Set-Cookie header is an exception and remains as separate values.

*   A fix was implemented to normalize newline-separated duplicate headers into comma-separated format, aligning with common HTTP and Fetch API expectations.

What changed

*   Headers that appear multiple times are now merged into one header with comma-separated values.

*   Set-Cookie headers remain distinct per their semantics.

How to use / post-process headers (example)

    function normalizeHeaders(headers) {

      const out = {};

      for (const [key, value] of Object.entries(headers)) {

        const k = key.toLowerCase();

        if (k === 'set-cookie') {

          out[key] = Array.isArray(value) ? value : [value];

        } else {

          if (Array.isArray(value)) {

            out[key] = value.join(', ');

          } else {

            out[key] = value;

          }

        }

      }

      return out;

    }

Example result after normalization

    {

      "server-timing": "sis; desc=0, geo; desc=IN, ak_p; desc=\"1760514732612_1750730254_2156868_116_7771_4_9_255\";dur=1",

      "set-cookie": ["AKA_A2=A; expires=Wed, 15-Oct-2025 08:59:55 GMT; path=/; domain=adobe.com; secure; HttpOnly"],

      "content-length": "3876",

      ...

    }

Notes

*   This change resolves issues caused by attempting to treat newline-separated headers as separate fields in header maps.

*   If you are consuming header data programmatically, prefer the normalized form where duplicates are merged with comma-space separators, except for Set-Cookie which remains as individual values.

----
url: https://webcrawlerapi.com/tools/llmstxt-generator
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Tools → [Free LLMs.txt Generator](/tools/llmstxt-generator)

-----------------------------------------------------------

Generate LLMs.txt for your website

==================================

### Insert a URL below to generate LLMs.txt from up to 100 pages of the website for free.

Start

Frequently asked questions

--------------------------

### What is LLMs.txt?

### What does LLMs.txt look like?

### Can I use the generated LLMs.txt for my AI model?

### How many pages can I convert?

### Which URL should I use?

### What if I need to convert more pages?

----
url: https://webcrawlerapi.com/blog/what-is-rag
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

RAGTechnicalStart here3 min read to read

What is RAG (Retrieval-Augmented Generation)?

=============================================

Learn about RAG, a powerful technique that improves AI responses by combining language models with real-time information retrieval, making AI answers more accurate and up-to-date.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Why is RAG needed?](#why-is-rag-needed)

*   [What is the typical RAG architecture?](#what-is-the-typical-rag-architecture)

*   [Where can I find RAG providers?](#where-can-i-find-rag-providers)

### Table of Contents

*   [Why is RAG needed?](#why-is-rag-needed)

*   [What is the typical RAG architecture?](#what-is-the-typical-rag-architecture)

*   [Where can I find RAG providers?](#where-can-i-find-rag-providers)

RAG, short for Retrieval-Augmented Generation, is a technique that improves how AI models answer questions or generate content.

Traditional AI models, called large language models (LLMs), are trained on large amounts of data, but that data is frozen at a certain point in time. This means the models can miss new or detailed information.

RAG fixes this by letting the AI search for information before giving an answer. Instead of only using what it "already knows," the AI fetches relevant facts from external sources - like documents, websites, or databases - and includes them in its response. Think of it like giving the AI a library to check before it speaks.

Why is RAG needed?

------------------

LLMs are great at writing and answering questions, but they sometimes "hallucinate" - which means they make things up. They can also give outdated answers because their training data doesn't include the latest events.

RAG helps solve this problem in several ways:

*   **Up-to-date answers**: RAG lets AI access current information, even after its training is done.

*   **Better accuracy**: It grounds responses in real documents or data, which means answers are based on facts, not guesses.

*   **Higher trust**: Users can trace the sources the AI used, which helps them trust the output more.

*   **Flexible and cost-effective**: Instead of retraining an AI model (which is expensive), RAG just adds new data on the fly.

In short, RAG makes AI smarter, more reliable, and more useful - especially for tasks that depend on facts.

What is the typical RAG architecture?

-------------------------------------

RAG usually follows a three-step process:

1.  **Retrieve**: First, the user's question is turned into a vector (a numeric format), and the system searches a special database (called a vector database) for the most relevant pieces of information. These databases store data in a way that allows fast, accurate matching based on meaning, not just keywords.

2.  **Augment**: The relevant information that was found is added to the original question. This new combination (called an augmented prompt) gives the AI extra context.

3.  **Generate**: The AI model uses the augmented prompt to generate a response. Since it has fresh, relevant info, the answer is usually more accurate and detailed.

In the background, RAG systems also update their sources regularly to keep content fresh, and they use advanced search techniques to ensure results are relevant and high quality.

Where can I find RAG providers?

-------------------------------

Many cloud and AI platforms now offer RAG tools or services. These providers help businesses and developers build RAG-powered applications more easily. While many companies offer these tools, here are some types of tools or platforms to look for:

*   RAG engines: These tools manage retrieval and generation for you. Some include vector databases, search engines, and LLM integrations all in one. Most big cloud providers have their own RAG engines. For example [Cloudflare RAG](https://developers.cloudflare.com/workers-ai/guides/tutorials/build-a-retrieval-augmented-generation-ai/) or [Google Vertex AI RAG engine](https://cloud.google.com/vertex-ai/generative-ai/docs/rag-overview)

*   Search services: Modern search engines now support vector search, semantic search, and re-ranking to improve accuracy. Check [Exa.AI](https://exa.ai/)

*   Embeddings and vector databases: These store your content in a searchable format that AI can understand ([Pinecone](https://www.pinecone.io/), [Weaviate](https://weaviate.io/), [Qdrant](https://qdrant.tech/), [Chroma](https://www.trychroma.com/), [Milvus](https://milvus.io/en/), [Redis](https://redis.com/solutions/vector-database/) and [PostgreSQL vector extension](https://github.com/pgvector/pgvector) are some examples)

*   LLM platforms: Many AI model hosting platforms allow you to add retrieval steps to your generative models ([Langchain](https://www.langchain.com/), [Mirascope](https://github.com/mirascope/mirascope))

*   RAG builders and APIs: Some platforms offer drag-and-drop or low-code tools to quickly create chatbots and AI apps that use RAG ([Chatbase](https://www.chatbase.co/), [SiteGPT](https://sitegpt.ai/))

You can also build your own RAG system using open-source tools, such as LangChain, or combine different services like [WebcrawlerAPI](https://webcrawlerapi.com) for data extraction.

⸻

In summary, Retrieval-Augmented Generation is a powerful way to improve AI by combining smart search with language generation. It gives more accurate, up-to-date, and trustworthy answers - perfect for businesses, chatbots, internal tools, or any app that depends on real information.

----
url: https://webcrawlerapi.com/blog/json-vs-csv-choosing-the-right-format-for-llm-prompts
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

ComparisonJSONCSVRAG

JSON vs CSV: Choosing the Right Format for LLM Prompts

======================================================

JSON vs CSV for scraped datasets and LLM prompt outputs: structure, nesting, parsing, and what works best for pipelines and RAG.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What JSON is good at](#what-json-is-good-at)

*   [What CSV is good at](#what-csv-is-good-at)

*   [Use cases in web crawling, scraping, and RAG](#use-cases-in-web-crawling-scraping-and-rag)

*   [When JSON should be used](#when-json-should-be-used)

*   [When CSV should be used](#when-csv-should-be-used)

*   [Practical tradeoffs](#practical-tradeoffs)

*   [CSV forces decisions early](#csv-forces-decisions-early)

*   [JSON makes "optional" fields easy](#json-makes-optional-fields-easy)

*   [Node.js snippet: Flatten JSON records for CSV export](#nodejs-snippet-flatten-json-records-for-csv-export)

*   [Conclusion](#conclusion)

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What JSON is good at](#what-json-is-good-at)

*   [What CSV is good at](#what-csv-is-good-at)

*   [Use cases in web crawling, scraping, and RAG](#use-cases-in-web-crawling-scraping-and-rag)

*   [When JSON should be used](#when-json-should-be-used)

*   [When CSV should be used](#when-csv-should-be-used)

*   [Practical tradeoffs](#practical-tradeoffs)

*   [CSV forces decisions early](#csv-forces-decisions-early)

*   [JSON makes "optional" fields easy](#json-makes-optional-fields-easy)

*   [Node.js snippet: Flatten JSON records for CSV export](#nodejs-snippet-flatten-json-records-for-csv-export)

*   [Conclusion](#conclusion)

JSON and CSV are both used for structured outputs, but different data shapes are assumed. JSON is used for objects and nested structures. CSV is used for flat rows.

A broader overview is covered in [Best Prompt Data](/blog/best-prompt-data).

Quick comparison

----------------

Topic

JSON

CSV

Best for

Nested objects, metadata, APIs

Flat tables and exports

Parsing reliability

High

High (with correct quoting)

Human editing

Medium

Medium to High (spreadsheets)

Nested data

Supported

Not supported

Common failure

Schema drift

Commas/quotes/newlines in cells

What JSON is good at

--------------------

JSON is usually selected when:

*   Each record contains nested fields (offers, variants, breadcrumbs)

*   Metadata is required for RAG (url, section, chunk\_id)

*   Validation and type checking are needed

JSON paired with readable docs is covered in [Markdown vs JSON](/blog/markdown-vs-json-choosing-the-right-format-for-llm-prompts).

What CSV is good at

-------------------

CSV is usually selected when:

*   A table is desired (one row per page/product)

*   Data must be used in spreadsheets

*   Simple imports are planned

If the data is not tabular, CSV is often the wrong tool. Plain narrative output is covered in [CSV vs Plain Text](/blog/csv-vs-plain-text-choosing-the-right-format-for-llm-prompts).

Use cases in web crawling, scraping, and RAG

--------------------------------------------

### When JSON should be used

JSON is usually preferred when:

*   Crawled pages produce different optional fields

*   Arrays are expected (multiple images, multiple prices, multiple authors)

*   Downstream systems expect objects

### When CSV should be used

CSV is usually preferred when:

*   A stable schema exists (same columns every time)

*   Data will be filtered and reviewed in spreadsheets

*   A quick export is more important than perfect expressiveness

For readability-first outputs, Markdown is often used, as covered in [Markdown vs CSV](/blog/markdown-vs-csv-choosing-the-right-format-for-llm-prompts).

Practical tradeoffs

-------------------

### CSV forces decisions early

If arrays or nested objects exist, flattening rules must be invented (join with ;, create repeated columns, or explode rows). Those rules can be correct, but they must be maintained.

### JSON makes "optional" fields easy

Fields can be omitted or set to null. That flexibility works well for scraped pages where data is inconsistent.

Node.js snippet: Flatten JSON records for CSV export

----------------------------------------------------

A simple flattening approach is shown: nested values are serialized as JSON strings. That is not pretty, but it is predictable.

    // Node 18+

    // Convert an array of JSON objects into a CSV with stable columns.

    import { readFile } from "node:fs/promises";

    const items = JSON.parse(await readFile("items.json", "utf8"));

    const keys = new Set();

    for (const item of items) for (const k of Object.keys(item)) keys.add(k);

    const headers = [...keys];

    function cell(v) {

      const s = typeof v === "string" ? v : JSON.stringify(v);

      return `"${String(s).replaceAll('"', '""')}"`;

    }

    const lines = [];

    lines.push(headers.join(","));

    for (const item of items) {

      lines.push(headers.map((h) => cell(item[h] ?? "")).join(","));

    }

    console.log(lines.slice(0, 5).join("\n"));

Conclusion

----------

*   JSON is usually preferred for nested data, metadata, and reliable ingestion.

*   CSV is usually preferred for flat datasets and spreadsheet-friendly exports.

*   In many scraping pipelines, JSON is used internally and CSV is generated only as an export.

If human-edited configs are needed, YAML can be compared in [YAML vs CSV](/blog/yaml-vs-csv-choosing-the-right-format-for-llm-prompts).

----
url: https://webcrawlerapi.com/blog/how-to-extract-article-or-blogpost-content-in-js-using-readabilityjs
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

JSTutorial

Extracting article or blogpost content with Mozilla Readability

===============================================================

Extract clean article content from any web page using Mozilla's Readability library—the same algorithm that powers Firefox Reader View. Complete JavaScript code examples with HTML cleaning and error handling.

Written byAndrew

Published onFeb 7, 2026

### Table of Contents

*   [Complete article content extraction code](#complete-article-content-extraction-code)

*   [Removing unwanted HTML elements](#removing-unwanted-html-elements)

*   [Using the extraction function](#using-the-extraction-function)

*   [What Readability extracts](#what-readability-extracts)

*   [Handling edge cases](#handling-edge-cases)

*   [WebCrawlerAPI's main\_content\_only parameter](#webcrawlerapis-main_content_only-parameter)

### Table of Contents

*   [Complete article content extraction code](#complete-article-content-extraction-code)

*   [Removing unwanted HTML elements](#removing-unwanted-html-elements)

*   [Using the extraction function](#using-the-extraction-function)

*   [What Readability extracts](#what-readability-extracts)

*   [Handling edge cases](#handling-edge-cases)

*   [WebCrawlerAPI's main\_content\_only parameter](#webcrawlerapis-main_content_only-parameter)

Complete article content extraction code

----------------------------------------

If you want to understand how it works inside (scoring, candidates, cleanup), read: [Mozilla Readability Algorithm (Readability.js), Step by Step](/blog/mozilla-readability-algorithm-readabilityjs). If you want the Rust alternative with extra policy and candidate-selection controls, read: [How dom\_smoozie Rust Mozilla Readability alternative works](/blog/how-dom-smoothie-rust-mozilla-readability-alternative-works).

Here's a standalone JavaScript function that combines HTML cleaning with Mozilla's Readability parser:

    //npm install @mozilla/readability jsdom

    import { JSDOM } from "jsdom";

    import { Readability } from "@mozilla/readability";

    function extractArticleContent(url, html) {

      try {

        // Create a JSDOM document from the HTML

        const dom = new JSDOM(html, {

          url: url,

          contentType: "text/html",

        });

        const document = dom.window.document;

        // Optional: Clean unwanted elements first

        const unwantedElements = document.querySelectorAll(

          "script, style, noscript, iframe, footer, header, nav, .advertisement, .sidebar, .menu"

        );

        unwantedElements.forEach((element) => element.remove());

        // Use Readability to extract article content

        const reader = new Readability(document);

        const article = reader.parse();

        if (!article) {

          return null;

        }

        return {

          title: article.title || "",

          content: article.content || "",

          textContent: article.textContent || "",

          length: article.length || 0,

          excerpt: article.excerpt || "",

          byline: article.byline || "",

          dir: article.dir || "",

          siteName: article.siteName || "",

          lang: article.lang || "",

        };

      } catch (error) {

        console.error("Error extracting article content:", error.message);

        return null;

      }

    }

Want to try it without coding? Use the [Readability tool](https://webcrawlerapi.com/tools/html-main-content-readability) to extract main content from any HTML. If you're extracting content from third-party sites, read: [Web Scraping Ethics: A Complete Guide to Responsible Data Collection](https://webcrawlerapi.com/blog/web-scraping-ethics).

Removing unwanted HTML elements

-------------------------------

Before using Readability, it's often helpful to clean up the HTML by removing elements that are definitely not article content. This improves extraction accuracy and reduces false positives.

Here's how to remove common unwanted elements using a simple cleaning function:

    import { JSDOM } from "jsdom";

    function cleanHtml(

      html,

      unwantedTags = "script, style, noscript, iframe, img, footer, header, nav, head"

    ) {

      const dom = new JSDOM(html);

      const document = dom.window.document;

      // Remove unwanted elements

      const elementsToRemove = document.querySelectorAll(unwantedTags);

      elementsToRemove.forEach((element) => element.remove());

      return dom.serialize();

    }

This removes:

*   **Scripts and styles**: JavaScript code and CSS that aren't content

*   **Navigation elements**: Headers, footers, and navigation menus

*   **Media**: Images and iframes that might interfere with text extraction

*   **Metadata**: Head elements and other non-visible content

Using the extraction function

-----------------------------

Here's how to use the function with a complete example:

    import fetch from "node-fetch";

    async function scrapeArticleContent(url) {

      try {

        // Fetch the webpage

        const response = await fetch(url);

        const html = await response.text();

        // Extract article content

        const articleContent = extractArticleContent(url, html);

        if (articleContent) {

          console.log("Title:", articleContent.title);

          console.log("Author:", articleContent.byline);

          console.log("Content length:", articleContent.length);

          console.log("Excerpt:", articleContent.excerpt);

          console.log(

            "\nArticle content:\n",

            articleContent.textContent.substring(0, 500) + "..."

          );

        } else {

          console.log("Could not extract article content from this page");

        }

      } catch (error) {

        console.error("Error scraping content:", error.message);

      }

    }

    // Example usage

    scrapeArticleContent("https://example-blog.com/article");

What Readability extracts

-------------------------

The Readability parser returns several useful properties:

*   **title**: The article's main title

*   **content**: Clean HTML content without navigation and ads

*   **textContent**: Plain text version of the article content

*   **length**: Character count of the article content

*   **excerpt**: Short summary or first few sentences

*   **byline**: Author information if found

*   **dir**: Text direction (ltr/rtl)

*   **siteName**: Name of the website

*   **lang**: Language of the content

Handling edge cases

-------------------

Not all pages will work perfectly with Readability. Here are some tips for better results:

    function extractArticleContentRobust(url, html) {

      const result = extractArticleContent(url, html);

      // Fallback if Readability fails

      if (!result || result.length < 100) {

        console.log("Readability extraction failed, trying fallback...");

        // Simple fallback: extract text from article content areas

        const dom = new JSDOM(html);

        const document = dom.window.document;

        const contentSelectors = [          "article",

          "main",

          '[role="main"]',

          ".post",

          ".article-body",

          ".content",

        ];

        for (const selector of contentSelectors) {

          const element = document.querySelector(selector);

          if (element && element.textContent.length > 100) {

            return {

              title: document.title || "",

              textContent: element.textContent.trim(),

              content: element.innerHTML,

              length: element.textContent.length,

            };

          }

        }

      }

      return result;

    }

WebCrawlerAPI's main\_content\_only parameter

---------------------------------------------

If you're looking for a ready-made solution that handles all the complexity of article content extraction, WebCrawlerAPI provides a simple parameter that does this automatically.

Just add main\_content\_only=true to your API request:

    const response = await fetch("https://api.webcrawlerapi.com/v1/scrape", {

      method: "POST",

      headers: {

        Authorization: "Bearer YOUR_API_KEY",

        "Content-Type": "application/json",

      },

      body: JSON.stringify({

        url: "https://example.com/article",

        main_content_only: true,

        scrape_type: "markdown",

      }),

    });

This automatically extracts only the main article content using advanced algorithms, saving you from having to implement and maintain the extraction logic yourself.

Learn more about the main\_content\_only parameter in the [WebCrawlerAPI documentation](https://webcrawlerapi.com/docs/api/scrape).

----
url: https://webcrawlerapi.com/blog/what-is-xpath
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Technical1 min to read

What is Xpath?

==============

Xpath is a powerful query language for selecting nodes in an HTML document. Learn about the key features and aspects of Xpath.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [What is Xpath?](#what-is-xpath)

### Table of Contents

*   [What is Xpath?](#what-is-xpath)

What is Xpath?

--------------

XPath, short for XML Path Language, is a query language designed for selecting nodes from an XML(HTML) document. In addition to its primary function, XPath can also be used to compute values (e.g., strings, numbers, or Boolean values) from the content of an XML/HTML document. It is a crucial part of technologies like XSLT, XQuery, and XPointer.

Here are some key features and aspects of XPath:

1.  **Path Expressions**: XPath uses path expressions to navigate through elements and attributes in an XML document. These expressions are similar to file paths in a filesystem.

*   / selects from the root node.

*   // selects nodes in the document from the current node that match the selection, no matter where they are.

*   . selects the current node.

*   .. selects the parent of the current node.

*   @ selects attributes.

2.  **Node Selection**: XPath defines several types of nodes, including element nodes, attribute nodes, text nodes, and more. Path expressions can be used to select these nodes.

*   Example: /bookstore/book selects all book elements under the bookstore element.

3.  **Predicates**: Predicates are used to find a specific node or a node that contains a specific value. They are enclosed in square brackets.

*   Example: //book\[price>35.00\] selects all book elements with a price element greater than 35.00.

4.  **Functions**: XPath includes over 100 built-in functions for string values, numeric values, date and time comparison, node and QName manipulation, sequence manipulation, and more.

*   Example: contains(@lang, 'en') checks if the lang attribute contains the string 'en'.

5.  **Axes**: Axes define a node-set relative to the current node. There are many axes available, such as child, parent, ancestor, descendant, following, and more.

*   Example: child::book selects all book children of the current node.

6.  **Operators**: XPath supports standard arithmetic operators, Boolean operators, and comparison operators.

XPath is widely used in various applications such as:

Extracting data from XML/HTML documents. Navigating through an XML/HTML document's structure. Transforming XML/HTMLL documents using XSLT. Defining parts of an XML/HTML document to be processed by XQuery.

----
url: https://webcrawlerapi.com/blog/cleaned-text-vs-markdown-choosing-the-right-output-format
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

ComparisonMarkdownRAG10 min read to read

Cleaned text vs Markdown: Choosing the Right Output Format for AI

=================================================================

Explore the differences between cleaned text and Markdown to determine the best format for your data processing and content management needs.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Quick Comparison](#quick-comparison)

*   [1\. Cleaned Text Overview](#1-cleaned-text-overview)

*   [Key Characteristics of Cleaned Text](#key-characteristics-of-cleaned-text)

*   [Trade-Offs of Cleaned Text](#trade-offs-of-cleaned-text)

*   [2\. Markdown Overview](#2-markdown-overview)

*   [Core Features and Applications](#core-features-and-applications)

*   [Implementation in Modern Workflows](#implementation-in-modern-workflows)

*   [Technical Considerations](#technical-considerations)

*   [Optimization Strategies](#optimization-strategies)

*   [Advantages and Disadvantages](#advantages-and-disadvantages)

*   [Comparative Analysis](#comparative-analysis)

*   [Format-Specific Considerations](#format-specific-considerations)

*   [Real-World Applications and Technical Impact](#real-world-applications-and-technical-impact)

*   [Conclusion](#conclusion)

### Table of Contents

*   [Quick Comparison](#quick-comparison)

*   [1\. Cleaned Text Overview](#1-cleaned-text-overview)

*   [Key Characteristics of Cleaned Text](#key-characteristics-of-cleaned-text)

*   [Trade-Offs of Cleaned Text](#trade-offs-of-cleaned-text)

*   [2\. Markdown Overview](#2-markdown-overview)

*   [Core Features and Applications](#core-features-and-applications)

*   [Implementation in Modern Workflows](#implementation-in-modern-workflows)

*   [Technical Considerations](#technical-considerations)

*   [Optimization Strategies](#optimization-strategies)

*   [Advantages and Disadvantages](#advantages-and-disadvantages)

*   [Comparative Analysis](#comparative-analysis)

*   [Format-Specific Considerations](#format-specific-considerations)

*   [Real-World Applications and Technical Impact](#real-world-applications-and-technical-impact)

*   [Conclusion](#conclusion)

[Cleaned text](/blog/html-vs-cleaned-text-choosing-the-right-output-format) works best for **AI training**, while [Markdown](/blog/best-prompt-data) is ideal for maintaining **content structure and hierarchy**.

The choice between cleaned text and Markdown depends on your project's needs. Here's a quick breakdown to help you decide:

*   **Cleaned Text**: Raw, unformatted text. Best for AI training, NLP, and large-scale data analysis where speed and simplicity matter most.

*   **Markdown**: Text with lightweight formatting (headers, lists, tables). Ideal for documentation, content management, and tasks requiring structure.

### Quick Comparison

**Aspect**

**Cleaned Text**

**Markdown**

**Primary Use Case**

AI training, data analysis

Content storage, documentation

**Processing Speed**

Faster (minimal overhead)

Slower (includes formatting)

**Human Readability**

Basic

Enhanced with formatting

**Structure Preservation**

None

Retains headers, lists, and tables

**Storage Efficiency**

Minimal space required

Requires more storage due to formatting

### Summary:

*   Use **cleaned text** for faster processing and AI workflows.

*   Choose **Markdown** when you need structured, human-readable content.

Read on for a deeper dive into their strengths, limitations, and real-world applications.

1\. Cleaned Text Overview

-------------------------

Cleaned text refers to raw data that has been stripped of formatting and unnecessary characters. It’s a key component in preparing data for AI training and large-scale analysis.

### Key Characteristics of Cleaned Text

Characteristic

Description

Processing Impact

**Structure**

Free of formatting or markup

Speeds up processing by 50%

**Storage**

Requires minimal space

Lowers storage costs

**Consistency**

Standardized and uniform

Boosts accuracy by 30%

**Scalability**

Suitable for batch processing

Optimized for large-scale tasks

Cleaned text ensures consistent and reliable data, making it especially useful for:

*   **AI Training**: Provides standardized datasets for better model performance.

*   **Natural Language Processing (NLP)**: Reduces noise and inconsistencies for clearer results.

*   **Large-Scale Analytics**: Handles vast datasets efficiently.

It’s particularly effective in tasks like web scraping for sentiment analysis, where extra formatting can distort findings. However, preprocessing is essential to address inconsistencies and special characters.

> "Studies have shown that cleaned text can reduce data processing time by up to 50% and improve data accuracy by 30%, underscoring its importance in data extraction workflows."

### Trade-Offs of Cleaned Text

The main downside of cleaned text is the loss of structural information and metadata, which can be critical for projects where content formatting matters. For tasks prioritizing speed and consistency, cleaned text is ideal. But if preserving structure is crucial, formats like Markdown might be a better fit.

Choosing between cleaned text and other formats depends on your project’s needs, particularly the balance between efficiency and structural detail.

2\. Markdown Overview

---------------------

Markdown is a lightweight markup language designed to balance simplicity and structure, making it ideal for tasks that require hierarchy and formatting. Unlike plain text, Markdown preserves structural elements, which is especially useful for data extraction and content organization.

### Core Features and Applications

**Feature**

**Advantage**

**Use Case**

**Simple Syntax**

Cuts down parsing complexity by 40%

Writing, documentation

**Structured Format**

Supports efficient data organization

RAG systems, AI training

**Format Flexibility**

Converts easily to [HTML](/blog/html-vs-markdown-choosing-the-right-output-format), PDF, DOCX

Multi-platform publishing

**Clean Structure**

Simplifies automated processing

Web scraping, content analysis

Markdown's use of headers, lists, and code blocks makes it an essential tool for AI training workflows and content management. Its structured format bridges the gap between human-readable content and machine-friendly data.

### Implementation in Modern Workflows

A great example of Markdown in action is [WebCrawlerAPI](https://webcrawlerapi.com/), which uses it to automate the conversion of web content into clean, structured data. This highlights Markdown's ability to streamline web scraping workflows.

> "Markdown's simplicity makes it a favorite for writers and content creators." - 2Markdown.com [\[1\]](https://2markdown.com/blog/markdown-vs-html-content-creation)

### Technical Considerations

Several factors influence Markdown's effectiveness in data extraction:

*   **Content Organization**: Features like headers and lists make targeting specific data easier.

*   **Processing Efficiency**: Its plain text nature minimizes computational demands.

*   **Format Consistency**: Standardized syntax ensures accurate parsing and reliable results.

### Optimization Strategies

To get the most out of Markdown in data extraction workflows, focus on consistent formatting and proper indentation. Regular checks for syntax errors can prevent issues and maintain data quality. This is especially critical for projects involving structured content management or preparing training data for AI systems [\[2\]](https://app.studyraid.com/en/read/11460/359239/writing-clean-and-maintainable-markdown).

###### sbb-itb-ac346ed

Advantages and Disadvantages

----------------------------

Choosing between cleaned text and Markdown comes down to understanding their pros and cons. Here's a quick comparison to help you decide.

### Comparative Analysis

**Aspect**

**Cleaned Text**

**Markdown**

**Storage Efficiency**

Requires minimal space

Needs more storage due to formatting

**Processing Speed**

Quick to process, ideal for AI tasks

Slower due to extra parsing

**Structural Information**

None

Retains headers, lists, and tables

**Data Integration**

Easy to integrate with ML pipelines

May need conversion for certain systems

**Content Preservation**

Loses formatting and structure

Keeps document hierarchy intact

These differences become more apparent when applied to specific workflows and technical needs.

### Format-Specific Considerations

According to WebCrawlerAPI's data, cleaned text processes about 40% faster than Markdown when handling large datasets for AI training. However, this speed comes at the expense of losing structural details, which can be crucial for content management systems.

> "Cleaned text is particularly effective in AI data preparation and LLM training, where large volumes of raw text data are needed. The simplicity of the format significantly reduces processing overhead." - 2Markdown.com [\[1\]](https://2markdown.com/blog/markdown-vs-html-content-creation)

### Real-World Applications and Technical Impact

In RAG systems, Markdown's structured format is essential for preserving hierarchy and context, something cleaned text cannot achieve. This makes Markdown critical for workflows that rely on detailed formatting.

**Why Choose Cleaned Text?**

*   Easy integration with AI workflows

*   Requires little to no preprocessing

**Why Choose Markdown?**

*   Supports inline code and tables

*   Offers flexible conversion options

*   Retains document structure and hierarchy

Ultimately, your choice should match your project's goals. [Cleaned text](/blog/html-vs-cleaned-text-choosing-the-right-output-format) works best for AI and data-heavy tasks, where speed and simplicity matter most. On the other hand, Markdown is ideal for projects that need to maintain structure and formatting, such as documentation or content management systems.

Conclusion

----------

Choosing between cleaned text and Markdown depends on your project's needs. Cleaned text works best for AI training, while Markdown is ideal for maintaining content structure and hierarchy. Understanding these differences can help teams align their workflows with their goals.

> "Markdown provides semantic meaning for content in a relatively simple way." - HackerNoon [\[3\]](https://hackernoon.com/pros-and-cons-of-using-markdown-for-technical-writing-34f277418a8a)

Each format has its own strengths. Here's how to decide:

*   **For AI/ML Projects**: Use cleaned text when processing large datasets, especially for AI training, where extra formatting can cause issues.

*   **For Content Management**: Markdown is better for preserving structure, making it essential for systems like RAG that need context for accurate results.

*   **For Hybrid Systems**: Save content in Markdown for flexibility and convert to cleaned text when AI processing is required.

Success depends on matching the format to your project's technical needs. Cleaned text offers faster processing and simpler integration, while Markdown's structure allows for flexibility and diverse outputs.

Tools for converting between these formats are widely available. WebCrawlerAPI's multi-format support ensures development teams can stay flexible and optimize for their specific use cases.

----
url: https://webcrawlerapi.com/tos
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Terms of service

----------------

These terms of use (the “Terms”) apply to your access and use of our service (the “Service”). By accessing or using the Service, you agree to be bound by these Terms. If you do not agree to these Terms, do not use the Service.

### Privacy

We are committed to protecting your privacy and personal data. We will only use your personal data in accordance with our privacy policy, which is available on our website. Please review our **[Privacy Policy](/privacy)** carefully before using the Service.

### Restrictions

You agree not to (a) access the Service in a manner that could damage, disable, overburden, or impair the Service or interfere with any other party’s use of the Service; or (b) use the Service for any illegal or unauthorized purpose.

### Termination

We reserve the right to terminate or suspend your access to the Service at any time, for any reason, and without notice.

### Scraped data misuse

This web scraping tool is designed to collect only publicly available data from websites, ensuring that no private or sensitive information, such as emails, phone numbers, or other confidential details, is accessed or gathered. The responsibility for how the collected data is used lies solely with the user of the tool. Users must ensure that their actions comply with all applicable laws, regulations, and the terms of service of the websites they scrape.

WebCrawlerAPI does not control or influence the use of the data and is not liable for any misuse or illegal activity by the user. Misuse of the data, violation of intellectual property rights, or any inappropriate handling of the data will result in immediate termination of the user’s account. We reserve the right to block any account that violates these rules without prior notice. In such cases, a full refund will be issued.

### Disclaimer of Warranties

The Service is provided “as is” and “as available” without warranties of any kind. We do not guarantee that the Service will be available at all times or that it will be error-free.

### Limitation of Liability

In no event shall we be liable for any damages arising out of or in connection with the use of the Service.

### Free Trial Balance

Free trial access is intended for evaluation purposes only. Creating multiple accounts to circumvent usage limits or obtain additional free trial credits is strictly prohibited. We reserve the right to limit, suspend, or terminate access to our services, including trial access, in cases of suspected abuse or misuse.

### Changes to These Terms

We reserve the right to modify these Terms at any time. We will post any changes on this page and encourage you to review the Terms regularly. Your continued use of the Service after any changes have been made will constitute your acceptance of the revised Terms.

### Entire Agreement

These Terms constitute the entire agreement between you and us regarding the use of the Service and supersede any prior agreements.

### Contact us

If you have any questions or suggestions about this Privacy Policy, you can contact us at [\[email protected\]](/cdn-cgi/l/email-protection#55262025253a2721152230373627342239302734253c7b363a38).

----
url: https://webcrawlerapi.com/changelog/2026-01-05-markdown-endpoint
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

January 5, 2026

Combined Markdown Export Endpoint

=================================

New **GET /job/:id/markdown** endpoint returns the full job content in one single concatenated markdown file.

Instead of downloading individual page results, you can now get all crawled markdown content combined into a single file. Each page's content is separated with URL headers for easy parsing.

Perfect for:

*   **RAG Applications**: Feed combined documentation into vector databases or AI models

*   **Batch Processing**: Process entire website content at once for analysis or indexing

*   **Documentation Extraction**: Extract and combine docs from multiple pages

*   **Backup**: Archive complete crawl results in a single file

See the [API documentation](/docs/api/markdown) for usage examples and complete reference.

----
url: https://webcrawlerapi.com/docs/sdk/dotnet
----

.NET

====

Copy MarkdownOpen

Learn how to use the WebCrawler API .NET SDK to crawl websites and extract data.

[Installation](#installation)

-----------------------------

    dotnet add package WebCrawlerApi

[Requirements](#requirements)

-----------------------------

*   .NET 7.0 or higher

[Usage](#usage)

---------------

### [Synchronous Crawling](#synchronous-crawling)

The synchronous method waits for the crawl to complete and returns all data at once.

    using WebCrawlerApi;

    using WebCrawlerApi.Models;

    // Initialize the client

    var crawler = new WebCrawlerApiClient("YOUR_API_KEY");

    // Synchronous crawling

    var job = await crawler.CrawlAndWaitAsync(

        url: "https://example.com",

        scrapeType: "markdown",

        itemsLimit: 10

    );

    Console.WriteLine($"Job completed with status: {job.Status}");

    // Access job items and their content

    foreach (var item in job.JobItems)

    {

        Console.WriteLine($"Page title: {item.Title}");

        Console.WriteLine($"Original URL: {item.OriginalUrl}");

        var content = await item.GetContentAsync();

        if (content != null)

        {

            Console.WriteLine($"Content preview: {content[..Math.Min(200, content.Length)]}...");

        }

    }

### [Asynchronous Crawling](#asynchronous-crawling)

The asynchronous method returns a job ID immediately and allows you to check the status later.

    using WebCrawlerApi;

    using WebCrawlerApi.Models;

    // Initialize the client

    var crawler = new WebCrawlerApiClient("YOUR_API_KEY");

    // Start async crawl job

    var response = await crawler.CrawlAsync(

        url: "https://example.com",

        scrapeType: "markdown",

        itemsLimit: 10

    );

    // Get the job ID

    var jobId = response.Id;

    Console.WriteLine($"Crawling job started with ID: {jobId}");

    // Check job status

    var job = await crawler.GetJobAsync(jobId);

    while (job.Status == "in_progress")

    {

        await Task.Delay(job.RecommendedPullDelayMs);

        job = await crawler.GetJobAsync(jobId);

    }

    // Process results

    if (job.Status == "done")

    {

        foreach (var item in job.JobItems)

        {

            Console.WriteLine($"Page title: {item.Title}");

            Console.WriteLine($"Original URL: {item.OriginalUrl}");

        }

    }

[Available Parameters](#available-parameters)

---------------------------------------------

-------------------------------------

### [Job Object](#job-object)

    job.Id                         // Unique job identifier

    job.Status                     // Job status (new, in_progress, done, error)

    job.Url                        // Original crawl URL

    job.CreatedAt                  // Job creation timestamp

    job.FinishedAt                 // Job completion timestamp

    job.JobItems                   // List of crawled items

    job.RecommendedPullDelayMs     // Recommended delay between status checks

### [JobItem Object](#jobitem-object)

    item.Id                    // Unique item identifier

    item.OriginalUrl           // URL of the crawled page

    item.Title                 // Page title

    item.Status               // Item status

    item.PageStatusCode       // HTTP status code

    item.MarkdownContentUrl   // URL to markdown content (if applicable)

    item.RawContentUrl        // URL to raw content

    item.CleanedContentUrl    // URL to cleaned content

    item.GetContentAsync()    // Method to get content based on scrape_type

[Java

Previous Page](/docs/sdk/java)[LangChain

Next Page](/docs/sdk/langchain)

### On this page

[Installation](#installation)[Requirements](#requirements)[Usage](#usage)[Synchronous Crawling](#synchronous-crawling)[Asynchronous Crawling](#asynchronous-crawling)[Available Parameters](#available-parameters)[Response Objects](#response-objects)[Job Object](#job-object)[JobItem Object](#jobitem-object)

----
url: https://webcrawlerapi.com/docs/rate-limits
----

WebcrawlerAPI Rate Limits

=========================

Copy MarkdownOpen

Rate limiting mechanisms to ensure ethical web scraping practices and prevent excessive load on target websites

To ensure ethical web scraping practices and prevent excessive load on target websites, we implement rate limiting mechanisms. These limits help maintain a balance between efficient data collection and responsible website interaction.

[Rate Limit Rules](#rate-limit-rules)

-------------------------------------

### [Per-Account Limits](#per-account-limits)

*   Each account is limited to **5 parallel threads** for concurrent scraping operations

*   This limit applies regardless of the target website

### [Per-Website Limits](#per-website-limits)

*   Each website has a shared limit of **5 parallel threads** across all accounts

*   This limit is shared among different accounts scraping the same website

[Example Scenario](#example-scenario)

-------------------------------------

Let's consider a situation with two different accounts scraping the same website:

1.  Account A starts scraping `example.com` using 3 parallel threads

2.  Account B attempts to scrape `example.com` at the same time

3.  Since `example.com` already has 3 threads in use:

    *   Account B can only use up to 2 additional threads (5 total - 3 in use = 2 available)

    *   This is true even if Account B's personal thread limit hasn't been reached

This shared website limit ensures that no single website receives excessive load, even when multiple accounts are scraping simultaneously.

[Errors

Previous Page](/docs/errors)[Caching

Next Page](/docs/api/caching)

### On this page

[Rate Limit Rules](#rate-limit-rules)[Per-Account Limits](#per-account-limits)[Per-Website Limits](#per-website-limits)[Example Scenario](#example-scenario)

----
url: https://webcrawlerapi.com/changelog/2026-01-23-structured-outputs-prompts
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

January 23, 2026

Structured Outputs with JSON Schema

===================================

Structured Outputs ensure AI-generated responses adhere to a JSON schema you define, eliminating the need to validate or retry incorrectly formatted responses.

When you provide a prompt to the /v2/scrape endpoint, you can now add a response\_schema parameter to enforce strict JSON Schema validation. The API will return guaranteed structured data that matches your schema every time.

Perfect for:

*   **Data Extraction**: Extract product info, business details, or any structured data with type safety

*   **Reliable Parsing**: No more validating or handling malformed JSON responses

*   **Complex Structures**: Support for nested objects, arrays, enums, and optional fields

*   **API Integration**: Feed predictable data directly into your application or database

What's new?

-----------

*   **JSON Schema Validation**: Define exact structure using standard JSON Schema format

*   **Type Safety**: Guaranteed response format with string, number, boolean, object, array, and enum types

*   **Nested Structures**: Support for complex nested objects and arrays of objects

*   **Enum Constraints**: Restrict values to predefined options

*   **Optional Fields**: Use null union types for fields that may not always be present

*   **Clear Error Handling**: Validation errors returned immediately if schema is invalid

Pricing

-------

Structured outputs cost the same as regular prompts: **$0.002 per request** (in addition to the base crawling cost).

See the [Structured Outputs documentation](/docs/structured-outputs) for schema examples and complete usage guide.

----
url: https://webcrawlerapi.com/changelog/2025-04-28-organizations-feature
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

April 28, 2025

🔐 New: Organizations Support with Role-Based Access

====================================================

Organizations has been added to WebcrawlerAPI. This feature lets multiple team members use the same API account with different access levels.

### What's new:

*   Organizations are now automatically created for all accounts

*   All existing users have been assigned the "OWNER" role

*   New "DEVELOPER" role with limited access:

    *   Can use API and see usage statistics

    *   Cannot access billing information

    *   Cannot add or manage team members

### How it works:

To add team members, go to your dashboard and click the "Invite member" button. You can assign roles based on what each person needs to do. This lets developers use the API without seeing billing details or changing the team.

----
url: https://webcrawlerapi.com/changelog/2025-06-26-integrately-integration
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

June 26, 2025

WebcrawlerAPI is now available on Integrately

=============================================

You can now automate web scraping workflows with 1200+ apps through Integrately. The integration provides seamless connectivity with popular business tools and platforms.

The Integrately app includes actions for:

*   Scrape a single webpage

*   Start crawling job

*   Get crawling job result

The integration supports all WebcrawlerAPI features including running prompts on scraped content to extract specific information or format the output. You can also specify output formats (markdown, cleaned text, or HTML) and use CSS selectors to clean unwanted elements.

Connect your WebcrawlerAPI account and start building automated workflows with Integrately's powerful automation platform.

[WebcrawlerAPI on Integrately](https://integrately.com/integrations/webcrawler-api)

----
url: https://webcrawlerapi.com/blog/best-prompt-data
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

RAGTechnicalMarkdown6 min read to read

The Best Data Format for Your Prompt

====================================

Learn which data format is best for your prompt. Markdown, JSON, CSV, Plain Text, and YAML each have their strengths and weaknesses.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [The Best Data Format for Your Prompt](#the-best-data-format-for-your-prompt)

*   [Quick Comparison](#quick-comparison)

*   [Markdown](#markdown)

*   [JSON](#json)

*   [CSV](#csv)

*   [Plain Text](#plain-text)

*   [YAML](#yaml)

*   [Conclusion](#conclusion)

### Table of Contents

*   [The Best Data Format for Your Prompt](#the-best-data-format-for-your-prompt)

*   [Quick Comparison](#quick-comparison)

*   [Markdown](#markdown)

*   [JSON](#json)

*   [CSV](#csv)

*   [Plain Text](#plain-text)

*   [YAML](#yaml)

*   [Conclusion](#conclusion)

The Best Data Format for Your Prompt

====================================

The data format you pick depends on what you need. If your prompt needs to be easy for people to read, **[Markdown](/blog/html-vs-markdown-choosing-the-right-output-format)** is the best. For other tasks, like storing detailed data, you might want to use **[JSON](/blog/markdown-vs-json-choosing-the-right-format-for-llm-prompts)**, **[CSV](/blog/markdown-vs-csv-choosing-the-right-format-for-llm-prompts)**, or **[YAML](/blog/markdown-vs-yaml-choosing-the-right-format-for-llm-prompts)**. Let’s go over Markdown and other formats to see when to use each one.

Quick Comparison

----------------

Format

Best For

Key Advantages

Limitations

[Markdown](/blog/html-vs-markdown-choosing-the-right-output-format)

Easy-to-read prompts

Simple to read and write; flexible

Needs special tools for structure

[JSON](/blog/markdown-vs-json-choosing-the-right-format-for-llm-prompts)

Complex, detailed data

Standard for APIs; works with code

Harder for people to read

[CSV](/blog/markdown-vs-csv-choosing-the-right-format-for-llm-prompts)

Table-like data

Compact; great for rows and columns

No advanced features

[Plain Text](/blog/markdown-vs-plain-text-choosing-the-right-format-for-llm-prompts)

Simple and small data

Very easy to write

Can’t handle complex data

[YAML](/blog/markdown-vs-yaml-choosing-the-right-format-for-llm-prompts)

Settings and nested data

Easy for people to read; supports comments

Easy to mess up with wrong spaces

* * *

Markdown

--------

Markdown is the best if your prompt needs to be clear and easy for humans to read. It’s great for mixing instructions and examples. You can even add other formats, like [JSON](/blog/markdown-vs-json-choosing-the-right-format-for-llm-prompts) or [YAML](/blog/markdown-vs-yaml-choosing-the-right-format-for-llm-prompts), inside Markdown for more complex needs.

**Example:**

    # Instruction

    Summarize the text below.

    ## Examples

    - **Input:** The cat sat on the mat.

      **Output:** A cat sat.

    - **Input:** The dog barked at the mailman.

      **Output:** A dog barked.

    ### Extra Data (in JSON)

    ```json

    {

      "articles": [        {

          "id": 1,

          "title": "AI in Healthcare",

          "content": "AI is transforming healthcare by..."

        },

        {

          "id": 2,

          "title": "Climate Change Impact",

          "content": "The effects of climate change are..."

        }

      ]

    }

    ```

**Why pick Markdown?**

*   It’s easy to read and write.

*   Perfect for mixing instructions and examples.

*   Lets you include other formats if needed.

* * *

JSON

----

JSON is great when you need to store or share complex and detailed data. It’s often used for APIs and works well with computers.

**Example:**

    {

      "instruction": "Summarize the text below.",

      "data_reference": {

        "articles": [          {

            "id": 1,

            "title": "AI in Healthcare",

            "content": "AI is transforming healthcare by..."

          },

          {

            "id": 2,

            "title": "Climate Change Impact",

            "content": "The effects of climate change are..."

          }

        ]

      }

    }

**Why pick JSON?**

*   Works for complex and nested data.

*   Computers can read it easily.

*   It’s the standard for many APIs.

* * *

CSV

---

CSV is the go-to for simple table-like data. It’s great for rows and columns but doesn’t handle more complex stuff.

**Example:**

    id,title,content

    1,AI in Healthcare,"AI is transforming healthcare by..."

    2,Climate Change Impact,"The effects of climate change are..."

**Why pick CSV?**

*   It’s small and easy to use.

*   Perfect for data in table format.

* * *

Plain Text

----------

Plain text is the simplest format. It’s great for small prompts but doesn’t work well for complex data.

**Example:**

    Referenced Articles:

    1. AI in Healthcare: AI is transforming healthcare by...

    2. Climate Change Impact: The effects of climate change are...

**Why pick Plain Text?**

*   Super simple to write.

*   Best for small and easy prompts.

* * *

YAML

----

YAML is easy for humans to read and great for nested data, like settings. However, you need to be careful with spaces since it’s sensitive to indentation.

**Example:**

    instruction: Summarize the text below.

    data_reference:

      articles:

        - id: 1

          title: "AI in Healthcare"

          content: "AI is transforming healthcare by..."

        - id: 2

          title: "Climate Change Impact"

          content: "The effects of climate change are..."

**Why pick YAML?**

*   Very readable for humans.

*   Lets you add comments (unlike JSON).

* * *

Conclusion

----------

Markdown is the best for prompts that need to be easy for humans to read. For extra data, use JSON for complex details, CSV for tables, and YAML for settings. Each format has its strengths, so pick the one that fits your project!

In WebcrawlerAPI you can get website data in [Markdown](/blog/html-vs-markdown-choosing-the-right-output-format) and [simple text](/blog/cleaned-text-vs-markdown-choosing-the-right-output-format). Check out our [API documentation](/docs) to see how you can use these formats in your projects.

----
url: https://webcrawlerapi.com/blog/how-ai-flowchat-uses-webcrawlerapi-to-add-context-to-users-flows
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Customer story

How AI FlowChat uses WebCrawlerAPI to add context to users' flows

=================================================================

I recently talked to Alex, founder of AI Flow Chat. Read the customer story about how AI Flow Chat is using WebCrawlerAPI in their user flows

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [AI Flow Chat](#ai-flow-chat)

*   [Real-world example: Content Rewriting Flow](#real-world-example-content-rewriting-flow)

*   [Why use WebcrawlerAPI instead of building their own scraping?](#why-use-webcrawlerapi-instead-of-building-their-own-scraping)

### Table of Contents

*   [AI Flow Chat](#ai-flow-chat)

*   [Real-world example: Content Rewriting Flow](#real-world-example-content-rewriting-flow)

*   [Why use WebcrawlerAPI instead of building their own scraping?](#why-use-webcrawlerapi-instead-of-building-their-own-scraping)

I (Andrew, the founder of WebCrawlerAPI) recently talked with [Alex](https://x.com/qwertyu_alex) (the founder of AI Flow Chat) about what AI Flow Chat is and how they are using WebCrawlerAPI in their flows.

AI Flow Chat

------------

[AI Flow Chat](https://aiflowchat.com/) is an automation tool that helps you chain multiple AI prompts together to create complex workflows.

You can set up these flows using their visual drag-and-drop builder, which makes it really intuitive to use.

Here's an example: let's say you want to create a social media post based on a webpage. Instead of manually doing this every time, you can build a chain of prompts that does it automatically. You just change the input - the webpage URL - and the flow handles the rest.

This is where [WebCrawlerAPI](https://webcrawlerapi.com/) helps. You can simply add it as a component in your flow and use it as a data source. It grabs the webpage content and feeds it into your AI prompts, giving them the context they need to create much better, more relevant content.

Real-world example: Content Rewriting Flow

------------------------------------------

Here's another practical example: [content rewriting workflows](https://aiflowchat.com/s/edb52ac5-093b-45bc-8871-baadb7ee687a). This flow helps you transform existing web content into fresh articles with specific hooks and titles.

The process flows naturally from input to output. You start by providing a webpage URL to any article or content you want to rewrite. WebCrawlerAPI automatically crawls the page and extracts the actual webpage content as source material. Then, multiple AI prompts work together to transform the material with your desired angle, tone, and structure, ultimately delivering a completely rewritten article that maintains the original's key information while presenting it in a fresh way.

This type of automated content transformation is powerful for content creators, marketers, and writers who need to repurpose existing material at scale. Instead of manually reading through articles and rewriting them, you can set up this flow once and then simply feed it different URLs to get professionally rewritten content every time.

Why use WebcrawlerAPI instead of building their own scraping?

-------------------------------------------------------------

Building a reliable web scraper from scratch involves handling JavaScript rendering, anti-bot measures, proxy rotation, and compatibility issues across different sites - quickly becoming a full-time engineering challenge. By integrating WebCrawlerAPI, AI Flow Chat can focus entirely on what they do best: creating intuitive AI automation tools while offering reliable webpage data extraction as a seamless component.

----
url: https://webcrawlerapi.com/changelog
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

WebcrawlerAPI product updates

=============================

Keep track of updates and improvements to our platform

March 9, 2026[### Subscriptions, Structured Outputs, and Markdown Export](/changelog/2026-03-09-subscriptions-structured-outputs-markdown-export)

This March we shipped three customer-facing improvements across billing and extraction workflows.

Monthly subscriptions and plan management

-----------------------------------------

You can now switch to a monthly subscription directly from the dashboard and manage your plan without contacting support.

This includes:

*   **Subscription plans** with included monthly credits

*   **Self-serve upgrades and downgrades**

*   **Subscription management** for payment method and billing changes

*   **Renewal visibility** with current plan, next billing date, and included credit usage

*   **Cancellation controls** while keeping top-up credits available after subscription end

Structured outputs with JSON Schema

-----------------------------------

Prompt-based scraping on /v2/scrape now supports **structured outputs**.

When you send a prompt, you can also provide a response\_schema so the response follows a JSON Schema-defined structure.

Useful for:

*   **Reliable extraction** of fields like prices, contacts, or metadata

*   **Typed responses** that fit directly into apps, automations, and databases

*   **Nested data structures** including arrays, objects, enums, and optional fields

Combined markdown export for crawl jobs

---------------------------------------

Completed markdown crawl jobs can now be exported as a **single combined markdown file**.

Instead of fetching each page result one by one, you can download one consolidated output for the full job.

Useful for:

*   **RAG and knowledge base workflows**

*   **Content archiving** of full crawl runs

*   **Bulk processing** of complete jobs in one step

January 23, 2026[### Structured Outputs with JSON Schema](/changelog/2026-01-23-structured-outputs-prompts)

Structured Outputs ensure AI-generated responses adhere to a JSON schema you define, eliminating the need to validate or retry incorrectly formatted responses.

When you provide a prompt to the /v2/scrape endpoint, you can now add a response\_schema parameter to enforce strict JSON Schema validation. The API will return guaranteed structured data that matches your schema every time.

Perfect for:

*   **Data Extraction**: Extract product info, business details, or any structured data with type safety

*   **Reliable Parsing**: No more validating or handling malformed JSON responses

*   **Complex Structures**: Support for nested objects, arrays, enums, and optional fields

*   **API Integration**: Feed predictable data directly into your application or database

What's new?

-----------

*   **JSON Schema Validation**: Define exact structure using standard JSON Schema format

*   **Type Safety**: Guaranteed response format with string, number, boolean, object, array, and enum types

*   **Nested Structures**: Support for complex nested objects and arrays of objects

*   **Enum Constraints**: Restrict values to predefined options

*   **Optional Fields**: Use null union types for fields that may not always be present

*   **Clear Error Handling**: Validation errors returned immediately if schema is invalid

Pricing

-------

Structured outputs cost the same as regular prompts: **$0.002 per request** (in addition to the base crawling cost).

See the [Structured Outputs documentation](/docs/structured-outputs) for schema examples and complete usage guide.

January 5, 2026[### Combined Markdown Export Endpoint](/changelog/2026-01-05-markdown-endpoint)

New **GET /job/:id/markdown** endpoint returns the full job content in one single concatenated markdown file.

Instead of downloading individual page results, you can now get all crawled markdown content combined into a single file. Each page's content is separated with URL headers for easy parsing.

Perfect for:

*   **RAG Applications**: Feed combined documentation into vector databases or AI models

*   **Batch Processing**: Process entire website content at once for analysis or indexing

*   **Documentation Extraction**: Extract and combine docs from multiple pages

*   **Backup**: Archive complete crawl results in a single file

See the [API documentation](/docs/api/markdown) for usage examples and complete reference.

December 27, 2025[### Monitor Any Website with RSS/JSON Feeds](/changelog/2025-12-27-website-feeds)

Turn any website into a feed. Monitor websites for changes and get automatic updates via RSS, JSON feeds, or webhooks.

The new [Feeds API](/docs/feeds) lets you track content changes on any website without building custom monitoring infrastructure. Perfect for tracking blogs, news sites, documentation, or any web content that doesn't offer native feeds.

What's new?

-----------

*   **RSS/Atom Feeds**: Subscribe to any website in standard Atom 1.0 format compatible with all feed readers

*   **JSON Feed Format**: Get updates in JSON Feed format for easy integration with applications

*   **Webhook Notifications**: Receive instant POST requests when content changes are detected

*   **Automatic Monitoring**: Periodic crawling with smart change detection

*   **Flexible Configuration**: Control crawl depth, page limits, URL patterns, and output formats (markdown, cleaned, HTML)

*   **Error Resilience**: Automatic pause after 3 consecutive errors to prevent unnecessary charges

*   **Status Tracking**: Detailed metrics on pages crawled, changed, new, unavailable, and errors

How it works

------------

Create a feed with a single API call, then subscribe to updates via RSS/Atom, JSON Feed, or webhooks. The system automatically crawls your target website periodically and delivers only what changed.

See the [Feeds documentation](/docs/feeds) and [API Reference](/docs/api/feed/feed-create) for complete details.

November 25, 2025[### Nov '25 Updates](/changelog/2025-11-25-november-updates)

*   **Max depth parameter**: New max\_depth parameter for [crawling](https://webcrawlerapi.com/docs/api/crawl) to control link following depth

*   **Usage endpoint**: Admin endpoint to retrieve usage statistics and metrics with date range filtering and optional daily breakdown ([documentation](https://webcrawlerapi.com/docs/api/usage))

*   **Microsoft login fix**: Fixed authentication issues with Microsoft/Outlook email accounts

*   **Documentation redesign**: New [documentation design](https://webcrawlerapi.com/docs) with copy-to-markdown feature for easy sharing with AI assistants

*   **PDF parsing improvements**: Enhanced PDF content extraction and parsing

*   **CSV/TSV file parsing**: Added support for parsing CSV and TSV files directly from URLs

September 25, 2025[### Sep '25 Updates](/changelog/2025-09-25-september-updates)

*   **Google News feed scraping**: Fixed and now working reliably

*   **New Integration Page**: Added to dashboard showing all available WebCrawlerAPI integrations including code, no-code, and storage options

*   **Webhook status tracking**: Now visible in dashboard with "Resend" button to retry failed webhook deliveries

*   **Infrastructure optimizations**: Enhanced crawling and scraping performance

*   **WWW handling**: Websites with and without www subdomain are now processed correctly

*   **Model upgrade**: Switched to google/gemini-2.5-flash-lite for prompt processing - significantly faster than OpenAI models

*   **Main content extraction**: New parameter to extract only useful text from webpages, perfect for blog posts and articles

*   **Milestone achieved**: WebCrawlerAPI crossed 750K total crawled pages this month

July 27, 2025[### MCP Integration Available](/changelog/2025-07-27-mcp-integration)

July 16, 2025[### n8n Integration Available](/changelog/2025-07-16-n8n-integration)

July 3, 2025[### Make Integration Available](/changelog/2025-07-03-make-integration)

WebcrawlerAPI is now available on Make (formerly Integromat).

You can now automate web scraping workflows with other apps through Make. The integration uses the Scrape API v2 endpoint to scrape single webpages.

The Make app includes for now a single **Scrape a single webpage** action. It supports all v2 features including running prompts on scraped content to extract specific information or format the output. You can also specify output formats (markdown, cleaned text, or HTML) and use CSS selectors to clean unwanted elements.

To use the WebcrawlerAPI in the Make app just search for "WebcrawlerAPI" in the app store.

[WebcrawlerAPI on Make](https://www.make.com/en/integrations/webcrawlerapi)

Read our documentation on how to [integrate WebcrawlerAPI in Make](/docs/sdk/make).

June 26, 2025[### WebcrawlerAPI is now available on Integrately](/changelog/2025-06-26-integrately-integration)

You can now automate web scraping workflows with 1200+ apps through Integrately. The integration provides seamless connectivity with popular business tools and platforms.

The Integrately app includes actions for:

*   Scrape a single webpage

*   Start crawling job

*   Get crawling job result

The integration supports all WebcrawlerAPI features including running prompts on scraped content to extract specific information or format the output. You can also specify output formats (markdown, cleaned text, or HTML) and use CSS selectors to clean unwanted elements.

Connect your WebcrawlerAPI account and start building automated workflows with Integrately's powerful automation platform.

[WebcrawlerAPI on Integrately](https://integrately.com/integrations/webcrawler-api)

June 23, 2025[### Zapier Integration Available](/changelog/2025-06-23-zapier-integration)

WebcrawlerAPI is now available on Zapier.

You can now automate web scraping workflows with other apps through Zapier. The integration uses the Scrape API v2 endpoint to scrape single webpages.

The Zapier app includes actions for:

*   Scrape a single webpage

The integration supports all v2 features including running prompts on scraped content to extract specific information or format the output. You can also specify output formats (markdown, cleaned text, or HTML) and use CSS selectors to clean unwanted elements.

Read [How to integrate no-code webcrawler in Zapier](/docs/sdk/zapier)

Connect your WebcrawlerAPI account and start building automated workflows.

[WebcrawlerAPI on Zapier](https://zapier.com/developer/public-invite/206901/b015f18eaa55e0545d9219f2942e94d1/)

June 7, 2025[### Scrape API v2 Released](/changelog/2025-06-07-scrape-v2)

Scrape API v2 is now live.

Easier and straight forward. The new version lets you run a prompt on the page. You can get results in markdown, cleaned text, or HTML. Scraping is now in synchronous mode, with a single API call.

The new endpoint is at https://api.webcrawlerapi.com/v2/scrape. See the [API Reference](/docs/api/scrape) for details.

What's new?

-----------

*   Scraping is now sync, with a single API call.

*   You can remove parts of the page using CSS selectors.

*   You can get results in markdown, cleaned text, or HTML.

*   You can run a prompt on the page.

*   Error handling is improved.

May 19, 2025[### Postman Collection is here](/changelog/2025-05-19-postman-collection)

May 7, 2025[### 📦New: S3 Compatible Storage Integration](/changelog/2025-05-07-s3-r2-integration)

WebcrawlerAPI now supports direct export to any S3 compatible storage.

### What's new:

*   Export crawl results directly to Amazon S3 buckets (or any S3 compatible storage: Cloudflare R2, DigitalOcean Spaces, Wasabi, Backblaze B2, etc.)

*   Simple setup with API keys and bucket information

*   We don't store your keys after job ends

### How it works:

When starting a job via API, just add several parameters, like access\_key\_id, secret\_access\_key and a few others. Crawled data will be placed under the specified path. Your keys will be deleted after the job ends. Read [Upload to S3](/docs/actions/s3-upload) docs for detailed information.

April 28, 2025[### 🔐 New: Organizations Support with Role-Based Access](/changelog/2025-04-28-organizations-feature)

Organizations has been added to WebcrawlerAPI. This feature lets multiple team members use the same API account with different access levels.

### What's new:

*   Organizations are now automatically created for all accounts

*   All existing users have been assigned the "OWNER" role

*   New "DEVELOPER" role with limited access:

    *   Can use API and see usage statistics

    *   Cannot access billing information

    *   Cannot add or manage team members

### How it works:

To add team members, go to your dashboard and click the "Invite member" button. You can assign roles based on what each person needs to do. This lets developers use the API without seeing billing details or changing the team.

April 13, 2025[### 🦜🔗 Introducing WebcrawlerAPI LangChain Integration 🤖](/changelog/2025-04-13-langchain-integration)

We're thrilled to announce the release of our official LangChain integration! The new webcrawlerapi-langchain package makes it seamless to incorporate WebcrawlerAPI's powerful web crawling capabilities into your LangChain document processing pipelines.

### Key Features:

*   🚀 Simple integration with LangChain's document loaders

*   📄 Multiple content formats (markdown, cleaned text, HTML)

*   ⚡️ Async and lazy loading support

*   🔄 Built-in retry mechanisms, proxies and error handling

*   🎯 Configurable URL filtering with regex patterns

### Quick Start:

    pip install webcrawlerapi-langchain

    from webcrawlerapi_langchain import WebCrawlerAPILoader

    loader = WebCrawlerAPILoader(

        url="https://example.com",

        api_key="your-api-key",

        scrape_type="markdown"

    )

    documents = loader.load()

### Perfect for:

*   Building AI-powered knowledge bases

*   Creating document QA systems

*   Training custom language models

*   Processing web content for LLM applications

### Need an integration example?

Check our [WebcrawlerAPI examples](https://github.com/WebCrawlerAPI/webcrawlerapi-examples/tree/master/python/langchain-basic)

Check out our [LangChain SDK documentation](/docs/sdk/langchain) for detailed usage instructions and examples. Start building powerful AI applications with web data today!

April 7, 2025[### ✨ New: $10 Trial Balance for WebcrawlerAPI 💫](/changelog/2025-04-07-webcrawler-trial)

We're excited to announce that all new WebcrawlerAPI accounts now receive a $10 evaluation balance for a 7-day trial period! This initiative allows new users to thoroughly test our API capabilities without any upfront commitment.

### What's included:

*   $10 trial funds automatically added to new accounts

*   Complete API access during 7-day evaluation period

*   Start immediately with no credit card required

*   Full access to all standard API features

The new trial balance makes it easier than ever to evaluate WebcrawlerAPI and test its capabilities for your projects.

April 6, 2025[### Additional dashboard improvements](/changelog/2025-04-06-dashboard-improvements)

*   Pagination for jobs and job items

*   Download button now has a progress and file size

*   Graphs now more interactive

March 31, 2025[### Major Dashboard Improvements](/changelog/2025-03-31-dashboard-improvements)

Major dashboard improvements 💫

*   Enhanced login with email form:

    *   Implemented rate limiting for magic link emails

    *   Improved user experience and security

*   Dashboard page enhancements:

    *   Added time period toggles (24h, 7d, 15d, 30d)

    *   Implemented total counter for each period

    *   Enhanced graphs for funds spent and crawled pages

*   New dedicated billing page:

    *   Comprehensive payment history

    *   Detailed payment usage tracking for all time

March 28, 2025[### Integrated Proxy Management System](/changelog/2025-03-28-proxy-management)

Major Update 🚀

*   Integrated proxy management system:

    *   All proxies are now handled internally

    *   Included in the standard pricing

    *   Significantly improved success rates

    *   Enhanced protection against anti-bot measures

    *   **No additional setup required from users**

March 24, 2025[### LLMStxt Generator Tool Launch](/changelog/2025-03-24-llmstxt-generator)

Launched free [llmstxt Generator Tool](/tools/llmstxt-generator) that helps create standardized llms.txt files for documenting AI models in your projects. You can learn more about the llms.txt standard in our [detailed guide](/blog/what-is-llm).

March 21, 2025[### Comprehensive Error Handling System](/changelog/2025-03-21-error-handling)

Major WebcrawlerAPI update: Comprehensive [error handling](/docs/errors) system implementation

*   Added two-level error handling system: job level and job item level errors

*   New job level error codes:

    *   insufficient\_balance for balance-related issues

    *   invalid\_request for malformed requests

    *   internal\_error for system-level issues

*   New job item level error codes:

    *   host\_returned\_error for non-200 HTTP responses

    *   website\_access\_denied for 403 responses

    *   name\_not\_resolved for DNS resolution failures

    *   internal\_error for system-level issues

*   Each error now includes detailed error messages and specific error codes for better debugging

March 14, 2025[### Headless Browser Improvements](/changelog/2025-03-14-headless-browser-improvements)

Major improvements to our headless browser implementation for enhanced web scraping capabilities:

*   Improved anti-bot protection bypass mechanisms

*   Enhanced blocking of non-essential content:

    *   Advertisement content filtering

    *   Cookie consent banner removal

    *   Other non-page-content elements blocking

*   These updates result in cleaner data extraction and improved scraping reliability

March 6, 2025[### Monitoring Server Incident Resolution](/changelog/2025-03-06-monitoring-server-incident)

The issue lasted for 9 hours but was not related to crawling. The root cause was a network issue affecting the monitoring server. Because the monitoring server was unavailable to the main job manager, each job report had to wait several minutes for a timeout response from the monitoring server.

As a result, the processing time for each job increased, and the job queue grew to several thousand jobs.

The incident has now been resolved. We are continuously working on improving our monitoring system to prevent similar issues in the future.

March 3, 2025[### Status Page Link Added](/changelog/2025-03-03-status-page-link)

A status page link has been added to the website footer. The current status of WebCrawlerAPI services can now be checked at [status.webcrawlerapi.com](https://status.webcrawlerapi.com/status/main).

February 22, 2025[### Changelog Page Added](/changelog/2025-02-22-changelog-page)

A [changelog page](https://webcrawlerapi.com/changelog) has been added to the website. This page tracks all the changes, improvements, and fixes to WebCrawlerAPI.

February 19, 2025[### Webpage to Markdown Tool Launch](/changelog/2025-02-19-webpage-to-markdown)

A new tool [Webpage to Markdown](https://webcrawlerapi.com/tools/website-to-md) has been added. This tool converts any documentation or website into a beautiful Markdown file. It is free and does not require an API key. It can crawl up to 100 pages.

February 18, 2025[### PDF Content Rendering Implementation](/changelog/2025-02-18-pdf-content-rendering)

PDF content rendering has been implemented. Text content can now be extracted from PDF files. When a website contains a PDF file, its content will be extracted and returned in the response as page content.

----
url: https://webcrawlerapi.com/glossary/playwright/how-to-fix-playwright-err-name-not-resolved
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

ERR\_NAME\_NOT\_RESOLVED means DNS lookup failed for the URL host.

Verify baseURL, environment variables, and network access from the machine running tests.

    import { test } from '@playwright/test';

    test.use({ baseURL: 'https://staging.example.com' });

    test('home loads', async ({ page }) => {

      await page.goto('/');

    });

In CI, confirm the host is reachable from the runner network, not only locally.

----
url: https://webcrawlerapi.com/glossary/playwright/how-to-fix-net-err-invalid-auth-credentials
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

This means HTTP auth credentials are missing or incorrect.

Set valid httpCredentials when creating the browser context.

    const context = await browser.newContext({

      httpCredentials: {

        username: process.env.BASIC_AUTH_USER,

        password: process.env.BASIC_AUTH_PASS,

      },

    });

    const page = await context.newPage();

    await page.goto('https://staging.example.com');

Confirm credentials match the environment (staging vs production) and are not empty.

----
url: https://webcrawlerapi.com/blog/how-to-build-a-web-crawler-with-scrapy-in-python
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

PythonTutorial7 min read to read

How to build a web crawler with Scrapy in Python

================================================

Scrapy is a powerful tool for crawling and scraping websites. In this tutorial, you will learn how to build a crawler using this framework, render JavaScript, and save the content of the website page by page.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Basic crawler with Python](#basic-crawler-with-python)

*   [Scrapy crawl delay](#scrapy-crawl-delay)

*   [Render Javascript in Scrapy](#render-javascript-in-scrapy)

*   [Final Scrapy crawler script](#final-scrapy-crawler-script)

*   [Summary](#summary)

### Table of Contents

*   [Basic crawler with Python](#basic-crawler-with-python)

*   [Scrapy crawl delay](#scrapy-crawl-delay)

*   [Render Javascript in Scrapy](#render-javascript-in-scrapy)

*   [Final Scrapy crawler script](#final-scrapy-crawler-script)

*   [Summary](#summary)

Basic crawler with Python

-------------------------

To start crawling, first install [Scrapy](https://scrapy.org/):

    pip install scrapy

Then, create a basic script:

    import scrapy

    import os

    import hashlib

    class PageSaverSpider(scrapy.Spider):

        name = "page_saver"

        start_urls = [            'https://books.toscrape.com/index.html',

        ]

        def parse(self, response):

            # Extracting the URL to use as a filename

            url = response.url

            # Using hashlib to create a unique hash for the URL to ensure filename is filesystem safe

            url_hash = hashlib.md5(url.encode()).hexdigest()

            # Create a safe filename from the URL

            safe_url = url.replace('://', '_').replace('/', '_').replace(':', '_')

            filename = f'{safe_url}_{url_hash}.html'

            # Ensuring filename length does not exceed filesystem limits

            filename = (filename[:245] + '..html') if len(filename) > 250 else filename

            # Creating a directory to save files if it doesn't exist

            os.makedirs('saved_pages', exist_ok=True)

            file_path = os.path.join('saved_pages', filename)

            # Writing the response body to the file

            with open(file_path, 'wb') as f:

                f.write(response.body)

            self.log(f'Saved file {file_path}')

            # Following links to the next page

            next_pages = response.css('a::attr(href)').getall()

            for next_page in next_pages:

                next_page_url = response.urljoin(next_page)

                yield scrapy.Request(next_page_url, callback=self.parse)

This scraper script does the following:

*   download content from the initial page

*   save content into the file under saved\_pages directory. Name includes URL

*   find all <a href=""></a> elements and send extracted links to a queue

This basic Python Scrapy script helps to save the content of the website pages to files.

### Scrapy crawl delay

If you run the script above on a certain website, every new page will be crawled immediately after the previous one. This can create an unwanted load, which may lead to downtime or blocks. Website owners, if it is not you, of course, can decide to make crawler bots' lives harder by installing bot protection, CAPTCHAs, banning IPs, etc.

To respect crawling websites, you can set delays between crawling web pages using the DOWNLOAD\_DELAY custom setting:

       custom_settings = {

           'DOWNLOAD_DELAY': 1,

       }

### Render Javascript in Scrapy

By default, Scrapy doesn't render JavaScript. This is a significant limitation since, nowadays, more and more websites are using JS to render webpages.

Fortunately, this is possible with (Splash)\[[https://splash.readthedocs.io/](https://splash.readthedocs.io/)\] - a lightweight javascript rendering service.

Run Splash using Docker first:

    docker pull scrapinghub/splash

    docker run -p 8050:8050 scrapinghub/splash

Next, install the scrapy-splash package:

    pip install scrapy-splash

Add these settings in your scrapper:

    # Enable splash middleware

    DOWNLOADER_MIDDLEWARES = {

        'scrapy_splash.SplashCookiesMiddleware': 723,

        'scrapy_splash.SplashMiddleware': 725,

        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,

    }

    # Enable splash deduplicate filter

    DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

    # Enable splash HTTP cache

    HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

    # Splash server URL

    SPLASH_URL = 'http://localhost:8050'

### Final Scrapy crawler script

    import scrapy

    import os

    import hashlib

    class PageSaverSpider(scrapy.Spider):

        name = "page_saver"

        start_urls = [            'https://books.toscrape.com/index.html',

        ]

        custom_settings = {

            'DOWNLOAD_DELAY': 1,  # Adding a delay of 1 second between requests

            'BOT_NAME': 'myproject',

            'DOWNLOADER_MIDDLEWARES': {

                'scrapy_splash.SplashCookiesMiddleware': 723,

                'scrapy_splash.SplashMiddleware': 725,

                'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,

            },

            'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',

            'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',

            'SPLASH_URL': 'http://localhost:8050',

        }

        def parse(self, response):

            # Extracting the URL to use as a filename

            url = response.url

            # Using hashlib to create a unique hash for the URL to ensure filename is filesystem safe

            url_hash = hashlib.md5(url.encode()).hexdigest()

            # Create a safe filename from the URL

            safe_url = url.replace('://', '_').replace('/', '_').replace(':', '_')

            filename = f'{safe_url}_{url_hash}.html'

            # Ensuring filename length does not exceed filesystem limits

            filename = (filename[:245] + '..html') if len(filename) > 250 else filename

            # Creating a directory to save files if it doesn't exist

            os.makedirs('saved_pages', exist_ok=True)

            file_path = os.path.join('saved_pages', filename)

            # Writing the response body to the file

            with open(file_path, 'wb') as f:

                f.write(response.body)

            self.log(f'Saved file {file_path}')

            # Following links to the next page

            next_pages = response.css('a::attr(href)').getall()

            for next_page in next_pages:

                next_page_url = response.urljoin(next_page)

                yield scrapy.Request(next_page_url, callback=self.parse)

Summary

-------

Scrapy is a powerful Python crawling and scraping framework. It works great if you need to crawl or scrape a website. However, in order to do that, you have to be familiar with Python programming language and manage infrastructure yourself.

If you don't have time for that and simply want to do an HTTP call and get the data, it is better to try an [WebCrawler API](https://webcrawlerapi.com/) which handles all this for you, however, it is a paid service.

----
url: https://webcrawlerapi.com/glossary/puppeteer/why-are-headers-not-updated-when-using-page-setrequestinterception-true-on-firefox
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Firefox currently mutates the headers object returned by request.headers() in a way that does not reflect in the response headers. In other words, mutating the copied headers you obtained from request.headers() and then calling request.continue({ headers }) does not guarantee those changes will be visible in response.request().headers(). Chrome has a related fix (PR 14341) to align behavior, but Firefox behaves differently. A reliable pattern is to track the mutated headers on your side and map them to the corresponding request, rather than reading them back from the response. This avoids relying on response.headers() to reflect modifications.

    // TypeScript/JavaScript example

    const modifiedHeaders = new WeakMap<object, Record<string, string>>();

    page.on("request", req => {

      const headers = { ...(req.headers() || {}), testung: 'abceded' };

      console.log(`[request] ${JSON.stringify(headers)}`);

      modifiedHeaders.set(req, headers);

      req.continue({ headers });

    });

    page.on('response', resp => {

      const req = resp.request();

      const seen = modifiedHeaders.get(req) || {};

      console.log(`[response] tracked headers: ${JSON.stringify(seen)}`);

    });

What you can rely on is that request.headers() represents the headers before interception, and response.headers() reflects what the server actually sent back. If you need to correlate mutations, maintain your own mapping keyed by the intercepted request.

----
url: https://webcrawlerapi.com/glossary/puppeteer/how-to-use-emulation-setuseragentoverride-instead-of-network-interception
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Solution

Use the Emulation.setUserAgentOverride command via a CDP session to override the user agent instead of relying on network interception.

    // Puppeteer example using a CDP session

    const client = await page.target().createCDPSession();

    await client.send('Emulation.setUserAgentOverride', {

      userAgent: 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36',

      acceptLanguage: 'en-US,en;q=0.9'

    });

----
url: https://webcrawlerapi.com/glossary/playwright/how-to-fix-frame-was-detached
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Frame was detached means the iframe was removed or reloaded while you were using it.

Re-acquire the frame right before acting on it and wait until it is attached.

    await page.waitForSelector('iframe[name="checkout"]');

    const frame = page.frame({ name: 'checkout' });

    if (!frame)

      throw new Error('Checkout frame is not available');

    await frame.getByLabel('Card number').fill('4242 4242 4242 4242');

If the app frequently remounts iframes, wait for the triggering network/UI event first.

----
url: https://webcrawlerapi.com/blog/html-vs-cleaned-text-choosing-the-right-output-format
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

ComparisonHTMLRAG

HTML vs Cleaned Text: Choosing the Right Output Format

======================================================

HTML vs cleaned text for web crawling and RAG: what is preserved, what is lost, and which output format is safer for real pipelines.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What HTML is good at](#what-html-is-good-at)

*   [What cleaned text is good at](#what-cleaned-text-is-good-at)

*   [Use cases for crawling and RAG ingestion](#use-cases-for-crawling-and-rag-ingestion)

*   [When HTML should be used](#when-html-should-be-used)

*   [When cleaned text should be used](#when-cleaned-text-should-be-used)

*   [Practical tradeoffs (what tends to break)](#practical-tradeoffs-what-tends-to-break)

*   [Link-heavy pages](#link-heavy-pages)

*   [Layout-heavy pages](#layout-heavy-pages)

*   [Node.js snippet: Strip HTML tags into rough cleaned text](#nodejs-snippet-strip-html-tags-into-rough-cleaned-text)

*   [Conclusion](#conclusion)

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What HTML is good at](#what-html-is-good-at)

*   [What cleaned text is good at](#what-cleaned-text-is-good-at)

*   [Use cases for crawling and RAG ingestion](#use-cases-for-crawling-and-rag-ingestion)

*   [When HTML should be used](#when-html-should-be-used)

*   [When cleaned text should be used](#when-cleaned-text-should-be-used)

*   [Practical tradeoffs (what tends to break)](#practical-tradeoffs-what-tends-to-break)

*   [Link-heavy pages](#link-heavy-pages)

*   [Layout-heavy pages](#layout-heavy-pages)

*   [Node.js snippet: Strip HTML tags into rough cleaned text](#nodejs-snippet-strip-html-tags-into-rough-cleaned-text)

*   [Conclusion](#conclusion)

HTML and cleaned text sit at opposite ends of the output spectrum. HTML keeps almost everything (including markup). Cleaned text keeps only readable text (and usually drops most structure).

If Markdown is being considered too, [HTML vs Markdown](/blog/html-vs-markdown-choosing-the-right-output-format) and [Cleaned Text vs Markdown](/blog/cleaned-text-vs-markdown-choosing-the-right-output-format) should be read.

Quick comparison

----------------

Topic

HTML

Cleaned Text

Best for

Fidelity and re-processing later

RAG, embeddings, fast reading

Keeps links

Yes (as <a href> etc.)

Usually no (or links are flattened)

Keeps structure

Yes (DOM)

Limited

Size

Larger

Smaller

Common failure

Noise: scripts, nav, repeated UI

Context loss: lists, tables, link targets

What HTML is good at

--------------------

HTML is usually preferred when:

*   Maximum fidelity is needed

*   The page must be re-parsed later with different rules

*   Link targets, attributes, and DOM structure matter

Typical crawling cases:

*   Product pages where microdata or attributes are needed

*   Pages where selectors will be applied later

*   Audits where evidence must be preserved

If extracted fields are the goal, a structured format should be used after parsing, as covered in [Best Prompt Data](/blog/best-prompt-data).

What cleaned text is good at

----------------------------

Cleaned text is usually preferred when:

*   The content will be embedded for RAG

*   Token cost should be reduced

*   Navigation and boilerplate should be removed

Cleaned text vs Markdown is compared in [Cleaned Text vs Markdown](/blog/cleaned-text-vs-markdown-choosing-the-right-output-format).

Use cases for crawling and RAG ingestion

----------------------------------------

### When HTML should be used

HTML is usually the safer choice when:

*   Re-processing is expected (parsing rules will change)

*   Link URLs must be preserved exactly

*   Tables and lists must be reconstructed later

A practical downside is that HTML often includes a lot of noise. Boilerplate must be removed in a second step.

### When cleaned text should be used

Cleaned text is usually the safer choice when:

*   The primary goal is retrieval over the readable content

*   Chunking will be done without relying on DOM structure

*   Storage and token costs must be kept down

A practical downside is that important structure can be lost, especially:

*   Tables (column meaning is lost)

*   Lists (nesting can be flattened)

*   Links (anchor text remains but target URLs can be dropped)

If structure must be preserved for readability, Markdown can be considered in [HTML vs Markdown](/blog/html-vs-markdown-choosing-the-right-output-format).

Practical tradeoffs (what tends to break)

-----------------------------------------

### Link-heavy pages

If a page is mostly a set of links (directories, documentation sidebars), cleaned text can become hard to use because the URL targets are lost. HTML keeps that.

### Layout-heavy pages

If a page is mostly layout (menus, cards, footers), HTML can be too noisy. Cleaned text usually performs better for RAG, because the noise is removed.

Node.js snippet: Strip HTML tags into rough cleaned text

--------------------------------------------------------

This is intentionally rough. It is only suitable as a fallback or a quick test.

    // Node 18+

    // Rough HTML to text conversion without external deps.

    import { readFile } from "node:fs/promises";

    const html = await readFile("page.html", "utf8");

    // Remove script/style blocks

    let text = html

      .replace(/<script[\s\S]*?<\/script>/gi, "")

      .replace(/<style[\s\S]*?<\/style>/gi, "");

    // Replace tags with spaces, then normalize whitespace

    text = text.replace(/<[^>]+>/g, " ");

    text = text.replace(/\s+/g, " ").trim();

    console.log(text.slice(0, 600));

Conclusion

----------

*   HTML is usually selected when fidelity and re-processing matter.

*   Cleaned text is usually selected when RAG and readable content are the goal.

*   A common pattern is: HTML is stored for traceability, and cleaned text is produced for embeddings.

If a single best default is being sought for RAG, [HTML vs Cleaned Text vs Markdown](/blog/html-vs-cleaned-text-vs-markdown-which-should-be-used-for-rag) can be used as the tie-breaker.

----
url: https://webcrawlerapi.com/blog/extract-xpath-golang
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

TechnicalTutorial6 min read to read

How to extract XPath in Golang

==============================

XPath is a powerful tool for selecting nodes in an XML document. In this article, we will show you how to extract XPath in Golang.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [How it works in Go](#how-it-works-in-go)

*   [Extracting multiple Xpath elements in Golang](#extracting-multiple-xpath-elements-in-golang)

### Table of Contents

*   [How it works in Go](#how-it-works-in-go)

*   [Extracting multiple Xpath elements in Golang](#extracting-multiple-xpath-elements-in-golang)

Sometimes you need to extract data points from HTML using Xpath (Read what is [XPath](/blog/what-is-xpath/) first). First, install [htmlquery](https://github.com/antchfx/htmlquery) package:

    go get github.com/antchfx/htmlquery

Here is the code and read explanation after:

    package main

    import (

    	"bytes"

    	"github.com/antchfx/htmlquery"

    	"strings"

    	"fmt"

    )

    // ExtractXPath function that accepts a single XPath expression and returns a single string

    func ExtractXPath(htmlStr string, xpathExpr string) (string, error) {

    	// Load the HTML document

    	var buffer bytes.Buffer

    	buffer.WriteString(htmlStr)

    	doc, err := htmlquery.Parse(&buffer)

    	if err != nil {

    		return "", err

    	}

    	// Find the nodes matching the XPath expression

    	nodes := htmlquery.Find(doc, xpathExpr)

    	var content []string

    	// Iterate over the nodes and extract the content

    	for _, node := range nodes {

    		content = append(content, htmlquery.InnerText(node))

    	}

    	// Join the extracted content if multiple nodes were found

    	result := strings.Join(content, " ")

    	return result, nil

    }

    func main() {

    	htmlStr := `

    		<html>

    			<head>

    				<title>Test Page</title>

    			</head>

    			<body>

    				<div class="content">

    					<p>Hello, World!</p>

    					<p>This is a test.</p>

    				</div>

    			</body>

    		</html>`

    	xpathExpr := "//div[@class='content']/p"

    	content, err := ExtractXPath(htmlStr, xpathExpr)

    	if err != nil {

    		fmt.Println("Error:", err)

    	} else {

    		fmt.Println("Extracted content:", content)

    	}

    }

You will receive the output:

    Extracted content: Hello, World! This is a test.

How it works in Go

------------------

Luckily, there is an open-source lib [htmlquery](https://github.com/antchfx/htmlquery) for that. Install it first:

    go get github.com/antchfx/htmlquery

Then, do a basic query against the document:

    nodes, err := htmlquery.QueryAll(doc, "//a")

    if err != nil {

    	panic(`not a valid XPath expression.`)

    }

See more examples at the doc.

Extracting multiple Xpath elements in Golang

--------------------------------------------

    package extract

    import (

    	"bytes"

    	"github.com/antchfx/htmlquery"

    	"strings"

    )

    type Rules = map[string]string

    type Content = map[string]string

    func XPath(htmlStr string, filter Rules) (Content, error) {

    	// Load the HTML document

    	var buffer bytes.Buffer

    	buffer.WriteString(htmlStr)

    	doc, err := htmlquery.Parse(&buffer)

    	if err != nil {

    		return nil, err

    	}

    	result := make(Content)

    	// Iterate over the filter to apply each XPath expression

    	for key, xpathExpr := range filter {

    		// Find the nodes matching the XPath expression

    		nodes := htmlquery.Find(doc, xpathExpr)

    		var content []string

    		// Iterate over the nodes and extract the content

    		for _, node := range nodes {

    			content = append(content, htmlquery.InnerText(node))

    		}

    		// Join the extracted content if multiple nodes were found

    		result[key] = strings.Join(content, " ")

    	}

    	return result, nil

    }

Extracting multiple Xpath elements requires more complicated code. First, define two maps: for extracting rules and for the result. Each rule has its own key, which will be used for the result map after extraction.

Then, iterate over filter rules and find elements for each rule. Extract it and put it into the resulting map under a certain key.

Here is the usage example:

    filter := Rules{

    		"Title":          "//title/text()",

    		"Header":         "//h1/text()",

    		"link_more_info": "//a[contains(text(),'More information')]/@href",

    		"link_fb":        "//a[contains(text(),'Another link fb')]/@href",

    	}

    	content, err := XPath(html, filter)

    	if err != nil {

    		fmt.Println("Error: %s", err)

    	}

    	fmt.Printf("Extracted content: %v

    ", content)

The result map will be:

    map[    	Header:Example Domain

    	Title:Example Domain

    	link_fb:https://fb.com/test

    	link_more_info:https://www.iana.org/domains/example

    ]

----
url: https://webcrawlerapi.com/changelog/2025-03-24-llmstxt-generator
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

March 24, 2025

LLMStxt Generator Tool Launch

=============================

Launched free [llmstxt Generator Tool](/tools/llmstxt-generator) that helps create standardized llms.txt files for documenting AI models in your projects. You can learn more about the llms.txt standard in our [detailed guide](/blog/what-is-llm).

----
url: https://webcrawlerapi.com/glossary/playwright/how-to-fix-strict-mode-violation-locator
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Strict mode means a locator used for an action must match exactly one element.

Refine the locator with role/name/test id, or intentionally choose one element.

    // Better: unique locator

    await page.getByRole('button', { name: 'Save changes' }).click();

    // If needed: pick one explicitly

    await page.locator('.save-button').first().click();

Prefer unique semantic locators over .first() so tests stay stable.

----
url: https://webcrawlerapi.com/glossary/puppeteer/what-fix-do-not-wait-for-all-targets-when-connecting
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Fixes Puppeteer not waiting for all targets when connecting by only awaiting child targets for tab targets.

*   When connecting, only tab targets are expected to have a child target attached, avoiding waiting for all targets.

*   iframe subtargets may not be fully initialized on connect, but this is generally fine.

----
url: https://webcrawlerapi.com/blog/top-5-best-firecrawl-alternatives
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

ComparisonAPI10 min read to read

Top 6 best Firecrawl alternatives

=================================

Explore five web scraping tools that serve as alternatives to Firecrawl, each offering unique features for diverse data extraction needs.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Quick Comparison](#quick-comparison)

*   [1\. WebCrawlerAPI](#1-webcrawlerapi)

*   [Key Features](#key-features)

*   [Pricing](#pricing)

*   [What Stands Out](#what-stands-out)

*   [Potential Drawbacks](#potential-drawbacks)

*   [2\. Spider](#2-spider)

*   [Key Features](#key-features)

*   [Pricing](#pricing)

*   [Pros and Cons](#pros-and-cons)

*   [3\. Skrape.ai](#3-skrapeai)

*   [Key Features](#key-features)

*   [Pricing](#pricing)

*   [Advantages and Disadvantages](#advantages-and-disadvantages)

*   [4\. LLM-Scraper](#4-llm-scraper)

*   [Key Features](#key-features)

*   [Pricing](#pricing)

*   [Pros and Cons](#pros-and-cons)

*   [6\. Crawlee](#6-crawlee)

*   [Features That Stand Out](#features-that-stand-out)

*   [Costs to Consider](#costs-to-consider)

*   [Example of How It Works](#example-of-how-it-works)

*   [Challenges to Keep in Mind](#challenges-to-keep-in-mind)

*   [7\. GPT-Crawler](#7-gpt-crawler)

*   [Key Features](#key-features)

*   [Pricing](#pricing)

*   [Advantages and Disadvantages](#advantages-and-disadvantages)

*   [Pros and Cons](#pros-and-cons)

*   [Conclusion](#conclusion)

### Table of Contents

*   [Quick Comparison](#quick-comparison)

*   [1\. WebCrawlerAPI](#1-webcrawlerapi)

*   [Key Features](#key-features)

*   [Pricing](#pricing)

*   [What Stands Out](#what-stands-out)

*   [Potential Drawbacks](#potential-drawbacks)

*   [2\. Spider](#2-spider)

*   [Key Features](#key-features)

*   [Pricing](#pricing)

*   [Pros and Cons](#pros-and-cons)

*   [3\. Skrape.ai](#3-skrapeai)

*   [Key Features](#key-features)

*   [Pricing](#pricing)

*   [Advantages and Disadvantages](#advantages-and-disadvantages)

*   [4\. LLM-Scraper](#4-llm-scraper)

*   [Key Features](#key-features)

*   [Pricing](#pricing)

*   [Pros and Cons](#pros-and-cons)

*   [6\. Crawlee](#6-crawlee)

*   [Features That Stand Out](#features-that-stand-out)

*   [Costs to Consider](#costs-to-consider)

*   [Example of How It Works](#example-of-how-it-works)

*   [Challenges to Keep in Mind](#challenges-to-keep-in-mind)

*   [7\. GPT-Crawler](#7-gpt-crawler)

*   [Key Features](#key-features)

*   [Pricing](#pricing)

*   [Advantages and Disadvantages](#advantages-and-disadvantages)

*   [Pros and Cons](#pros-and-cons)

*   [Conclusion](#conclusion)

**Looking for [Firecrawl](https://www.firecrawl.dev/) alternatives?** Here are six [web scraping tools](https://webcrawlerapi.com/scrapers) to consider, each with unique strengths and capabilities:

*   [**WebCrawlerAPI**](https://webcrawlerapi.com/): Best for AI and LLM; supports multiple SDKs; pay-as-you-go low pricing; 10$ try-out credit

*   [**Spider**](https://spider.cloud/): Open-source crawler for AI and LMM; support multiple AI framework integrations

*   [**Skrape.ai**](https://skrape.ai/): Cloud-based, [AI-powered crawling](https://webcrawlerapi.com/scrapers/webcrawler/html); suitable for complex websites but costly.

*   [**LLM-Scraper**](https://github.com/mishushakov/llm-scraper): Open-source, designed for LLM integration; free but demands self-hosting.

*   [**Crawlee**](https://crawlee.dev/): Open-source, scalable, and versatile; great for developers with technical skills.

*   [**GPT-Crawler**](https://github.com/BuilderIO/gpt-crawler): Combines AI with [web crawling](https://webcrawlerapi.com/blog/what-is-a-web-crawling); open-source and ideal for advanced data workflows.

Quick Comparison

----------------

API

Pricing

Pros

Cons

WebCrawlerAPI

Pay-per-usage, $2 per 1k requests

Scalable, Multi-SDK, various output formats, easy integration

No AI framework integrations, lack of customization

Spider

Depends on your needs

Open Source, Scalable, Multiple integrations

Poor documentation, complication pricing

Skrape.ai

Subscription: $15-$250

AI-driven, multi-format

Expensive for large-scale use

LLM-Scraper

Free

LLM integration, Python-based

Complex setup, self-hosting

Crawlee

Free

Anti-blocking, dual crawling

Resource-heavy, setup complexity

GPT-Crawler

Free

AI integration, customizable

Requires technical knowledge

Each tool serves different needs. For AI-focused tasks, **WebCrawlerAPI** or **GPT-Crawler** are great choices. If you're looking for free, customizable options, try **Crawlee**. For managed services, **Skrape.ai** offers convenience but at a higher cost. Choose based on your budget, technical skills, and project requirements.

1\. [WebCrawlerAPI](https://webcrawlerapi.com/)

-----------------------------------------------

WebCrawlerAPI is a SaaS platform designed to simplify data extraction for AI and large language models (LLMs). It's built with a distributed system architecture to handle the demands of AI workflows, including training and analysis.

### Key Features

*   Get content from every page of a website with a single seed URL

*   Outputs optimized for AI workflows in **HTML**, **text**, and **Markdown** formats.

*   Handles complex JavaScript-heavy websites with advanced parsing capabilities.

*   Offers multi-language SDKs for **JavaScript/TypeScript**, **Python**, **PHP**, and **.NET**.

*   $10 try-out credit

Here's a basic example of [integrating WebCrawlerAPI](https://webcrawlerapi.com/docs/getting-started) using Node.js:

    const { WebCrawlerAPI } = require("webcrawlerapi");

    const api = new WebCrawlerAPI("YOUR_API_KEY");

    api

      .crawl("https://example.com")

      .then((data) => console.log(data))

      .catch((error) => console.error(error));

### Pricing

WebCrawlerAPI offers a pay-as-you-go model for just **2$ per 1k** pages. Big trial tier to start - 10$ credit.

### What Stands Out

*   Low price

*   A strong SDK ecosystem for developers.

*   Scalable infrastructure suitable for enterprise needs.

*   Built-in solutions for anti-bot challenges.

*   Extra scrapers, like [Google Search Result Scraper](https://webcrawlerapi.com/scrapers/webcrawler/google-search-result/description), [AI Scraper](https://webcrawlerapi.com/scrapers/webcrawler/ai/description), or [Webpage Metadata Scraper](https://webcrawlerapi.com/scrapers/webcrawlerapi/webpage-metadata/description)

### Potential Drawbacks

*   No popular [RAG](https://webcrawlerapi.com/blog/what-is-rag) integrations

*   No sitemap crawling feature

WebCrawlerAPI is particularly well-suited for businesses focused on AI and machine learning. It fits well for small businesses that don't want to spend effort on crawling and would like just to get full website content.

2\. [Spider](https://spider.cloud/)

-----------------------------------

Spider is a [web crawler](https://spider.cloud/docs/api#crawl-website) that has a lot of features, such as scraping, [answering the question](https://spider.cloud/docs/api#questions-and-answers) based on the content and [lead extraction](https://spider.cloud/docs/api#extract-contacts-website). Works well for any kind of business, from small to big.

### Key Features

*   **Performance**: Spider is written in Rust and runs in full concurrency to achieve crawling thousands of pages in seconds.

*   **Customization** - Write scripts if you need custom scraping

### Pricing

Complicated pricing, based on your needs. Pay per GB of storage, per request using JS or Chrome, per endpoint, etc.

### Pros and Cons

**Pros:**

*   Various output formats

*   [Screenshots](https://spider.cloud/docs/api#screenshot-website)

*   High performance (100k pages per second). Good if you want to DDoS a website.

*   Open Source

**Cons:**

*   Overhead, if you need a simple solution

*   No integrations with RAGs

*   Complicated pricing

*   Open Source, but build with Rust, which is not really widely used.

Spider is a solid solution if you need high customisation. It is also based on Open Source [spider-rs](https://github.com/spider-rs/spider) framework so you can run it yourself. However, you have to be familiar with Rust if you want to dive deeper

3\. [Skrape.ai](https://skrape.ai/)

-----------------------------------

Skrape.ai is a cloud-based platform designed for web crawling and data extraction. Using AI, it simplifies pulling data from even the most complex websites, making it a go-to tool for businesses in industries like e-commerce and digital analytics.

### Key Features

*   **AI-Powered Extraction**: Schema-based data extraction

*   **Cloud Infrastructure**: Scales easily to handle varying workloads without manual intervention.

*   **Multi-Format Support**: Exports data in formats like JSON and Markdown

*   **Actions**: Click buttons, scroll, and wait for content

### Pricing

A robust solution that extracts data ready to use in RAG, LLMs and AI. Pricing is subscription-based and might not be fit for small businesses with ad-hoc demand. Plans starts from 15$ to 250$ with the cost of 5$ per 1k pages.

### Advantages and Disadvantages

Advantages

Disadvantages

AI-driven schema-based data extraction

Can be costly for large-scale projects

Cloud-based scalability removes infrastructure headaches

No SDK, no AI framework integration

Supports multiple data formats for convenience

Pure documentation

Actions, like click buttons, scroll, and wait for content

Low trial plan (20 requests only)

Skrape.ai shines when dealing with complex web applications, especially in fields like e-commerce, market research, and digital marketing. Its cloud-based setup removes the hassle of managing your own infrastructure while delivering dependable data extraction. However, it offers no integrations, poor documentation and high pricing with a small trial tier.

For those looking for a more customizable, hands-on solution, Crawlee might be a better fit, offering greater control over web crawling setups.

4\. [LLM-Scraper](https://github.com/mishushakov/llm-scraper)

-------------------------------------------------------------

LLM-Scraper blends traditional web scraping with AI-powered data processing, offering an open-source tool tailored for integrating large language models (LLMs) into data workflows. Unlike commercial tools, it focuses on meeting the demand for smooth LLM integration in data extraction tasks.

### Key Features

*   **Direct LLM Integration**: Works seamlessly with large language models to enable advanced AI-driven data processing.

*   **Flexible Python Framework**: Open-source and highly customizable, making it easy to integrate with data science tools.

*   **Handles Dynamic Content**: Efficiently processes both static pages and those rendered with JavaScript.

*   **Active Community Support**: Regular updates and contributions via its GitHub repository, with a focus on LLM-related improvements.

### Pricing

LLM-Scraper is free to use as an open-source tool. However, users should budget for related costs, such as:

Cost Category

Description

LLM Usage

Charges for external API calls (e.g., OpenAI services).

Infrastructure

Costs for hosting on self-managed servers.

Maintenance

Resources needed for updates and technical fixes.

Development

Expenses for adding custom features.

### Pros and Cons

Pros

Cons

Free and open-source

Requires self-managed hosting and upkeep.

Fully customizable for AI workflows

Limited official support.

Smooth LLM integration

Steeper learning curve for beginners.

Backed by an active community

Can be resource-intensive to set up.

LLM-Scraper is ideal for research and development environments where customization and LLM integration are key. Its design caters to data scientists and AI researchers who need a tool tailored to language model workflows. However, for organizations looking for a ready-to-use solution, the setup and maintenance demands might be a hurdle.

For those seeking a more streamlined and scalable option, **Crawlee** provides a strong alternative with its commercial-grade features.

6\. [Crawlee](https://crawlee.dev/)

-----------------------------------

Crawlee is an open-source tool designed for web scraping and browser automation. With 15.4K GitHub stars, it's widely recognized and works seamlessly in both Node.js and Python environments, catering to a variety of development needs.

### Features That Stand Out

*   Combines HTTP and browser crawling for versatility

*   Manages resources automatically with smart concurrency

*   Incorporates browser fingerprints and proxy rotation to avoid detection

*   Offers flexible storage options

*   Compatible with tools like Cheerio, Beautiful Soup, Puppeteer, and Playwright

*   Backed by an active GitHub community

*   Built for scalability with anti-blocking mechanisms

### Costs to Consider

Cost Type

Details

Server Costs

Hosting and maintenance expenses

Proxy Services

Optional for handling large-scale tasks

Development Time

Time invested in setup and customization

Browser Resources

Costs related to headless browser usage

### Example of How It Works

Here's a simple implementation example using Crawlee:

    const { Crawlee } = require("crawlee");

    const crawler = new Crawlee({

      urls: ["https://example.com"],

      proxyRotation: true,

      storage: "filesystem",

    });

    crawler.run().then(() => {

      console.log("Crawling completed");

    });

### Challenges to Keep in Mind

*   Requires technical knowledge to get started

*   Initial setup can be complex

*   Limited official support available

*   Additional tools may be necessary for JavaScript rendering

Crawlee is particularly well-suited for large-scale projects thanks to its built-in concurrency management and unified interface. These features make it a strong contender among open-source tools. While Crawlee focuses on scalability and flexibility, GPT-Crawler takes a different approach by integrating AI for more advanced data extraction tasks.

7\. [GPT-Crawler](https://github.com/BuilderIO/gpt-crawler)

-----------------------------------------------------------

GPT-Crawler, developed by [BuilderIO](https://www.builder.io/), is an open-source tool that combines standard web crawling techniques with AI-driven data extraction. Tailored for workflows involving large language models (LLMs), it offers a cutting-edge solution for collecting and processing web data.

### Key Features

*   AI-driven data extraction designed for LLM workflows

*   Reliable URL queuing to ensure uninterrupted crawling

*   Headless browser support to handle dynamic content

*   Flexible deployment options with various storage and parsing configurations

*   Anti-blocking features, including proxy rotation

### Pricing

Component

Cost

Core Software

Free (Open-source)

Infrastructure

Self-hosted costs

Proxy Services

Optional third-party expenses

Storage

Depends on chosen solution

### Advantages and Disadvantages

Advantages

Disadvantages

Strong AI integration for data workflows

Requires advanced technical skills

Highly customizable for different needs

Initial setup can be complex

Active support from the GitHub community

Limited official support

Free and open-source

GPT-Crawler shines in situations where standard crawling methods fall short, especially when paired with AI-based data extraction. Its ability to integrate seamlessly with modern AI tools makes it a valuable resource for developers building advanced data pipelines. However, it does demand a solid technical foundation to set up and use effectively, making it best suited for teams with the necessary expertise.

Pros and Cons

-------------

Here's a breakdown of the strengths and limitations of the top web crawler APIs in 2025:

API

Key Advantages

Limitations

WebCrawlerAPI

• Optimized for AI/LLM • Supports multiple SDKs • Markdown, text, HTML output • 10$ beginning credit to try it out • Easy integration

• No RAG integrations

Spider

• Handles large datasets efficiently • Open Source

• Lacks advanced AI features • Documentation is basic

Skrape.ai

• High-end scraping capabilities • Built on modern frameworks

• Expensive • Limited options for customization

LLM-Scraper

• Free and open-source • Focused on AI integration • Deployable in various environments

• Setup is complex • Limited user support

Crawlee

• Scales well for large tasks • Strong anti-blocking measures • Supports dual crawling modes

• High resource consumption • Configuration can be challenging

GPT-Crawler

• AI-powered data extraction • Backed by an active community • Free to use

• Requires technical knowledge • May incur infrastructure costs

The best choice depends on your project's requirements, available resources, and technical expertise. Open-source options like Firecrawl, LLM-Scraper, and GPT-Crawler offer great flexibility but demand more technical know-how. On the other hand, managed services like Skrape.ai or WebCrawlerAPI simplify deployment and provide infrastructure support, though at a higher cost.

For more complex workflows, Crawlee shines with its scalability and anti-blocking features, though it requires a skilled team to manage its setup [\[2\]](https://blog.apify.com/top-11-open-source-web-crawlers-and-one-powerful-web-scraper/). WebCrawlerAPI is particularly suited for AI-related tasks, thanks to its optimization for LLM workflows and support for multiple formats [\[1\]](https://oxylabs.io/blog/best-web-crawlers). These tools highlight the growing role of AI in data extraction, offering varied solutions for developers and businesses.

Ultimately, your decision should align with your project's goals, technical capacity, and budget. Open-source tools are ideal for teams with strong technical expertise, while SaaS solutions are better for those seeking faster deployment and ease of use.

Conclusion

----------

Whether you're replacing Firecrawl or just exploring new options, it's clear that each of these web scraping tools brings something unique to the table. For those seeking cost-effective, AI-friendly data extraction, WebCrawlerAPI is particularly compelling with its developer-focused SDKs, multi-format outputs, and pay-per-use pricing. That combination of flexibility and affordability makes it a standout choice if you need robust crawling without overhauling your budget or setup.

Of course, every project has its own priorities. LLM-Scraper and GPT-Crawler shine for AI-based workflows—especially when you're comfortable self-hosting—while DataFuel caters to large, enterprise-level data ops. If ease of use and a managed setup are key, Skrape.ai may fit better, though it comes at a premium. Lastly, Crawlee straddles the line between open-source freedom and enterprise-scale performance, requiring a bit more technical prowess to manage effectively.

Ultimately, the best choice depends on your project's size, budget, and complexity. By weighing scalability, integration needs, and total cost of ownership, you can select the most suitable Firecrawl alternative—whether that's the feature-rich WebCrawlerAPI or another platform ready to power your next data-driven venture.

----
url: https://webcrawlerapi.com/glossary/playwright/how-to-fix-playwright-strict-mode-violation
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

strict mode violation means your locator matches multiple elements, but the action needs exactly one.

Narrow the locator with role/name filters, scoping, or .first() only when order is stable.

    const dialog = page.getByRole('dialog', { name: 'Delete project' });

    await dialog.getByRole('button', { name: 'Delete' }).click();

Avoid broad selectors like .btn for click actions in larger pages.

----
url: https://webcrawlerapi.com/glossary/scraping/how-do-you-clean-and-validate-scraped-data
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

### Answer

Clean scraped data by trimming whitespace, normalizing formats, and removing duplicates. Validate fields with schemas to catch missing or invalid values. Use type conversions for dates, numbers, and currencies. Track extraction errors and store raw inputs for debugging. Good cleaning and validation improves reliability downstream.

----
url: https://webcrawlerapi.com/blog/csv-vs-plain-text-choosing-the-right-format-for-llm-prompts
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

ComparisonCSVRAG

CSV vs Plain Text: Choosing the Right Format for LLM Prompts

============================================================

CSV vs plain text for scraped outputs and prompt data: when a dataset is needed, when narrative text is enough, and what to avoid.

Written byAndrew

Published onFeb 1, 2026

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What CSV is good at](#what-csv-is-good-at)

*   [What plain text is good at](#what-plain-text-is-good-at)

*   [Use cases in web crawling, scraping, and RAG](#use-cases-in-web-crawling-scraping-and-rag)

*   [When CSV should be used](#when-csv-should-be-used)

*   [When plain text should be used](#when-plain-text-should-be-used)

*   [Practical tradeoffs](#practical-tradeoffs)

*   [CSV is a poor container for long content](#csv-is-a-poor-container-for-long-content)

*   [Plain text does not provide a schema](#plain-text-does-not-provide-a-schema)

*   [Node.js snippet: Turn extracted lines into a simple CSV](#nodejs-snippet-turn-extracted-lines-into-a-simple-csv)

*   [Conclusion](#conclusion)

### Table of Contents

*   [Quick comparison](#quick-comparison)

*   [What CSV is good at](#what-csv-is-good-at)

*   [What plain text is good at](#what-plain-text-is-good-at)

*   [Use cases in web crawling, scraping, and RAG](#use-cases-in-web-crawling-scraping-and-rag)

*   [When CSV should be used](#when-csv-should-be-used)

*   [When plain text should be used](#when-plain-text-should-be-used)

*   [Practical tradeoffs](#practical-tradeoffs)

*   [CSV is a poor container for long content](#csv-is-a-poor-container-for-long-content)

*   [Plain text does not provide a schema](#plain-text-does-not-provide-a-schema)

*   [Node.js snippet: Turn extracted lines into a simple CSV](#nodejs-snippet-turn-extracted-lines-into-a-simple-csv)

*   [Conclusion](#conclusion)

CSV and plain text are easy to confuse because both look "simple". The difference is that CSV implies a dataset with a schema (columns). Plain text implies that the content is the product.

A broader overview of formats is provided in [Best Prompt Data](/blog/best-prompt-data).

Quick comparison

----------------

Topic

CSV

Plain Text

Best for

Flat tabular datasets

Raw page content and simple outputs

Parsing reliability

High (with correct quoting)

Low

Human editing

High (spreadsheets)

High

RAG fit

Not great as-is

Good for embeddings and chunking

Common failure

Broken quoting with real-world text

Ambiguity and missing fields

What CSV is good at

-------------------

CSV is usually selected when:

*   One row per page/product is needed

*   A stable set of columns exists

*   Export to spreadsheet tools is important

If nested structures are needed, JSON is often preferred, as covered in [JSON vs CSV](/blog/json-vs-csv-choosing-the-right-format-for-llm-prompts).

What plain text is good at

--------------------------

Plain text is usually selected when:

*   The main value is the content itself

*   Embeddings and retrieval are planned

*   Formatting noise should be minimized

If light structure is helpful, Markdown can be compared in [Markdown vs Plain Text](/blog/markdown-vs-plain-text-choosing-the-right-format-for-llm-prompts).

Use cases in web crawling, scraping, and RAG

--------------------------------------------

### When CSV should be used

CSV is usually preferred when:

*   A list, directory, or catalog is being extracted

*   Data will be filtered, sorted, and joined

*   Audits are being done in spreadsheet tools

### When plain text should be used

Plain text is usually preferred when:

*   Page content is being indexed for RAG

*   Summaries are being generated without strict fields

*   The pipeline is text-first and extraction is optional

If the output starts as HTML, conversion choices are covered in [HTML vs Cleaned Text](/blog/html-vs-cleaned-text-choosing-the-right-output-format) and [HTML vs Markdown](/blog/html-vs-markdown-choosing-the-right-output-format).

Practical tradeoffs

-------------------

### CSV is a poor container for long content

Long descriptions often contain commas, quotes, and newlines. That can be handled, but it must be enforced. If the primary goal is content, plain text is usually simpler.

### Plain text does not provide a schema

If a dataset is expected, plain text will require a second pass to extract fields. That can work, but the complexity is just shifted.

Node.js snippet: Turn extracted lines into a simple CSV

-------------------------------------------------------

This example turns "key: value" lines into a CSV with two columns.

    // Node 18+

    // Convert simple "key: value" lines into CSV.

    import { readFile } from "node:fs/promises";

    const text = await readFile("pairs.txt", "utf8");

    const rows = [];

    for (const line of text.split("\n")) {

      const trimmed = line.trim();

      if (!trimmed) continue;

      const idx = trimmed.indexOf(":");

      if (idx === -1) continue;

      const key = trimmed.slice(0, idx).trim();

      const value = trimmed.slice(idx + 1).trim();

      rows.push({ key, value });

    }

    const out = ["key,value"];

    for (const r of rows) {

      const k = `"${r.key.replaceAll('"', '""')}"`;

      const v = `"${r.value.replaceAll('"', '""')}"`;

      out.push(`${k},${v}`);

    }

    console.log(out.join("\n"));

Conclusion

----------

*   CSV is usually selected for flat datasets with stable columns.

*   Plain text is usually selected for content-first outputs and RAG ingestion.

*   If both are needed, a common approach is: plain text is stored for content, CSV is generated only for specific exports.

If a readable structured document is preferred over plain text, [Markdown vs CSV](/blog/markdown-vs-csv-choosing-the-right-format-for-llm-prompts) can be compared next.

----
url: https://webcrawlerapi.com/docs/sdk/zapier
----

Zapier WebcrawlerAPI integration

================================

Copy MarkdownOpen

How to get wabpage content for LLM training using Zapier and WebCrawlerAPI.

Zapier is a powerful workflow automation tool that allows you to connect various services and automate tasks. You can use Zapier to integrate WebCrawlerAPI for crawling websites and extracting data, which can then be used for training large language models (LLMs) or other purposes.

The simplest way to use WebcrawlerAPI in Zapier is to use the WebcrawerAPI action.

Follow the link to have access to the WebcrawlerAPI integration: [Zapier WebcrawlerAPI](https://zapier.com/developer/public-invite/206901/b015f18eaa55e0545d9219f2942e94d1/)

To add the WebcrawlerAPI action to your Zapier workflow, follow these steps:

1.  Create a new Zapier workflow.

2.  Add a new step and select the WebcrawlerAPI action.

3.  Create a new WebcrawlerAPI credential account in Zapier

Get your API key from the [WebcrawlerAPI dashboard](https://dash.webcrawlerapi.com/access)

Your action setup step should look like this:

4.  Go to "configure" tab and fill the fields with URL and prompt(optional)

5.  Test the action

WebcrawlerAPI action is ready to use!

If you have any questions, please contact us at [\[email protected\]](/cdn-cgi/l/email-protection#5f2c2a2f2f302d2b1f283a3d3c2d3e28333a2d3e2f36713c3032)

[Make.com

Previous Page](/docs/sdk/make)[S3 Upload

Next Page](/docs/actions/s3-upload)

----
url: https://webcrawlerapi.com/changelog/2025-05-19-postman-collection
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

[Back to Changelog](/changelog)

May 19, 2025

Postman Collection is here

==========================

----
url: https://webcrawlerapi.com/blog/mozilla-readability-algorithm-readabilityjs
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

JSTechnical

Mozilla Readability Algorithm (Readability.js) explanation

==========================================================

A simple, step-by-step breakdown of the Mozilla Readability.js algorithm: how it scores the DOM and extracts the main article content.

Written byAndrew

Published onFeb 7, 2026

### Table of Contents

*   [The most important heuristics (in plain English)](#the-most-important-heuristics-in-plain-english)

*   [High-level flow inside Readability.parse()](#high-level-flow-inside-readabilityparse)

*   [Core idea: score blocks, then pick a container](#core-idea-score-blocks-then-pick-a-container)

*   [How Readability.js thinks about “main content”](#how-readabilityjs-thinks-about-main-content)

*   [Post-processing: fixing URLs and simplifying the output](#post-processing-fixing-urls-and-simplifying-the-output)

*   [Why the algorithm fails (and what it usually means)](#why-the-algorithm-fails-and-what-it-usually-means)

*   [Final output: what Readability.parse() returns](#final-output-what-readabilityparse-returns)

*   [WebCrawlerAPI main\_content\_only (real life shortcut)](#webcrawlerapi-main_content_only-real-life-shortcut)

### Table of Contents

*   [The most important heuristics (in plain English)](#the-most-important-heuristics-in-plain-english)

*   [High-level flow inside Readability.parse()](#high-level-flow-inside-readabilityparse)

*   [Core idea: score blocks, then pick a container](#core-idea-score-blocks-then-pick-a-container)

*   [How Readability.js thinks about “main content”](#how-readabilityjs-thinks-about-main-content)

*   [Post-processing: fixing URLs and simplifying the output](#post-processing-fixing-urls-and-simplifying-the-output)

*   [Why the algorithm fails (and what it usually means)](#why-the-algorithm-fails-and-what-it-usually-means)

*   [Final output: what Readability.parse() returns](#final-output-what-readabilityparse-returns)

*   [WebCrawlerAPI main\_content\_only (real life shortcut)](#webcrawlerapi-main_content_only-real-life-shortcut)

Hi, I’m Andrew. I’m a software engineer and a scraping browser expert. I have been using the Mozilla Readability library in [WebcrawlerAPI](https://webcrawlerapi.com/) for a long time now. In this post, I will share how the algorithm actually works. Readability.js looks simple from the outside: give it HTML, get back an article. Inside it is a bunch of small heuristics that try to survive real websites.

If you want a practical implementation tutorial first, read: [Extracting article or blogpost content with Mozilla Readability](/blog/how-to-extract-article-or-blogpost-content-in-js-using-readabilityjs). If you want the Rust alternative version, read: [How dom\_smoozie Rust Mozilla Readability alternative works](/blog/how-dom-smoothie-rust-mozilla-readability-alternative-works).

The most important heuristics (in plain English)

------------------------------------------------

This is what makes the algorithm work most of the time:

1.  “Unlikely candidates” removal. Sidebars, ads, and share widgets often have obvious class/id names.

2.  Class/id weighting. Positive names get bonus. Negative names get penalty.

3.  Link density. Link-heavy blocks are usually navigation, not content.

4.  Text length and punctuation. Longer text with commas looks like real writing.

5.  Parent score propagation. Articles live inside containers.

6.  Sibling merging. Many sites split articles into multiple blocks.

7.  Conditional cleanup. Remove stuff like forms, embeds, and weird lists, but only when it looks like junk.

High-level flow inside Readability.parse()

------------------------------------------

Readability.parse() is a pipeline. It does a few passes over the DOM, and each pass has one job.

High level steps:

1.  Preprocess the document (remove obvious noise and normalize the DOM).

2.  Extract metadata (title, byline, excerpt, site name, publish time).

3.  Find the article container with \_grabArticle() (this is the core algorithm).

4.  Clean the chosen container with \_prepArticle() (remove junk, fix images, simplify).

5.  Post-process output (fix URLs, remove nested wrappers, strip classes if needed).

6.  Return the final article object (HTML + text + metadata).

Real life note: it can retry step 3-4 with less aggressive cleanup if the result is too short. This is why it works on many weird pages, but it is also why it is not “free” CPU-wise.

Core idea: score blocks, then pick a container

----------------------------------------------

Readability works like this:

1.  Find many small blocks (mostly p, pre, td, headings, and some div after normalization).

2.  Give each block a score based on how “article-like” its text is.

3.  Add that score to parent containers.

4.  Pick the best container (the “top candidate”).

This is important. In real life, the article is not a single paragraph. It is usually a div or article that contains many paragraphs. So Readability scores the leaf blocks, but the winner is often a parent node.

The scoring rules are simple (and very practical):

*   Very short text is ignored.

*   More commas usually means more sentences (good signal).

*   Longer text gets a small bonus, but it is capped.

*   Link-heavy blocks get punished.

Here is a small scoring example, close to what Readability does:

    // Node 18+

    export function scoreText(text) {

      const t = (text || "").trim();

      if (t.length < 25) return 0;

      let score = 1;

      score += (t.match(/,/g) || []).length;

      score += Math.min(3, Math.floor(t.length / 100));

      return score;

    }

Then Readability pushes that score up the tree. Parent gets more, grandparents get less. This is how it finds the container that “owns” most of the real text.

How Readability.js thinks about “main content”

----------------------------------------------

When people say “main content”, they usually mean: the part you would copy-paste into a note. Not the header. Not the menu. Not “related posts”. Just the article.

Readability.js does not use one magic selector. In real life every website is different, so it uses a few simple signals together:

1.  Text blocks with real sentences get points.

2.  Blocks with too many links lose points (navigation is mostly links).

3.  Scores are pushed to parent containers (because the real article is usually a wrapper).

4.  It removes “unlikely” blocks early, but it can retry with softer rules if it gets a short result.

The key tradeoff is simple: if cleanup is too aggressive, you lose real text. If cleanup is too soft, you keep junk.

Here is a tiny example of one important signal: link density.

    // Node 18+

    // Rough idea: menus have high link density, articles usually don't.

    export function linkDensity(el) {

      const text = (el.textContent || "").trim();

      if (!text) return 0;

      let linkTextLen = 0;

      for (const a of el.querySelectorAll("a")) {

        linkTextLen += (a.textContent || "").trim().length;

      }

      return linkTextLen / text.length;

    }

Preflight: isProbablyReaderable (fast check before full parse)

--------------------------------------------------------------

Before running the full algorithm, Readability has a quick check: isProbablyReaderable(). It answers one boring but important question: “Is this page even worth parsing?”

In real life you crawl a lot of pages that are not articles:

*   home pages

*   category pages

*   search pages

*   login pages

Full parsing is heavier (DOM walk + scoring + cleanup), so this preflight tries to save time. It looks for a few candidate nodes (p, pre, article, and some div patterns), ignores hidden/unlikely blocks, and adds up a score based on text length. If the total score is high enough, the page is "probably readerable".

Post-processing: fixing URLs and simplifying the output

-------------------------------------------------------

After Readability finds the article, it still does some cleanup to make the output stable. This part is easy to miss, but it matters in real life.

What it does:

*   Converts relative URLs to absolute URLs (links, images, video/audio sources, srcset).

*   Removes javascript: links (or turns them into plain text).

*   Removes pointless nested wrappers (div inside div inside div) when it can.

*   Strips class attributes unless you want to keep them.

*   Detects text direction (dir) and keeps language (lang) when it is available.

Why it matters:

*   If you save extracted HTML and render it later, relative URLs will break.

*   If you feed extracted HTML into a markdown converter, deep wrappers make output noisy.

*   If you keep classes from random websites, your CSS can get weird.

Why the algorithm fails (and what it usually means)

---------------------------------------------------

Readability fails when the page does not look like an article. Or when the “article” is there, but it is hidden behind layout tricks.

Common reasons:

*   Not enough text. The page is a list, or a short announcement.

*   Too many links. Navigation blocks can win if the page is mostly links.

*   Content is split into many small blocks with little text.

*   The DOM is broken (bad nesting) and the algorithm scores the wrong container.

*   Heavy templates: the same “chrome” (header/sidebar/footer) has more text than the real article.

Real life caveat: Readability works on the HTML you give it. If you fetch a JS app and you get an empty shell, the algorithm has nothing to score.

Final output: what Readability.parse() returns

----------------------------------------------

The result is a plain object. The important fields are:

*   title

*   byline

*   excerpt

*   siteName

*   publishedTime

*   content (clean HTML)

*   textContent (plain text)

*   length (characters)

*   lang

*   dir

In practice, I treat textContent + title as the “real payload”, and everything else as optional metadata.

WebCrawlerAPI main\_content\_only (real life shortcut)

------------------------------------------------------

If you just need the result and you don’t want to maintain your own extraction stack, WebCrawlerAPI has main\_content\_only. It runs the same kind of “find main content” logic for you and returns only the article.

If you want to test it fast, use this tool: [Readability tool](https://webcrawlerapi.com/tools/html-main-content-readability).

    const response = await fetch("https://api.webcrawlerapi.com/v1/scrape", {

      method: "POST",

      headers: {

        Authorization: "Bearer YOUR_API_KEY",

        "Content-Type": "application/json",

      },

      body: JSON.stringify({

        url: "https://example.com/article",

        main_content_only: true,

        scrape_type: "markdown",

      }),

    });

If you want to see the code-first tutorial, here is my practical guide: [Extracting article or blogpost content with Mozilla Readability](/blog/how-to-extract-article-or-blogpost-content-in-js-using-readabilityjs). If you want to compare it with a Rust implementation, read: [How dom\_smoozie Rust Mozilla Readability alternative works](/blog/how-dom-smoothie-rust-mozilla-readability-alternative-works).

----
url: https://webcrawlerapi.com/glossary/playwright/how-to-fix-net-err-aborted-page-goto
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

net::ERR\_ABORTED often appears when navigation is interrupted by another navigation, redirect, or page close.

Avoid overlapping goto calls and await the one you trigger.

    await page.goto('https://example.com');

    await Promise.all([      page.waitForURL('**/dashboard'),

      page.getByRole('link', { name: 'Dashboard' }).click(),

    ]);

Do not close the page/context until all navigation-related promises settle.

----
url: https://webcrawlerapi.com/glossary
----

[WebCrawler API](https://webcrawlerapi.com)

[Pricing](/#pricing)[Docs](/docs/getting-started)[Blog](/blog)[Sign in](https://dash.webcrawlerapi.com/sign-in)[Sign Up](https://dash.webcrawlerapi.com/sign-up "WebcrawlerAPI Dashboard")

Glossary

Web Scraping & API Glossary

===========================

Comprehensive glossary of web scraping, crawling, and API terms. Learn the essential concepts and terminology used in web data extraction.

AllPlaywrightPuppeteerScrapingWebcrawling

P

Playwright

----------

(20)

P

Puppeteer

---------

(26)

S

Scraping

--------

(10)

W

Webcrawling

-----------

(10)

----
url: https://webcrawlerapi.com/docs/sdk/mcp
----

MCP WebcrawlerAPI integration

=============================

Copy MarkdownOpen

How to use WebCrawlerAPI with MCP (Model Context Protocol) for AI-powered web scraping.

MCP (Model Context Protocol) is a protocol that allows AI assistants to connect to external services and tools. The WebCrawlerAPI MCP server enables AI assistants like Claude Code to scrape web content directly during conversations.

[Installation](#installation)

-----------------------------

### [Using npm](#using-npm)

    npm install -g webcrawler-mcp

### [Using npx (no installation required)](#using-npx-no-installation-required)

    npx webcrawler-mcp

[Setup](#setup)

---------------

### [1\. Get your API key](#1-get-your-api-key)

Get your API key from the [WebCrawlerAPI dashboard](https://dash.webcrawlerapi.com/access).

### [2\. Set environment variable](#2-set-environment-variable)

    export WEBCRAWLER_API_KEY="your-api-key-here"

### [3\. Run the server](#3-run-the-server)

The MCP server can run in two modes:

**Standard mode (stdio):**

    npx webcrawler-mcp

**HTTP mode:**

    export USE_HTTP=true

    export PORT=8080

    npx webcrawler-mcp

[Configuration](#configuration)

-------------------------------

The MCP server supports the following environment variables:

*   `WEBCRAWLER_API_KEY` (required): Your WebCrawlerAPI access key

*   `USE_HTTP` (optional): Enable HTTP transport mode

*   `PORT` (optional): HTTP server port (default: 8080)

[Using with Claude Code](#using-with-claude-code)

-------------------------------------------------

Once the MCP server is running, Claude Code can automatically discover and use the `webcrawler-scrape` tool to scrape web content.

The tool accepts these parameters:

*   `url` (required): The webpage URL to scrape

*   `prompt` (optional): Specific information to extract from the page

Example usage in Claude Code:

    Please use webcrawler-scrape to scrape https://example.com and extract the main article content

Claude Code will automatically use the MCP server to scrape the webpage and return the content as markdown.

[Features](#features)

---------------------

*   **Single webpage scraping**: Extract content from any accessible webpage

*   **Content extraction**: Use prompts to extract specific information

*   **Markdown output**: Clean, structured content ready for AI processing

*   **Real-time scraping**: Dynamic content extraction during AI conversations

[Development](#development)

---------------------------

For development and testing:

    git clone https://github.com/WebCrawlerAPI/webcrawlerapi-mcp

    cd webcrawlerapi-mcp

    npm install

    npm run dev

Build for production:

    npm run build

    npm start

[GitHub Repository](#github-repository)

---------------------------------------

[WebCrawlerAPI MCP Server on GitHub](https://github.com/WebCrawlerAPI/webcrawlerapi-mcp)

If you have any questions, please contact us at [\[email protected\]](/cdn-cgi/l/email-protection#0d7e787d7d627f794d7a686f6e7f6c7a61687f6c7d64236e6260)

[LangChain

Previous Page](/docs/sdk/langchain)[n8n

Next Page](/docs/sdk/n8n)

### On this page

[Installation](#installation)[Using npm](#using-npm)[Using npx (no installation required)](#using-npx-no-installation-required)[Setup](#setup)[1\. Get your API key](#1-get-your-api-key)[2\. Set environment variable](#2-set-environment-variable)[3\. Run the server](#3-run-the-server)[Configuration](#configuration)[Using with Claude Code](#using-with-claude-code)[Features](#features)[Development](#development)[GitHub Repository](#github-repository)

----
url: https://webcrawlerapi.com/docs/guides/filters
----

Job URL filters

===============

Copy MarkdownOpen

Sometimes you don't need content of all pages for website. You want to get only specific blog posts or filter out some pages. For this you can use `whitelist_regexp` and `blacklist_regexp` parameters in your job request.

[Whitelisting specific pages for website crawling](#whitelisting-specific-pages-for-website-crawling)

-----------------------------------------------------------------------------------------------------

Whitelisting is a way to **include only** specific pages in your crawling job. You can use a regular expression to match the URLs you want to include.

### [Whitelist example](#whitelist-example)

For example, if you want to include only blog posts from a specific category, you can use the following regular expression:

    {

        "url": "https://example.com/",

        "whitelist_regexp": "/blog/category/technology.*"

    }

Let's say you have a blog with the following URLs:

*   [https://example.com/blog/category/technology/post1](https://example.com/blog/category/technology/post1)

*   [https://example.com/blog/category/technology/post2](https://example.com/blog/category/technology/post2)

*   [https://example.com/blog/category/lifestyle/post1](https://example.com/blog/category/lifestyle/post1)

*   [https://example.com/blog/category/lifestyle/post2](https://example.com/blog/category/lifestyle/post2)

In this case, the crawler will only include the URLs that match the `whitelist_regexp` pattern. The URLs that do not match the pattern will be excluded from the crawling job.

So the result job will only have the following URLs:

*   [https://example.com/blog/category/technology/post1](https://example.com/blog/category/technology/post1)

*   [https://example.com/blog/category/technology/post2](https://example.com/blog/category/technology/post2)

Excluded URLs:

*   [https://example.com/blog/category/lifestyle/post1](https://example.com/blog/category/lifestyle/post1)

*   [https://example.com/blog/category/lifestyle/post2](https://example.com/blog/category/lifestyle/post2)

[Blacklisting specific pages for website crawling](#blacklisting-specific-pages-for-website-crawling)

-----------------------------------------------------------------------------------------------------

Blacklisting is a way to **exclude** specific pages from your crawling job. You can use a regular expression to match the URLs you want to exclude.

### [Blacklist example](#blacklist-example)

For example, if you want to exclude all URLs that contain the word "admin", you can use the following regular expression:

    {

        "url": "https://example.com/",

        "blacklist_regexp": "/admin.*"

    }

Let's say you have a website with the following URLs:

*   [https://example.com/admin/dashboard](https://example.com/admin/dashboard)

*   [https://example.com/admin/settings](https://example.com/admin/settings)

*   [https://example.com/blog/post1](https://example.com/blog/post1)

*   [https://example.com/blog/post2](https://example.com/blog/post2)

In this case, the crawler will exclude the URLs that match the `blacklist_regexp` pattern. The URLs that do not match the pattern will be included in the crawling job.

So the result job will only have the following URLs:

*   [https://example.com/blog/post1](https://example.com/blog/post1)

*   [https://example.com/blog/post2](https://example.com/blog/post2)

Excluded URLs:

*   [https://example.com/admin/dashboard](https://example.com/admin/dashboard)

*   [https://example.com/admin/settings](https://example.com/admin/settings)

[Combining Whitelisting and Blacklisting](#combining-whitelisting-and-blacklisting)

-----------------------------------------------------------------------------------

You can combine both whitelisting and blacklisting in your crawling job. This allows you to include only specific pages while excluding others. First, the crawler will apply the `whitelist_regexp` to include only the URLs that match the pattern. Then, it will apply the `blacklist_regexp` to exclude any URLs that match that pattern.

### [Example](#example)

For example, if you want to include only blog posts from a specific category except `lifestyle`, you can use the following regular expressions:

    {

        "url": "https://example.com/",

        "whitelist_regexp": "/blog/category/technology.*",

        "blacklist_regexp": "/blog/category/lifestyle.*"

    }

Let's say you have a website with the following URLs:

*   [https://example.com/blog/category/technology/post1](https://example.com/blog/category/technology/post1)

*   [https://example.com/blog/category/technology/post2](https://example.com/blog/category/technology/post2)

*   [https://example.com/blog/category/lifestyle/post1](https://example.com/blog/category/lifestyle/post1)

*   [https://example.com/blog/category/lifestyle/post2](https://example.com/blog/category/lifestyle/post2)

*   [https://example.com/blog/category/sports/post1](https://example.com/blog/category/sports/post1)

*   [https://example.com/blog/category/sports/post2](https://example.com/blog/category/sports/post2)

*   [https://example.com/admin/dashboard](https://example.com/admin/dashboard)

*   [https://example.com/admin/settings](https://example.com/admin/settings)

In this case, the crawler will include only the URLs that match the `whitelist_regexp` pattern and do not match the `blacklist_regexp` pattern.

So the result job will only have the following URLs:

*   [https://example.com/blog/category/technology/post1](https://example.com/blog/category/technology/post1)

*   [https://example.com/blog/category/technology/post2](https://example.com/blog/category/technology/post2)

*   [https://example.com/blog/category/lifestyle/post1](https://example.com/blog/category/lifestyle/post1)

*   [https://example.com/blog/category/lifestyle/post2](https://example.com/blog/category/lifestyle/post2)

Excluded URLs:

*   [https://example.com/blog/category/sports/post1](https://example.com/blog/category/sports/post1)

*   [https://example.com/blog/category/sports/post2](https://example.com/blog/category/sports/post2)

*   [https://example.com/admin/dashboard](https://example.com/admin/dashboard)

*   [https://example.com/admin/settings](https://example.com/admin/settings)

[What if I need to whitelist by several patterns?](#what-if-i-need-to-whitelist-by-several-patterns)

----------------------------------------------------------------------------------------------------

You can use the `|` operator to combine multiple patterns in a single regular expression. This allows you to include URLs that match any of the specified patterns. For example:

    {

        "url": "https://example.com/",

        "whitelist_regexp": "/blog/category/(technology|lifestyle).*"

    }

[How to debug whitelist\_regexp and blacklist\_regexp](#how-to-debug-whitelist_regexp-and-blacklist_regexp)

-----------------------------------------------------------------------------------------------------------

When you are creating a job with `whitelist_regexp` or `blacklist_regexp`, it can be difficult to know if your regular expression is correct and what pages will be included or excluded from the crawling job.

In addition, it is not known in advance website URL structure.

What we recommend is to crawl without any filters first (set small limit of items 100, for example) and then use the `urls` [API endpoint](/docs/api/urls) to get the list of URLs that were crawled.

You can then use tools like [regex101](https://regex101.com/) to debug your regexps. Paste the list of URLs into the `Find matches` section and write regexp.

[Structured Outputs with Prompts

Previous Page](/docs/structured-outputs)[Content Cleaning

Next Page](/docs/guides/cleaning)

### On this page

[Whitelisting specific pages for website crawling](#whitelisting-specific-pages-for-website-crawling)[Whitelist example](#whitelist-example)[Blacklisting specific pages for website crawling](#blacklisting-specific-pages-for-website-crawling)[Blacklist example](#blacklist-example)[Combining Whitelisting and Blacklisting](#combining-whitelisting-and-blacklisting)[Example](#example)[What if I need to whitelist by several patterns?](#what-if-i-need-to-whitelist-by-several-patterns)[How to debug whitelist\_regexp and blacklist\_regexp](#how-to-debug-whitelist_regexp-and-blacklist_regexp)