Skip to main content

Engineering LLM-Assisted Web Scraping: From Agentic Discovery to Deterministic Extraction

Β· 4 min read
Frank Chen
Backend & Applied ML Engineer

The most robust scraping architecture uses an LLM Agent for the "Discovery Phase" to map endpoints, but relies on deterministic scripts for the actual data extraction.

The Problem​

User's Messy Thought: Use an LLM agent (via MCP tools) to either reverse engineer a web page to write a scraping script, or agentically control a browser to scroll, intercept network endpoints, and save data iteratively.

Standard Industry Terminology:

  • Problem: Automated Extraction of Dynamically Loaded Web Content (Infinite Scroll/Pagination).
  • Core Technologies: LLM-Assisted Browser Automation, API Reverse Engineering, Network Interception (XHR/Fetch logging), Model Context Protocol (MCP), Autonomous Agents.
  • System Objectives: To design an architecture where an AI agent dynamically discovers data sources (Hidden APIs) and extracts target data (e.g., comments) from web applications that require user interaction (scrolling).

Possible Approaches​

Based on industry standards, there are three primary architectural approaches to solve this.

Approach 1: Agent-Assisted Reverse Engineering & Code Generation (The "Script Builder")​

  • How it works: The LLM agent acts as a developer. Using an MCP tool hooked into Playwright/Puppeteer, it navigates to the page and dumps the network logs (HAR file or XHR requests) after an initial scroll. The agent analyzes the network traffic, identifies the backend GraphQL/REST API returning the comments (in JSON format), and figures out the pagination mechanism (e.g., cursor tokens, page offsets). The agent then writes a standalone, deterministic Python or Node.js script that queries the API directly, bypassing the browser entirely.
  • Execution Phase: You run the generated script directly. The LLM is out of the loop.

Approach 2: Interactive Browser Automation with Network Interception (The "Scrolling Listener")​

  • How it works: The LLM agent actively controls a headless browser session. An MCP tool exposes browser actions (scroll_down(), wait()) and a network listener (get_intercepted_json()). The browser is configured to intercept and capture all responses matching a specific pattern (e.g., *api/v1/comments*).
  • Execution Phase: The agent operates in a continuous loop: command a scroll β†’\rightarrow the website's JS triggers the API β†’\rightarrow the MCP tool captures the raw JSON response β†’\rightarrow the agent saves the data β†’\rightarrow repeat until no more comments load.

Approach 3: DOM-Parsing Autonomous Agent (The "Human-Like Browser")​

  • How it works: The agent does not look at network traffic at all. Instead, it relies on an Accessibility Tree (A11y), DOM snapshots, or Vision (screenshots) via MCP.
  • Execution Phase: The agent looks at the page, reads the text of the loaded comments, parses them into a JSON structure via prompt extraction, commands the browser to scroll down, waits for rendering, looks at the new DOM, deduplicates against old comments, and repeats.

Compare Trade-offs​

MetricApproach 1: Script BuilderApproach 2: Scrolling ListenerApproach 3: Human-like Browser
ScalabilityHigh: Lightweight HTTP requests.Low: Heavy headless browser.Very Low: Massive token context.
LatencyLow: Bypasses UI rendering.Medium: Must wait for DOM rendering.High: Wait for rendering + LLM inference.
ComplexityHigh: Must decode auth/signatures.Medium: Native frontend handles auth.Low (Setup) / High (Maintain)
RobustnessMedium: Deterministic but schema-sensitive.High: Resilient to UI changes.Low: Susceptible to layout shifts.
CostLow: LLM used once for discovery.High: LLM in the execution loop.Very High: Continuous parsing.

Industry Best Practices & Recommendations​

As a senior systems engineer, I recommend implementing a hybrid architecture utilizing Approach 1 for production, and falling back to Approach 2 for highly-secured targets.

  1. Decouple Discovery from Extraction: Do not use an LLM Agent to perform the actual scraping job on a continuous basis. Use the MCP-powered Agent exclusively for the Discovery Phase to map the endpoints and generate deterministic scraping code.
  2. Rely on Structured APIs over DOM: Always prioritize intercepting network traffic (JSON) over parsing the DOM/HTML. Fetching raw JSON from an intercepted API call ensures structured, clean data without needing complex regex or LLM parsing.
  3. The "Puppeteer/Playwright Route" for Anti-Bot (Approach 2 implementation): If the target site uses heavy protection (Cloudflare, Datadome), use Approach 2, but remove the LLM from the execution loop. Have the LLM write a Playwright script that sets up page.route(), scrolls the page using a simple while loop, and writes intercepted JSON to disk. This leverages the browser for auth while keeping execution fast and cheap.
  4. Handling Infinite Scroll in Code: Avoid arbitrary timeouts (sleep(2)). Instead, hook into network idle states or wait for DOM mutations (e.g., a "loading..." spinner disappearing) to minimize race conditions.
  • [[mcp-model-context-protocol]]
  • [[browser-automation]]
  • [[reverse-engineering]]
  • [[deterministic-vs-agentic-execution]]
  • [[anti-bot-evasion]]

Source​

Chat session β€” 2026-03-24