Skip to content

Latest commit

 

History

History
145 lines (111 loc) · 4.19 KB

File metadata and controls

145 lines (111 loc) · 4.19 KB
title Strategy
description Learn about different strategies for fetching website content with html2rss. Choose between faraday and browserless strategies for optimal performance.

The strategy key defines how html2rss fetches a website's content.

  • faraday (default): Makes a direct HTTP request. It is fast but does not execute JavaScript.
  • browserless: Renders the website in a headless Chrome browser, which is necessary for JavaScript-heavy sites.

strategy is a top-level config key. Request-specific controls live under request.

Use faraday first for direct newsroom/listing/changelog pages. Prefer browserless when the target is client-rendered, protected by anti-bot checks, or otherwise requires JavaScript to expose article links.

browserless

To use the browserless strategy, you need a running instance of Browserless.io.

Docker

You can run a local Browserless.io instance using Docker:

docker run \
  --rm \
  -p 3000:3000 \
  -e "CONCURRENT=10" \
  -e "TOKEN=6R0W53R135510" \
  ghcr.io/browserless/chromium

Configuration

Set the strategy at the top level of your feed configuration and put request controls under request:

strategy: browserless
request:
  max_redirects: 5
  max_requests: 6
channel:
  url: "https://example.com/app"
selectors:
  items:
    selector: ".article"
  title:
    selector: "h2"
  url:
    selector: "a"
    extractor: "href"

Request Structure

Use this split consistently:

  • strategy: selects faraday or browserless
  • headers: top-level headers shared by all strategies
  • request.max_redirects: redirect limit for the request session
  • request.max_requests: total request budget for the whole feed build
  • request.browserless.*: Browserless-only options

Example:

strategy: browserless
headers:
  User-Agent: "Mozilla/5.0 (compatible; html2rss/1.0)"
request:
  max_redirects: 5
  max_requests: 6
  browserless:
    preload:
      wait_after_ms: 5000
channel:
  url: "https://example.com/app"
selectors:
  items:
    selector: ".article"
  title:
    selector: "h2"
  url:
    selector: "a"
    extractor: "href"

Browserless Preload

Browserless can interact with the page before html2rss captures the final HTML. Configure preload steps under request.browserless.preload.

strategy: browserless
request:
  browserless:
    preload:
      wait_after_ms: 5000
      click_selectors:
        - selector: ".load-more"
          max_clicks: 3
          wait_after_ms: 250
      scroll_down:
        iterations: 5
        wait_after_ms: 200
  • wait_after_ms: inserts a fixed wait before or after preload steps
  • click_selectors: clicks matching elements until they disappear or max_clicks is reached
  • scroll_down: scrolls until the page height stops growing or iterations is reached

If preload triggers a real navigation or redirect, html2rss keeps the final document metadata. Relative links and follow-up pagination therefore resolve against the page that was actually rendered after preload completed.

Command-Line Usage

You can also specify the strategy on the command line:

# Set environment variables for your Browserless.io instance
BROWSERLESS_IO_WEBSOCKET_URL="ws://127.0.0.1:3000" \
BROWSERLESS_IO_API_TOKEN="6R0W53R135510" \
  html2rss feed my_config.yml --strategy browserless

# Override request budgets at runtime
html2rss feed my_config.yml --max-redirects 5 --max-requests 6

# Or rely on the strategy stored in the YAML config
html2rss feed my_config.yml

Browserless Troubleshooting

If Browserless cannot connect, html2rss surfaces a Browserless connection failed (...) error with endpoint/token hints.

Check these first:

  • BROWSERLESS_IO_WEBSOCKET_URL is reachable from where html2rss runs
  • BROWSERLESS_IO_API_TOKEN matches your Browserless TOKEN
  • your Browserless service is running and accepting connections

For custom Browserless websocket endpoints, BROWSERLESS_IO_API_TOKEN is mandatory. The local default endpoint (ws://127.0.0.1:3000) can use the default local token 6R0W53R135510.


For detailed documentation on the Ruby API, see the official YARD documentation.