| title | Strategy |
|---|---|
| description | Learn about different strategies for fetching website content with html2rss. Choose between faraday and browserless strategies for optimal performance. |
The strategy key defines how html2rss fetches a website's content.
faraday(default): Makes a direct HTTP request. It is fast but does not execute JavaScript.browserless: Renders the website in a headless Chrome browser, which is necessary for JavaScript-heavy sites.
strategy is a top-level config key. Request-specific controls live under request.
Use faraday first for direct newsroom/listing/changelog pages. Prefer browserless when the target is client-rendered, protected by anti-bot checks, or otherwise requires JavaScript to expose article links.
To use the browserless strategy, you need a running instance of Browserless.io.
You can run a local Browserless.io instance using Docker:
docker run \
--rm \
-p 3000:3000 \
-e "CONCURRENT=10" \
-e "TOKEN=6R0W53R135510" \
ghcr.io/browserless/chromiumSet the strategy at the top level of your feed configuration and put request controls under request:
strategy: browserless
request:
max_redirects: 5
max_requests: 6
channel:
url: "https://example.com/app"
selectors:
items:
selector: ".article"
title:
selector: "h2"
url:
selector: "a"
extractor: "href"Use this split consistently:
strategy: selectsfaradayorbrowserlessheaders: top-level headers shared by all strategiesrequest.max_redirects: redirect limit for the request sessionrequest.max_requests: total request budget for the whole feed buildrequest.browserless.*: Browserless-only options
Example:
strategy: browserless
headers:
User-Agent: "Mozilla/5.0 (compatible; html2rss/1.0)"
request:
max_redirects: 5
max_requests: 6
browserless:
preload:
wait_after_ms: 5000
channel:
url: "https://example.com/app"
selectors:
items:
selector: ".article"
title:
selector: "h2"
url:
selector: "a"
extractor: "href"Browserless can interact with the page before html2rss captures the final HTML. Configure preload steps under
request.browserless.preload.
strategy: browserless
request:
browserless:
preload:
wait_after_ms: 5000
click_selectors:
- selector: ".load-more"
max_clicks: 3
wait_after_ms: 250
scroll_down:
iterations: 5
wait_after_ms: 200wait_after_ms: inserts a fixed wait before or after preload stepsclick_selectors: clicks matching elements until they disappear ormax_clicksis reachedscroll_down: scrolls until the page height stops growing oriterationsis reached
If preload triggers a real navigation or redirect, html2rss keeps the final document metadata. Relative links and follow-up pagination therefore resolve against the page that was actually rendered after preload completed.
You can also specify the strategy on the command line:
# Set environment variables for your Browserless.io instance
BROWSERLESS_IO_WEBSOCKET_URL="ws://127.0.0.1:3000" \
BROWSERLESS_IO_API_TOKEN="6R0W53R135510" \
html2rss feed my_config.yml --strategy browserless
# Override request budgets at runtime
html2rss feed my_config.yml --max-redirects 5 --max-requests 6
# Or rely on the strategy stored in the YAML config
html2rss feed my_config.ymlIf Browserless cannot connect, html2rss surfaces a Browserless connection failed (...) error with endpoint/token hints.
Check these first:
BROWSERLESS_IO_WEBSOCKET_URLis reachable from where html2rss runsBROWSERLESS_IO_API_TOKENmatches your BrowserlessTOKEN- your Browserless service is running and accepting connections
For custom Browserless websocket endpoints, BROWSERLESS_IO_API_TOKEN is mandatory. The local default endpoint (ws://127.0.0.1:3000) can use the default local token 6R0W53R135510.
For detailed documentation on the Ruby API, see the official YARD documentation.