docs: align strategy docs with botasaurus-first auto fallback

gildesmarais · gildesmarais · commit 7dedfbbb3abb · 2026-04-17T17:59:08.000+02:00
diff --git a/src/content/docs/creating-custom-feeds.mdx b/src/content/docs/creating-custom-feeds.mdx
@@ -48,7 +48,7 @@ When auto-sourcing isn't enough, you can write your own configuration files to c
 3. **Validate the config** with `html2rss validate your-config.yml`
 4. **Render the feed** with `html2rss feed your-config.yml`
 5. **Add it to `html2rss-web`** so you can use it through your normal instance
-6. **Escalate to `browserless`** if the content is rendered by JavaScript
+6. **Escalate strategy when needed**: if static fetching is insufficient, switch to a JavaScript/browser-based extraction strategy
 
 This order keeps iteration fast and makes it easier to see whether the problem is the page structure, your
 selectors, or the fetch strategy.
@@ -210,7 +210,7 @@ there.
 - **No items found?** Check your selectors with browser tools (F12) - the `items.selector` might not match the page structure
 - **Invalid YAML?** Use spaces, not tabs, and ensure proper indentation
 - **Website not loading?** Check the URL and try accessing it in your browser
-- **Missing content?** Some websites load content with JavaScript - you may need to use the `browserless` strategy
+- **Missing content?** Some websites load content with JavaScript - you may need a JavaScript/browser-based extraction strategy instead of plain HTTP fetching
 - **Wrong data extracted?** Verify your selectors are pointing to the right elements
 
 **Need more help?** See our [comprehensive troubleshooting guide](/troubleshooting/troubleshooting) or ask in [GitHub Discussions](https://github.com/orgs/html2rss/discussions).
@@ -234,5 +234,5 @@ there.
 
 - **[Browse existing configs](https://github.com/html2rss/html2rss-configs/tree/master/lib/html2rss/configs)** - See real examples
 - **[Join discussions](https://github.com/orgs/html2rss/discussions)** - Connect with other users
-- **[Learn about strategies](/ruby-gem/reference/strategy/)** - Decide when to use `browserless`
+- **[Learn about strategies](/ruby-gem/reference/strategy/)** - Decide when to use static vs JavaScript/browser-based extraction
 - **[Learn advanced features](/ruby-gem/how-to/advanced-features/)** - Take your configs to the next level
diff --git a/src/content/docs/index.mdx b/src/content/docs/index.mdx
@@ -43,7 +43,7 @@ Most people should start with the web application:
 
 1. **[Creating Custom Feeds](/creating-custom-feeds)**: write and test your own configs
 2. **[Selectors Reference](/ruby-gem/reference/selectors/)**: learn the matching rules
-3. **[Strategy Reference](/ruby-gem/reference/strategy/)**: decide when `browserless` is justified
+3. **[Strategy Reference](/ruby-gem/reference/strategy/)**: choose the right extraction strategy for static vs JavaScript-heavy pages
 
 ### I'm building or integrating
 
diff --git a/src/content/docs/ruby-gem/how-to/advanced-features.mdx b/src/content/docs/ruby-gem/how-to/advanced-features.mdx
@@ -16,7 +16,7 @@ html2rss uses parallel processing in auto-source discovery. This happens automat
 1. **Use appropriate selectors:** More specific selectors reduce processing time
 2. **Limit items when possible:** Use CSS selectors that target only the content you need
 3. **Cache responses:** The web application caches responses automatically
-4. **Choose the right strategy:** Use `faraday` for static content, `browserless` only when JavaScript is required
+4. **Choose the right strategy:** Use static HTTP fetching for simple pages, and move to a JavaScript/browser-based extraction strategy when rendering or anti-bot handling is required
 
 ## Memory Optimization
 
diff --git a/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx b/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx
@@ -11,7 +11,7 @@ Keep this structure in mind:
 
 - `headers` stays top-level
 - `strategy` stays top-level
-- request-specific controls such as budgets and Browserless options live under `request`
+- request-specific controls such as budgets and strategy-specific options live under `request`
 
 ## When You Need Custom Headers
 
@@ -74,6 +74,7 @@ Request budgets are configured under `request`, not as top-level keys:
 - `request.max_redirects` limits redirect hops
 - `request.max_requests` limits the total request budget for the feed build
 - `request.browserless.*` is reserved for Browserless-only behavior such as preload actions
+- `request.botasaurus.*` is reserved for Botasaurus-only behavior such as navigation mode and retries
 
 ## Common Use Cases
 
diff --git a/src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx b/src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx
@@ -1,6 +1,6 @@
 ---
 title: Handling Dynamic Content
-description: "Learn how to handle JavaScript-heavy websites and dynamic content with html2rss. Use browserless strategy for sites that load content dynamically."
+description: "Learn how to handle JavaScript-heavy websites and dynamic content with html2rss using browser-based extraction strategies."
 ---
 
 import { Code } from "@astrojs/starlight/components";
@@ -9,7 +9,7 @@ Some websites load their content dynamically using JavaScript. The default `html
 
 ## Solution
 
-Use the [`browserless` strategy](/ruby-gem/reference/strategy) to render JavaScript-heavy websites with a headless browser.
+Use a [browser-based extraction strategy](/ruby-gem/reference/strategy) when JavaScript-heavy pages do not work with default static fetching.
 
 Keep the strategy at the top level and put request-specific options under `request`:
 
@@ -36,9 +36,9 @@ Keep the strategy at the top level and put request-specific options under `reque
   lang="yaml"
 />
 
-## When to Use Browserless
+## When to Use Browser-Based Extraction
 
-The `browserless` strategy is necessary when:
+A browser-based extraction strategy is necessary when:
 
 - **Content loads after page load** - JavaScript fetches data from APIs
 - **Single Page Applications (SPAs)** - React, Vue, Angular apps
@@ -100,13 +100,13 @@ These preload steps can be combined in a single config when a site needs several
 
 ## Performance Considerations
 
-The `browserless` strategy is slower than the default `faraday` strategy because it:
+Browser-based extraction is slower than default static HTTP fetching because it:
 
 - Launches a headless Chrome browser
 - Renders the full page with JavaScript
 - Takes more memory and CPU resources
 
-**Use `faraday` for static content** and only switch to `browserless` when necessary.
+**Use static HTTP fetching for static content** and switch to browser-based extraction when needed. See the [Strategy Reference](/ruby-gem/reference/strategy) for concrete transports, defaults, and environment requirements.
 
 ## Related Topics
 
diff --git a/src/content/docs/ruby-gem/reference/cli-reference.mdx b/src/content/docs/ruby-gem/reference/cli-reference.mdx
@@ -23,6 +23,7 @@ Automatically discovers items from a page and prints the generated RSS feed to s
   code={`
   html2rss auto https://example.com/articles ; \
   html2rss auto https://example.com/app --strategy browserless ; \
+  BOTASAURUS_SCRAPER_URL="http://localhost:4010" html2rss auto https://example.com/protected --strategy botasaurus ; \
   html2rss auto https://example.com/app --strategy browserless --max-redirects 5 --max-requests 6 ; \
   html2rss auto https://example.com/articles --items_selector ".post-card"
 `}
@@ -31,6 +32,8 @@ Automatically discovers items from a page and prints the generated RSS feed to s
 
 Command: `html2rss auto URL`
 
+Default behavior uses `--strategy auto`, which tries `faraday` then `botasaurus` then `browserless`.
+
 #### URL Surface Guidance For `auto`
 
 `auto` works best when the input URL already exposes a server-rendered list of entries.
@@ -63,6 +66,8 @@ When no extractable items are found, `auto` now classifies likely causes instead
 
 Known anti-bot interstitial responses (for example Cloudflare challenge pages) are surfaced explicitly as blocked-surface errors.
 
+If you run with the default `--strategy auto`, no manual strategy override is required for fallback ordering.
+
 #### Browserless Setup And Diagnostics (CLI)
 
 `browserless` is opt-in for CLI usage.
@@ -97,6 +102,24 @@ If you see `Browserless connection failed`, check:
 
 For custom Browserless endpoints, `BROWSERLESS_IO_API_TOKEN` is required.
 
+#### Botasaurus Environment Requirement (CLI)
+
+`botasaurus` is opt-in for CLI usage and requires `BOTASAURUS_SCRAPER_URL`:
+
+<Code
+  code={`
+  BOTASAURUS_SCRAPER_URL="http://localhost:4010" \
+  html2rss auto https://example.com/updates --strategy botasaurus
+`}
+  lang="bash"
+/>
+
+If you see a Botasaurus configuration error, check:
+
+- `BOTASAURUS_SCRAPER_URL` is set
+- `BOTASAURUS_SCRAPER_URL` is a valid URL
+- the Botasaurus scrape API is reachable from the shell environment running `html2rss`
+
 ### Feed
 
 Loads a YAML config, builds the feed, and prints the RSS XML to stdout.
@@ -106,6 +129,7 @@ Loads a YAML config, builds the feed, and prints the RSS XML to stdout.
   html2rss feed single.yml ; \
   html2rss feed feeds.yml my-first-feed ; \
   html2rss feed single.yml --strategy browserless ; \
+  BOTASAURUS_SCRAPER_URL="http://localhost:4010" html2rss feed single.yml --strategy botasaurus ; \
   html2rss feed single.yml --max-redirects 5 --max-requests 6 ; \
   html2rss feed single.yml --params id:42 foo:bar
 `}
diff --git a/src/content/docs/ruby-gem/reference/strategy.mdx b/src/content/docs/ruby-gem/reference/strategy.mdx
@@ -1,6 +1,6 @@
 ---
 title: Strategy
-description: "Learn about different strategies for fetching website content with html2rss. Choose between faraday and browserless strategies for optimal performance."
+description: "Learn about different strategies for fetching website content with html2rss. Choose between faraday, browserless, and botasaurus strategies for optimal performance."
 ---
 
 import { Code } from "@astrojs/starlight/components";
@@ -9,10 +9,13 @@ The `strategy` key defines how `html2rss` fetches a website's content.
 
 - **`faraday`** (default): Makes a direct HTTP request. It is fast but does not execute JavaScript.
 - **`browserless`**: Renders the website in a headless Chrome browser, which is necessary for JavaScript-heavy sites.
+- **`botasaurus`**: Delegates fetching to a Botasaurus scrape API. This is opt-in and requires `BOTASAURUS_SCRAPER_URL`.
 
 `strategy` is a top-level config key. Request-specific controls live under `request`.
 
-Use `faraday` first for direct newsroom/listing/changelog pages. Prefer `browserless` when the target is client-rendered, protected by anti-bot checks, or otherwise requires JavaScript to expose article links.
+If you use CLI `--strategy auto` (default), html2rss tries `faraday` then `botasaurus` then `browserless`.
+
+Use `faraday` first for direct newsroom/listing/changelog pages. Prefer `botasaurus` as the first explicit browser-based strategy when you have a Botasaurus scrape API. Use `browserless` when you specifically need Browserless preload actions.
 
 ## `browserless`
 
@@ -62,11 +65,12 @@ Set the `strategy` at the top level of your feed configuration and put request c
 
 Use this split consistently:
 
-- `strategy`: selects `faraday` or `browserless`
+- `strategy`: selects `faraday`, `browserless`, or `botasaurus`
 - `headers`: top-level headers shared by all strategies
 - `request.max_redirects`: redirect limit for the request session
 - `request.max_requests`: total request budget for the whole feed build
 - `request.browserless.*`: Browserless-only options
+- `request.botasaurus.*`: Botasaurus-only options
 
 Example:
 
@@ -153,6 +157,58 @@ Check these first:
 
 For custom Browserless websocket endpoints, `BROWSERLESS_IO_API_TOKEN` is mandatory. The local default endpoint (`ws://127.0.0.1:3000`) can use the default local token `6R0W53R135510`.
 
+## `botasaurus`
+
+`botasaurus` delegates page fetching to a Botasaurus scrape API endpoint. This strategy is explicit opt-in and requires:
+
+- `strategy: botasaurus`
+- `BOTASAURUS_SCRAPER_URL` set to your Botasaurus scrape API base URL (for example `http://localhost:4010`)
+
+### Configuration
+
+<Code
+  code={`
+  strategy: botasaurus
+  request:
+    max_redirects: 5
+    max_requests: 6
+    botasaurus:
+      navigation_mode: auto
+      max_retries: 2
+      headless: false
+  channel:
+    url: "https://example.com/protected-listing"
+  auto_source: {}
+  `}
+  lang="yml"
+/>
+
+Supported `request.botasaurus` options:
+
+- `navigation_mode` (`auto`, `get`, `google_get`, `google_get_bypass`)
+- `max_retries` (`0..3`)
+- `wait_for_selector`
+- `wait_timeout_seconds`
+- `block_images`
+- `block_images_and_css`
+- `wait_for_complete_page_load`
+- `headless`
+- `proxy`
+- `user_agent`
+- `window_size` (two integers, for example `[1920, 1080]`)
+- `lang`
+
+### Command-Line Usage
+
+<Code
+  code={`
+  BOTASAURUS_SCRAPER_URL="http://localhost:4010" \
+  html2rss auto https://example.com/updates --strategy botasaurus ; \
+  html2rss feed my_config.yml --strategy botasaurus
+`}
+  lang="sh"
+/>
+
 ---
 
 For detailed documentation on the Ruby API, see the [official YARD documentation](https://www.rubydoc.info/gems/html2rss).
diff --git a/src/content/docs/troubleshooting/troubleshooting.mdx b/src/content/docs/troubleshooting/troubleshooting.mdx
@@ -32,14 +32,16 @@ The `auto` flow is URL-surface sensitive.
 
 If extraction quality is poor, switch to a more specific listing/update URL before tuning selectors.
 
+If you use CLI defaults, `--strategy auto` is already active and attempts `faraday` then `botasaurus` then `browserless`.
+
 ### Empty Feeds
 
 If your feed is empty, check the following:
 
 - **URL:** Ensure the `url` in your configuration is correct and accessible.
 - **`items.selector`:** Verify that the `items.selector` matches the elements on the page.
 - **Website Changes:** Websites change their HTML structure frequently. Your selectors may be outdated.
-- **JavaScript Content:** If the content is loaded via JavaScript, use the `browserless` strategy instead of `faraday`.
+- **JavaScript Content:** If the content is loaded via JavaScript, move from `faraday` to a rendering strategy such as `browserless` (or `botasaurus` when you use a Botasaurus scrape API).
 - **Authentication:** Some sites require authentication — check if you need to add headers or use a different strategy.
 
 ### `No scrapers found` Failure Taxonomy (`auto`)
@@ -91,7 +93,9 @@ For custom websocket endpoints, `BROWSERLESS_IO_API_TOKEN` is required.
 Common configuration-related errors:
 
 - **`UnsupportedResponseContentType`:** The website returned content that html2rss can't parse (not HTML or JSON).
-- **`UnsupportedStrategy`:** The specified strategy is not available. Use `faraday` or `browserless`.
+- **`UnsupportedStrategy`:** The specified strategy is not available. Use `faraday`, `browserless`, or `botasaurus`.
+- **`BOTASAURUS_SCRAPER_URL is required for strategy=botasaurus.`:** Set `BOTASAURUS_SCRAPER_URL` to your Botasaurus scrape API base URL when using `--strategy botasaurus`.
+- **`BOTASAURUS_SCRAPER_URL is invalid`:** Fix the URL format and retry.
 - **`Configuration must include at least 'selectors' or 'auto_source'`:** You need to specify either manual selectors or enable auto-source.
 - **`stylesheet.type invalid`:** Only `text/css` and `text/xsl` are supported for stylesheets.
 
@@ -101,7 +105,7 @@ If parts of your items (e.g., title, link) are missing, check the following:
 
 - **Selector:** Ensure the selector for the missing part is correct and relative to the `items.selector`.
 - **Extractor:** Verify that you are using the correct `extractor` (e.g., `text`, `href`, `attribute`).
-- **Dynamic Content:** `faraday` does not render JavaScript. If content loads dynamically, run with `--strategy browserless` (with the Browserless service available) so the page can be rendered before extraction.
+- **Dynamic Content:** `faraday` does not render JavaScript. If content loads dynamically, run with `--strategy browserless` (with Browserless available) or `--strategy botasaurus` (with `BOTASAURUS_SCRAPER_URL` configured) so the page can be rendered before extraction.
 
 ### Date/Time Parsing Errors