Skip to content

Commit 7dedfbb

Browse files
committed
docs: align strategy docs with botasaurus-first auto fallback
1 parent 38527f4 commit 7dedfbb

8 files changed

Lines changed: 103 additions & 18 deletions

File tree

src/content/docs/creating-custom-feeds.mdx

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ When auto-sourcing isn't enough, you can write your own configuration files to c
4848
3. **Validate the config** with `html2rss validate your-config.yml`
4949
4. **Render the feed** with `html2rss feed your-config.yml`
5050
5. **Add it to `html2rss-web`** so you can use it through your normal instance
51-
6. **Escalate to `browserless`** if the content is rendered by JavaScript
51+
6. **Escalate strategy when needed**: if static fetching is insufficient, switch to a JavaScript/browser-based extraction strategy
5252

5353
This order keeps iteration fast and makes it easier to see whether the problem is the page structure, your
5454
selectors, or the fetch strategy.
@@ -210,7 +210,7 @@ there.
210210
- **No items found?** Check your selectors with browser tools (F12) - the `items.selector` might not match the page structure
211211
- **Invalid YAML?** Use spaces, not tabs, and ensure proper indentation
212212
- **Website not loading?** Check the URL and try accessing it in your browser
213-
- **Missing content?** Some websites load content with JavaScript - you may need to use the `browserless` strategy
213+
- **Missing content?** Some websites load content with JavaScript - you may need a JavaScript/browser-based extraction strategy instead of plain HTTP fetching
214214
- **Wrong data extracted?** Verify your selectors are pointing to the right elements
215215

216216
**Need more help?** See our [comprehensive troubleshooting guide](/troubleshooting/troubleshooting) or ask in [GitHub Discussions](https://github.com/orgs/html2rss/discussions).
@@ -234,5 +234,5 @@ there.
234234

235235
- **[Browse existing configs](https://github.com/html2rss/html2rss-configs/tree/master/lib/html2rss/configs)** - See real examples
236236
- **[Join discussions](https://github.com/orgs/html2rss/discussions)** - Connect with other users
237-
- **[Learn about strategies](/ruby-gem/reference/strategy/)** - Decide when to use `browserless`
237+
- **[Learn about strategies](/ruby-gem/reference/strategy/)** - Decide when to use static vs JavaScript/browser-based extraction
238238
- **[Learn advanced features](/ruby-gem/how-to/advanced-features/)** - Take your configs to the next level

src/content/docs/index.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ Most people should start with the web application:
4343

4444
1. **[Creating Custom Feeds](/creating-custom-feeds)**: write and test your own configs
4545
2. **[Selectors Reference](/ruby-gem/reference/selectors/)**: learn the matching rules
46-
3. **[Strategy Reference](/ruby-gem/reference/strategy/)**: decide when `browserless` is justified
46+
3. **[Strategy Reference](/ruby-gem/reference/strategy/)**: choose the right extraction strategy for static vs JavaScript-heavy pages
4747

4848
### I'm building or integrating
4949

src/content/docs/ruby-gem/how-to/advanced-features.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ html2rss uses parallel processing in auto-source discovery. This happens automat
1616
1. **Use appropriate selectors:** More specific selectors reduce processing time
1717
2. **Limit items when possible:** Use CSS selectors that target only the content you need
1818
3. **Cache responses:** The web application caches responses automatically
19-
4. **Choose the right strategy:** Use `faraday` for static content, `browserless` only when JavaScript is required
19+
4. **Choose the right strategy:** Use static HTTP fetching for simple pages, and move to a JavaScript/browser-based extraction strategy when rendering or anti-bot handling is required
2020

2121
## Memory Optimization
2222

src/content/docs/ruby-gem/how-to/custom-http-requests.mdx

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Keep this structure in mind:
1111

1212
- `headers` stays top-level
1313
- `strategy` stays top-level
14-
- request-specific controls such as budgets and Browserless options live under `request`
14+
- request-specific controls such as budgets and strategy-specific options live under `request`
1515

1616
## When You Need Custom Headers
1717

@@ -74,6 +74,7 @@ Request budgets are configured under `request`, not as top-level keys:
7474
- `request.max_redirects` limits redirect hops
7575
- `request.max_requests` limits the total request budget for the feed build
7676
- `request.browserless.*` is reserved for Browserless-only behavior such as preload actions
77+
- `request.botasaurus.*` is reserved for Botasaurus-only behavior such as navigation mode and retries
7778

7879
## Common Use Cases
7980

src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Handling Dynamic Content
3-
description: "Learn how to handle JavaScript-heavy websites and dynamic content with html2rss. Use browserless strategy for sites that load content dynamically."
3+
description: "Learn how to handle JavaScript-heavy websites and dynamic content with html2rss using browser-based extraction strategies."
44
---
55

66
import { Code } from "@astrojs/starlight/components";
@@ -9,7 +9,7 @@ Some websites load their content dynamically using JavaScript. The default `html
99

1010
## Solution
1111

12-
Use the [`browserless` strategy](/ruby-gem/reference/strategy) to render JavaScript-heavy websites with a headless browser.
12+
Use a [browser-based extraction strategy](/ruby-gem/reference/strategy) when JavaScript-heavy pages do not work with default static fetching.
1313

1414
Keep the strategy at the top level and put request-specific options under `request`:
1515

@@ -36,9 +36,9 @@ Keep the strategy at the top level and put request-specific options under `reque
3636
lang="yaml"
3737
/>
3838

39-
## When to Use Browserless
39+
## When to Use Browser-Based Extraction
4040

41-
The `browserless` strategy is necessary when:
41+
A browser-based extraction strategy is necessary when:
4242

4343
- **Content loads after page load** - JavaScript fetches data from APIs
4444
- **Single Page Applications (SPAs)** - React, Vue, Angular apps
@@ -100,13 +100,13 @@ These preload steps can be combined in a single config when a site needs several
100100

101101
## Performance Considerations
102102

103-
The `browserless` strategy is slower than the default `faraday` strategy because it:
103+
Browser-based extraction is slower than default static HTTP fetching because it:
104104

105105
- Launches a headless Chrome browser
106106
- Renders the full page with JavaScript
107107
- Takes more memory and CPU resources
108108

109-
**Use `faraday` for static content** and only switch to `browserless` when necessary.
109+
**Use static HTTP fetching for static content** and switch to browser-based extraction when needed. See the [Strategy Reference](/ruby-gem/reference/strategy) for concrete transports, defaults, and environment requirements.
110110

111111
## Related Topics
112112

src/content/docs/ruby-gem/reference/cli-reference.mdx

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ Automatically discovers items from a page and prints the generated RSS feed to s
2323
code={`
2424
html2rss auto https://example.com/articles ; \
2525
html2rss auto https://example.com/app --strategy browserless ; \
26+
BOTASAURUS_SCRAPER_URL="http://localhost:4010" html2rss auto https://example.com/protected --strategy botasaurus ; \
2627
html2rss auto https://example.com/app --strategy browserless --max-redirects 5 --max-requests 6 ; \
2728
html2rss auto https://example.com/articles --items_selector ".post-card"
2829
`}
@@ -31,6 +32,8 @@ Automatically discovers items from a page and prints the generated RSS feed to s
3132

3233
Command: `html2rss auto URL`
3334

35+
Default behavior uses `--strategy auto`, which tries `faraday` then `botasaurus` then `browserless`.
36+
3437
#### URL Surface Guidance For `auto`
3538

3639
`auto` works best when the input URL already exposes a server-rendered list of entries.
@@ -63,6 +66,8 @@ When no extractable items are found, `auto` now classifies likely causes instead
6366

6467
Known anti-bot interstitial responses (for example Cloudflare challenge pages) are surfaced explicitly as blocked-surface errors.
6568

69+
If you run with the default `--strategy auto`, no manual strategy override is required for fallback ordering.
70+
6671
#### Browserless Setup And Diagnostics (CLI)
6772

6873
`browserless` is opt-in for CLI usage.
@@ -97,6 +102,24 @@ If you see `Browserless connection failed`, check:
97102

98103
For custom Browserless endpoints, `BROWSERLESS_IO_API_TOKEN` is required.
99104

105+
#### Botasaurus Environment Requirement (CLI)
106+
107+
`botasaurus` is opt-in for CLI usage and requires `BOTASAURUS_SCRAPER_URL`:
108+
109+
<Code
110+
code={`
111+
BOTASAURUS_SCRAPER_URL="http://localhost:4010" \
112+
html2rss auto https://example.com/updates --strategy botasaurus
113+
`}
114+
lang="bash"
115+
/>
116+
117+
If you see a Botasaurus configuration error, check:
118+
119+
- `BOTASAURUS_SCRAPER_URL` is set
120+
- `BOTASAURUS_SCRAPER_URL` is a valid URL
121+
- the Botasaurus scrape API is reachable from the shell environment running `html2rss`
122+
100123
### Feed
101124

102125
Loads a YAML config, builds the feed, and prints the RSS XML to stdout.
@@ -106,6 +129,7 @@ Loads a YAML config, builds the feed, and prints the RSS XML to stdout.
106129
html2rss feed single.yml ; \
107130
html2rss feed feeds.yml my-first-feed ; \
108131
html2rss feed single.yml --strategy browserless ; \
132+
BOTASAURUS_SCRAPER_URL="http://localhost:4010" html2rss feed single.yml --strategy botasaurus ; \
109133
html2rss feed single.yml --max-redirects 5 --max-requests 6 ; \
110134
html2rss feed single.yml --params id:42 foo:bar
111135
`}

src/content/docs/ruby-gem/reference/strategy.mdx

Lines changed: 59 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Strategy
3-
description: "Learn about different strategies for fetching website content with html2rss. Choose between faraday and browserless strategies for optimal performance."
3+
description: "Learn about different strategies for fetching website content with html2rss. Choose between faraday, browserless, and botasaurus strategies for optimal performance."
44
---
55

66
import { Code } from "@astrojs/starlight/components";
@@ -9,10 +9,13 @@ The `strategy` key defines how `html2rss` fetches a website's content.
99

1010
- **`faraday`** (default): Makes a direct HTTP request. It is fast but does not execute JavaScript.
1111
- **`browserless`**: Renders the website in a headless Chrome browser, which is necessary for JavaScript-heavy sites.
12+
- **`botasaurus`**: Delegates fetching to a Botasaurus scrape API. This is opt-in and requires `BOTASAURUS_SCRAPER_URL`.
1213

1314
`strategy` is a top-level config key. Request-specific controls live under `request`.
1415

15-
Use `faraday` first for direct newsroom/listing/changelog pages. Prefer `browserless` when the target is client-rendered, protected by anti-bot checks, or otherwise requires JavaScript to expose article links.
16+
If you use CLI `--strategy auto` (default), html2rss tries `faraday` then `botasaurus` then `browserless`.
17+
18+
Use `faraday` first for direct newsroom/listing/changelog pages. Prefer `botasaurus` as the first explicit browser-based strategy when you have a Botasaurus scrape API. Use `browserless` when you specifically need Browserless preload actions.
1619

1720
## `browserless`
1821

@@ -62,11 +65,12 @@ Set the `strategy` at the top level of your feed configuration and put request c
6265

6366
Use this split consistently:
6467

65-
- `strategy`: selects `faraday` or `browserless`
68+
- `strategy`: selects `faraday`, `browserless`, or `botasaurus`
6669
- `headers`: top-level headers shared by all strategies
6770
- `request.max_redirects`: redirect limit for the request session
6871
- `request.max_requests`: total request budget for the whole feed build
6972
- `request.browserless.*`: Browserless-only options
73+
- `request.botasaurus.*`: Botasaurus-only options
7074

7175
Example:
7276

@@ -153,6 +157,58 @@ Check these first:
153157

154158
For custom Browserless websocket endpoints, `BROWSERLESS_IO_API_TOKEN` is mandatory. The local default endpoint (`ws://127.0.0.1:3000`) can use the default local token `6R0W53R135510`.
155159

160+
## `botasaurus`
161+
162+
`botasaurus` delegates page fetching to a Botasaurus scrape API endpoint. This strategy is explicit opt-in and requires:
163+
164+
- `strategy: botasaurus`
165+
- `BOTASAURUS_SCRAPER_URL` set to your Botasaurus scrape API base URL (for example `http://localhost:4010`)
166+
167+
### Configuration
168+
169+
<Code
170+
code={`
171+
strategy: botasaurus
172+
request:
173+
max_redirects: 5
174+
max_requests: 6
175+
botasaurus:
176+
navigation_mode: auto
177+
max_retries: 2
178+
headless: false
179+
channel:
180+
url: "https://example.com/protected-listing"
181+
auto_source: {}
182+
`}
183+
lang="yml"
184+
/>
185+
186+
Supported `request.botasaurus` options:
187+
188+
- `navigation_mode` (`auto`, `get`, `google_get`, `google_get_bypass`)
189+
- `max_retries` (`0..3`)
190+
- `wait_for_selector`
191+
- `wait_timeout_seconds`
192+
- `block_images`
193+
- `block_images_and_css`
194+
- `wait_for_complete_page_load`
195+
- `headless`
196+
- `proxy`
197+
- `user_agent`
198+
- `window_size` (two integers, for example `[1920, 1080]`)
199+
- `lang`
200+
201+
### Command-Line Usage
202+
203+
<Code
204+
code={`
205+
BOTASAURUS_SCRAPER_URL="http://localhost:4010" \
206+
html2rss auto https://example.com/updates --strategy botasaurus ; \
207+
html2rss feed my_config.yml --strategy botasaurus
208+
`}
209+
lang="sh"
210+
/>
211+
156212
---
157213

158214
For detailed documentation on the Ruby API, see the [official YARD documentation](https://www.rubydoc.info/gems/html2rss).

src/content/docs/troubleshooting/troubleshooting.mdx

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,14 +32,16 @@ The `auto` flow is URL-surface sensitive.
3232

3333
If extraction quality is poor, switch to a more specific listing/update URL before tuning selectors.
3434

35+
If you use CLI defaults, `--strategy auto` is already active and attempts `faraday` then `botasaurus` then `browserless`.
36+
3537
### Empty Feeds
3638

3739
If your feed is empty, check the following:
3840

3941
- **URL:** Ensure the `url` in your configuration is correct and accessible.
4042
- **`items.selector`:** Verify that the `items.selector` matches the elements on the page.
4143
- **Website Changes:** Websites change their HTML structure frequently. Your selectors may be outdated.
42-
- **JavaScript Content:** If the content is loaded via JavaScript, use the `browserless` strategy instead of `faraday`.
44+
- **JavaScript Content:** If the content is loaded via JavaScript, move from `faraday` to a rendering strategy such as `browserless` (or `botasaurus` when you use a Botasaurus scrape API).
4345
- **Authentication:** Some sites require authentication — check if you need to add headers or use a different strategy.
4446

4547
### `No scrapers found` Failure Taxonomy (`auto`)
@@ -91,7 +93,9 @@ For custom websocket endpoints, `BROWSERLESS_IO_API_TOKEN` is required.
9193
Common configuration-related errors:
9294

9395
- **`UnsupportedResponseContentType`:** The website returned content that html2rss can't parse (not HTML or JSON).
94-
- **`UnsupportedStrategy`:** The specified strategy is not available. Use `faraday` or `browserless`.
96+
- **`UnsupportedStrategy`:** The specified strategy is not available. Use `faraday`, `browserless`, or `botasaurus`.
97+
- **`BOTASAURUS_SCRAPER_URL is required for strategy=botasaurus.`:** Set `BOTASAURUS_SCRAPER_URL` to your Botasaurus scrape API base URL when using `--strategy botasaurus`.
98+
- **`BOTASAURUS_SCRAPER_URL is invalid`:** Fix the URL format and retry.
9599
- **`Configuration must include at least 'selectors' or 'auto_source'`:** You need to specify either manual selectors or enable auto-source.
96100
- **`stylesheet.type invalid`:** Only `text/css` and `text/xsl` are supported for stylesheets.
97101

@@ -101,7 +105,7 @@ If parts of your items (e.g., title, link) are missing, check the following:
101105

102106
- **Selector:** Ensure the selector for the missing part is correct and relative to the `items.selector`.
103107
- **Extractor:** Verify that you are using the correct `extractor` (e.g., `text`, `href`, `attribute`).
104-
- **Dynamic Content:** `faraday` does not render JavaScript. If content loads dynamically, run with `--strategy browserless` (with the Browserless service available) so the page can be rendered before extraction.
108+
- **Dynamic Content:** `faraday` does not render JavaScript. If content loads dynamically, run with `--strategy browserless` (with Browserless available) or `--strategy botasaurus` (with `BOTASAURUS_SCRAPER_URL` configured) so the page can be rendered before extraction.
105109

106110
### Date/Time Parsing Errors
107111

0 commit comments

Comments
 (0)