You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/content/docs/creating-custom-feeds.mdx
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -48,7 +48,7 @@ When auto-sourcing isn't enough, you can write your own configuration files to c
48
48
3.**Validate the config** with `html2rss validate your-config.yml`
49
49
4.**Render the feed** with `html2rss feed your-config.yml`
50
50
5.**Add it to `html2rss-web`** so you can use it through your normal instance
51
-
6.**Escalate to `browserless`** if the content is rendered by JavaScript
51
+
6.**Escalate strategy when needed**: if static fetching is insufficient, switch to a JavaScript/browser-based extraction strategy
52
52
53
53
This order keeps iteration fast and makes it easier to see whether the problem is the page structure, your
54
54
selectors, or the fetch strategy.
@@ -210,7 +210,7 @@ there.
210
210
-**No items found?** Check your selectors with browser tools (F12) - the `items.selector` might not match the page structure
211
211
-**Invalid YAML?** Use spaces, not tabs, and ensure proper indentation
212
212
-**Website not loading?** Check the URL and try accessing it in your browser
213
-
-**Missing content?** Some websites load content with JavaScript - you may need to use the `browserless`strategy
213
+
-**Missing content?** Some websites load content with JavaScript - you may need a JavaScript/browser-based extraction strategy instead of plain HTTP fetching
214
214
-**Wrong data extracted?** Verify your selectors are pointing to the right elements
215
215
216
216
**Need more help?** See our [comprehensive troubleshooting guide](/troubleshooting/troubleshooting) or ask in [GitHub Discussions](https://github.com/orgs/html2rss/discussions).
@@ -234,5 +234,5 @@ there.
234
234
235
235
-**[Browse existing configs](https://github.com/html2rss/html2rss-configs/tree/master/lib/html2rss/configs)** - See real examples
236
236
-**[Join discussions](https://github.com/orgs/html2rss/discussions)** - Connect with other users
237
-
-**[Learn about strategies](/ruby-gem/reference/strategy/)** - Decide when to use `browserless`
237
+
-**[Learn about strategies](/ruby-gem/reference/strategy/)** - Decide when to use static vs JavaScript/browser-based extraction
238
238
-**[Learn advanced features](/ruby-gem/how-to/advanced-features/)** - Take your configs to the next level
Copy file name to clipboardExpand all lines: src/content/docs/ruby-gem/how-to/advanced-features.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,7 +16,7 @@ html2rss uses parallel processing in auto-source discovery. This happens automat
16
16
1.**Use appropriate selectors:** More specific selectors reduce processing time
17
17
2.**Limit items when possible:** Use CSS selectors that target only the content you need
18
18
3.**Cache responses:** The web application caches responses automatically
19
-
4.**Choose the right strategy:** Use `faraday`for static content, `browserless` only when JavaScript is required
19
+
4.**Choose the right strategy:** Use static HTTP fetching for simple pages, and move to a JavaScript/browser-based extraction strategy when rendering or anti-bot handling is required
Copy file name to clipboardExpand all lines: src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
---
2
2
title: Handling Dynamic Content
3
-
description: "Learn how to handle JavaScript-heavy websites and dynamic content with html2rss. Use browserless strategy for sites that load content dynamically."
3
+
description: "Learn how to handle JavaScript-heavy websites and dynamic content with html2rss using browser-based extraction strategies."
@@ -100,13 +100,13 @@ These preload steps can be combined in a single config when a site needs several
100
100
101
101
## Performance Considerations
102
102
103
-
The `browserless` strategy is slower than the default `faraday` strategy because it:
103
+
Browser-based extraction is slower than default static HTTP fetching because it:
104
104
105
105
- Launches a headless Chrome browser
106
106
- Renders the full page with JavaScript
107
107
- Takes more memory and CPU resources
108
108
109
-
**Use `faraday`for static content** and only switch to `browserless`when necessary.
109
+
**Use static HTTP fetching for static content** and switch to browser-based extraction when needed. See the [Strategy Reference](/ruby-gem/reference/strategy) for concrete transports, defaults, and environment requirements.
Copy file name to clipboardExpand all lines: src/content/docs/ruby-gem/reference/strategy.mdx
+59-3Lines changed: 59 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
---
2
2
title: Strategy
3
-
description: "Learn about different strategies for fetching website content with html2rss. Choose between faradayand browserless strategies for optimal performance."
3
+
description: "Learn about different strategies for fetching website content with html2rss. Choose between faraday, browserless, and botasaurus strategies for optimal performance."
@@ -9,10 +9,13 @@ The `strategy` key defines how `html2rss` fetches a website's content.
9
9
10
10
-**`faraday`** (default): Makes a direct HTTP request. It is fast but does not execute JavaScript.
11
11
-**`browserless`**: Renders the website in a headless Chrome browser, which is necessary for JavaScript-heavy sites.
12
+
-**`botasaurus`**: Delegates fetching to a Botasaurus scrape API. This is opt-in and requires `BOTASAURUS_SCRAPER_URL`.
12
13
13
14
`strategy` is a top-level config key. Request-specific controls live under `request`.
14
15
15
-
Use `faraday` first for direct newsroom/listing/changelog pages. Prefer `browserless` when the target is client-rendered, protected by anti-bot checks, or otherwise requires JavaScript to expose article links.
16
+
If you use CLI `--strategy auto` (default), html2rss tries `faraday` then `botasaurus` then `browserless`.
17
+
18
+
Use `faraday` first for direct newsroom/listing/changelog pages. Prefer `botasaurus` as the first explicit browser-based strategy when you have a Botasaurus scrape API. Use `browserless` when you specifically need Browserless preload actions.
16
19
17
20
## `browserless`
18
21
@@ -62,11 +65,12 @@ Set the `strategy` at the top level of your feed configuration and put request c
62
65
63
66
Use this split consistently:
64
67
65
-
-`strategy`: selects `faraday`or `browserless`
68
+
-`strategy`: selects `faraday`, `browserless`, or `botasaurus`
66
69
-`headers`: top-level headers shared by all strategies
67
70
-`request.max_redirects`: redirect limit for the request session
68
71
-`request.max_requests`: total request budget for the whole feed build
For custom Browserless websocket endpoints, `BROWSERLESS_IO_API_TOKEN` is mandatory. The local default endpoint (`ws://127.0.0.1:3000`) can use the default local token `6R0W53R135510`.
155
159
160
+
## `botasaurus`
161
+
162
+
`botasaurus` delegates page fetching to a Botasaurus scrape API endpoint. This strategy is explicit opt-in and requires:
163
+
164
+
-`strategy: botasaurus`
165
+
-`BOTASAURUS_SCRAPER_URL` set to your Botasaurus scrape API base URL (for example `http://localhost:4010`)
Copy file name to clipboardExpand all lines: src/content/docs/troubleshooting/troubleshooting.mdx
+7-3Lines changed: 7 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,14 +32,16 @@ The `auto` flow is URL-surface sensitive.
32
32
33
33
If extraction quality is poor, switch to a more specific listing/update URL before tuning selectors.
34
34
35
+
If you use CLI defaults, `--strategy auto` is already active and attempts `faraday` then `botasaurus` then `browserless`.
36
+
35
37
### Empty Feeds
36
38
37
39
If your feed is empty, check the following:
38
40
39
41
-**URL:** Ensure the `url` in your configuration is correct and accessible.
40
42
-**`items.selector`:** Verify that the `items.selector` matches the elements on the page.
41
43
-**Website Changes:** Websites change their HTML structure frequently. Your selectors may be outdated.
42
-
-**JavaScript Content:** If the content is loaded via JavaScript, use the `browserless`strategy instead of `faraday`.
44
+
-**JavaScript Content:** If the content is loaded via JavaScript, move from `faraday` to a rendering strategy such as `browserless` (or `botasaurus` when you use a Botasaurus scrape API).
43
45
-**Authentication:** Some sites require authentication — check if you need to add headers or use a different strategy.
44
46
45
47
### `No scrapers found` Failure Taxonomy (`auto`)
@@ -91,7 +93,9 @@ For custom websocket endpoints, `BROWSERLESS_IO_API_TOKEN` is required.
91
93
Common configuration-related errors:
92
94
93
95
-**`UnsupportedResponseContentType`:** The website returned content that html2rss can't parse (not HTML or JSON).
94
-
-**`UnsupportedStrategy`:** The specified strategy is not available. Use `faraday` or `browserless`.
96
+
-**`UnsupportedStrategy`:** The specified strategy is not available. Use `faraday`, `browserless`, or `botasaurus`.
97
+
-**`BOTASAURUS_SCRAPER_URL is required for strategy=botasaurus.`:** Set `BOTASAURUS_SCRAPER_URL` to your Botasaurus scrape API base URL when using `--strategy botasaurus`.
98
+
-**`BOTASAURUS_SCRAPER_URL is invalid`:** Fix the URL format and retry.
95
99
-**`Configuration must include at least 'selectors' or 'auto_source'`:** You need to specify either manual selectors or enable auto-source.
96
100
-**`stylesheet.type invalid`:** Only `text/css` and `text/xsl` are supported for stylesheets.
97
101
@@ -101,7 +105,7 @@ If parts of your items (e.g., title, link) are missing, check the following:
101
105
102
106
-**Selector:** Ensure the selector for the missing part is correct and relative to the `items.selector`.
103
107
-**Extractor:** Verify that you are using the correct `extractor` (e.g., `text`, `href`, `attribute`).
104
-
-**Dynamic Content:**`faraday` does not render JavaScript. If content loads dynamically, run with `--strategy browserless` (with the Browserless service available) so the page can be rendered before extraction.
108
+
-**Dynamic Content:**`faraday` does not render JavaScript. If content loads dynamically, run with `--strategy browserless` (with Browserless available) or `--strategy botasaurus` (with `BOTASAURUS_SCRAPER_URL` configured) so the page can be rendered before extraction.
0 commit comments