Skip to content

Commit ce3b3f7

Browse files
committed
docs: tighten strategy UX wording for end users
1 parent 7dedfbb commit ce3b3f7

7 files changed

Lines changed: 45 additions & 42 deletions

File tree

src/content/docs/creating-custom-feeds.mdx

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -11,14 +11,6 @@ When auto-sourcing isn't enough, you can write your own configuration files to c
1111

1212
**Prerequisites:** You should be familiar with the [Getting Started](/getting-started) guide before diving into custom configurations.
1313

14-
<Aside type="note" title="Release note">
15-
This guide tracks the current documentation tree and may describe features that have not yet shipped in the
16-
latest released `html2rss` gem. If you want the newest integrated behavior, prefer running
17-
[`html2rss-web`](/web-application/getting-started) via Docker. The web application ships as a rolling
18-
release and usually reflects the latest development state of the gem first. See [Versioning and
19-
releases](/web-application/reference/versioning-and-releases/) for details.
20-
</Aside>
21-
2214
<Aside type="tip" title="Use this guide when you need more control">
2315
Start with included feeds first. If your site is not covered, try [automatic feed
2416
generation](/web-application/how-to/use-automatic-feed-generation/) next. Reach for a custom config when you
@@ -48,7 +40,7 @@ When auto-sourcing isn't enough, you can write your own configuration files to c
4840
3. **Validate the config** with `html2rss validate your-config.yml`
4941
4. **Render the feed** with `html2rss feed your-config.yml`
5042
5. **Add it to `html2rss-web`** so you can use it through your normal instance
51-
6. **Escalate strategy when needed**: if static fetching is insufficient, switch to a JavaScript/browser-based extraction strategy
43+
6. **Escalate request strategy when needed**: use a browser-based rendering strategy only when troubleshooting requires it
5244

5345
This order keeps iteration fast and makes it easier to see whether the problem is the page structure, your
5446
selectors, or the fetch strategy.
@@ -210,7 +202,7 @@ there.
210202
- **No items found?** Check your selectors with browser tools (F12) - the `items.selector` might not match the page structure
211203
- **Invalid YAML?** Use spaces, not tabs, and ensure proper indentation
212204
- **Website not loading?** Check the URL and try accessing it in your browser
213-
- **Missing content?** Some websites load content with JavaScript - you may need a JavaScript/browser-based extraction strategy instead of plain HTTP fetching
205+
- **Missing content?** Try a browser-based rendering strategy during troubleshooting
214206
- **Wrong data extracted?** Verify your selectors are pointing to the right elements
215207

216208
**Need more help?** See our [comprehensive troubleshooting guide](/troubleshooting/troubleshooting) or ask in [GitHub Discussions](https://github.com/orgs/html2rss/discussions).

src/content/docs/getting-started.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,8 @@ If you are working directly with the gem instead of `html2rss-web`, start with:
3434

3535
<Code code={`html2rss auto https://example.com/blog`} lang="bash" />
3636

37+
For strategy behavior and manual overrides, see the [Strategy reference](/ruby-gem/reference/strategy).
38+
3739
If the target site is unusually redirect-heavy or needs extra follow-up requests, the CLI also supports:
3840

3941
<Code code={`html2rss auto https://example.com/blog --max-redirects 10 --max-requests 5`} lang="bash" />

src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,14 @@ description: "Learn how to handle JavaScript-heavy websites and dynamic content
55

66
import { Code } from "@astrojs/starlight/components";
77

8-
Some websites load their content dynamically using JavaScript. The default `html2rss` strategy might not see this content.
8+
Some websites load their content dynamically using JavaScript. Static fetch paths may not see this content reliably.
99

1010
## Solution
1111

1212
Use a [browser-based extraction strategy](/ruby-gem/reference/strategy) when JavaScript-heavy pages do not work with default static fetching.
1313

14+
`browserless` is common for this workflow, and `botasaurus` is an alternate browser-based strategy when you run a Botasaurus scrape API.
15+
1416
Keep the strategy at the top level and put request-specific options under `request`:
1517

1618
<Code

src/content/docs/ruby-gem/reference/cli-reference.mdx

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -22,17 +22,16 @@ Automatically discovers items from a page and prints the generated RSS feed to s
2222
<Code
2323
code={`
2424
html2rss auto https://example.com/articles ; \
25-
html2rss auto https://example.com/app --strategy browserless ; \
26-
BOTASAURUS_SCRAPER_URL="http://localhost:4010" html2rss auto https://example.com/protected --strategy botasaurus ; \
2725
html2rss auto https://example.com/app --strategy browserless --max-redirects 5 --max-requests 6 ; \
26+
BOTASAURUS_SCRAPER_URL="http://localhost:4010" html2rss auto https://example.com/protected --strategy botasaurus ; \
2827
html2rss auto https://example.com/articles --items_selector ".post-card"
2928
`}
3029
lang="bash"
3130
/>
3231

3332
Command: `html2rss auto URL`
3433

35-
Default behavior uses `--strategy auto`, which tries `faraday` then `botasaurus` then `browserless`.
34+
Default behavior is `--strategy auto`, which tries `faraday` then `botasaurus` then `browserless`.
3635

3736
#### URL Surface Guidance For `auto`
3837

@@ -52,25 +51,29 @@ When possible, pass a direct listing/update URL instead of a top-level homepage
5251

5352
#### Failure Outcomes You Should Expect
5453

55-
When no extractable items are found, `auto` now classifies likely causes instead of only returning a generic message:
54+
When no extractable items are found, `auto` classifies likely causes instead of only returning a generic message:
5655

5756
- `blocked surface likely (anti-bot or interstitial)`:
58-
- retry with `--strategy browserless`
5957
- try a more specific public listing URL
6058
- `app-shell surface detected`:
61-
- retry with `--strategy browserless`
6259
- switch to a direct listing/update URL
6360
- `unsupported extraction surface for auto mode`:
6461
- switch to listing/changelog/category URLs
6562
- use explicit selectors in a feed config
6663

6764
Known anti-bot interstitial responses (for example Cloudflare challenge pages) are surfaced explicitly as blocked-surface errors.
6865

69-
If you run with the default `--strategy auto`, no manual strategy override is required for fallback ordering.
66+
If all fallback tiers run but still extract zero items, html2rss raises:
67+
68+
- `No RSS feed items extracted after auto fallback ...`
69+
70+
If failures continue after URL/surface fixes, retry with an explicit browser-based override (`--strategy browserless`), or `--strategy botasaurus` when `BOTASAURUS_SCRAPER_URL` is configured.
71+
72+
Start by changing the input URL to a direct listing/update page, then move to explicit selectors if needed.
7073

7174
#### Browserless Setup And Diagnostics (CLI)
7275

73-
`browserless` is opt-in for CLI usage.
76+
`browserless` is an explicit override for CLI usage.
7477

7578
<Code
7679
code={`
@@ -104,7 +107,7 @@ For custom Browserless endpoints, `BROWSERLESS_IO_API_TOKEN` is required.
104107

105108
#### Botasaurus Environment Requirement (CLI)
106109

107-
`botasaurus` is opt-in for CLI usage and requires `BOTASAURUS_SCRAPER_URL`:
110+
`botasaurus` is an explicit override for CLI usage and requires `BOTASAURUS_SCRAPER_URL`:
108111

109112
<Code
110113
code={`
@@ -128,6 +131,7 @@ Loads a YAML config, builds the feed, and prints the RSS XML to stdout.
128131
code={`
129132
html2rss feed single.yml ; \
130133
html2rss feed feeds.yml my-first-feed ; \
134+
html2rss feed single.yml --strategy auto ; \
131135
html2rss feed single.yml --strategy browserless ; \
132136
BOTASAURUS_SCRAPER_URL="http://localhost:4010" html2rss feed single.yml --strategy botasaurus ; \
133137
html2rss feed single.yml --max-redirects 5 --max-requests 6 ; \

src/content/docs/ruby-gem/reference/strategy.mdx

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,26 @@
11
---
22
title: Strategy
3-
description: "Learn about different strategies for fetching website content with html2rss. Choose between faraday, browserless, and botasaurus strategies for optimal performance."
3+
description: "Learn how html2rss chooses request strategies by default with auto fallback, and when to override with faraday, botasaurus, or browserless."
44
---
55

66
import { Code } from "@astrojs/starlight/components";
77

88
The `strategy` key defines how `html2rss` fetches a website's content.
99

10-
- **`faraday`** (default): Makes a direct HTTP request. It is fast but does not execute JavaScript.
10+
- **`auto`** (default): Tries concrete strategies in order: `faraday` -> `botasaurus` -> `browserless`.
11+
- **`faraday`**: Makes a direct HTTP request. It is fast but does not execute JavaScript.
1112
- **`browserless`**: Renders the website in a headless Chrome browser, which is necessary for JavaScript-heavy sites.
1213
- **`botasaurus`**: Delegates fetching to a Botasaurus scrape API. This is opt-in and requires `BOTASAURUS_SCRAPER_URL`.
1314

1415
`strategy` is a top-level config key. Request-specific controls live under `request`.
1516

16-
If you use CLI `--strategy auto` (default), html2rss tries `faraday` then `botasaurus` then `browserless`.
17+
`auto` falls back to the next strategy when the current attempt errors or extracts zero items. Use explicit `--strategy ...` only when you need to force a specific transport for troubleshooting or reproducibility.
1718

18-
Use `faraday` first for direct newsroom/listing/changelog pages. Prefer `botasaurus` as the first explicit browser-based strategy when you have a Botasaurus scrape API. Use `browserless` when you specifically need Browserless preload actions.
19+
## `auto` (default)
20+
21+
The default strategy chain is:
22+
23+
`faraday` -> `botasaurus` -> `browserless`
1924

2025
## `browserless`
2126

@@ -65,7 +70,7 @@ Set the `strategy` at the top level of your feed configuration and put request c
6570

6671
Use this split consistently:
6772

68-
- `strategy`: selects `faraday`, `browserless`, or `botasaurus`
73+
- `strategy`: selects `auto`, `faraday`, `browserless`, or `botasaurus`
6974
- `headers`: top-level headers shared by all strategies
7075
- `request.max_redirects`: redirect limit for the request session
7176
- `request.max_requests`: total request budget for the whole feed build

src/content/docs/troubleshooting/troubleshooting.mdx

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -32,34 +32,34 @@ The `auto` flow is URL-surface sensitive.
3232

3333
If extraction quality is poor, switch to a more specific listing/update URL before tuning selectors.
3434

35-
If you use CLI defaults, `--strategy auto` is already active and attempts `faraday` then `botasaurus` then `browserless`.
36-
3735
### Empty Feeds
3836

3937
If your feed is empty, check the following:
4038

4139
- **URL:** Ensure the `url` in your configuration is correct and accessible.
4240
- **`items.selector`:** Verify that the `items.selector` matches the elements on the page.
4341
- **Website Changes:** Websites change their HTML structure frequently. Your selectors may be outdated.
44-
- **JavaScript Content:** If the content is loaded via JavaScript, move from `faraday` to a rendering strategy such as `browserless` (or `botasaurus` when you use a Botasaurus scrape API).
42+
- **JavaScript Content:** If the content is loaded via JavaScript, use a browser-based rendering strategy.
4543
- **Authentication:** Some sites require authentication — check if you need to add headers or use a different strategy.
4644

4745
### `No scrapers found` Failure Taxonomy (`auto`)
4846

4947
`auto` classifies no-scraper failures with actionable hints:
5048

5149
- **Blocked surface likely (anti-bot or interstitial):**
52-
- retry with `--strategy browserless`
5350
- try a more specific public listing URL
5451
- **App-shell surface detected:**
55-
- retry with `--strategy browserless`
5652
- target a direct listing/update page instead of homepage/shell entrypoint
5753
- **Unsupported extraction surface for auto mode:**
5854
- switch to listing/changelog/category URLs
5955
- or use explicit selectors in YAML config
6056

6157
Known anti-bot interstitial patterns (for example Cloudflare challenge pages) are surfaced as blocked-surface errors instead of silent empty extraction results.
6258

59+
When all auto fallback tiers complete but still extract zero items, html2rss raises `No RSS feed items extracted after auto fallback ...`.
60+
61+
If failures continue after URL/surface fixes, retry with an explicit browser-based override (`--strategy browserless`), or `--strategy botasaurus` when `BOTASAURUS_SCRAPER_URL` is configured.
62+
6363
### Browserless Connection / Setup Failures
6464

6565
If you receive `Browserless connection failed (...)`:
@@ -93,7 +93,7 @@ For custom websocket endpoints, `BROWSERLESS_IO_API_TOKEN` is required.
9393
Common configuration-related errors:
9494

9595
- **`UnsupportedResponseContentType`:** The website returned content that html2rss can't parse (not HTML or JSON).
96-
- **`UnsupportedStrategy`:** The specified strategy is not available. Use `faraday`, `browserless`, or `botasaurus`.
96+
- **`UnsupportedStrategy`:** The specified strategy is not available. Use `auto`, `faraday`, `browserless`, or `botasaurus`.
9797
- **`BOTASAURUS_SCRAPER_URL is required for strategy=botasaurus.`:** Set `BOTASAURUS_SCRAPER_URL` to your Botasaurus scrape API base URL when using `--strategy botasaurus`.
9898
- **`BOTASAURUS_SCRAPER_URL is invalid`:** Fix the URL format and retry.
9999
- **`Configuration must include at least 'selectors' or 'auto_source'`:** You need to specify either manual selectors or enable auto-source.

src/content/docs/web-application/how-to/use-automatic-feed-generation.mdx

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ Then restart the stack:
4242
1. Open your instance at `http://localhost:4000`
4343
2. Paste a page URL into `Create a feed`
4444
3. Add a valid access token when prompted
45-
4. Choose a strategy if needed, then submit
45+
4. Submit the request
4646
5. Copy the generated feed URL or open it directly
4747

4848
## What Success Looks Like
@@ -59,23 +59,21 @@ That is enough to confirm the self-hosted flow is working.
5959

6060
## Strategy Behavior
6161

62-
- `faraday` is the default strategy and should be your first try for most pages.
63-
- During the feed-creation API request (`POST /api/v1/feeds`) from the web UI, a `faraday` submission may be retried once with `browserless` when the first failure looks retryable.
64-
- If that fallback attempt fails, or if the first failure is clearly auth/URL/unsupported-strategy related, the UI stops and shows an error.
65-
- This retry behavior is scoped to feed creation. It is not a general retry layer for later feed rendering (`GET /api/v1/feeds/:token`) or preview loading.
62+
- Feed creation uses the backend default strategy behavior.
63+
- If feed creation fails, the UI surfaces structured retry/error guidance rather than exposing low-level strategy controls.
6664

6765
## Input URL Guidance (Quality First)
6866

6967
Automatic generation is most successful when the input URL is already a listing/update surface.
7068

7169
- Higher-success inputs:
72-
- newsroom/press listing pages
73-
- category/tag/archive/listing pages
74-
- changelog/release/update pages
70+
- newsroom/press listing pages
71+
- category/tag/archive/listing pages
72+
- changelog/release/update pages
7573
- Lower-success inputs:
76-
- generic homepages
77-
- search pages
78-
- app-shell entrypoints (client-rendered shells)
74+
- generic homepages
75+
- search pages
76+
- app-shell entrypoints (client-rendered shells)
7977

8078
If output quality is poor, switch the input to a direct listing/update URL before assuming the feature is broken.
8179

0 commit comments

Comments
 (0)