-
Notifications
You must be signed in to change notification settings - Fork 4.5k
feat(python): Add CRW web scraper plugin #13709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,251 @@ | ||||||||||||||||||||||||||||||||||||||||||
| # Copyright (c) Microsoft. All rights reserved. | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| import json | ||||||||||||||||||||||||||||||||||||||||||
| import logging | ||||||||||||||||||||||||||||||||||||||||||
| from typing import Annotated, Any | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| import aiohttp | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| from semantic_kernel.exceptions import FunctionExecutionException | ||||||||||||||||||||||||||||||||||||||||||
| from semantic_kernel.functions.kernel_function_decorator import kernel_function | ||||||||||||||||||||||||||||||||||||||||||
| from semantic_kernel.kernel_pydantic import KernelBaseModel | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| logger = logging.getLogger(__name__) | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| class WebScraperPlugin(KernelBaseModel): | ||||||||||||||||||||||||||||||||||||||||||
| """A plugin that provides web scraping functionality using CRW. | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| CRW is an open-source web scraper for AI agents that exposes a | ||||||||||||||||||||||||||||||||||||||||||
| Firecrawl-compatible REST API. It supports scraping single pages, | ||||||||||||||||||||||||||||||||||||||||||
| crawling entire websites, and discovering site maps. | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| GitHub: https://github.com/nicepkg/crw | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| Usage: | ||||||||||||||||||||||||||||||||||||||||||
| kernel.add_plugin( | ||||||||||||||||||||||||||||||||||||||||||
| WebScraperPlugin(base_url="http://localhost:3000"), | ||||||||||||||||||||||||||||||||||||||||||
| "WebScraper", | ||||||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| # With authentication: | ||||||||||||||||||||||||||||||||||||||||||
| kernel.add_plugin( | ||||||||||||||||||||||||||||||||||||||||||
| WebScraperPlugin( | ||||||||||||||||||||||||||||||||||||||||||
| base_url="http://localhost:3000", | ||||||||||||||||||||||||||||||||||||||||||
| api_key="fc-your-api-key", | ||||||||||||||||||||||||||||||||||||||||||
| ), | ||||||||||||||||||||||||||||||||||||||||||
| "WebScraper", | ||||||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| Examples: | ||||||||||||||||||||||||||||||||||||||||||
| {{WebScraper.scrape_url "https://example.com"}} | ||||||||||||||||||||||||||||||||||||||||||
| {{WebScraper.crawl_website "https://example.com"}} | ||||||||||||||||||||||||||||||||||||||||||
| {{WebScraper.map_site "https://example.com"}} | ||||||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| base_url: str = "http://localhost:3000" | ||||||||||||||||||||||||||||||||||||||||||
| """Base URL of the CRW server.""" | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| api_key: str | None = None | ||||||||||||||||||||||||||||||||||||||||||
| """Optional Bearer token for authenticating with the CRW server.""" | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| def _headers(self) -> dict[str, str]: | ||||||||||||||||||||||||||||||||||||||||||
| """Build request headers including auth if configured.""" | ||||||||||||||||||||||||||||||||||||||||||
| headers: dict[str, str] = {"Content-Type": "application/json"} | ||||||||||||||||||||||||||||||||||||||||||
| if self.api_key: | ||||||||||||||||||||||||||||||||||||||||||
| headers["Authorization"] = f"Bearer {self.api_key}" | ||||||||||||||||||||||||||||||||||||||||||
| return headers | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| async def _post(self, path: str, body: dict[str, Any]) -> dict[str, Any]: | ||||||||||||||||||||||||||||||||||||||||||
| """Send a POST request to the CRW server and return the JSON response.""" | ||||||||||||||||||||||||||||||||||||||||||
| url = f"{self.base_url.rstrip('/')}{path}" | ||||||||||||||||||||||||||||||||||||||||||
| async with ( | ||||||||||||||||||||||||||||||||||||||||||
| aiohttp.ClientSession() as session, | ||||||||||||||||||||||||||||||||||||||||||
| session.post(url, headers=self._headers(), data=json.dumps(body)) as response, | ||||||||||||||||||||||||||||||||||||||||||
| ): | ||||||||||||||||||||||||||||||||||||||||||
| result = await response.json() | ||||||||||||||||||||||||||||||||||||||||||
| if response.status >= 400: | ||||||||||||||||||||||||||||||||||||||||||
| error_msg = result.get("error", f"HTTP {response.status}") | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
| error_msg = result.get("error", f"HTTP {response.status}") | |
| async def _post(self, path: str, body: dict[str, Any]) -> dict[str, Any]: | |
| """Send a POST request to the CRW server and return the JSON response."" | |
| url = f"{self.base_url.rstrip('/')}{path}" | |
| timeout = aiohttp.ClientTimeout(total=30) | |
| async with ( | |
| aiohttp.ClientSession(timeout=timeout) as session, |
Outdated
Copilot
AI
Mar 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_post() always calls response.json() before checking response.status. If CRW returns a non JSON error body (or invalid JSON), this will raise an aiohttp parsing exception instead of a FunctionExecutionException. Consider checking the status first and reading response.text() on errors, or wrapping JSON parsing in a try except with a safe fallback error message. Same concern applies to _get().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: response.json() is called before checking response.status. If the server returns a non-JSON error (e.g., 502 with HTML), this raises ContentTypeError/JSONDecodeError instead of FunctionExecutionException. Check status first, then attempt JSON parsing.
| ): | |
| result = await response.json() | |
| if response.status >= 400: | |
| error_msg = result.get("error", f"HTTP {response.status}") | |
| raise FunctionExecutionException(f"CRW request failed: {error_msg}") | |
| return result | |
| async def _get(self, path: str) -> dict[str, Any]: | |
| async with ( | |
| aiohttp.ClientSession() as session, | |
| session.post(url, headers=self._headers(), data=json.dumps(body)) as response, | |
| ): | |
| if response.status >= 400: | |
| try: | |
| result = await response.json() | |
| error_msg = result.get("error", f"HTTP {response.status}") | |
| except Exception: | |
| error_msg = f"HTTP {response.status}" | |
| raise FunctionExecutionException(f"CRW request failed: {error_msg}") | |
| return await response.json() |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same bug in _get: response.json() before status check. Non-JSON error responses will raise an unhandled exception instead of FunctionExecutionException.
| aiohttp.ClientSession() as session, | |
| session.get(url, headers=self._headers()) as response, | |
| ): | |
| result = await response.json() | |
| if response.status >= 400: | |
| error_msg = result.get("error", f"HTTP {response.status}") | |
| raise FunctionExecutionException(f"CRW request failed: {error_msg}") | |
| return result | |
| async with ( | |
| aiohttp.ClientSession() as session, | |
| session.get(url, headers=self._headers()) as response, | |
| ): | |
| if response.status >= 400: | |
| try: | |
| result = await response.json() | |
| error_msg = result.get("error", f"HTTP {response.status}") | |
| except Exception: | |
| error_msg = f"HTTP {response.status}" | |
| raise FunctionExecutionException(f"CRW request failed: {error_msg}") | |
| return await response.json() |
Copilot
AI
Mar 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring says url "must be http or https" but the implementation only checks for empty string. Either validate the scheme (and potentially reject unsupported URLs early) or update the docstring to avoid stating requirements that are not enforced.
Outdated
Copilot
AI
Mar 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When parsing formats, the current split and strip logic can still include empty entries (for example trailing commas). Consider filtering out empty strings and optionally validating against the supported set so the CRW API does not receive invalid formats.
| body["formats"] = [f.strip() for f in formats.split(",")] | |
| # Normalize, filter out empty entries, and validate against supported formats | |
| supported_formats = {"markdown", "html", "plainText", "links"} | |
| requested_formats = [ | |
| f.strip() for f in formats.split(",") if f.strip() | |
| ] | |
| valid_formats = [f for f in requested_formats if f in supported_formats] | |
| body["formats"] = valid_formats or ["markdown"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crawl_website's ``@kernel_function description says 'Crawl a website starting from a URL, following links up to a specified depth' but the function returns a job ID string, not crawled content. An LLM invoking this function has no way to know a second polling call is needed. Update the description to reflect the actual return value, or have the function poll until completion.
| description="Start an async crawl job and return a job ID. Call check_crawl_status with the returned ID to retrieve results.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crawl_id is user-supplied and interpolated directly into the URL path without sanitization. A malicious value like ../../admin or id?x=y could alter the request target. Use urllib.parse.quote(crawl_id, safe='') to encode path-unsafe characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded [:500] silently truncates page markdown with no indication to the caller that content was cut. Make this a configurable plugin attribute and signal truncation in the returned entry.
| "markdown": page.get("markdown", "")[:self.max_markdown_preview], |
Copilot
AI
Mar 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check_crawl_status() truncates each page's markdown to 500 characters, but the docstring says it returns crawl results. This truncation is a behavior change that API consumers may not expect. Consider documenting the truncation explicitly, returning full content, or making the truncation length configurable.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely bug: Firecrawl v1 /map returns {"success": true, "links": [...]} — links are at the top level, not nested under data. If CRW follows the Firecrawl spec, this will always return []. Should this be result.get("links", [])?
| links = result.get("data", {}).get("links", []) | |
| links = result.get("links", []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loggeris defined but never used in this module, which will fail Ruff (unused variable). Remove theloggingimport andlogger = logging.getLogger(__name__), or useloggerfor actual logging in this plugin.