The web_fetch tool is one of two tools available to the Rabbithole LLM during page
generation (the other being web_search).
It allows the LLM to retrieve the text content of any publicly accessible web page by URL,
which can then be incorporated into the generated HTML response.
A typical use case is fetching documentation, reference data, a README from GitHub, or article text that the LLM needs to quote, summarize, or link to within the generated page. The tool handles the HTTP request, follows redirects, strips HTML markup, and returns the resulting plain text — truncated at a configurable byte limit — back to the model as a tool result.
web_fetch is available during page
generation depends on the Rabbithole server configuration. Both tools may be independently
enabled or disabled via rabbithole.toml. See Configuration.
The tool is registered with the LLM provider (e.g., Anthropic Claude, OpenAI) as a structured tool definition. The schema Rabbithole uses is:
{
"name": "web_fetch",
"description": "Fetch the text content of a web page by URL. Returns the page body
as plain text (HTML tags stripped). Useful for reading documentation, articles,
or reference material to incorporate into the generated page.",
"input_schema": {
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to fetch"
}
},
"required": ["url"]
}
}
This schema is sent to the LLM as part of the system prompt / tool registration block at the start of every page-generation request. The model may invoke it zero or more times before producing its final HTML output.
| Parameter | Type | Required | Description |
|---|---|---|---|
url |
string |
Yes |
The fully-qualified URL to fetch. Must begin with http:// or
https://. Other schemes (e.g., file://, ftp://)
are rejected. Relative URLs are not accepted; the LLM must supply an absolute URL.
|
There are no optional parameters in the tool call itself. Behaviour modifiers such as
byte limits, allowed domains, and blocked domains are configured server-side in
rabbithole.toml and are invisible to the LLM.
On success, the tool returns a plain-text string containing the extracted body text of the fetched page. Specifically:
<script> and <style> blocks are removed
entirely (content and tags).max_fetch_bytes (default: 20,000 bytes) to
prevent extremely large responses from consuming the model's context window.On failure, the tool returns an error message string (not a structured error object) so the model can read and react to the failure. Error strings are prefixed to make them distinguishable — for example:
Error fetching URL: HTTP 403 Forbidden Error fetching URL: domain not in allowed list Error fetching URL: connection timed out after 10s
Rabbithole performs HTML-to-text conversion on the server side before returning the result to the LLM. The processing pipeline is:
Content-Type header, falling back to UTF-8.
<script>…</script> and <style>…</style>
tags is deleted before further processing, including the tags themselves.
&, <, >,
are decoded to their Unicode equivalents.
max_fetch_bytes bytes (UTF-8 encoded). No truncation marker is appended
automatically; the text simply ends at the cut-off point.
If the response Content-Type does not contain text/html, the
stripping step is skipped and the raw body is returned (still subject to truncation).
This allows fetching JSON APIs, plain-text files, Markdown documents (e.g., GitHub raw
URLs), etc.
max_fetch_bytes)To avoid filling the LLM's context window with a single large fetch, Rabbithole enforces a hard byte limit on the returned text. The relevant configuration key is:
# rabbithole.toml [tools] max_fetch_bytes = 20000 # default: 20,000 bytes (~20 KB of plain text)
| Config key | Default | Unit | Notes |
|---|---|---|---|
tools.max_fetch_bytes |
20000 |
bytes (UTF-8) | Applies after HTML stripping. Set to 0 to disable the limit (not recommended). |
The truncation is applied to the post-strip plain-text bytes, not the raw HTTP response. A 500 KB HTML page may strip down to 40 KB of text; only the first 20 KB of that text would be returned to the model.
Rabbithole follows HTTP redirects automatically, up to a maximum of 5 hops. Each redirect target URL is validated against the same domain allow/block lists as the original request. If a redirect leads to a blocked domain or exceeds the hop limit, the request fails with an appropriate error message.
If the final HTTP response (after following redirects) has a status code other than
2xx, the tool returns an error string rather than the response body.
This prevents the LLM from inadvertently treating an error page's HTML as real content.
| Status range | Behaviour |
|---|---|
2xx |
Success — body is processed and returned. |
3xx |
Redirect — followed automatically up to 5 hops. |
4xx (e.g., 403, 404, 429) |
Client error — returns error string: Error fetching URL: HTTP 4xx <reason> |
5xx |
Server error — returns error string: Error fetching URL: HTTP 5xx <reason> |
Connection errors (DNS failure, TCP timeout, TLS handshake failure, etc.) also return an error string. The default connection timeout is 10 seconds. This can be adjusted in configuration:
# rabbithole.toml [tools] fetch_timeout_secs = 10 # default: 10 seconds
allowed_domains & blocked_domains
Rabbithole operators can constrain which hosts the web_fetch tool is permitted
to contact. Two complementary configuration lists exist:
allowed_domains (allowlist)If this list is non-empty, only URLs whose hostname exactly matches or is a subdomain of an entry in the list are permitted. All other hostnames are rejected with:
Error fetching URL: domain not in allowed list
Example configuration:
# rabbithole.toml
[tools]
allowed_domains = [
"docs.rs",
"crates.io",
"github.com",
"raw.githubusercontent.com",
]
This is the most secure option for production deployments where you know in advance which
external sources the LLM should be able to read. An empty allowed_domains
list means no allowlist restriction is applied (all domains are permitted, subject
to blocked_domains).
blocked_domains (blocklist)
Entries in this list are always refused, even if allowed_domains is empty or
the domain would otherwise match an allowlist entry. The blocked list takes precedence.
# rabbithole.toml
[tools]
blocked_domains = [
"169.254.169.254", # AWS/GCP metadata service
"metadata.google.internal",
"example-internal.corp",
]
Regardless of configuration, Rabbithole always rejects requests to loopback addresses and the local machine. The following are hardcoded and cannot be overridden by configuration:
localhost127.0.0.0/8 (entire loopback range)::1 (IPv6 loopback)0.0.0.0After DNS resolution, the resolved IP address is also checked against these ranges, to guard against DNS rebinding attacks where a public hostname resolves to a private IP.
10.x.x.x,
172.16.x.x–172.31.x.x, 192.168.x.x) and the
link-local range (169.254.x.x) are not blocked by default,
but are strongly recommended to add to blocked_domains or enforce
via firewall rules for production deployments.
The web_fetch tool introduces a Server-Side Request Forgery (SSRF) attack
surface by design — it is a mechanism for the server to make outbound HTTP requests
based on URLs supplied by an external system (the LLM, which in turn is prompted by
page visitors or seed prompts).
Server-side request forgery is a web security vulnerability that allows an attacker to cause the server-side application to make requests to an unintended location. In a typical SSRF attack, the attacker might cause the server to make a connection to internal-only services within the organization's infrastructure.
In Rabbithole's context, SSRF could arise if a malicious seed prompt or crafted URL path
caused the LLM to call web_fetch with a URL pointing at internal network
resources. What makes SSRF particularly dangerous is that these requests originate from the server itself, bypassing external firewalls and security controls that would normally block malicious traffic.
http://169.254.169.254/. An attacker can read the metadata to gain sensitive information.
http://127.0.0.1/admin or http://localhost:8080/metrics. These attacks exploit services bound to the loopback interface, assuming they're protected from external access.
http://localhost:443/ or other internal URLs, potentially bypassing input validation.
allowed_domains allowlist — deny-lists are bypass-prone; prefer allow-lists.http:// and https:// schemes are accepted — no
file://, gopher://, or other protocols.169.254.169.254 and
metadata.google.internal to blocked_domains, or restrict them
via host-level firewall rules. Add all RFC1918 ranges (10.0.0.0/8,
172.16.0.0/12, 192.168.0.0/16) to your egress deny rules.
Consider using allowed_domains to restrict fetching to only the domains
your use case requires.
Preventing SSRF requires a layered approach that includes strict input validation, egress controls, robust cloud-native application security practices, API authentication, and the continuous management of API inventories.
The following illustrates a complete round-trip of the web_fetch tool as
it occurs internally during a Rabbithole page generation cycle.
{
"type": "tool_use",
"id": "toolu_01XzExample",
"name": "web_fetch",
"input": {
"url": "https://raw.githubusercontent.com/ajbt200128/rabbithole/main/README.md"
}
}
https:// — OK).raw.githubusercontent.com against blocklist and allowlist — passes.Content-Type: text/plain; charset=utf-8.text/html, skips HTML stripping.max_fetch_bytes (20,000 bytes) if necessary.{
"type": "tool_result",
"tool_use_id": "toolu_01XzExample",
"content": "# Rabbithole\n\nA Rust web server that dynamically generates entire websites
on the fly using LLMs. Each page is generated on first request and cached permanently.\n\n
## Features\n\n- Dynamic page generation via LLM\n- Permanent page caching\n- web_search
and web_fetch tools for LLM grounding\n- Configurable via TOML\n\n[...truncated at 20000 bytes]"
}
The model receives the tool result and continues its response, using the fetched content
to inform the HTML it generates. It may call web_fetch additional times
before emitting its final <!DOCTYPE html> response.
If the URL is blocked or unreachable, the tool result content is an error string:
{
"type": "tool_result",
"tool_use_id": "toolu_01XzExample",
"content": "Error fetching URL: HTTP 404 Not Found"
}
When the LLM uses web_fetch to retrieve content from third-party websites,
it may encounter copyrighted material — articles, documentation, blog posts, source code,
and other works protected by copyright law. Several important points apply:
web_fetch can retrieve a page does not mean its content may be freely
reproduced in the generated HTML. The LLM (and by extension Rabbithole operators)
should treat fetched content with the same copyright respect as any other source.
robots.txt or terms of service. Rabbithole does not
automatically check or enforce robots.txt. Operators are responsible
for configuring allowed_domains to comply with the access policies of
the sites they fetch from.
For detailed guidance on how Rabbithole's citation and quotation policy handles web-fetched content, see the Web Tools Citation & Copyright Policy.
web_search and web_fetch[tools] settings including max_fetch_bytes, fetch_timeout_secs, allowed_domains, blocked_domains