web_fetch Tool Reference

Contents

1. Overview
2. JSON Schema
3. Parameters
4. Return Value
5. HTML Stripping & Text Extraction
6. Byte Limit & Truncation
7. Redirects & Error Handling
8. Domain Configuration (allowed / blocked)
9. Security Considerations (SSRF)
10. Example Tool Call & Response
11. Copyright Considerations
12. See Also

1. Overview

The web_fetch tool is one of two tools available to the Rabbithole LLM during page generation (the other being web_search). It allows the LLM to retrieve the text content of any publicly accessible web page by URL, which can then be incorporated into the generated HTML response.

A typical use case is fetching documentation, reference data, a README from GitHub, or article text that the LLM needs to quote, summarize, or link to within the generated page. The tool handles the HTTP request, follows redirects, strips HTML markup, and returns the resulting plain text — truncated at a configurable byte limit — back to the model as a tool result.

Tool availability: Whether web_fetch is available during page generation depends on the Rabbithole server configuration. Both tools may be independently enabled or disabled via rabbithole.toml. See Configuration.

2. JSON Schema

The tool is registered with the LLM provider (e.g., Anthropic Claude, OpenAI) as a structured tool definition. The schema Rabbithole uses is:

{
  "name": "web_fetch",
  "description": "Fetch the text content of a web page by URL. Returns the page body
as plain text (HTML tags stripped). Useful for reading documentation, articles,
or reference material to incorporate into the generated page.",
  "input_schema": {
    "type": "object",
    "properties": {
      "url": {
        "type": "string",
        "description": "The URL to fetch"
      }
    },
    "required": ["url"]
  }
}

This schema is sent to the LLM as part of the system prompt / tool registration block at the start of every page-generation request. The model may invoke it zero or more times before producing its final HTML output.

3. Parameters

Parameter	Type	Required	Description
`url`	`string`	Yes	The fully-qualified URL to fetch. Must begin with `http://` or `https://`. Other schemes (e.g., `file://`, `ftp://`) are rejected. Relative URLs are not accepted; the LLM must supply an absolute URL.

There are no optional parameters in the tool call itself. Behaviour modifiers such as byte limits, allowed domains, and blocked domains are configured server-side in rabbithole.toml and are invisible to the LLM.

4. Return Value

On success, the tool returns a plain-text string containing the extracted body text of the fetched page. Specifically:

All HTML tags are stripped; only text nodes are preserved.
Inline <script> and <style> blocks are removed entirely (content and tags).
Whitespace is normalized: consecutive blank lines are collapsed.
The result is truncated to max_fetch_bytes (default: 20,000 bytes) to prevent extremely large responses from consuming the model's context window.
If the page is non-HTML (e.g., plain text, JSON, XML), the raw response body is returned as-is, also subject to the byte limit.

On failure, the tool returns an error message string (not a structured error object) so the model can read and react to the failure. Error strings are prefixed to make them distinguishable — for example:

Error fetching URL: HTTP 403 Forbidden
Error fetching URL: domain not in allowed list
Error fetching URL: connection timed out after 10s

5. HTML Stripping & Text Extraction

Rabbithole performs HTML-to-text conversion on the server side before returning the result to the LLM. The processing pipeline is:

Decode response bytes — The HTTP response body is decoded using the charset declared in the Content-Type header, falling back to UTF-8.
Remove script and style blocks — Any content between <script>…</script> and <style>…</style> tags is deleted before further processing, including the tags themselves.
Strip all remaining tags — A regex-based or HTML-parser-based pass removes all remaining angle-bracket tags, leaving only the inner text.
Decode HTML entities — Common HTML entities such as &, <, >,   are decoded to their Unicode equivalents.
Normalize whitespace — Runs of whitespace (spaces, tabs, newlines) are collapsed. Multiple consecutive blank lines are reduced to one.
Truncate — The resulting string is truncated at max_fetch_bytes bytes (UTF-8 encoded). No truncation marker is appended automatically; the text simply ends at the cut-off point.

Note: Because tag removal is based on textual processing rather than a full DOM parse, malformed HTML may occasionally result in fragments of attribute values or tag names appearing in the output. The LLM is generally robust to this noise.

If the response Content-Type does not contain text/html, the stripping step is skipped and the raw body is returned (still subject to truncation). This allows fetching JSON APIs, plain-text files, Markdown documents (e.g., GitHub raw URLs), etc.

6. Byte Limit & Truncation (`max_fetch_bytes`)

To avoid filling the LLM's context window with a single large fetch, Rabbithole enforces a hard byte limit on the returned text. The relevant configuration key is:

# rabbithole.toml
[tools]
max_fetch_bytes = 20000   # default: 20,000 bytes (~20 KB of plain text)

Config key	Default	Unit	Notes
`tools.max_fetch_bytes`	`20000`	bytes (UTF-8)	Applies after HTML stripping. Set to 0 to disable the limit (not recommended).

The truncation is applied to the post-strip plain-text bytes, not the raw HTTP response. A 500 KB HTML page may strip down to 40 KB of text; only the first 20 KB of that text would be returned to the model.

Tip: If you need the LLM to read a specific section of a long page, include the exact anchor or section URL in the prompt rather than the page root, where possible. Some sites structure their content so that section-specific URLs return shorter pages.

7. Redirects & Error Handling

HTTP Redirects

Rabbithole follows HTTP redirects automatically, up to a maximum of 5 hops. Each redirect target URL is validated against the same domain allow/block lists as the original request. If a redirect leads to a blocked domain or exceeds the hop limit, the request fails with an appropriate error message.

Security note: Redirect chains can be used to bypass domain restrictions. For example, an attacker-controlled domain could redirect to an internal address. Rabbithole re-validates every redirect target. See Security Considerations.

Non-200 HTTP Status Codes

If the final HTTP response (after following redirects) has a status code other than 2xx, the tool returns an error string rather than the response body. This prevents the LLM from inadvertently treating an error page's HTML as real content.

Status range	Behaviour
`2xx`	Success — body is processed and returned.
`3xx`	Redirect — followed automatically up to 5 hops.
`4xx` (e.g., 403, 404, 429)	Client error — returns error string: `Error fetching URL: HTTP 4xx <reason>`
`5xx`	Server error — returns error string: `Error fetching URL: HTTP 5xx <reason>`

Network / Timeout Errors

Connection errors (DNS failure, TCP timeout, TLS handshake failure, etc.) also return an error string. The default connection timeout is 10 seconds. This can be adjusted in configuration:

# rabbithole.toml
[tools]
fetch_timeout_secs = 10   # default: 10 seconds

8. Domain Configuration: `allowed_domains` & `blocked_domains`

Rabbithole operators can constrain which hosts the web_fetch tool is permitted to contact. Two complementary configuration lists exist:

`allowed_domains` (allowlist)

If this list is non-empty, only URLs whose hostname exactly matches or is a subdomain of an entry in the list are permitted. All other hostnames are rejected with:

Error fetching URL: domain not in allowed list

Example configuration:

# rabbithole.toml
[tools]
allowed_domains = [
    "docs.rs",
    "crates.io",
    "github.com",
    "raw.githubusercontent.com",
]

This is the most secure option for production deployments where you know in advance which external sources the LLM should be able to read. An empty allowed_domains list means no allowlist restriction is applied (all domains are permitted, subject to blocked_domains).

`blocked_domains` (blocklist)

Entries in this list are always refused, even if allowed_domains is empty or the domain would otherwise match an allowlist entry. The blocked list takes precedence.

# rabbithole.toml
[tools]
blocked_domains = [
    "169.254.169.254",   # AWS/GCP metadata service
    "metadata.google.internal",
    "example-internal.corp",
]

Built-in localhost blocking

Regardless of configuration, Rabbithole always rejects requests to loopback addresses and the local machine. The following are hardcoded and cannot be overridden by configuration:

localhost
127.0.0.0/8 (entire loopback range)
::1 (IPv6 loopback)
0.0.0.0

After DNS resolution, the resolved IP address is also checked against these ranges, to guard against DNS rebinding attacks where a public hostname resolves to a private IP.

Note: Private RFC1918 ranges (10.x.x.x, 172.16.x.x–172.31.x.x, 192.168.x.x) and the link-local range (169.254.x.x) are not blocked by default, but are strongly recommended to add to blocked_domains or enforce via firewall rules for production deployments.

9. Security Considerations (SSRF)

The web_fetch tool introduces a Server-Side Request Forgery (SSRF) attack surface by design — it is a mechanism for the server to make outbound HTTP requests based on URLs supplied by an external system (the LLM, which in turn is prompted by page visitors or seed prompts).

What is SSRF?

Server-side request forgery is a web security vulnerability that allows an attacker to cause the server-side application to make requests to an unintended location. In a typical SSRF attack, the attacker might cause the server to make a connection to internal-only services within the organization's infrastructure.

In Rabbithole's context, SSRF could arise if a malicious seed prompt or crafted URL path caused the LLM to call web_fetch with a URL pointing at internal network resources. What makes SSRF particularly dangerous is that these requests originate from the server itself, bypassing external firewalls and security controls that would normally block malicious traffic.

Threat scenarios specific to Rabbithole

Internal service enumeration: If the network architecture is unsegmented, attackers can map out internal networks and determine if ports are open or closed on internal servers from connection results or elapsed time to connect or reject SSRF payload connections.
Cloud metadata exfiltration: Most cloud providers have metadata storage such as http://169.254.169.254/. An attacker can read the metadata to gain sensitive information.
Localhost exploitation: The most straightforward pattern targets the vulnerable server itself, attempting to access local resources through localhost references like http://127.0.0.1/admin or http://localhost:8080/metrics. These attacks exploit services bound to the loopback interface, assuming they're protected from external access.
Redirect-based bypass: Attackers may use redirects or redirect chains to evade validation. For example, they could own a domain and configure it to redirect to http://localhost:443/ or other internal URLs, potentially bypassing input validation.

Mitigations in Rabbithole

Hardcoded loopback blocking at both the hostname and post-DNS-resolution IP level.
Redirect re-validation: every redirect hop is re-checked against domain rules.
allowed_domains allowlist — deny-lists are bypass-prone; prefer allow-lists.
Only http:// and https:// schemes are accepted — no file://, gopher://, or other protocols.
Short connection timeouts to limit exposure from slow-response internal probes.

Production recommendation: If you are running Rabbithole on a cloud VM (AWS EC2, GCP, Azure, etc.), add 169.254.169.254 and metadata.google.internal to blocked_domains, or restrict them via host-level firewall rules. Add all RFC1918 ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) to your egress deny rules. Consider using allowed_domains to restrict fetching to only the domains your use case requires.

Preventing SSRF requires a layered approach that includes strict input validation, egress controls, robust cloud-native application security practices, API authentication, and the continuous management of API inventories.

10. Example Tool Call & Response

The following illustrates a complete round-trip of the web_fetch tool as it occurs internally during a Rabbithole page generation cycle.

1. LLM emits a tool use block (Anthropic format)

{
  "type": "tool_use",
  "id": "toolu_01XzExample",
  "name": "web_fetch",
  "input": {
    "url": "https://raw.githubusercontent.com/ajbt200128/rabbithole/main/README.md"
  }
}

2. Rabbithole processes the request

Validates URL scheme (https:// — OK).
Checks hostname raw.githubusercontent.com against blocklist and allowlist — passes.
Resolves DNS, verifies the resolved IP is not in a blocked range.
Makes HTTP GET request with a 10-second timeout.
Receives HTTP 200 response with Content-Type: text/plain; charset=utf-8.
Since content type is not text/html, skips HTML stripping.
Truncates to max_fetch_bytes (20,000 bytes) if necessary.

3. Rabbithole returns tool result to the LLM

{
  "type": "tool_result",
  "tool_use_id": "toolu_01XzExample",
  "content": "# Rabbithole\n\nA Rust web server that dynamically generates entire websites
on the fly using LLMs. Each page is generated on first request and cached permanently.\n\n
## Features\n\n- Dynamic page generation via LLM\n- Permanent page caching\n- web_search
and web_fetch tools for LLM grounding\n- Configurable via TOML\n\n[...truncated at 20000 bytes]"
}

4. LLM continues generation

The model receives the tool result and continues its response, using the fetched content to inform the HTML it generates. It may call web_fetch additional times before emitting its final <!DOCTYPE html> response.

Error example

If the URL is blocked or unreachable, the tool result content is an error string:

{
  "type": "tool_result",
  "tool_use_id": "toolu_01XzExample",
  "content": "Error fetching URL: HTTP 404 Not Found"
}

11. Copyright Considerations

When the LLM uses web_fetch to retrieve content from third-party websites, it may encounter copyrighted material — articles, documentation, blog posts, source code, and other works protected by copyright law. Several important points apply:

Tool use does not grant reproduction rights. The fact that web_fetch can retrieve a page does not mean its content may be freely reproduced in the generated HTML. The LLM (and by extension Rabbithole operators) should treat fetched content with the same copyright respect as any other source.
Summarization vs. verbatim reproduction. Brief factual summaries, paraphrasing, and short quotations in commentary are generally lower-risk than verbatim reproduction of substantial portions of copyrighted text. However, this is a nuanced legal area.
Robots.txt and terms of service. Many websites prohibit automated fetching in their robots.txt or terms of service. Rabbithole does not automatically check or enforce robots.txt. Operators are responsible for configuring allowed_domains to comply with the access policies of the sites they fetch from.
Attribution. When the LLM incorporates fetched information, it should link to the source where possible. Generated pages that cite sources are more transparent and respectful of original authors.
Jurisdiction variance. Copyright law varies by country. What constitutes fair use or fair dealing differs between the US, EU, UK, and other jurisdictions.

For detailed guidance on how Rabbithole's citation and quotation policy handles web-fetched content, see the Web Tools Citation & Copyright Policy.

12. See Also

Web Tools Overview — comparison of web_search and web_fetch
web_search Tool Reference — the complementary search tool
Web Tools Citation & Copyright Policy
Configuration Reference — all [tools] settings including max_fetch_bytes, fetch_timeout_secs, allowed_domains, blocked_domains
Architecture — how tool calls fit into the page generation lifecycle
Security Guide — SSRF hardening, network egress controls, deployment hardening
Rabbithole on GitHub — source code, issue tracker, and contributions
OWASP SSRF (Top 10:2021) — authoritative SSRF risk reference