Contents

1. Overview

The web_fetch tool is one of two tools available to the Rabbithole LLM during page generation (the other being web_search). It allows the LLM to retrieve the text content of any publicly accessible web page by URL, which can then be incorporated into the generated HTML response.

A typical use case is fetching documentation, reference data, a README from GitHub, or article text that the LLM needs to quote, summarize, or link to within the generated page. The tool handles the HTTP request, follows redirects, strips HTML markup, and returns the resulting plain text — truncated at a configurable byte limit — back to the model as a tool result.

Tool availability: Whether web_fetch is available during page generation depends on the Rabbithole server configuration. Both tools may be independently enabled or disabled via rabbithole.toml. See Configuration.

2. JSON Schema

The tool is registered with the LLM provider (e.g., Anthropic Claude, OpenAI) as a structured tool definition. The schema Rabbithole uses is:

{
  "name": "web_fetch",
  "description": "Fetch the text content of a web page by URL. Returns the page body
as plain text (HTML tags stripped). Useful for reading documentation, articles,
or reference material to incorporate into the generated page.",
  "input_schema": {
    "type": "object",
    "properties": {
      "url": {
        "type": "string",
        "description": "The URL to fetch"
      }
    },
    "required": ["url"]
  }
}

This schema is sent to the LLM as part of the system prompt / tool registration block at the start of every page-generation request. The model may invoke it zero or more times before producing its final HTML output.

3. Parameters

Parameter Type Required Description
url string Yes The fully-qualified URL to fetch. Must begin with http:// or https://. Other schemes (e.g., file://, ftp://) are rejected. Relative URLs are not accepted; the LLM must supply an absolute URL.

There are no optional parameters in the tool call itself. Behaviour modifiers such as byte limits, allowed domains, and blocked domains are configured server-side in rabbithole.toml and are invisible to the LLM.

4. Return Value

On success, the tool returns a plain-text string containing the extracted body text of the fetched page. Specifically:

On failure, the tool returns an error message string (not a structured error object) so the model can read and react to the failure. Error strings are prefixed to make them distinguishable — for example:

Error fetching URL: HTTP 403 Forbidden
Error fetching URL: domain not in allowed list
Error fetching URL: connection timed out after 10s

5. HTML Stripping & Text Extraction

Rabbithole performs HTML-to-text conversion on the server side before returning the result to the LLM. The processing pipeline is:

  1. Decode response bytes — The HTTP response body is decoded using the charset declared in the Content-Type header, falling back to UTF-8.
  2. Remove script and style blocks — Any content between <script>…</script> and <style>…</style> tags is deleted before further processing, including the tags themselves.
  3. Strip all remaining tags — A regex-based or HTML-parser-based pass removes all remaining angle-bracket tags, leaving only the inner text.
  4. Decode HTML entities — Common HTML entities such as &amp;, &lt;, &gt;, &nbsp; are decoded to their Unicode equivalents.
  5. Normalize whitespace — Runs of whitespace (spaces, tabs, newlines) are collapsed. Multiple consecutive blank lines are reduced to one.
  6. Truncate — The resulting string is truncated at max_fetch_bytes bytes (UTF-8 encoded). No truncation marker is appended automatically; the text simply ends at the cut-off point.
Note: Because tag removal is based on textual processing rather than a full DOM parse, malformed HTML may occasionally result in fragments of attribute values or tag names appearing in the output. The LLM is generally robust to this noise.

If the response Content-Type does not contain text/html, the stripping step is skipped and the raw body is returned (still subject to truncation). This allows fetching JSON APIs, plain-text files, Markdown documents (e.g., GitHub raw URLs), etc.

6. Byte Limit & Truncation (max_fetch_bytes)

To avoid filling the LLM's context window with a single large fetch, Rabbithole enforces a hard byte limit on the returned text. The relevant configuration key is:

# rabbithole.toml
[tools]
max_fetch_bytes = 20000   # default: 20,000 bytes (~20 KB of plain text)
Config key Default Unit Notes
tools.max_fetch_bytes 20000 bytes (UTF-8) Applies after HTML stripping. Set to 0 to disable the limit (not recommended).

The truncation is applied to the post-strip plain-text bytes, not the raw HTTP response. A 500 KB HTML page may strip down to 40 KB of text; only the first 20 KB of that text would be returned to the model.

Tip: If you need the LLM to read a specific section of a long page, include the exact anchor or section URL in the prompt rather than the page root, where possible. Some sites structure their content so that section-specific URLs return shorter pages.

7. Redirects & Error Handling

HTTP Redirects

Rabbithole follows HTTP redirects automatically, up to a maximum of 5 hops. Each redirect target URL is validated against the same domain allow/block lists as the original request. If a redirect leads to a blocked domain or exceeds the hop limit, the request fails with an appropriate error message.

Security note: Redirect chains can be used to bypass domain restrictions. For example, an attacker-controlled domain could redirect to an internal address. Rabbithole re-validates every redirect target. See Security Considerations.

Non-200 HTTP Status Codes

If the final HTTP response (after following redirects) has a status code other than 2xx, the tool returns an error string rather than the response body. This prevents the LLM from inadvertently treating an error page's HTML as real content.

Status rangeBehaviour
2xx Success — body is processed and returned.
3xx Redirect — followed automatically up to 5 hops.
4xx (e.g., 403, 404, 429) Client error — returns error string: Error fetching URL: HTTP 4xx <reason>
5xx Server error — returns error string: Error fetching URL: HTTP 5xx <reason>

Network / Timeout Errors

Connection errors (DNS failure, TCP timeout, TLS handshake failure, etc.) also return an error string. The default connection timeout is 10 seconds. This can be adjusted in configuration:

# rabbithole.toml
[tools]
fetch_timeout_secs = 10   # default: 10 seconds

8. Domain Configuration: allowed_domains & blocked_domains

Rabbithole operators can constrain which hosts the web_fetch tool is permitted to contact. Two complementary configuration lists exist:

allowed_domains (allowlist)

If this list is non-empty, only URLs whose hostname exactly matches or is a subdomain of an entry in the list are permitted. All other hostnames are rejected with:

Error fetching URL: domain not in allowed list

Example configuration:

# rabbithole.toml
[tools]
allowed_domains = [
    "docs.rs",
    "crates.io",
    "github.com",
    "raw.githubusercontent.com",
]

This is the most secure option for production deployments where you know in advance which external sources the LLM should be able to read. An empty allowed_domains list means no allowlist restriction is applied (all domains are permitted, subject to blocked_domains).

blocked_domains (blocklist)

Entries in this list are always refused, even if allowed_domains is empty or the domain would otherwise match an allowlist entry. The blocked list takes precedence.

# rabbithole.toml
[tools]
blocked_domains = [
    "169.254.169.254",   # AWS/GCP metadata service
    "metadata.google.internal",
    "example-internal.corp",
]

Built-in localhost blocking

Regardless of configuration, Rabbithole always rejects requests to loopback addresses and the local machine. The following are hardcoded and cannot be overridden by configuration:

After DNS resolution, the resolved IP address is also checked against these ranges, to guard against DNS rebinding attacks where a public hostname resolves to a private IP.

Note: Private RFC1918 ranges (10.x.x.x, 172.16.x.x–172.31.x.x, 192.168.x.x) and the link-local range (169.254.x.x) are not blocked by default, but are strongly recommended to add to blocked_domains or enforce via firewall rules for production deployments.

9. Security Considerations (SSRF)

The web_fetch tool introduces a Server-Side Request Forgery (SSRF) attack surface by design — it is a mechanism for the server to make outbound HTTP requests based on URLs supplied by an external system (the LLM, which in turn is prompted by page visitors or seed prompts).

What is SSRF?

Server-side request forgery is a web security vulnerability that allows an attacker to cause the server-side application to make requests to an unintended location. In a typical SSRF attack, the attacker might cause the server to make a connection to internal-only services within the organization's infrastructure.

In Rabbithole's context, SSRF could arise if a malicious seed prompt or crafted URL path caused the LLM to call web_fetch with a URL pointing at internal network resources. What makes SSRF particularly dangerous is that these requests originate from the server itself, bypassing external firewalls and security controls that would normally block malicious traffic.

Threat scenarios specific to Rabbithole

Mitigations in Rabbithole

Production recommendation: If you are running Rabbithole on a cloud VM (AWS EC2, GCP, Azure, etc.), add 169.254.169.254 and metadata.google.internal to blocked_domains, or restrict them via host-level firewall rules. Add all RFC1918 ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) to your egress deny rules. Consider using allowed_domains to restrict fetching to only the domains your use case requires.

Preventing SSRF requires a layered approach that includes strict input validation, egress controls, robust cloud-native application security practices, API authentication, and the continuous management of API inventories.

10. Example Tool Call & Response

The following illustrates a complete round-trip of the web_fetch tool as it occurs internally during a Rabbithole page generation cycle.

1. LLM emits a tool use block (Anthropic format)

{
  "type": "tool_use",
  "id": "toolu_01XzExample",
  "name": "web_fetch",
  "input": {
    "url": "https://raw.githubusercontent.com/ajbt200128/rabbithole/main/README.md"
  }
}

2. Rabbithole processes the request

  1. Validates URL scheme (https:// — OK).
  2. Checks hostname raw.githubusercontent.com against blocklist and allowlist — passes.
  3. Resolves DNS, verifies the resolved IP is not in a blocked range.
  4. Makes HTTP GET request with a 10-second timeout.
  5. Receives HTTP 200 response with Content-Type: text/plain; charset=utf-8.
  6. Since content type is not text/html, skips HTML stripping.
  7. Truncates to max_fetch_bytes (20,000 bytes) if necessary.

3. Rabbithole returns tool result to the LLM

{
  "type": "tool_result",
  "tool_use_id": "toolu_01XzExample",
  "content": "# Rabbithole\n\nA Rust web server that dynamically generates entire websites
on the fly using LLMs. Each page is generated on first request and cached permanently.\n\n
## Features\n\n- Dynamic page generation via LLM\n- Permanent page caching\n- web_search
and web_fetch tools for LLM grounding\n- Configurable via TOML\n\n[...truncated at 20000 bytes]"
}

4. LLM continues generation

The model receives the tool result and continues its response, using the fetched content to inform the HTML it generates. It may call web_fetch additional times before emitting its final <!DOCTYPE html> response.

Error example

If the URL is blocked or unreachable, the tool result content is an error string:

{
  "type": "tool_result",
  "tool_use_id": "toolu_01XzExample",
  "content": "Error fetching URL: HTTP 404 Not Found"
}

When the LLM uses web_fetch to retrieve content from third-party websites, it may encounter copyrighted material — articles, documentation, blog posts, source code, and other works protected by copyright law. Several important points apply:

For detailed guidance on how Rabbithole's citation and quotation policy handles web-fetched content, see the Web Tools Citation & Copyright Policy.

12. See Also

---MAPPINGS--- {}