Citation and Copyright Policy

Applies to: Web Tools (web_search / web_fetch) — Last updated: March 2026

⚠ Not Legal Advice This page is informational documentation for Rabbithole operators. Nothing here constitutes legal advice. Copyright law is complex, varies by jurisdiction, and is actively evolving in response to AI systems. If you have specific legal concerns, consult a qualified attorney.

Overview

Rabbithole's web tools feature allows the LLM to call web_search and web_fetch at page-generation time, pulling live content from the internet to enrich generated pages with current facts, data, and references. This is a powerful capability, but it introduces real risks: the LLM may incorporate third-party copyrighted material into generated HTML pages that are then served publicly.

Unlike a human writer who actively makes judgments about quotation and attribution, an LLM generating HTML on the fly may reproduce copyrighted text without any indication that it is doing so. This document explains the risks, classifies content types by copyright sensitivity, and describes configuration strategies to minimize exposure.

How the Risk Arises

When web_fetch retrieves a URL, the full text of that page is injected into the LLM's context window. The model then generates HTML that may:

LLMs can generate material that violates copyright. Generated text may include verbatim snippets from non-free content, or be a derivative work. Using LLMs to summarize copyrighted content (like news articles) may produce excessively close paraphrases.

Because Rabbithole serves generated pages publicly and caches them permanently, infringing content may be accessible for an extended period unless the cache is explicitly invalidated. Operators are responsible for the content served from their Rabbithole instances.

Content Risk Classification

Not all web content carries equal copyright risk. The table below provides a practical risk classification for the most common content types the web tools may encounter.

Content Type Risk Level Notes
News articles (Reuters, AP, NYT, etc.) HIGH Journalistic prose is fully protected. Even brief reproduction can constitute infringement. News organizations have actively litigated against AI content use.
Song lyrics HIGH Among the most aggressively enforced copyright. Even reproducing a few lines verbatim is highly risky. Concord Music Group v. Anthropic specifically alleged that lyrics can be accessed verbatim or near-verbatim from LLM outputs.
Book excerpts / fiction HIGH Creative literary works carry the strongest copyright protection. The LLM should not reproduce plot summaries that are derivative of the original.
Blog posts & editorial content MEDIUM-HIGH Protected by default. Attribution alone does not cure infringement.
Academic papers (closed-access journals) MEDIUM-HIGH Publisher-owned content. Fetching behind paywalls may additionally violate Terms of Service and the DMCA.
Wikipedia articles LOW-MEDIUM Released under CC BY-SA 4.0. Reproduction is permitted with attribution and share-alike compliance. Paraphrasing without attribution is still best practice.
Official technical documentation (IETF RFCs, W3C specs) LOW Much official standards documentation carries permissive reproduction rights, though always verify the specific license.
Open-source code and README files (MIT, Apache 2.0, etc.) LOW Permissive licenses generally allow reproduction with attribution. Verify the license — GPL requires copyleft compliance if code is distributed.
Government publications (US federal) VERY LOW US federal government works are generally in the public domain under 17 U.S.C. § 105. State and foreign government works vary.
Raw factual data (statistics, measurements, prices) VERY LOW Pure facts are not copyrightable. The expression used to convey facts may be, however, so reproduction of the surrounding prose remains risky.

Legal Background

Copyright Fundamentals

Copyright attaches automatically to original creative works upon creation. In the United States, this is governed by the Copyright Act (17 U.S.C.). No registration is required. Web pages, articles, and other online content are copyrighted by default unless the author has explicitly licensed them otherwise.

When a scraping tool copies data from a website, it may be copying text, images, or code that someone else has the exclusive rights to use and distribute.

Fair Use

US copyright law includes a fair use doctrine (17 U.S.C. § 107) that may permit limited reproduction of copyrighted material. Four factors are weighed:

  1. The purpose and character of the use (commercial vs. educational; transformative vs. reproductive);
  2. The nature of the copyrighted work;
  3. The amount and substantiality of the portion used;
  4. The effect on the market for the original work.

AI companies have maintained that their datasets are protected by the "fair use" doctrine in copyright law, which allows for copyrighted work to be repurposed under certain limited conditions. However, Rabbithole's use case differs from LLM training: it is outputting third-party content into publicly served HTML, which is a more direct reproduction than inference-time transformation.

Most copyright experts would probably agree that duplicating a book word for word is not fair use. Operators should not rely on fair use as a general defense for verbatim reproduction of substantial portions of third-party content.

Terms of Service and the CFAA

In addition to copyright, web fetching may implicate website Terms of Service. Most websites have Terms of Service (ToS) agreements that explicitly prohibit web scraping. Violating ToS may expose operators to breach-of-contract claims in some jurisdictions. The Computer Fraud and Abuse Act (CFAA) may also apply in certain cases. Unauthorized web scraping can lead to copyright infringement, breach of contract, violation of privacy rights, and violation of the Computer Fraud and Abuse Act (CFAA).

The DMCA

The Digital Millennium Copyright Act (DMCA) is a critical statute that impacts web scraping activities. The DMCA prohibits circumventing technological measures that control access to copyrighted works. Web scraping can potentially violate the DMCA if it involves bypassing such measures to access protected content.

Privacy Regulations

If the LLM fetches pages containing personal data, additional regulations may apply. In the United States, the California Consumer Privacy Act (CCPA) imposes strict requirements on the collection and use of personal data. Similarly, the GDPR in the European Union mandates that personal data can only be collected and processed with explicit consent from the data subjects.

Configuring System Prompts for Safe Content Handling

The most effective mitigation available to Rabbithole operators is crafting system prompts that explicitly instruct the LLM to paraphrase fetched content rather than reproduce it. The LLM follows system prompt instructions reliably for stylistic constraints of this kind.

Set a system prompt in your Rabbithole configuration using the system_prompt field. Below are recommended clauses to include:

Recommended System Prompt Clauses

# Copyright and Citation Instructions

When using web_search or web_fetch results, you MUST follow these rules:

1. NEVER reproduce verbatim text from fetched web pages. Always paraphrase
   content in your own words, synthesizing information rather than copying it.

2. NEVER reproduce song lyrics, poetry, or literary fiction passages under
   any circumstances.

3. For news articles: summarize the factual information only. Do not reproduce
   the article's sentences, even partially. Limit yourself to one very short
   quote (under 20 words) per source, and always place quotes in quotation marks
   with the source URL noted.

4. For technical documentation, open-source README files, or official standards:
   paraphrasing is preferred. If a short direct quote is essential for technical
   accuracy, keep it under 20 words and cite the source URL inline.

5. Facts, numbers, and data points may be stated directly, but the expressive
   language around them must be your own.

6. Always attribute the source of factual claims with an inline link or a
   "Source: [URL]" notation at the end of the relevant section.

7. Do not fetch or reproduce content from URLs that require login, are behind
   a paywall, or where robots.txt disallows crawling.

Additional Prompt Hardening by Content Domain

For sites focused on specific domains, add targeted clauses:

# For music-related sites:
Never reproduce any portion of song lyrics. Instead, describe the lyrical
themes, mood, or subject matter in your own words.

# For news-aggregation sites:
Summarize only the key factual claims from news sources. The summary must
not exceed 2-3 sentences per article and must not replicate the original
phrasing. Always link to the original article.

# For academic content sites:
Do not reproduce passages from academic papers. Describe the methodology
and findings in plain language. Note the paper's authors, title, and DOI
if available.

Setting the System Prompt in Configuration

In your rabbithole.toml (or equivalent configuration file), the system prompt is set as follows. See the Configuration page for the full schema.

[llm]
model = "claude-3-5-sonnet-20241022"
system_prompt = """
You are an AI that generates complete HTML pages for a website.

[Copyright and Citation Instructions — paste the block above here]
"""

What the LLM Citation System Does

When Rabbithole is run with a capable model (such as Claude 3.5 Sonnet or GPT-4o), the default system prompt bundled with the server already includes instructions to limit quotation to a maximum of one short quote per source, always in quotation marks. These instructions mirror the constraints used by the Rabbithole system itself.

The built-in instructions tell the model to:

Operators who modify the system prompt should preserve or strengthen (never weaken) these constraints.

Safer Sources to Target with web_fetch

When designing prompts for Rabbithole pages that use web tools, prefer directing the LLM toward lower-risk sources:

Sources to Avoid or Handle with Care

robots.txt and Crawl Etiquette

web_fetch in Rabbithole makes a single HTTP GET request per URL. Operators should be aware that many sites express crawl restrictions in robots.txt. While the legal enforceability of robots.txt is debated, ignoring it is widely considered poor practice and, in some jurisdictions, may strengthen a CFAA or ToS-violation claim.

Best practice: when building prompts that direct the LLM to fetch specific domains, manually review https://example.com/robots.txt first and avoid configuring fetches to paths disallowed for all agents (User-agent: *).

Attribution Best Practices

Even where reproduction is legally permissible (e.g., CC-licensed content, public domain), attribution is both ethically expected and, in many licenses, legally required. When the LLM uses fetched content, the generated page should include:

You can reinforce this in your system prompt:

Always include a "Sources" section at the bottom of any page that uses
web_search or web_fetch results. List each source as:
  - [Title or description] — <a href="URL">URL</a>

Operator Responsibility

Rabbithole is open-source infrastructure. The project maintainers (github.com/ajbt200128/rabbithole) are not responsible for the content generated by instances operated by third parties. Each operator is solely responsible for:

Summary Checklist

  1. ☐ System prompt includes explicit instructions to paraphrase, not quote;
  2. ☐ System prompt forbids reproducing lyrics, poetry, and fiction;
  3. ☐ System prompt limits quotes to under 20 words, in quotation marks, one per source;
  4. ☐ System prompt requires source attribution and links;
  5. ☐ Seed prompts avoid directing the LLM to fetch high-risk domains;
  6. ☐ You have reviewed robots.txt for any domains you systematically fetch;
  7. ☐ You have a process to invalidate cached pages if infringing content is discovered;
  8. ☐ You have reviewed applicable law for your deployment jurisdiction.

Related Pages