Citation and Copyright Policy
Applies to: Web Tools (web_search / web_fetch) — Last updated: March 2026
Overview
Rabbithole's web tools feature allows the LLM to call
web_search and web_fetch at page-generation time, pulling live
content from the internet to enrich generated pages with current facts, data, and references.
This is a powerful capability, but it introduces real risks: the LLM may incorporate
third-party copyrighted material into generated HTML pages that are then served publicly.
Unlike a human writer who actively makes judgments about quotation and attribution, an LLM generating HTML on the fly may reproduce copyrighted text without any indication that it is doing so. This document explains the risks, classifies content types by copyright sensitivity, and describes configuration strategies to minimize exposure.
How the Risk Arises
When web_fetch retrieves a URL, the full text of that page is injected into
the LLM's context window. The model then generates HTML that may:
- Quote passages verbatim from the fetched content without quotation marks or attribution;
- Reproduce content that is only slightly paraphrased — what courts sometimes call "excessively close paraphrase";
- Aggregate large portions of a copyrighted work across multiple sections of a generated page.
LLMs can generate material that violates copyright. Generated text may include verbatim snippets from non-free content, or be a derivative work. Using LLMs to summarize copyrighted content (like news articles) may produce excessively close paraphrases.
Because Rabbithole serves generated pages publicly and caches them permanently, infringing content may be accessible for an extended period unless the cache is explicitly invalidated. Operators are responsible for the content served from their Rabbithole instances.
Content Risk Classification
Not all web content carries equal copyright risk. The table below provides a practical risk classification for the most common content types the web tools may encounter.
| Content Type | Risk Level | Notes |
|---|---|---|
| News articles (Reuters, AP, NYT, etc.) | HIGH | Journalistic prose is fully protected. Even brief reproduction can constitute infringement. News organizations have actively litigated against AI content use. |
| Song lyrics | HIGH | Among the most aggressively enforced copyright. Even reproducing a few lines verbatim is highly risky. Concord Music Group v. Anthropic specifically alleged that lyrics can be accessed verbatim or near-verbatim from LLM outputs. |
| Book excerpts / fiction | HIGH | Creative literary works carry the strongest copyright protection. The LLM should not reproduce plot summaries that are derivative of the original. |
| Blog posts & editorial content | MEDIUM-HIGH | Protected by default. Attribution alone does not cure infringement. |
| Academic papers (closed-access journals) | MEDIUM-HIGH | Publisher-owned content. Fetching behind paywalls may additionally violate Terms of Service and the DMCA. |
| Wikipedia articles | LOW-MEDIUM | Released under CC BY-SA 4.0. Reproduction is permitted with attribution and share-alike compliance. Paraphrasing without attribution is still best practice. |
| Official technical documentation (IETF RFCs, W3C specs) | LOW | Much official standards documentation carries permissive reproduction rights, though always verify the specific license. |
| Open-source code and README files (MIT, Apache 2.0, etc.) | LOW | Permissive licenses generally allow reproduction with attribution. Verify the license — GPL requires copyleft compliance if code is distributed. |
| Government publications (US federal) | VERY LOW | US federal government works are generally in the public domain under 17 U.S.C. § 105. State and foreign government works vary. |
| Raw factual data (statistics, measurements, prices) | VERY LOW | Pure facts are not copyrightable. The expression used to convey facts may be, however, so reproduction of the surrounding prose remains risky. |
Legal Background
Copyright Fundamentals
Copyright attaches automatically to original creative works upon creation. In the United States, this is governed by the Copyright Act (17 U.S.C.). No registration is required. Web pages, articles, and other online content are copyrighted by default unless the author has explicitly licensed them otherwise.
When a scraping tool copies data from a website, it may be copying text, images, or code that someone else has the exclusive rights to use and distribute.
Fair Use
US copyright law includes a fair use doctrine (17 U.S.C. § 107) that may permit limited reproduction of copyrighted material. Four factors are weighed:
- The purpose and character of the use (commercial vs. educational; transformative vs. reproductive);
- The nature of the copyrighted work;
- The amount and substantiality of the portion used;
- The effect on the market for the original work.
AI companies have maintained that their datasets are protected by the "fair use" doctrine in copyright law, which allows for copyrighted work to be repurposed under certain limited conditions. However, Rabbithole's use case differs from LLM training: it is outputting third-party content into publicly served HTML, which is a more direct reproduction than inference-time transformation.
Most copyright experts would probably agree that duplicating a book word for word is not fair use. Operators should not rely on fair use as a general defense for verbatim reproduction of substantial portions of third-party content.
Terms of Service and the CFAA
In addition to copyright, web fetching may implicate website Terms of Service. Most websites have Terms of Service (ToS) agreements that explicitly prohibit web scraping. Violating ToS may expose operators to breach-of-contract claims in some jurisdictions. The Computer Fraud and Abuse Act (CFAA) may also apply in certain cases. Unauthorized web scraping can lead to copyright infringement, breach of contract, violation of privacy rights, and violation of the Computer Fraud and Abuse Act (CFAA).
The DMCA
The Digital Millennium Copyright Act (DMCA) is a critical statute that impacts web scraping activities. The DMCA prohibits circumventing technological measures that control access to copyrighted works. Web scraping can potentially violate the DMCA if it involves bypassing such measures to access protected content.
Privacy Regulations
If the LLM fetches pages containing personal data, additional regulations may apply. In the United States, the California Consumer Privacy Act (CCPA) imposes strict requirements on the collection and use of personal data. Similarly, the GDPR in the European Union mandates that personal data can only be collected and processed with explicit consent from the data subjects.
Configuring System Prompts for Safe Content Handling
The most effective mitigation available to Rabbithole operators is crafting system prompts that explicitly instruct the LLM to paraphrase fetched content rather than reproduce it. The LLM follows system prompt instructions reliably for stylistic constraints of this kind.
Set a system prompt in your Rabbithole configuration
using the system_prompt field. Below are recommended clauses to include:
Recommended System Prompt Clauses
# Copyright and Citation Instructions
When using web_search or web_fetch results, you MUST follow these rules:
1. NEVER reproduce verbatim text from fetched web pages. Always paraphrase
content in your own words, synthesizing information rather than copying it.
2. NEVER reproduce song lyrics, poetry, or literary fiction passages under
any circumstances.
3. For news articles: summarize the factual information only. Do not reproduce
the article's sentences, even partially. Limit yourself to one very short
quote (under 20 words) per source, and always place quotes in quotation marks
with the source URL noted.
4. For technical documentation, open-source README files, or official standards:
paraphrasing is preferred. If a short direct quote is essential for technical
accuracy, keep it under 20 words and cite the source URL inline.
5. Facts, numbers, and data points may be stated directly, but the expressive
language around them must be your own.
6. Always attribute the source of factual claims with an inline link or a
"Source: [URL]" notation at the end of the relevant section.
7. Do not fetch or reproduce content from URLs that require login, are behind
a paywall, or where robots.txt disallows crawling.
Additional Prompt Hardening by Content Domain
For sites focused on specific domains, add targeted clauses:
# For music-related sites:
Never reproduce any portion of song lyrics. Instead, describe the lyrical
themes, mood, or subject matter in your own words.
# For news-aggregation sites:
Summarize only the key factual claims from news sources. The summary must
not exceed 2-3 sentences per article and must not replicate the original
phrasing. Always link to the original article.
# For academic content sites:
Do not reproduce passages from academic papers. Describe the methodology
and findings in plain language. Note the paper's authors, title, and DOI
if available.
Setting the System Prompt in Configuration
In your rabbithole.toml (or equivalent configuration file), the
system prompt is set as follows. See the Configuration
page for the full schema.
[llm]
model = "claude-3-5-sonnet-20241022"
system_prompt = """
You are an AI that generates complete HTML pages for a website.
[Copyright and Citation Instructions — paste the block above here]
"""
What the LLM Citation System Does
When Rabbithole is run with a capable model (such as Claude 3.5 Sonnet or GPT-4o), the default system prompt bundled with the server already includes instructions to limit quotation to a maximum of one short quote per source, always in quotation marks. These instructions mirror the constraints used by the Rabbithole system itself.
The built-in instructions tell the model to:
- Limit any single quote to under 20 words, always inside quotation marks;
- Use at most one quote per search result;
- Lead with synthesis and original analysis rather than reproduction;
- Provide inline source links or citations for factual claims.
Operators who modify the system prompt should preserve or strengthen (never weaken) these constraints.
Safer Sources to Target with web_fetch
When designing prompts for Rabbithole pages that use web tools, prefer directing the LLM toward lower-risk sources:
- Wikipedia — CC BY-SA 4.0 licensed; paraphrase and attribute;
- GitHub repositories — Most use MIT or Apache 2.0; check the
LICENSEfile; README content is generally safe to paraphrase; - Official project documentation (docs.rs, MDN, Python docs, IETF RFCs) — Typically permissively licensed or in the public domain;
- US government websites (.gov domains) — Federal works are generally public domain under 17 U.S.C. § 105;
- Open-access research (arxiv.org, PubMed Central) — Check the specific article license (CC BY is common on arXiv);
- The Wayback Machine / Internet Archive — Content retains its original copyright; the archive itself does not change rights;
- Structured data sources (OpenStreetMap, Wikidata) — ODbL or CC0 licensed; generally very safe.
Sources to Avoid or Handle with Care
- Major news outlets — Reuters, AP, NYT, Washington Post, BBC, The Guardian, etc. Paraphrase only; never reproduce prose;
- Music lyrics sites — Genius, AZLyrics, etc. Do not fetch these pages at all for the purpose of reproducing lyrics;
- eBook and audiobook content — Project Gutenberg texts (pre-1928) are public domain; all others require care;
- Paywalled content — Any page requiring a subscription or login; do not attempt to bypass access controls;
- Social media platforms — Terms of Service typically prohibit scraping; additionally, platform-specific ToS may prohibit republication.
robots.txt and Crawl Etiquette
web_fetch in Rabbithole makes a single HTTP GET request per URL.
Operators should be aware that many sites express crawl restrictions in
robots.txt. While the legal enforceability of robots.txt
is debated, ignoring it is widely considered poor practice and, in some jurisdictions,
may strengthen a CFAA or ToS-violation claim.
Best practice: when building prompts that direct the LLM to fetch specific domains,
manually review https://example.com/robots.txt first and avoid configuring
fetches to paths disallowed for all agents (User-agent: *).
Attribution Best Practices
Even where reproduction is legally permissible (e.g., CC-licensed content, public domain), attribution is both ethically expected and, in many licenses, legally required. When the LLM uses fetched content, the generated page should include:
- A link to the original source URL;
- The name of the author or organization, if known;
- The license under which the content is reproduced, if applicable (e.g., "via Wikipedia, CC BY-SA 4.0").
You can reinforce this in your system prompt:
Always include a "Sources" section at the bottom of any page that uses
web_search or web_fetch results. List each source as:
- [Title or description] — <a href="URL">URL</a>
Operator Responsibility
Rabbithole is open-source infrastructure. The project maintainers (github.com/ajbt200128/rabbithole) are not responsible for the content generated by instances operated by third parties. Each operator is solely responsible for:
- The system prompt configuration they deploy;
- The URLs and seed prompts they configure for their site;
- Monitoring generated page content for policy compliance;
- Responding to takedown or DMCA requests directed at their hosted instance;
- Ensuring their deployment complies with applicable law in their jurisdiction.
Summary Checklist
- ☐ System prompt includes explicit instructions to paraphrase, not quote;
- ☐ System prompt forbids reproducing lyrics, poetry, and fiction;
- ☐ System prompt limits quotes to under 20 words, in quotation marks, one per source;
- ☐ System prompt requires source attribution and links;
- ☐ Seed prompts avoid directing the LLM to fetch high-risk domains;
- ☐ You have reviewed
robots.txtfor any domains you systematically fetch; - ☐ You have a process to invalidate cached pages if infringing content is discovered;
- ☐ You have reviewed applicable law for your deployment jurisdiction.