opencommit/roadmap

Deploy Anti AI Scraper Measures #15

New issue

Open

opened 2026-04-14 21:52:43 +02:00 by mvdkleijn · 0 comments

mvdkleijn commented

2026-04-14 21:52:43 +02:00

Owner

1. What

The implementation of a multi-layered defensive perimeter around OpenCommit to identify, challenge, and mitigate unauthorized automated data extraction by AI-related crawlers (e.g., GPTBot, CCBot, AnthropicAI). This includes deploying technical controls at the application level (via robots.txt), the network/proxy level (via User-Agent filtering or rate limiting), and, if required, an identity challenge layer (such as Anubis) to verify human-driven traffic.

2. Why

To protect the intellectual property and privacy of the repositories hosted on this instance by preventing uncompensated use of source code in Large Language Model (LLM) training sets. Additionally, this initiative aims to reduce infrastructure resource consumption and "noise" caused by high-frequency automated requests, ensuring higher availability and performance for legitimate human users and authorized integrations.

3. Boundaries

In-Scope:
- Configuration of robots.txt directives. (Robots.txt Traefik plugin?)
- Implementation of reverse proxy rules (Traefik) for User-Agent blocking or rate limiting.
- Evaluation and deployment of challenge-based utilities (e.g., Anubis).
- Configuration of "Good Bot" allowlists to ensure no loss of critical service functionality.
Out-of-Scope:
- Modification of the Forgejo core codebase or internal application logic.
- Blocking of legitimate human users or authenticated developers.
- Decommissioning of search engine visibility (unless specifically identified as a risk during implementation).
- Any measures that would break essential CI/CD webhooks or third-party integrations.

4. Definition of Done

Deployment: A multi-layered defense strategy is active and operational across the edge/proxy layer.
Shadowing: Initial deployment of measures is in a "shadow" or "log-only" mode so administrators can verify that normal traffic remains unaffected.
Validation: Verification testing confirms that known AI scrapers (e.g., GPTBot) are either blocked or successfully challenged by the new layers.
Integrity Check: An audit of "Good Bot" traffic (Search Engines, CI/CD webhooks, and Archive bots) confirms they remain unaffected and can still access necessary resources.
Observability: Monitoring or logging is configured to track the frequency of blocked attempts and any impact on server latency/resource usage.
Documentation: The new security architecture, including the "allowlist" of permitted bots, is documented for future maintenance.

## 1. What The implementation of a multi-layered defensive perimeter around OpenCommit to identify, challenge, and mitigate unauthorized automated data extraction by AI-related crawlers (e.g., GPTBot, CCBot, AnthropicAI). This includes deploying technical controls at the application level (via `robots.txt`), the network/proxy level (via User-Agent filtering or rate limiting), and, if required, an identity challenge layer (such as Anubis) to verify human-driven traffic. ## 2. Why To protect the intellectual property and privacy of the repositories hosted on this instance by preventing uncompensated use of source code in Large Language Model (LLM) training sets. Additionally, this initiative aims to reduce infrastructure resource consumption and "noise" caused by high-frequency automated requests, ensuring higher availability and performance for legitimate human users and authorized integrations. ## 3. Boundaries * **In-Scope:** * Configuration of `robots.txt` directives. (Robots.txt Traefik plugin?) * Implementation of reverse proxy rules (Traefik) for User-Agent blocking or rate limiting. * Evaluation and deployment of challenge-based utilities (e.g., Anubis). * Configuration of "Good Bot" allowlists to ensure no loss of critical service functionality. * **Out-of-Scope:** * Modification of the Forgejo core codebase or internal application logic. * Blocking of legitimate human users or authenticated developers. * Decommissioning of search engine visibility (unless specifically identified as a risk during implementation). * Any measures that would break essential CI/CD webhooks or third-party integrations. ## 4. Definition of Done * [ ] **Deployment:** A multi-layered defense strategy is active and operational across the edge/proxy layer. * [ ] **Shadowing:** Initial deployment of measures is in a "shadow" or "log-only" mode so administrators can verify that normal traffic remains unaffected. * [ ] **Validation:** Verification testing confirms that known AI scrapers (e.g., GPTBot) are either blocked or successfully challenged by the new layers. * [ ] **Integrity Check:** An audit of "Good Bot" traffic (Search Engines, CI/CD webhooks, and Archive bots) confirms they remain unaffected and can still access necessary resources. * [ ] **Observability:** Monitoring or logging is configured to track the frequency of blocked attempts and any impact on server latency/resource usage. * [ ] **Documentation:** The new security architecture, including the "allowlist" of permitted bots, is documented for future maintenance.

mvdkleijn added the

labels

2026-04-14 21:52:43 +02:00

mvdkleijn added this to the Highlevel Roadmap project

2026-04-14 21:52:43 +02:00

mvdkleijn self-assigned this

2026-05-21 16:19:56 +02:00