如何使用robots.txt阻止openai爬虫、chatgpt爬虫、google CCbot爬虫

Nov 22, 2023

—

from

robots.txt ，大家都知道是干什么的，其语法如下：

不允许爬取的语法：
user-agent: {BOT-NAME-HERE}
disallow: /

允许爬取的语法：
User-agent: {BOT-NAME-HERE}
Allow: /

同时各大公司有其专有的robot语法说明

谷歌（google）公司：
https://developers.google.com/search/docs/crawling-indexing/robots/intro
Cloudflare公司：
https://www.cloudflare.com/learning/bots/what-is-robots-txt/

对于openai，最简单粗暴的语法：
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /

ChatGPT的agent名称是ChatGPT-User，完整的描述大概如下：
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
单独的就是 GPTBot
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

openai的爬虫也官方出示了ip地址范围：

https://openai.com/gptbot-ranges.txt
20.15.240.64/28
20.15.240.80/28
20.15.240.96/28
20.15.240.176/28
20.15.241.0/28
20.15.242.128/28
20.15.242.144/28
20.15.242.192/28
40.83.2.64/28
20.9.164.0/24
52.230.152.0/24

以下是一个根据ip地址范围生成robot.txt的shell脚本：

#!/bin/bash
# Purpose: Block OpenAI ChatGPT bot CIDR 
# Tested on: Debian and Ubuntu Linux
# Author: Vivek Gite {https://www.cyberciti.biz} under GPL v2.x+ 
# ------------------------------------------------------------------
file="/tmp/out.txt.$$"
wget -q -O "$file" https://openai.com/gptbot-ranges.txt 2>/dev/null
 
while IFS= read -r cidr
do
    sudo ufw deny proto tcp from $cidr to any port 80
    sudo ufw deny proto tcp from $cidr to any port 443
done < "$file"
[ -f "$file" ] && rm -f "$file"

禁止谷歌的AI的robot.txt语法：
User-agent: Google-Extended
Disallow: /

禁止CCbot：
User-agent: CCBot
Disallow: /

Cloudflare研发了WAF来阻止AI机器人，推出了一项新的防火墙规则，如下图：

openai 爬虫