openai爬虫

如何使用robots.txt阻止openai爬虫、chatgpt爬虫、google CCbot爬虫

如何使用robots.txt阻止openai爬虫、chatgpt爬虫、google CCbot爬虫

robots.txt ,大家都知道是干什么的,其语法如下:

不允许爬取的语法:
user-agent: {BOT-NAME-HERE}
disallow: /

允许爬取的语法:
User-agent: {BOT-NAME-HERE}
Allow: /

同时各大公司有其专有的robot语法说明

谷歌(google)公司
https://developers.google.com/search/docs/crawling-indexing/robots/intro
Cloudflare公司
https://www.cloudflare.com/learning/bots/what-is-robots-txt/

对于openai,最简单粗暴的语法:
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /

openai的爬虫也官方出示了ip地址范围:

https://openai.com/gptbot-ranges.txt
20.15.240.64/28
20.15.240.80/28
20.15.240.96/28
20.15.240.176/28
20.15.241.0/28
20.15.242.128/28
20.15.242.144/28
20.15.242.192/28
40.83.2.64/28
20.9.164.0/24
52.230.152.0/24

以下是一个根据ip地址范围生成robot.txt的shell脚本:

#!/bin/bash
# Purpose: Block OpenAI ChatGPT bot CIDR 
# Tested on: Debian and Ubuntu Linux
# Author: Vivek Gite {https://www.cyberciti.biz} under GPL v2.x+ 
# ------------------------------------------------------------------
file="/tmp/out.txt.$$"
wget -q -O "$file" https://openai.com/gptbot-ranges.txt 2>/dev/null
 
while IFS= read -r cidr
do
    sudo ufw deny proto tcp from $cidr to any port 80
    sudo ufw deny proto tcp from $cidr to any port 443
done < "$file"
[ -f "$file" ] && rm -f "$file"

禁止谷歌的AI的robot.txt语法:
User-agent: Google-Extended
Disallow: /

禁止CCbot:
User-agent: CCBot
Disallow: /

Cloudflare研发了WAF来阻止AI机器人,推出了一项新的防火墙规则,如下图:

如何使用robots.txt阻止openai爬虫、chatgpt爬虫、google CCbot爬虫