robots.txt ,大家都知道是干什么的,其语法如下:
不允许爬取的语法:
user-agent: {BOT-NAME-HERE}
disallow: /
允许爬取的语法:
User-agent: {BOT-NAME-HERE}
Allow: /
同时各大公司有其专有的robot语法说明
谷歌(google)公司:
https://developers.google.com/search/docs/crawling-indexing/robots/intro
Cloudflare公司:
https://www.cloudflare.com/learning/bots/what-is-robots-txt/
对于openai,最简单粗暴的语法:
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
ChatGPT的agent名称是ChatGPT-User,完整的描述大概如下:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
单独的就是 GPTBot
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
openai的爬虫也官方出示了ip地址范围:
https://openai.com/gptbot-ranges.txt
20.15.240.64/28
20.15.240.80/28
20.15.240.96/28
20.15.240.176/28
20.15.241.0/28
20.15.242.128/28
20.15.242.144/28
20.15.242.192/28
40.83.2.64/28
20.9.164.0/24
52.230.152.0/24
以下是一个根据ip地址范围生成robot.txt的shell脚本:
#!/bin/bash
# Purpose: Block OpenAI ChatGPT bot CIDR
# Tested on: Debian and Ubuntu Linux
# Author: Vivek Gite {https://www.cyberciti.biz} under GPL v2.x+
# ------------------------------------------------------------------
file="/tmp/out.txt.$$"
wget -q -O "$file" https://openai.com/gptbot-ranges.txt 2>/dev/null
while IFS= read -r cidr
do
sudo ufw deny proto tcp from $cidr to any port 80
sudo ufw deny proto tcp from $cidr to any port 443
done < "$file"
[ -f "$file" ] && rm -f "$file"
禁止谷歌的AI的robot.txt语法:
User-agent: Google-Extended
Disallow: /
禁止CCbot:
User-agent: CCBot
Disallow: /
Cloudflare研发了WAF来阻止AI机器人,推出了一项新的防火墙规则,如下图: