How to Stop OpenAIs Bot from Crawling Your Website

How to Stop OpenAI’s Bot from Crawling Your Website

Introduction

Founded in 2015, OpenAI is an artificial intelligence (AI) research laboratory in America created with the intention of developing artificial general intelligence. ChatGPT, one of the most capable AI-powered language models ever developed, released a new web crawler – the OpenAI GPTBot. Dissimilar to indexers and spiderbots operated by search engines to index webpages – GoogleBot, Applebot, Bingbot, WebCrawler, GRUB, etc. – GPTBot will be used to scan and extract data from websites to be used for training its language learning models (LLMs), becoming a direct part of their end product.

 

 

AI’s Access to Data and Content

While the benefits of AI in general in improving productivity and driving economic growth is undeniable, OpenAI has come under increased scrutiny and a litany of lawsuits recently alleging copyright infringement and data theft. Microsoft’s acquisition of Github, commonly used to host open source software development projects, brought with it a lawsuit for allegedly ingesting large swaths of licensed source codes to train OpenAI’s codex model.

 

Questions on how, when and what data OpenAI has used and continues to use to train its models are not without merit or good reason. Considering OpenAI themselves charge a fee to use their platform, it isn’t unreasonable for a website owner to charge companies like OpenAI for using their data. Social media platforms like Reddit and Twitter (the irony) have already said that they plan to begin charging for access to their content.

 

 

Age of Consent

Because we do still live in the age of consent, the decision on whether publicly accessible content is made available to any web crawling spider should ultimately lie with the creator of the website. It is exactly for this reason that robotstxt.org – a standard for robot exclusion – was established back in 2007. By creating a robots.txt file on your web server, a website owner can state explicitly which robots should not be crawling your web content and/or which web content you would like to specifically block from being crawled.

 

In this article, we share how to stop OpenAI’s bot from crawling your website by setting one or more rules in a robots.txt file.

 

 

How to Create a Robots.txt File

A robots.txt file lives at the root folder of your website. Document roots for primary domains (e.g. example.com) usually live in a folder called ‘public_html’. This folder is where the robots.txt file should be placed. If your website is hosted on a subdomain (e.g. subdomain.example.com), the root folder is usually ‘public_html/subdomain.example.com’.

 

To create the file, open any text editor (e.g. Notepad) on your computer or server, type the desired rules in the file and save the file to the root document folder with the name ‘robots.txt’.

 

 

Basic Robots Rules

Rules are instructions for crawlers about which parts of your website’s publicly accessible content it can and cannot crawl. A basic robots.txt file which allows all bots, spiders, crawlers, etc. to access any part of your website’s content looks like this:

 

User-agent: *
Allow: /

 

 

Grouped Robots Rules

A robots.txt file can contain groups of rules, with each group containing multiple rules, one rule per line. Each group has to start with a ‘User-agent’ line that specifies the bot to direct. This is an example of a grouped rule:

 

User-agent: GoogleBot
Disallow: /somefolder
Allow: /otherfolder

 

 

Multiple Robots Rules

To combine rules for multiple groups or robots, you would use the examples above within the same robots.txt file. For example, the following rules provide directives to block OpenAI’s GPTBot from accessing all your website content while allowing Googlebot full access.

 

User-agent: GoogleBot
Allow: /

User-agent: GPTBot
Disallow: /

 

More useful examples of grouping robots.txt lines and rules are available Google’s website here and here.

 

Here’s an example of a code script OpenAI’s web crawlers could potentially have scraped. Having said that, we have not implemented a robots.txt block for GPTBot on Siliceous Solutions’s website. 🙂