Hey there, website owners! Do you know that search engines and other online services often use AI crawlers to check out what’s on your site? These crawlers, deployed by giants like OpenAI and Google, collect data to train their evolving artificial intelligence (AI) models.
If you wish to exercise greater control over who gets to see and use your content, read on. We’ll guide you on how to adjust your site’s robots.txt
file to fend off these AI web crawlers. Keep reading; a step-by-step guide is up next. 👀
AI training isn’t necessarily a bad thing, but if you’re concerned about the ethical and legal implications of AI training data sourcing, the ability to block OpenAI’s and Bard web crawlers is a crucial first step. It won’t remove any content previously scraped, but it’s a starting point in a landscape increasingly concerned with data privacy and consent.
💡 Before we dive in, let’s quickly understand what a robots.txt
file is. Think of it as the bouncer at the door of your website. It tells crawlers which pages they can visit and which ones they can’t. This file sits in the main folder of your site, so crawlers can find it right away.
OpenAI has recently announced a feature that allows website operators to block its GPTBot web crawler from scraping content to help train its language models, like GPT-3 or GPT-4. This means you can now explicitly disallow OpenAI’s crawlers in your site’s robots.txt
file.
According to OpenAI, the crawled web pages may potentially contribute to future models, although the company filters out content behind paywalls, or content known for gathering personally identifiable information (PII).
However, opting out could be a significant step towards user privacy and data protection.
robots.txt
File: This file is usually in the root directory of your website. If you can’t find it, you might need to create one.robots.txt
file with a text editor. If you’re creating a new one, you can use any plain text editor like Notepad on Windows or TextEdit on a Mac.robots.txt
file (This will tell the OpenAI crawler to not crawl any pages on your website.):
User-agent: GPTBot
Disallow: /
robots.txt
file back to your root directory.robots.txt
file. To force Googlebot to re-crawl your site, you can use the following command in the Google Search Console:
https://www.google.com/webmasters/tools/robots?siteUrl=https://yourwebsite.com
Allow
directive in your robots.txt
file to allow the OpenAI crawler to access specific pages on your website.User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/
robots.txt
file.👉 Discover the 10 Fatal Website Mistakes You Must Avoid to Shield Your Reputation and Protect Your Users!
In line with AI evolution, Google Bard has its set of crawlers that venture into websites for model training. Like OpenAI, Google recognizes the importance of user privacy and offers the choice to webmasters to block its crawlers.
Google highlights the benefits of AI in improving their products and acknowledges the feedback from web publishers seeking more control. They introduced “Google-Extended,” a new tool for publishers to manage how their sites affect Bard and Vertex AI generative APIs. They emphasize transparency, control, and their commitment to engaging with the community for better AI applications.
robots.txt
file (This will tell the Google Bard crawler to not crawl any pages on your website.):
User-agent: Google-Extended
Disallow: /
You might wonder why you should bother doing this. Well, by updating your robots.txt
file, you take control. You decide who can look at your site’s content and who can’t. This can be especially important if you have sensitive information on your site that you don’t want to be part of AI training data.
It’s your website, and the choice of who gets to crawl it should be yours. By spending just a few minutes on your robots.txt
file, you can take control and prevent OpenAI’s and Google crawlers from exploring your content. It’s a simple yet effective step to protect your site.
Compliance Isn’t Optional, It’s Required! 👉 Discover here a Simple Guide to Laws and Regulations for Websites – and how to comply!
The solution to draft, update and maintain your Terms and Conditions. Optimised for eCommerce, marketplace, SaaS, apps & more.