Iubenda logo
Start generating

Documentation

Table of Contents

Block AI Crawlers: Here’s How To Stop Your Site From Being Used for AI Training (OpenAI and Google Bard Guide)

Hey there, website owners! Do you know that search engines and other online services often use AI crawlers to check out what’s on your site? These crawlers, deployed by giants like OpenAI and Google, collect data to train their evolving artificial intelligence (AI) models.

If you wish to exercise greater control over who gets to see and use your content, read on. We’ll guide you on how to adjust your site’s robots.txt file to fend off these AI web crawlers. Keep reading; a step-by-step guide is up next. 👀

Crawlers

AI training isn’t necessarily a bad thing, but if you’re concerned about the ethical and legal implications of AI training data sourcing, the ability to block OpenAI’s and Bard web crawlers is a crucial first step. It won’t remove any content previously scraped, but it’s a starting point in a landscape increasingly concerned with data privacy and consent.

💡 Before we dive in, let’s quickly understand what a robots.txt file is. Think of it as the bouncer at the door of your website. It tells crawlers which pages they can visit and which ones they can’t. This file sits in the main folder of your site, so crawlers can find it right away.

OpenAI Crawlers

Start Here: What OpenAI’s Update Means for Your Website

OpenAI has recently announced a feature that allows website operators to block its GPTBot web crawler from scraping content to help train its language models, like GPT-3 or GPT-4. This means you can now explicitly disallow OpenAI’s crawlers in your site’s robots.txt file.

🔊 What OpenAI Says

According to OpenAI, the crawled web pages may potentially contribute to future models, although the company filters out content behind paywalls, or content known for gathering personally identifiable information (PII).

🔗
OpenAI stated:

Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.”

However, opting out could be a significant step towards user privacy and data protection.

📌 How to Block OpenAI’s Crawler

  1. Find Your robots.txt File: This file is usually in the root directory of your website. If you can’t find it, you might need to create one.
  2. Edit the File: Open the robots.txt file with a text editor. If you’re creating a new one, you can use any plain text editor like Notepad on Windows or TextEdit on a Mac.
  3. Add the Rules: Add the following line to your robots.txt file (This will tell the OpenAI crawler to not crawl any pages on your website.):
    • User-agent: GPTBot
      Disallow: /
  4. Save and Upload: Save your changes and upload your robots.txt file back to your root directory.
  5. Refresh Google’s robots.txt cache: Googlebot will not automatically detect changes to your robots.txt file. To force Googlebot to re-crawl your site, you can use the following command in the Google Search Console:
    • https://www.google.com/webmasters/tools/robots?siteUrl=https://yourwebsite.com
  6. ✅ Once you have completed these steps, the OpenAI crawler will no longer be able to crawl your website for AI training.

Here are some additional things to keep in mind:

  • You can also use the Allow directive in your robots.txt file to allow the OpenAI crawler to access specific pages on your website.

User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/

  • If you have a large website, you may want to consider using a web crawler management tool to help you manage your robots.txt file.
  • You can also use other methods to prevent your website from being used for AI training, such as password protection or noindex tags.
🚨 Weak Website Security Can Cost You—Both Users and Reputation!

👉 Discover the 10 Fatal Website Mistakes You Must Avoid to Shield Your Reputation and Protect Your Users!

Google Bard Crawlers

The Emergence of Google Bard

In line with AI evolution, Google Bard has its set of crawlers that venture into websites for model training. Like OpenAI, Google recognizes the importance of user privacy and offers the choice to webmasters to block its crawlers.

🔊 What Google Bard Says

Google highlights the benefits of AI in improving their products and acknowledges the feedback from web publishers seeking more control. They introduced “Google-Extended,” a new tool for publishers to manage how their sites affect Bard and Vertex AI generative APIs. They emphasize transparency, control, and their commitment to engaging with the community for better AI applications.

🔗
Google Bard stated:

We’re enhancing our products with AI and introducing Google-Extended for publishers to control their content’s role in our AI systems. Our goal is transparency and collaboration with the web and AI communities.”

📌 How to Block Google Bard’s Crawler

  1. Pinpoint Your robots.txt File: As before, it’s usually in the site’s root directory.
  2. Access and Edit: Utilize a text editor to make changes.
  3. Add the Rules: To block Google Bard, add the following line to your robots.txt file (This will tell the Google Bard crawler to not crawl any pages on your website.):
    • User-agent: Google-Extended
      Disallow: /
  4. Commit and Update: Save your modifications and replace the file in the root directory.
  5. Alert Google: As previously noted, remind Googlebot of the changes via the Search Console.
  6. ✅ Blocking Google Bard’s crawlers is now activated for your website.

Why Should You Do This?

You might wonder why you should bother doing this. Well, by updating your robots.txt file, you take control. You decide who can look at your site’s content and who can’t. This can be especially important if you have sensitive information on your site that you don’t want to be part of AI training data.

Final Thoughts

It’s your website, and the choice of who gets to crawl it should be yours. By spending just a few minutes on your robots.txt file, you can take control and prevent OpenAI’s and Google crawlers from exploring your content. It’s a simple yet effective step to protect your site.

💡

Do You Run a Website or Blog?


Compliance Isn’t Optional, It’s Required! 👉 Discover here a Simple Guide to Laws and Regulations for Websites – and how to comply!

About us

iubenda

The solution to draft, update and maintain your Terms and Conditions. Optimised for eCommerce, marketplace, SaaS, apps & more.

www.iubenda.com