Google Introduces New Control for Web Publishers Over AI Training Data

Google has announced a new tool called Google-Extended, which allows web publishers to opt out of having their data used for training the company’s AI models, while still being accessible through Google Search. The tool gives publishers more control over whether their sites contribute to the improvement of Bard and Vertex AI generative APIs, which are powered by Google’s AI chatbot and machine learning platform.

What Is Google-Extended?

Google-Extended is a “standalone product token that web publishers can use to manage whether their sites help improve Bard and Vertex AI generative APIs” and the AI models that power them. The tool is available through robots.txt, also known as the text file that informs web crawlers whether they can access certain sites.

Google Introduces New Control for Web Publishers Over AI Training Data
Google Introduces New Control for Web Publishers Over AI Training Data

Web publishers can use Google-Extended to specify which directories or pages of their sites they want to exclude from being used as training data for Google’s AI models, while still allowing them to be crawled and indexed by Google Search. For example, publishers can use the following syntax in their robots.txt file:

User-agent: Google-Extended Disallow: /paywall-content/ Allow: /

This instructs Google-Extended not to access or use the content in the “paywall-content” directory to improve Bard and Vertex AI generative APIs, but to access and use content from all other site directories.

Why Did Google Launch Google-Extended?

Google launched Google-Extended in response to the rapidly evolving landscape of generative artificial intelligence (AI), which is a branch of AI that can create new content such as text, images, audio, or video based on existing data. Google’s Bard and Vertex AI are examples of generative AI products that use publicly available data scraped from the web to train their models and generate new content.

However, some web publishers may not want their data to be used for this purpose, as it may raise ethical, legal, or business concerns. For instance, some publishers may worry about the quality, accuracy, or originality of the content generated by AI models using their data. Others may want to protect their intellectual property rights or monetize their content in other ways.

Google-Extended aims to address these concerns by giving web publishers more choice and control over how their data is used for AI training, while still maintaining their visibility and discoverability on Google Search. The tool also reflects Google’s commitment to responsible AI development, which is guided by its established AI principles and consumer privacy policy.

How Can Web Publishers Use Google-Extended?

Web publishers who want to use Google-Extended can simply add the user-agent token and the corresponding rules to their robots.txt file. They can also test their robots.txt file using the robots.txt Tester tool in Google Search Console. Google says that it will respect the rules specified by web publishers using Google-Extended and will not use their data for improving Bard and Vertex AI generative APIs or future AI products.

Web publishers who want more information or updates about Google-Extended can fill out this form to join Google’s AI Web Publisher Controls Mailing List. They can also check out Google’s documentation on how to control crawling and indexing.

Google-Extended is not the only tool that web publishers can use to manage access to their content for AI training purposes. For example, OpenAI, another company that develops generative AI models such as ChatGPT, has its own web crawler called GPTbot that respects the robots.txt rules set by web publishers. Web publishers who do not want their content to be used in future OpenAI models can also consider using the GPTbot user-agent token in their robots.txt file.

Leave a Reply

Your email address will not be published. Required fields are marked *