OpenAI will respect the ban on crawling in robots.txt
OpenAI’s crawler will respect a line in the robots.txt file of websites if it states that the crawler is not welcome. Then OpenAI’s models will not be trained on the texts of that site. Data that was previously collected remains in the models.
OpenAI’s crawlerthe maker of ChatGPT, has already left web pages with paywalls, personal information and ‘cross-terms content’ alone, but this is the first time that the crawler can also be kept away from other content.
Webmasters can add the text to robots.txt, the text file that is part of web standards and provides instructions to non-human visitors. A common use for this file is to tell search engines not to save page content for previewing search results. Now the file can also be used to keep the user agent GPTBot out. Following the instructions is voluntary.
OpenAI trains its large language model on texts on the internet. That model is then the basis for the information that ChatGPT itself can understand in user questions and that it can produce in its responses to them. Reddit and Twitter have been critical of OpenAI crawling; they think it is unacceptable that money is being made with the content on those sites while OpenAI gives nothing in return. In response, they say, they’re setting up paywalls and the like. Deviantart already had its own ‘noai’ flag to stop crawlers, The Verge sums it up.