Internet Archive wants to ignore robots.txt to get a more accurate picture
The Internet Archive has announced that in the future it wants to ignore a website’s robots file more often in order to better archive it. The organization is already doing this on US government websites and now wants to apply the practice more widely.
The Internet Archive doesn’t specifically state in which cases the robots file is ignored, only that it concerns files that are specifically aimed at search engines. It goes on to say that ignoring the file on government websites “has not caused any problems” and that it “wants to apply the practice more often now.” The organization states that respecting the file often means that it is not possible to archive a website in its entirety. That would be the very purpose of the Internet Archive.
In addition, websites would increasingly use the file for SEO purposes and for hiding entire domains, for example if a certain domain is no longer in use. As a result, in the past this domain also disappeared from the internet archive. The organization says that complaints about this are received almost daily. With the change in policy, the Internet Archive aims to “provide a more accurate picture of the Internet from the user’s perspective.”
The robots text file has been around since the 1990s and serves to block certain parts of a website from internet bots, such as web crawlers. In this way, for example, login pages can be hidden, although it is also a way to find them. It is also possible to block a specific user agent, for example that of the Internet Archive itself. Some organizations, including Google, respect such a file. Others don’t.
The Internet Archive is a non-profit organization founded in 1996 with the goal of providing access to digitized material, including web pages, games and movies. The total size of the collection is now more than 15 petabytes. In 2012, that was still 10 petabytes. The organization’s web archive is known as the Wayback Machine.