- Joined
- May 27, 2016
- Messages
- 41
- Likes
- 54
- Degree
- 0
When I first read this I was baffled and thought it was a mistake, but apparently it is for real, Archive.org is no longer respecting the robots file and in the comments section a "Mark Graham" states the following:
--
The reason I didn't believe the original blogpost or it was wishful thinking was because I always blocked Archive.org since day #1, BUT then suddenly I checked a site and the whole history was available. So simply blocking Archive.org is not enough since they still will crawl the website just not show it in their database.
Reading Mark Graham's comments further he states that Archive.org doesn't even use the "ia_archiver" user agent. A little digging found the new user agent is "archive.org_bot".
The Bad Bots thread has been updated to reflect this: https://www.buildersociety.com/threads/block-unwanted-bots-on-apache-nginx-constantly-updated.1898/
"Please know that site owners can always write to info@archive.org and request that content from a site be removed from the Wayback Machine and from future crawling. We process requests like that every day."
Source: Robots.txt meant for search engines don’t work well for web archives
Source: Robots.txt meant for search engines don’t work well for web archives
--
The reason I didn't believe the original blogpost or it was wishful thinking was because I always blocked Archive.org since day #1, BUT then suddenly I checked a site and the whole history was available. So simply blocking Archive.org is not enough since they still will crawl the website just not show it in their database.
Reading Mark Graham's comments further he states that Archive.org doesn't even use the "ia_archiver" user agent. A little digging found the new user agent is "archive.org_bot".
The Bad Bots thread has been updated to reflect this: https://www.buildersociety.com/threads/block-unwanted-bots-on-apache-nginx-constantly-updated.1898/