Archive.org Disregarding Robots.txt Block

Tay · Jun 23, 2017

When I first read this I was baffled and thought it was a mistake, but apparently it is for real, Archive.org is no longer respecting the robots file and in the comments section a "Mark Graham" states the following:

"Please know that site owners can always write to info@archive.org and request that content from a site be removed from the Wayback Machine and from future crawling. We process requests like that every day."

Source: Robots.txt meant for search engines don’t work well for web archives

--

The reason I didn't believe the original blogpost or it was wishful thinking was because I always blocked Archive.org since day #1, BUT then suddenly I checked a site and the whole history was available. So simply blocking Archive.org is not enough since they still will crawl the website just not show it in their database.

Reading Mark Graham's comments further he states that Archive.org doesn't even use the "ia_archiver" user agent. A little digging found the new user agent is "archive.org_bot".

The Bad Bots thread has been updated to reflect this: https://www.buildersociety.com/threads/block-unwanted-bots-on-apache-nginx-constantly-updated.1898/

Robin · Jun 24, 2017

Everything seems fine with my websites (I checked a handful).
I have used robots.txt and in some cases requested a manual removal just to mix it up - apparently, I have also blocked the wrong user agent.

Robots.txt

I am looking forward to seeing if these website will suddenly appear and more importantly, if they have actually achieved in the past like you mentioned.

Manual removal request

doublethinker · Jun 26, 2017

The Archiver: "I'm coming in!"
House Owner: "No you're not, you've not been on the guestlist for 20 years!"
The Archiver: "I don't care anymore!"

This is ridiculous and so is the reasoning that because they are not a search engine (they are), and robots.txt have evolved to being only for search engines (they aren't) justifies their trespassing.

Granted, I do support their cause- but they should respect the impermanent nature of the internet.

As we have moved towards broader access it has not caused problems, which we take as a good sign.

Of course, it hasn't caused problems. It's not like you bang down the doors with guns blazing. The selective thinking of this guy, geez.

Thanks for the heads up.

built · Jun 26, 2017

Question: Why would you want to block archive.org?

CCarter · Jun 26, 2017

built said:
Question: Why would you want to block archive.org?

For the same reasons why people want to delete their information from the Internet. Could be Safety, Security, Privacy or erasing past mistakes. It wouldn't guarantee you'll get rid of all the data but it will slowdown the potential since having an "archive" of past web pages easily accessible could be detrimental to the 4 mentioned.

1. Safety - Some personal information was leaked or didn't belong on the website and it was quickly removed but there are archives of it cause of archive.org and other places.

2. Security - There might be something that was exposed like a security vulnerability that is fixed but having it in a random archive could give potential hackers/crackers a source of research to figure out what type of technology you are running.

3. Privacy - I know this might be completely perplexing to a lot of you younger kids - but there was a time where you didn't put your whole life on the internet so any random person can "Google" you and find photos of you, images of your family & friends, know where you frequent and hang out, know where your kids go to school. I mean shit some of you make it really easy for random people on the earth to look you up without a second thought.

4. Erasing Past Mistakes - something you wrote could be wrong and you want it erased.

5. Protecting Intellectual Property - something you wrote is important and you don't want random other website having it in their collection. Especially if you delete your website - there is a reason you deleted the website. On top of that there are these people that go around re-creating websites from scratch when they drop so deleting the content makes it more difficult to pretend to BE YOU AND YOUR FORMER ORGANIZATION - identity thief.

6. Because you can.

Archive.org Disregarding Robots.txt Block

Tay

Robin

I ain't Robin

doublethinker

built

//

CCarter

Final Boss ®