- Joined
- Sep 3, 2014
- Messages
- 6,229
- Likes
- 13,100
- Degree
- 9
A heads up for the builders.
I'm in the process of doing massive behind the scenes work for my case study site here, and part of that is crafting a giant robots.txt.
Building a database'd site with templates means you're going to have one Header file being pulled for every page. I hate plugin dependency and don't really trust myself to create a custom set of tables for the purpose of slapping in customized meta tags in the header. The best I could do is create a sophisticated if/elseif/else loop to add tags into the right pages. That's dumb to me at this level of the game.
So I thought, hey, I'll just tell the search engines to not crawl these pages I don't want indexed, and i'll do so in the robots.txt....
It's not going to cut it. For example:
About.com told Google specifically, don't crawl this folder. Don't crawl it should mean "don't even look at it." Yet look at this:
Google went ahead and indexed the 2760 urls from that folder. They aren't showing the content, title, descriptions, etc., but they are indexing the URLs.
Do they show up for any legit search terms? Probably not. Could our sites get caught in some stupid Panda crossfire due to something like this? Most likely.
So my heads up is to definitely make sure you use:
But they seem to not ignore the meta tags when the robots.txt isn't "confusing" their crawlers. They crawl and follow but will at least not index.
TL;DR
Don't double up duties in the robots.txt and meta robots tags. Use meta tags when you can. Yoast's Wordpress plugin makes this easy in Wordpress for instance. Find a solution and do it right, or you'll end up like About.com. Search engines don't have to obey robots.txt or meta tags, so try to figure out which they are choosing to respect and go with that. In this case, Google will respect meta robots tags as far as indexing goes as long as you don't double up in the robots.txt.
I'm in the process of doing massive behind the scenes work for my case study site here, and part of that is crafting a giant robots.txt.
Building a database'd site with templates means you're going to have one Header file being pulled for every page. I hate plugin dependency and don't really trust myself to create a custom set of tables for the purpose of slapping in customized meta tags in the header. The best I could do is create a sophisticated if/elseif/else loop to add tags into the right pages. That's dumb to me at this level of the game.
So I thought, hey, I'll just tell the search engines to not crawl these pages I don't want indexed, and i'll do so in the robots.txt....
It's not going to cut it. For example:
About.com told Google specifically, don't crawl this folder. Don't crawl it should mean "don't even look at it." Yet look at this:
Google went ahead and indexed the 2760 urls from that folder. They aren't showing the content, title, descriptions, etc., but they are indexing the URLs.
Do they show up for any legit search terms? Probably not. Could our sites get caught in some stupid Panda crossfire due to something like this? Most likely.
So my heads up is to definitely make sure you use:
<meta name="robots" content="noindex">
If you want to make sure a page isn't indexed. And if you do this, don't give the same directives in robots.txt or it will take precedence and google will ignore the meta tags. My guess is that, since they are choosing to disobey robots.txt directives (guaranteed they crawl the pages for their own data, as the pictures above essentially prove), they see the meta tags and ignore them as well.
But they seem to not ignore the meta tags when the robots.txt isn't "confusing" their crawlers. They crawl and follow but will at least not index.
TL;DR
Don't double up duties in the robots.txt and meta robots tags. Use meta tags when you can. Yoast's Wordpress plugin makes this easy in Wordpress for instance. Find a solution and do it right, or you'll end up like About.com. Search engines don't have to obey robots.txt or meta tags, so try to figure out which they are choosing to respect and go with that. In this case, Google will respect meta robots tags as far as indexing goes as long as you don't double up in the robots.txt.