- Joined
- Sep 3, 2014
- Messages
- 6,230
- Likes
- 13,100
- Degree
- 9
Let me give you the timeline:
For huge content sites and especially eCommerce sites, a big issue arose when Panda came out, which was having low quality pages in the index. A particular problem was category / author / tag style archives.
Let's say you have 1000 pages on a site. Each post gets 3 tags, has an author or even two, and belongs to one or two categories. Each archive page shows 10 posts. Even if you didn't show author archives and tags, and only had one category, you still have 1000 divided by 10... 100 extra pages of category nonsense in the index. In this scenario, realistically you could be up to 300-500 pages of low quality crap bloating out the index. 1000 real pages of content with an additional 500 pages of crap. Now imagine if you have sub-categories with parent categories displaying all of the child posts... You could easily get up to a one-to-one ratio of content to crap.
What SEO's Used To Do
So what we started doing was setting the sub-pages of these archives to noindex. So in terms of Wordpress, we're talking about
Fast forward to late last year / this year and John Mueller went on record saying something like "If a page is set to noindex long enough, we will stop crawling the links on the page, essentially turning them nofollow." Well, our situation was that we noindex'd these pages but they were still set to "follow" by default. This scared everybody and the Yoast team even freaked out and removed the option altogether from their plugin. Suddenly everyone was getting 100's of category sub-pages reindexed if they relied on Yoast to set the meta robots to noindex.
Don't believe it's a problem? Remember recently when Yoast goofed up and set an option in everyone's database to permanently be set to "no," causing 1000's upon 1000's of "attachment pages" on Wordpress to be indexed, and people's rankings started tanking almost immediately? You ended up with tons of pages with nothing but a single image plus your header and footer and sidebar indexed. They quickly tossed together a plugin that would create a sub-sitemap of all these URLs and set the HTTP header code to 410 on these pages. It quickly fixed the issue for most sites.
What SEO's Do Now
That brings us to the present. Everyone, largely based on the John Mueller comment and not knowing what to do without the Yoast plugin, all recommend on leaving these sub-pages of the archives in the index. They think that Google will suddenly stop being able to crawl their old posts, not because the pages are noindex but because all links on them "become" nofollow.
What I'm Recommending
I'm dealing with some indexation issues myself and have spent a decent chunk of time thinking this through. I'm trying to trim my indexation down as much as I can.
Thankfully, Gary Illyes did an AMA on Reddit, which I summarized for you here, and someone asked him about this. He brought sanity back to the game by stating the obvious:
My conclusion today was that we need to continue following the old path of setting these pages to noindex.
How Do You Get This Done?
If you were depending on Yoast then that option is gone. One thing you can do is use the hook that Yoast still provides for this, but I don't recommend doing that. You never know when Yoast will change something related to that, especially now that they think it's bad.
I can't speak to every single CMS out there, but I can tell you how to get it done on Wordpress.
You'll drop this in your functions.php of your theme or child theme, and that's it. You can confirm it's done by checking the source code and looking in the
Let me point out that having "follow" in there is not necessary since it's the default, but anything that doesn't hurt but helps clarify the intent of the code is good. You can remove that (and the comma before it) if you wish and write yourself a nice comment about it.
This is made possible by two features provided by Wordpress:
Do You Disagree?
What I want to know is if you're one of the people out there who disagree with what I'm saying. I want to hear your perspective on this and see if I'm missing a core piece of logic in thinking about this or not. Please speak up if that's you so we can all learn more on the topic.
While we're here, let me say, you should definitely check right now if you haven't done so, and make sure your indexation isn't bloated. Use the new Search Console Coverage Report, do a
Edit: Let me add, you have to use the meta robots noindex method to drop these out of the index and to stop them from entering the index. You cannot use robots.txt. The meta robots method of robots.txt does not work with Google. Also, all the normal Disallow method of robots.txt does is tell Google not to crawl the page. It doesn't tell it not to index it. You'll end up with empty pages indexed in Google because they can't read the page but still index it, and if they can't read the page then they can't read the noindex command in the
For huge content sites and especially eCommerce sites, a big issue arose when Panda came out, which was having low quality pages in the index. A particular problem was category / author / tag style archives.
Let's say you have 1000 pages on a site. Each post gets 3 tags, has an author or even two, and belongs to one or two categories. Each archive page shows 10 posts. Even if you didn't show author archives and tags, and only had one category, you still have 1000 divided by 10... 100 extra pages of category nonsense in the index. In this scenario, realistically you could be up to 300-500 pages of low quality crap bloating out the index. 1000 real pages of content with an additional 500 pages of crap. Now imagine if you have sub-categories with parent categories displaying all of the child posts... You could easily get up to a one-to-one ratio of content to crap.
What SEO's Used To Do
So what we started doing was setting the sub-pages of these archives to noindex. So in terms of Wordpress, we're talking about
.com/category/page/2
and beyond, while the first page remains indexed. Problem solved.Fast forward to late last year / this year and John Mueller went on record saying something like "If a page is set to noindex long enough, we will stop crawling the links on the page, essentially turning them nofollow." Well, our situation was that we noindex'd these pages but they were still set to "follow" by default. This scared everybody and the Yoast team even freaked out and removed the option altogether from their plugin. Suddenly everyone was getting 100's of category sub-pages reindexed if they relied on Yoast to set the meta robots to noindex.
Don't believe it's a problem? Remember recently when Yoast goofed up and set an option in everyone's database to permanently be set to "no," causing 1000's upon 1000's of "attachment pages" on Wordpress to be indexed, and people's rankings started tanking almost immediately? You ended up with tons of pages with nothing but a single image plus your header and footer and sidebar indexed. They quickly tossed together a plugin that would create a sub-sitemap of all these URLs and set the HTTP header code to 410 on these pages. It quickly fixed the issue for most sites.
What SEO's Do Now
That brings us to the present. Everyone, largely based on the John Mueller comment and not knowing what to do without the Yoast plugin, all recommend on leaving these sub-pages of the archives in the index. They think that Google will suddenly stop being able to crawl their old posts, not because the pages are noindex but because all links on them "become" nofollow.
What I'm Recommending
I'm dealing with some indexation issues myself and have spent a decent chunk of time thinking this through. I'm trying to trim my indexation down as much as I can.
Thankfully, Gary Illyes did an AMA on Reddit, which I summarized for you here, and someone asked him about this. He brought sanity back to the game by stating the obvious:
- Google can still crawl old pages because we have sitemaps and these pages are in their index anyways.
- Google will still crawl these noindex pages because our pagination links to them.
- Google will still crawl the "follow" links on these pages because they're set to "follow."
- Mueller probably meant that if a page is orphaned then the links essentially become "nofollow" because Google never crawls the page again.
My conclusion today was that we need to continue following the old path of setting these pages to noindex.
How Do You Get This Done?
If you were depending on Yoast then that option is gone. One thing you can do is use the hook that Yoast still provides for this, but I don't recommend doing that. You never know when Yoast will change something related to that, especially now that they think it's bad.
I can't speak to every single CMS out there, but I can tell you how to get it done on Wordpress.
Code:
// Set Sub-Pages of Archives to Noindex
function ryu_noindex_paged() {
if ( is_paged() ) {
echo '<meta name="robots" content="noindex,follow" />';
}
}
add_action('wp_head', 'ryu_noindex_paged');
You'll drop this in your functions.php of your theme or child theme, and that's it. You can confirm it's done by checking the source code and looking in the
<head>
of those pages to make sure it's there, and check other pages to make sure it's not there. "It" being <meta name="robots" content="noindex,follow" />
.Let me point out that having "follow" in there is not necessary since it's the default, but anything that doesn't hurt but helps clarify the intent of the code is good. You can remove that (and the comma before it) if you wish and write yourself a nice comment about it.
This is made possible by two features provided by Wordpress:
- the is_paged() function
- the wp_head hook
Do You Disagree?
What I want to know is if you're one of the people out there who disagree with what I'm saying. I want to hear your perspective on this and see if I'm missing a core piece of logic in thinking about this or not. Please speak up if that's you so we can all learn more on the topic.
While we're here, let me say, you should definitely check right now if you haven't done so, and make sure your indexation isn't bloated. Use the new Search Console Coverage Report, do a
site:domain.com
search on Google, check your sitemaps, do a crawl of your site. Anything you can to glean insight will help you sort it out. Don't trust theme creators, plugin developers, etc, to have this covered properly. Check now, and check every few months. The Coverage Report will have your back from here on out. It'll take minutes to check.Edit: Let me add, you have to use the meta robots noindex method to drop these out of the index and to stop them from entering the index. You cannot use robots.txt. The meta robots method of robots.txt does not work with Google. Also, all the normal Disallow method of robots.txt does is tell Google not to crawl the page. It doesn't tell it not to index it. You'll end up with empty pages indexed in Google because they can't read the page but still index it, and if they can't read the page then they can't read the noindex command in the
<head>
.