Sub-Pages on Archives / Paged Pages - To Noindex or Not?

Ryuzaki · Feb 18, 2019

Let me give you the timeline:

For huge content sites and especially eCommerce sites, a big issue arose when Panda came out, which was having low quality pages in the index. A particular problem was category / author / tag style archives.

Let's say you have 1000 pages on a site. Each post gets 3 tags, has an author or even two, and belongs to one or two categories. Each archive page shows 10 posts. Even if you didn't show author archives and tags, and only had one category, you still have 1000 divided by 10... 100 extra pages of category nonsense in the index. In this scenario, realistically you could be up to 300-500 pages of low quality crap bloating out the index. 1000 real pages of content with an additional 500 pages of crap. Now imagine if you have sub-categories with parent categories displaying all of the child posts... You could easily get up to a one-to-one ratio of content to crap.

What SEO's Used To Do
So what we started doing was setting the sub-pages of these archives to noindex. So in terms of Wordpress, we're talking about .com/category/page/2 and beyond, while the first page remains indexed. Problem solved.

Fast forward to late last year / this year and John Mueller went on record saying something like "If a page is set to noindex long enough, we will stop crawling the links on the page, essentially turning them nofollow." Well, our situation was that we noindex'd these pages but they were still set to "follow" by default. This scared everybody and the Yoast team even freaked out and removed the option altogether from their plugin. Suddenly everyone was getting 100's of category sub-pages reindexed if they relied on Yoast to set the meta robots to noindex.

Don't believe it's a problem? Remember recently when Yoast goofed up and set an option in everyone's database to permanently be set to "no," causing 1000's upon 1000's of "attachment pages" on Wordpress to be indexed, and people's rankings started tanking almost immediately? You ended up with tons of pages with nothing but a single image plus your header and footer and sidebar indexed. They quickly tossed together a plugin that would create a sub-sitemap of all these URLs and set the HTTP header code to 410 on these pages. It quickly fixed the issue for most sites.

What SEO's Do Now
That brings us to the present. Everyone, largely based on the John Mueller comment and not knowing what to do without the Yoast plugin, all recommend on leaving these sub-pages of the archives in the index. They think that Google will suddenly stop being able to crawl their old posts, not because the pages are noindex but because all links on them "become" nofollow.

What I'm Recommending
I'm dealing with some indexation issues myself and have spent a decent chunk of time thinking this through. I'm trying to trim my indexation down as much as I can.

Thankfully, Gary Illyes did an AMA on Reddit, which I summarized for you here, and someone asked him about this. He brought sanity back to the game by stating the obvious:

Google can still crawl old pages because we have sitemaps and these pages are in their index anyways.
Google will still crawl these noindex pages because our pagination links to them.
Google will still crawl the "follow" links on these pages because they're set to "follow."
Mueller probably meant that if a page is orphaned then the links essentially become "nofollow" because Google never crawls the page again.

But the reality is, the sub-pages on the archives aren't orphaned nor are the old pages. We interlink, we use sitemaps, and Google has them in their index. These are all ways they crawl the pages.

My conclusion today was that we need to continue following the old path of setting these pages to noindex.

How Do You Get This Done?
If you were depending on Yoast then that option is gone. One thing you can do is use the hook that Yoast still provides for this, but I don't recommend doing that. You never know when Yoast will change something related to that, especially now that they think it's bad.

I can't speak to every single CMS out there, but I can tell you how to get it done on Wordpress.

Code:

// Set Sub-Pages of Archives to Noindex
function ryu_noindex_paged()  {
    if ( is_paged() ) {
        echo '<meta name="robots" content="noindex,follow" />';
    }
}
add_action('wp_head', 'ryu_noindex_paged');

You'll drop this in your functions.php of your theme or child theme, and that's it. You can confirm it's done by checking the source code and looking in the <head> of those pages to make sure it's there, and check other pages to make sure it's not there. "It" being <meta name="robots" content="noindex,follow" />.

Let me point out that having "follow" in there is not necessary since it's the default, but anything that doesn't hurt but helps clarify the intent of the code is good. You can remove that (and the comma before it) if you wish and write yourself a nice comment about it.

This is made possible by two features provided by Wordpress:

the is_paged() function
the wp_head hook

is_paged is boolean, meaning it returns True or False, if the paged variable is higher than 1 (meaning you're on a sub-page). The wp_head hook simply lets you insert anything you want into the wp_head() function in Wordpress themes.

Do You Disagree?
What I want to know is if you're one of the people out there who disagree with what I'm saying. I want to hear your perspective on this and see if I'm missing a core piece of logic in thinking about this or not. Please speak up if that's you so we can all learn more on the topic.

While we're here, let me say, you should definitely check right now if you haven't done so, and make sure your indexation isn't bloated. Use the new Search Console Coverage Report, do a site:domain.com search on Google, check your sitemaps, do a crawl of your site. Anything you can to glean insight will help you sort it out. Don't trust theme creators, plugin developers, etc, to have this covered properly. Check now, and check every few months. The Coverage Report will have your back from here on out. It'll take minutes to check.

Edit: Let me add, you have to use the meta robots noindex method to drop these out of the index and to stop them from entering the index. You cannot use robots.txt. The meta robots method of robots.txt does not work with Google. Also, all the normal Disallow method of robots.txt does is tell Google not to crawl the page. It doesn't tell it not to index it. You'll end up with empty pages indexed in Google because they can't read the page but still index it, and if they can't read the page then they can't read the noindex command in the <head>.

Darth · Feb 18, 2019

Code:

<?php if ( is_paged() ) { echo '<meta name="robots" content="noindex,follow" />'; } ?>

is the way I have been getting it done. Goes in my header.php file.

Also if you want to block additional comments pages:

Code:

<?php if (strpos($_SERVER['REQUEST_URI'], 'comment-page') !== false) { echo '<meta name="robots" content="noindex,follow" />'; } ?>

MrMedia · Feb 19, 2019

Yoast can still set these to noindex.

Edit. For fuck sake. 100s of sub category pages now indexed.

Can people not just leave well alone?!

Ryuzaki · Feb 19, 2019

MrMedia said:
Yoast can still set these to noindex.

Yoast can set Categories / Tags / Authors as a whole to Noindex and can redirect Media Attachments to their corresponding post pages. It can no longer set sub-pages (paged pages) to noindex within it's dashboard. There is a hook to do it still but I wouldn't trust it to remain as it is unchanged.

bernard · Feb 19, 2019

Hmmm...

I think it's a waste of contextual clustering to noindex category pages.

I have taken to writing long category descriptions, below posts, and I am now seeing traffic to those category pages.

I'm all about contextual clusters and relevance on my site. Creating beautiful AI friendly content clusters, which mimic what Google does themselves.

DarkRed · Feb 19, 2019

I prefer other solutions.

I like what thespruce.com does.

Basically they don't use pagination at all. They just use really specific subcategories

>>>

>>>>

I don't think this are crap pages. They are pretty useful for the end user (better navigation) and it helps google better understand your content (relevancy, topic clustering).

I would set up categories to a max of 30 / 40 posts or something like that. If you have that amount of content under 1 cat I bet you can find subcategories for them.

Ryuzaki · Feb 19, 2019

DarkRed said:
I would set up categories to a max of 30 / 40 posts or something like that. If you have that amount of content under 1 cat I bet you can find subcategories for them.

I like this too. I increased my number of posts per archive page drastically.

The issue for some sites is they are showing the category folders in the URL, so you can't easily move them around when you add more sub-categories without doing 301's and keeping up with all that. If you just show the slug as the permalink then you can move them around at will. This is probably a good idea for a giant site for the flexibility. Otherwise you have to reaaaally plan it out early on, which is nearly impossible.

bernard said:
I have taken to writing long category descriptions, below posts, and I am now seeing traffic to those category pages.

I do the same, but only on page 1 of the archive. If you let that description roll over to page/2/ through /page/10 or whatever, you'll still have pages with nothing but duplicate content (description, post titles, post excerpts).

I agree that Google overlooks this in most cases, which is why I made sure to talk about huge sites and ecommerce stores. It absolutely does not over look it in those cases. It tosses out Panda problems.

DarkRed · Feb 19, 2019

Ryuzaki said:
The issue for some sites is they are showing the category folders in the URL, so you can't easily move them around when you add more sub-categories without doing 301's and keeping up with all that. If you just show the slug as the permalink then you can move them around at will. This is probably a good idea for a giant site for the flexibility. Otherwise you have to reaaaally plan it out early on, which is nearly impossible.

Yes that is what I do. I use only the slug in the url

thespruce does the same thing and they also add a number at the end of the url. I guess it's the ID to prevent duplicate posts or something like that.

Sub-Pages on Archives / Paged Pages - To Noindex or Not?

Ryuzaki

お前はもう死んでいる

Darth

MrMedia

Ryuzaki

お前はもう死んでいる

bernard

DarkRed

Ryuzaki

お前はもう死んでいる

DarkRed