Exp 4 - Technical SEO - The Foundation to Your Site's Success

Ryuzaki

お前はもう死んでいる
Moderator
BuSo Pro
Digital Strategist
Joined
Sep 3, 2014
Messages
6,229
Likes
13,100
Degree
9
1.jpg


Technical SEO is the foundation of search engine optimization. I don't mean it is the most basic or easy to grasp concept. I mean that everything builds off of this groundwork.

There's no expectation to get it 100% right. But getting it wrong can spell doom for your site, literally. You can end up with, on the lightest end, a reduced SERP exposure and an algorithmic penalty. On the heaviest end you could end up de-indexed.

Here's an example of getting something seemingly minor wrong and what happens when you fix it. This is one of my own sites:

traffic.png

Take a guess what fixing this easily-overlooked, extra easily-fixed error did to my revenue?

I'd wager most people don't even realize Technical SEO exists, and most of those that do ignore it because they figure Wordpress has it sorted out. Then they install a random theme that hurts them, then they compound the issue with plugins. Then they get confident and mess with their robots.txt and .htaccess files. And since this is already sorted out for them, they never watch their Search Console Coverage Reports that are spitting errors out left and right as their indexation tanks and their impressions and clicks begin to suffer.

That's enough fear mongering for today. It's not an exaggeration, but let's get on to the education side of things...

What is Technical SEO?

Tech SEO revolves around how you control two critical aspects of how a search engine works:
  • Crawling
  • Indexing
For the sake of the newcomers, let me define these two items.

Crawling - This refers to how the search engine's "spiders" (bots that crawl along the web) move throughout your website. Where do they go? What can they see or not? Discoverability is the phrase to remember.​
Indexing - Spiders will find everything possible and take note of it to be included in the search results. This means the page is free to appear in the index. That is the process of indexing. But should everything be indexed? What about things that should but aren't?​

These are the two main items we are concerned with. Everything else is a sub-set of these two things, and usually both at the same time.

Let's cover both of these in order in greater depth, which will include tasks and problems you will want to consider as you audit your website. That is a term you'll hear tossed around: "auditing your website." This is the act of trying to analyze how spiders crawl your site, what is indexed or not, looking at server logs, etc... in an act to improve crawling & indexing.

We'll start by doing a quick and very basic overview of the process, dealing only with discoverability and indexing. It's better for you to see the entire process, even if it's super limited, than to get lost in the details for now.

Crawling - Hide & Seek

Maybe you haven't stopped to think how a search engine like Google finds your site and then discovers everything about it. Google drops its spiders out onto the web in various places, like a seed set of authority sites, RSS feed aggregators, ping sites, recent domain registry lists, etc. Then those spiders go out and multiply. If they hit a page with 10 links, they'll spawn off 9 more spiders to follow all the links. They want to discover everything and have become exceedingly good at it.

Guess what? You can do the same thing. The main tool in your SEO Audit arsenal is a crawler, which is a software that sets loose spiders wherever you tell it and then crawls based on the parameters you set for it. Some examples are:
These will put you in control of the spiders and give varying amounts of data at different prices (including free). There's the typical SaaS's that now offer crawls too, like Ahrefs, Moz, Sitebulb, SEMRush, etc. These will also try to unearth any problems you might have. But if you don't know what to watch out for, you should probably not use these shortcuts and instead keep reading this guide.

Using a crawler is the fastest and easiest (and only sane) way of getting the data you need to fix any problems on your site.

"Hops" during a crawl (leaps to the next page or file) don't only occur using hyperlinks. They occur through embedded absolute URLs (https://...) in the source code and relative URLS (.../picture.jpg). If a link can be found, it will be crawled (unless you direct the spider otherwise, which we cover later).

Typically you'd have varying goals per crawl instead of trying to do everything in one shot. The first and main thing I'd try to uncover is how efficiently I'm using my "crawl budget."

Crawl Budget
A crawl budget refers to how much resources Google will allocate for crawling your website at any given time. Google assigns every website a crawl budget. They say small sites don't need to worry about it. But there's also the case that there seems to be a correlation between crawl frequency and crawl depths and positive rankings. So just because you don't "need to worry about it" doesn't mean you shouldn't.

Big sites and eCommerce sites definitely need to worry about this. There's a million ways you can goof this up, like sending spiders into infinite loops, endless dofollow & noindex hops, and through countless amounts of duplicate content like in faceted navigation on an eCommerce store. If this problem exists long enough, Google will reduce your crawl budget, rather than waste a ton of time in your funhouse hall of mirrors.

Small sites can't skip this, though. What happens when spiders unearth large portions of your admin area or start digging through your folder hierarchy and exposing that to the masses in the index?

Your first order of action is to use one of the crawlers above, tell it to ignore external links and images, and set it loose on your site starting at the homepage. Just see what it turns up and make sure it's only what you expect and nothing more. Remember, for this short example we're only worrying about discoverabliity.

Indexing - Sought & Found

Google's spider's job is to crawl your site, use it to find other sites, and index everything they're allowed to index. When you ran your crawl on your own site, your objective is to determine two things:
  • What is discoverable on my site?
  • What is indexable on my site?
Just because something is discoverable but not indexable doesn't mean all is well. You can still be wasting your crawl budget. The first thing you want to do is figure out what areas of the site is getting crawled that shouldn't. We'll fix that in a second.

Second, what you want to do is compare the amount of unique URLs being found on your crawl against what Google is actually indexing. The old way to find out how much of your site Google indexed was to use the site:domain.com search operator and compare that number to what was reported in the old, inaccurate Webmaster Tools. Now that it's Search Console v2.0 with the Coverage Report, you can just look at that, which is perfectly accurate.

Real Life Example

Here's a real life example that happened to me. I created an affiliate link system using PHP redirects and text files. I could build a link like this: domain.com/walmart/automotive/product-name. So in this made-up case, I have a folder at the root of my site called walmart that contains various folders with one called automotive. All of these folders contained .htaccess files to make sure they're all set to noindex.

Inside the automotive folder, there is the .htaccess file and a .txt file. The text files contains lines like this: product-name,https://walmart.com/product-name?aff=myID. This level of folders also contain a PHP file that creates a redirect. It scans the text file for product-name, matches the start of the line, and redirects to the URL after the comma.

So note, all of these folders are noindex and 302 redirect to the homepage. The correct product 302 redirects to the product page at the respective store. Also, each redirection tosses up HTTP headers that reiterate the noindex directive and also throw up a nofollow for all links in the redirect.

So what's the problem? None of the folders or redirections should get indexed and should all be nofollow.

Two things happened:
  1. Turns out, the nofollow HTTP header directive only applies to links on the page. Since there are no links on the page (because there is no page, only a redirection) these ended up being dofollow links. So now I have "incentivized links" on my site (aka paid links).
  2. At some point I said "why even let Google crawl these folders" and used the robots.txt to block crawling to anything in the /walmart/ folder.
#2 was my fatal mistake. Before long, Google had indexed almost 800 blank "pages" for each affiliate link. There was no title tag, no meta descriptions, no content. They couldn't read it anyways. So they used the anchor texts of the links as the title tags and said "there's no meta description due to robots.txt" in the SERPs.

Note that my site had about 300 pages of real content and now had an additional 800 blank pages indexed. Guess what happened? PANDA. My site quality score went down and down, because Panda analyzes the cumulative quality of only what's in the index. And I had an 8:3 ratio of literally "nothing" pages indexed to every 3 good pages. This is why indexation matters.

It took me around 6 months to get that crap out of the index once I removed the robots.txt disallow directive to that folder. Even with tricks like adding sitemaps with just those links, it took forever.

#1 ended up being an issue, because fortuitously at the time John Mueller came out and said the "nofollow HTTP header" only applies to links on the page. If it's a redirect, there are no links, thus the HTTP header is ignored. This meant I had to go through my entire site and add nofollow rel tags to all these hyperlinks... manually. I didn't have them before because the HTTP header should have overridden it.

So not only did I have a bunch of junk indexed because Google could NOT crawl the affiliate link folder, which meant they could NOT see the noindex directive in the HTTP header, I also was wasting my crawl budget once they were allowed to crawl because the links weren't explicitly nofollow'd on the page.

Panda is supposedly now the Panda Everflux, running every 30 days. That may be true, but do you really think they're refreshing the Panda data every 30 days? I think not. If you tank your site like this, you have to find the problem, fix it, and wait while your income takes a hit until you get a true Panda refresh.

This is why crawling & indexing matter.

Additional Reading:
 
Last edited:
2.jpg


In the above real life example, these problems were only surfaced thanks to a crawl and looking at my indexation in Search Console's Coverage Report.

My site is in a much better place now, because not only did I clean up the indexation even further (did some content pruning) but my crawl budget is even better than it was before the goof up thanks to placing explicit nofollows on the links.

All of this begs two questions:
  • "What other problems might we surface during a technical SEO audit?"
  • "How do we control crawling and indexing?"
Let me preface this portion by saying that our goal in the Digital Strategy Crash Course has never been to tell you what to do or what to think, only HOW to think.

Therefore, I won't be walking you through an entire audit. That's pretty much impossible to do. It could be an entire book, which is why all the blog posts about this topic suck.

What I will do is give you a common list of items to look for, where some are more critical than others. Then I want to give you the rundown on .htacccess and robots.txt.

And finally I want to give you a few more scenarios of things to think about and show you how you'd go about fixing them. I want to show you how to think about these situations so you can figure out the solutions to unique problems as you encounter them.

Common Problems To Look For in an Audit
  • Broken Images
  • Broken YouTube Embeds
  • Broken External Links
  • Broken CSS & JS Files
  • Dofollow Affiliate Links
  • Problems with Canonicals
  • Bad Schema
  • Problems with Hreflang
  • Multiple Versions of Site Accessible (http vs https, www vs non-www)
  • Mixed Content Errors (http resources on an https site)
  • Duplicate Meta Titles
  • Duplicate Meta Descriptions
  • Duplicate Pages
  • Search Indexable
  • Slow Page Speed (Use Day 27)
  • Missing Robots.txt
  • Missing Sitemaps
  • Redirect Chains
  • Missing Image Alt Texts
  • Problems with Faceted Navigation
  • Etc.
Some audit checklists will even take you through on-page SEO stuff like missing H1 tags. I'd leave that to a separate on-page focused crawl.

All of the problems listed above are easily fixable, and you should look at them all, perhaps once a year if you don't change your theme or architecture. Being easy to fix doesn't mean they aren't important. Many of them I'm sure deal with trust and quality scores, even though they seem simple and common place. They're becoming easier to get right and easier to get wrong thanks to the proliferation of CMS's.

Remember, the foundation to SEO is crawling & indexing. Google wants to crawl fast, find no broken hops, index everything they find, and not crawl what shouldn't be indexed. If you can nail that, you're doing better than most, even though it sounds simple.

.htaccess & robots.txt & meta directives

These two files are the ways you will get yourself into the most and biggest trouble. They have multiple reasons for existing, but let's keep it narrowed to crawling & indexing.

Server Level - .htaccess
.htaccess is a hidden dot-file that can exist in any folder of your site, but you'll mainly only use the one at the root of your site in the public_html folder. For lack of better terms, think of this as a "server level" file, while robots.txt should be thought of as a "site level" file. Think of meta directives as "page level" requests for the spiders to consider. No spider has to obey any of these except the .htaccess rules.

The .htaccess file is loaded, parsed, and read on every page load, whether you're using caching or not. So it makes sense to keep it small. It will contain info about your preferred version (www or non-www), your security level (redirects to https://), your 301 redirects, blocking crawler bots, limiting access to your site or sub-folders by IP address, browser caching, and much more. For the most part, you shouldn't mess with this unless you really know what you're doing. Your CMS should have you taken care of here, and any serious plugin that touches it will do so correctly, like any of the main server caching plugins.

The .htaccess file is NOT the place to create a solution for 99% of your problems.​

For limiting and controlling crawling, you have two main methods:
  • robots.txt
  • rel="nofollow"
Site Level - robots.txt
This is the file to use if you want to block off entire portions of your site from being crawled by spiders but accessible by humans. Big problems can arise from doing this though. Robots.txt should never be used to block crawling from pages that otherwise might get indexed. If you block them, Google can't determine if they should be indexed or not and will go ahead and do it, because remember, their goal is to index everything they can discover unless you tell them not to index it. If you tell them but they can't read the directions, then it's getting indexed.

The robots.txt file is NOT the place to create a solution for 99% of your problems.​

Page Level - nofollow & meta directives
Most pages you don't want crawled, like your admin area, aren't crawl-able anyways because they're password protected and you're not linking to your login page on most sites. 99% of the time the problems with crawling & indexing are on the public facing side of your site, not your folders or your admin area. And you'll want to deal with those problems at the page level.

If you do not want spiders to crawl a link, simply add rel="nofollow" to the anchor tag on the page. This will also restrict the flow of page rank, so don't try to save crawl budget by nofollow-ing your About page link in your footer. You'll waste page rank and hurt your rankings. Let them crawl it (they'll tag the link once they've seen it and not crawl it a million times), and interlink to other pages you want to rank instead.

Example: I have social share buttons on my pages. They link out to the API urls for Facebook, Twitter, (RIP Google+), Reddit, etc. Not only do they lead nowhere useful for Google, I don't endorse those pages as editorial links, and I don't want to waste my crawl budget on them or have Google trust my crawls less because they land on trash. So the solution was to edit my PHP templates so all of those links have a rel="nofollow" attribute. Yes, I lose page rank, but I get shares.​
Maybe you want the cleanest indexation ever. Your spider (and Google's) find 100's of paginated pages like your categories or author archives. They all have the same title tags and meta descriptions too, and have zero unique content on them but post titles and excerpts. Even though Google said it's okay to have them in the index, you want them gone, because you don't believe everything Google says.

The problem here is that you want them crawl-able. You want them to be spider'd so your posts are continually re-crawled. It's good to have this even though you submitted a sitemap to Search Console too. How do you reconcile having them crawled but not indexed?

Solution: HTML gives you the <meta> tag and various ways to use it. This goes in the <head> in your source code and is not visible on the page to a user, but is visible to a spider. It gives them various information about your page and directions on how to crawl. One of these is <meta name="robots" content="noindex" />. This tells the spider/robot not to index the page. And since there is not a "nofollow" in the comma separated content attribute, the links remain "dofollow."​
So now Google will crawl your paginated pages but not index them! Of course, I didn't mention the PHP magic you'll need to make sure it only happens on paginated pages and not the first page of the sequence (you want page/1/ of your categories indexed!). If someone asks I can help out with that.

Side Note - URL Parameters
Let me say, for the eCommerce guys, using Search Console's URL Parameter directives is a must. It's easier than trying to cover it with Regex in .htaccess (which is bad for indexing reasons) and robots.txt. You can tell Google exactly which parameters to ignore, which will save your butt in your faceted navigation.
 
3.jpg


Rather than banging your head against the wall and Googling for hours trying to find out exactly what to do for a problem you run into, you can think it through logically. If you can define the problem and understand the effect it's causing, then the solution tends to present itself. Let's look at how that's done.

Some Basic Scenarios You Could Encounter

I said above I wanted to lay out some common scenarios that you could find yourself in, depending on the work you've done, the CMS you're using, the client's site you're working on, etc. I want to lay out the problems and then think it through to the solutions.

I already talked about the problem of having affiliate redirects indexed. I also mentioned a problem above with pagination. Let's cover that again since it's more mundane, then jump to more.

Paged Pagination Pages Being Indexed

Problem:

You've got 1000's of paginated pages on all of your categories, author archives, date archives, and tags. You don't want anything but page 1 of the categories and authors indexed, and you don't want the rest crawled by spiders but still accessible by humans.

Think It Through:
I need to figure out how to place a <meta name="robots" content="noindex" /> on ALL of the tags and date archives pages. I need the same on all pages EXCEPT page 1 on the categories and author archives. Then I need to stop spiders from entering tags or date pages.

Solution:
For tags and dates, I'll create a function in functions.php template to insert an if statement into the wp_head() hook that says "if this is a tag or date archive, insert the meta tag into the <head>." I'll do the exact same for pages/2/ to infinity for categories and author archives, but I'll check for the "paged" variable used by Wordpress to indicate we're paginated past page 1. To stop spiders from entering the tags or date archives, I'll edit those widgets or whatever (or use a function if it exists) to add rel="nofollow" to all the links generated on the widgets and anywhere else on the site. I'm better off removing tags and date archives altogether, really.

Search Pages Getting Indexed

Problem:

Chinese spammers realized my search pages were indexable and are building links using the ?s="search-term" at random, causing my indexation to spike like crazy.

Think It Through:
I want my search pages crawlable just to help with discoverability and re-crawls of existing pages. But I need them to not be indexable.

Solution:
I'll simply use an if statement in a function in functions.php or an if statement in the <head> of the header.php to add our meta tag only on search results pages.

Bots Crawling For Images

Problem:

Image scrapers and various other spiders are eating up my server bandwidth by crawling and downloading my entire /wp-contents/uploads/ folder non-stop. It's causing 503 HTTP errors when users try to access the site, including Google.

Think It Through:
This isn't a page so I can't use a page-level solution. I should try a site-level solution and then jump up to a server level solution if I have to.

Solution:
In robots.txt I can disallow all bots from crawling that folder, then specifically just allow Google's Image Spider. But bad bots aren't going to respect that directive, so I'll need to use .htaccess. I can try to capture all the bot IP addresses and crawl-delay them or block them from the site altogether, but they'll have a huge IP pool, so I can try instead to catch their user-agents and redirect them somewhere else or block them. But they're using Safari and Firefox... So what I can do instead is block everyone from seeing the contents of these folders using the .htaccess code for that, or dropping an index.html into the folder. Now at least the bad bots have to crawl my site page by page to get the images, which will slow them down and slow down the bandwidth usage. And this is how Google functions anyways.

Broken Links Accumulating

Problem:

Over time, tons of those 1,000's of links you placed in your posts are now broken, have some other HTTP status code that's not 200, or turned into spam sites. Google is losing trust in your outbound links, which is causing you relevancy problems. My crawl budget is undoubtedly hurt, too. Even half the Youtube embeds are broken now.

Think It Through:
Obviously I don't want to link out to someone's PBN or adult site. Tons of other sites changed to HTTPS now and I'm seeing a lot of 301 chains with several hops, enough that Google won't crawl through them all. I re-categorized a lot of my own posts and now I have internal links going through 301's too.

Solution:
Sort your crawl list of links by "location page" and then by HTTP status code. First fix all of the 404 errors. Find replacement pages or remove the links altogether. Next go through your 200 codes and look at the title tags of the destination page and make sure they're still related to your site and haven't become a pharma or casino site. Now I'd look at the 301 redirects and swap them to the destination page, especially if it's a redirect from HTTP to HTTPS so Google knows your users always stay secure. This will take forever but if you do it once and then crawl once a month you can stay on top of it.

Mixed Content Errors

Problem:

You switched your site to be SSL compliant on the HTTPS protocol. You're noticing some pages are showing warnings and possibly not even loading your page, warning of "mixed content." This means that some assets on the page are still being loaded on HTTP, which means the page isn't secure. It's either secure or not. There's no such thing as "99% secure."

Think It Through:
Chrome won't even load these pages and my traffic is tanking. I have to figure out what's going on. The answer is that at least one asset on that page (and hopefully not globally across the whole site) is loading over HTTP.

Solution:
If you can crawl that page only and get a list of every asset's URL on the page, you can see which start with http:// and swap them to https://. Usually it's something hardcoded in the post's content or in a template somewhere. That could be a hot-linked image, for instance. Fixing it is easy, locating them is the harder part. Another solution is to open up the web developer tools of your browser and look at the console errors, which may list the problematic file paths.
 
4.jpg


I'll be frank. Most technical SEO problems are for the obsessive and Google likely ignores most of it and it doesn't weigh strongly into the algorithm if at all. But there are a few big items that can really screw you up or really help you out on your path to success.

Four More of the Most Important Technical SEO Items

In addition to indexation and crawling, there's four more items that I'd tag as the most important concerns for Tech SEO. Those are related to Sitemaps, Indexation Bloat & Quality, Responsive Design Errors, and Slow Site Speed.

Sitemap
First and foremost, you need to have one. No, not an HTML sitemap. Your best move is to have an auto-generated XML sitemap. Plugins like Yoast can handle this for you. They'll automatically add new posts to it for you, list the last updated time stamp for Google, and more. They'll even split into several sitemaps for you so you can stay organized and not have too many URLs in one sitemap.

The reason this matters is Google wants to do two things with your site. They want to crawl and index new content and they want to update older content when there's a change. If they have to actually crawl your whole site and do comparisons of cached posts versus new versions, that's wasted crawl budget and calculation time on their end. They'll assign you less frequent crawling and less crawling per crawl event.

An auto-generated sitemap tells them when you have new posts and when old ones are updated. Problem solved. I believe 100% that if you do Google favors like this, they'll see it as a trust and quality signal.

What you want to do is create this sitemap and do two things. First you want to drop it in your robots.txt file with a line like this: Sitemap: https://domain.com/sitemap.xml so spiders can find it. Second, you want to submit it directly to Google Search Console. This has huge extra benefits that'll become obvious once you submit it.

This is your direct line of communication with Google. Use it. It's how you get indexed within seconds sometimes.

Indexation Bloat & Quality
I've already talked about the dangers of indexation bloat above, but I want to talk about indexation quality. Let's assume you have zero bloat and a 100% indexation rate. That's great. But you still may have a quality score of 60%. Think of that as a way of punishing you and only giving you 60% of the ranking power you actually have earned.

The problem is that Google may be deeming a lot of what they're indexing as low quality. Your best moves are to either:
  • Improve the content quality
  • De-index the content (but not delete)
  • Delete the content (and thus de-index)
  • 301 redirect it to a better page
Again, this is back to doing Google the favor of not filling up their index with hot garbage. Do them the favor and they'll do you a solid in return. You will absolutely see a traffic benefit by improving your sitewide Panda quality score.

So the question becomes one of determining what is low quality or not. Usually that has to do with content length. But at the same time, a 100 word page could be the perfect high quality result for a "definition" intent SERP. So you have to look at intent too. But the truth is there's no hard answer, it's different for each site. Think of things like:
  • How much content is on the page and does it add value to the conversation
  • How much organic traffic does the page receive from Google (do they value it?)
  • How many links and social shares does the page have (do your users value it?)
  • Is it too closely related to better pages on my site
That's basically it. If you determine the page is low quality, do one of the four options above. Unless you want this to take forever, you'll probably either delete it or 301 it to a better and tightly related page.

If you delete a bunch of content, a tip is to keep a list of the URLs and then create a temporary sitemap of the 404'd pages and submit that to Search Console to get them all re-crawled and quickly de-indexed. Remember to delete the sitemap so Google doesn't lose trust in your other sitemaps.

You'll wait somewhere on the order of 6 months before you see results. Your pages have to be re-crawled, then the offline calculation for quality has to be performed for your whole domain, then you have to wait on a Panda refresh when that data is rolled out.

It's worth it, but don't assume you have this problem. You will know if you've been publishing complete trash or not. Don't lie to yourself about it, but don't be paranoid and start deleting good content either. You can seriously harm yourself doing this.

Responsive Design Errors
Responsive design is non-negotiable now. Google is a mobile-first index already. This section is short. All you need to do is log into Search Console after you've had it for a month or so and it will tell you if there are mobile design errors.

Let me state up front that their mobile design test is super broken. It often never loads the CSS file and thus thinks there are a ton of problems. Then later it'll load the CSS and remove all the errors. So don't get confused and hire some expensive designer. Make sure those errors are legit. You can run a mobile-friendly test here to find the true results. That one is not broken.

The errors you'll see boil down to:
  1. Links & Buttons are too small and too close together
  2. The text is too small to be read comfortably
  3. Elements on the page extend off the screen
The third item means that mobile users on smart phones and tablets have horizontal scroll bars showing up. All of these are CSS problems with media queries, which you can learn how to deal with on the Exp 1 day or hire someone to fix it.

Having these problems can mean you rank lower in the main Google index. You may still rank okay in the desktop index, but I'd assume they'll phase that out entirely one day.

Slow Site Speed
I absolutely can't go into this again, since I wrote an entire day on site speed optimization here in the crash course. What it boils down to though is that since everything is mobile-first these days, and mobile connections are slower (even when on WiFi), that your payloads need to be real light or you risk being de-ranked in the mobile-first indexed as a lower quality result. It's way worth doing. You'll get a lot more long-tail traffic, your users will stick around longer, and you'll benefit even more from those metrics.

____________

And that's the end of this expansion pack. I tried to write this in a surface level, plain-language manner so everyone can understand it. But to really dig into the topic requires a lot of technical knowledge, methodology, and an understanding of jargon. I'm happy to dig as deep as you want, just ask or comment.
 
Example: I have social share buttons on my pages. They link out to the API urls for Facebook, Twitter, (RIP Google+), Reddit, etc. Not only do they lead nowhere useful for Google, I don't endorse those pages as editorial links, and I don't want to waste my crawl budget on them or have Google trust my crawls less because they land on trash. So the solution was to edit my PHP templates so all of those links have a rel="nofollow" attribute. Yes, I lose page rank, but I get shares.

Why does this lose overall page rank?
Hypothetically, if there's 9 DF(do-follows) and 1(NF) on a Page with PR: 100, then each DF would obtain a juice of 11.111 rather than 10.

But it seems like you're saying each DF would receive 10.

Can you clarify what you mean by you lose page rank by using NF
 
@CashCowAdv, the aspect to realize is that Nofollow links and Dofollow links are both identical when it comes to the flow of page rank, as far as we know based on what Matt Cutts told us in his blog post on PageRank Sculpting.

Let's start at the beginning. Some time before 2009, you could get away with PageRank sculpting, meaning you could direct the flow of page rank juice where you wanted it to go. This was a problem for the algorithm.

This was done by using nofollow attributes on places like sitewide links in your footer to your About or Contact or Terms of Service pages. They don't need page rank, so why flow it in their direction?

Google didn't like this manipulation so they changed the algorithm. There's two things to understand about Nofollow links:
  • A Nofollow link on a page WILL send page rank out from it.
  • A page being linked from a Nofollow link WILL NOT receive page rank from that link.
So the question becomes "where does that page rank go?" It goes nowhere. It's simply lost to the aether in between web pages. It's gone.

Think of a link as a wormhole between two universes (two web pages). PageRank is the space ship traveling through that wormhole, but in the case of a Nofollow link, there's some kind of opening in the side of the wormhole that sucks the PageRank ship out of it like a vacuum. It gets sucked into hyperspace and is gone into the multiverse. It never makes it to the destination page.

That's how Nofollow links work now. They bleed page rank juice out, but it never makes it to the destination page. It's wasted. This is how they chose to stop PageRank Sculpting.

Our best assumption we can make with the highest confidence is that each Nofollow link "bleeds out" the same amount of page rank juice as a Dofollow link.

So in the old days, if a page had 10 Nofollow links and 10 Dofollow links, only the Dofollow links would send out page rank juice at 10% each.

But in the present, as far as page rank is concerned, there are 20 total links on the page. It doesn't care which is Dofollow or Nofollow. Each flows out 5% of the juice. If it happens to flow out of a Nofollow link it's wasted.

But, and this is a big 'but', my assumption is that, while Nofollow links do bleed out page rank, it's not the same amount as a Dofollow link. Maybe it's half as much. I don't know.

The reason I assume this is because I'm also 100% confident that links within the main content of a web page flow out more page rank juice than links in the supplementary or navigational content. That means links within the article paragraphs themselves send out more page rank juice than links in the sidebar, header menu, or footer menu.

How I know this, I'd rather not say, but I got it direct from Google.

So it only make sense they'd apply this type of calculation to Nofollow links as well and not tell us.

So to summarize, we know for a fact (if we assume Matt Cutts didn't lie and what he said is still true and not changed) that Nofollow links do bleed out page rank. What we don't know for sure but I assume is the case, is that Nofollow links bleed out less page rank than Dofollow links, and the amount will vary depending on the location of the link on the page (main content, supplementary content, navigational content).
 
Back