PDFs as thin Content

Joined
Mar 27, 2015
Messages
837
Likes
1,486
Degree
3
Interested to hear peoples take on this.

I am working on a large site that has a ton of pdf files that are included as links to useful content -

eg. Click here to download the blue widget guide ->> mysite.com/bluewidgetguide.pdf

I notice that a ton of these pdf only pages are being included in Googles index.

Would you no index them?

Surely they are just cluttering up the index and wasting index crawl budget on the site that could be used to index useful content pages.
 
Purely for SEO maybe. I think you can put links in your PDFs as well tho, so you can still retain your user.

To give a meaningful answer I think we need a bit more context on why you are writing the guides in PDF and not just as an article.

If you want brand awareness, PDFs can be fine and might be better. If they download the file and share it around with coworkers/friends because in the niche it is more natural to do it that way (I can't think of an example, but you know, why not? Maybe they print it out in schools or something) it can give you more brand awareness.

In that case, it doesn't really matter that they never reach your website. The "cluttering" in the google index doesn't matter because you reached your goal.

If you want them on your site, then I would personally lean to no-indexing, but if you want them to just go and download the pdf anyway I don't see the point.

The way I interpret your situation is that the website has helpful articles that have some ads or affiliate links for monetization, and it uses PDFs as extra information for the reader.
If I interpreted your case correctly, I would convert the PDFs to articles in the long run, and right now de-index them.
 
Purely for SEO maybe. I think you can put links in your PDFs as well tho, so you can still retain your user.

To give a meaningful answer I think we need a bit more context on why you are writing the guides in PDF and not just as an article.

If you want brand awareness, PDFs can be fine and might be better. If they download the file and share it around with coworkers/friends because in the niche it is more natural to do it that way (I can't think of an example, but you know, why not? Maybe they print it out in schools or something) it can give you more brand awareness.

In that case, it doesn't really matter that they never reach your website. The "cluttering" in the google index doesn't matter because you reached your goal.

If you want them on your site, then I would personally lean to no-indexing, but if you want them to just go and download the pdf anyway I don't see the point.

The way I interpret your situation is that the website has helpful articles that have some ads or affiliate links for monetization, and it uses PDFs as extra information for the reader.
If I interpreted your case correctly, I would convert the PDFs to articles in the long run, and right now de-index them.

Pretty much my train of thought. They are supplemental to the main content and do not warrant their own indexing in my opionion.

Cheers
 
At this point in time, I'd think of a PDF the same as I would any other indexable page, only these aren't formatted with HTML and CSS. And what I mean to say here is that, if the quality of the PDF "page" is high, then it won't be thin content (and definitely probably not thin content in the sense of mismatching user intent like turning a page optimized for an informational query into a sales page).

As you guys mentioned above, if they can't match the level of quality in your normal posts, then I wouldn't have them indexable. Ultimately I'd rather take the content and use it on a "real" page anyways, create a lander out of the PDF content, or whatever makes sense for the purpose of the PDF. But I'd rather it be on a page than in a file, even if they're treated similarly.
 
Cool thread. On one of my sites I have some PDFs I made that users could print out and refer to (better user experience, even if offline) hoping the link would maybe be shared etc. I just searched and saw they've gained no links though. They are exact copies of some of my content. Time to noindex them I think.

So how do you noindex a PDF?

To answer my own question:
Code:
    location ~* \.(pdf)$ {
    add_header  X-Robots-Tag "noindex, noarchive, nosnippet";
    }
 
@Darth, you beat me to it but I'm going to post this anyways for .htacess on Apache servers. The problem is you can't change the meta content of non-pages, so you need to send the X-Robots-Tag HTTP Header on the file request:

Code:
<Files ~ "\.pdf$">
  Header set X-Robots-Tag "noindex"
</Files>

For anyone looking to do this, do NOT get wise and decide to block Googlebot from crawling these too or they'll never see the Noindex directive. You'll get blank, empty pages being indexed. Just let them be crawled and accessed, but not indexed.
 
@Darth if those pages are identical and some day they can earn a natural link, the best aproach will be using X-Canonical header.

canon-header-2.png
 
Back