Leaked Internal Google Engineering Documentation

br4in · May 28, 2024

This is spicy, are you ready for this?

Internal documentation for Google Search’s Content Warehouse API has leaked.

Rand Fishkin was tipped off on this leak who then invited Mike King to take a deeper look.

I'm still reading everything myself, but would love to see what you think!

Ryuzaki · May 28, 2024

This is interesting, like the Yandex leak. The only issue is these 14,000+ data points aren't all ranking factors, so there's a lot to wade through. And they also don't include weightings in the algorithm so you can't determine how important something is.

Just like the Department of Justice investigation is showing in regards to advertisers, this is showing that Google has repeatedly lied to publishers about how their algorithms work. Some of these lies include:

They don't use SERP CTR to re-rank pages.
They don't treat sub-domains as separate domains.
There's no such thing as a ranking sandbox.
They don't consider the age of a domain in ranking.
There's no such thing as a site-wide authority score.

I'm trying to go through all of it now but I have to step away, so I'll go ahead and press "post". The Sparktoro Fishkin article is pretty interesting, especially about how the PageRank tool bar and later Chrome all existed only to gather ClickStream data from the net-at-large.

tomo8900 · May 28, 2024

The NavBoost (clickstream data) is an interesting one that stood out both in this leak and the DOJ trial.

From Rand’s post: “The power of Navboost and the intent of users is likely the most powerful ranking factor in Google’s systems.”

It's been clear for some time that this is the case but I'm still shocked that they’re placing so much weight behind these signals because surely it’s something very susceptible to being gamed?

Has anyone here ever run an experiment around the use of click farms or similar to fake those signals?

I’m thinking specifically:

Searches / clicks on your brand name
Non-brand searches / clicks making it appear your site has been ‘sought out’ in the SERPs
Long dwell time on site after clicking

Typically I’ve always erred towards sustainable methods, but if they’re going to abandon all measurement / rewarding of content quality in favour of CTR then I think it deserves to be gamed.

Smith · May 28, 2024

I have a love-hate relationship with ChatGPT. I don't have time to sift through this so I asked ChatGPT for the major takeaways in terms of SEO strategy.

Here's what it gathered from the Sparktoro article:

<gpt>Based on the insights from the leaked Google documents, here are ten important changes an SEO practitioner might consider to enhance their strategy:

Enhance User Engagement: Prioritize metrics like click-through rates and time on site to improve rankings.
Focus on User Experience: Optimize site layout, speed, and navigation to boost user satisfaction and interaction.
Content Quality: Create high-quality, engaging content that addresses user intent more precisely.
Diverse Metrics: Incorporate a broader set of analytics to understand and improve user interactions.
Adaptive Content Strategies: Continuously update content to remain relevant and engaging for users.
Privacy and Data Usage: Be transparent about data collection and usage to build trust.
Bias Understanding: Adapt strategies considering any potential biases in Google’s algorithm.
Algorithm Changes: Stay updated on changes to Google’s algorithm and adjust strategies accordingly.
Advanced SEO Tools: Utilize advanced SEO tools to analyze and predict the impact of changes in SEO tactics.
Community Engagement: Foster an active community around your content to increase engagement and visibility.

Each of these changes aims to align SEO practices more closely with the evolving nature of Google's search algorithms as suggested by the leaked documents.</gpt>

Looking forward to following this thread as people get a chance to dig in and provide real insights.

More insights like this please:

Ryuzaki said:
Just like the Department of Justice investigation is showing in regards to advertisers, this is showing that Google has repeatedly lied to publishers about how their algorithms work. Some of these lies include:

They don't use SERP CTR to re-rank pages.

They don't treat sub-domains as separate domains.

There's no such thing as a ranking sandbox.

They don't consider the age of a domain in ranking.

There's no such thing as a site-wide authority score.

They're easily the most direct and actionable insights I've seen yet!

harrytwatter · May 28, 2024

I guess this makes John Mueller, Gary Illyes, and Danny Sullivan's jobs redundant?

People have always been able to find examples that would contradict their public positions but now that such levels of deception have all been exposed, what's the point of the propaganda anymore?

Google's revenue-focused leadership should see this as a cost cutting opportunity. Axe the entire PR/disinformation teams entirely. Own your monopoly. Own your evil. Go full big oil/big pharma and give zero fucks.

The DOJ is either going to break you up at which point you pull your golden parachute cord and dip, or, more likely, they'll just slap you with an inconsequential six figure fine just like they do to other monopolies that they feel are eroding public trust/belief in "the system" aka the status quo.

There's no need to try to pretend to be anyone's friends anymore.

Side note, is it just me or does the timing of this leak seem suspect given the ongoing fiasco that is "AI Overviews", the leaks about engineering/management politics, and slew of publishers highlighting the decay of SERP quality.

Would love some insider context, I imagine internally things have got to be pretty fucking wild right now..

Ryuzaki · May 28, 2024

harrytwatter said:
I guess this makes John Mueller, Gary Illyes, and Danny Sullivan's jobs redundant?

I don't think so. The number of people that will be exposed to these documents is already tiny. The number that will be exposed to them and realize they contradict past "truths" is even smaller. And even among those people there are those that want to be lied to. They like corporate narratives and whatnot, as you eluded to later as "the system". It makes them feel like there's a structure that they're safe within. Some people willingly want to pull the wool over their own eyes.

But yeah, for the other 50% of "SEO's" and the other 99% of the population, they'll only ever encounter the public relations spins. And that's the true goal of all this. It's not to defend the algorithm against SEO's. It's to have the public see Google as a smiling face that interacts with its users, offering support and guidance.

harrytwatter said:
Side note, is it just me or does the timing of this leak seem suspect given the ongoing fiasco [...]

I would suspect this was purposefully leaked by someone on the inside who is as salty about it all as the rest of us on the outside. Google employees undoubtedly build their own websites, too. The implosion is coming and someone is hastening the process, accelerating towards the apocalypse. Google is behind in the AI race, without a doubt. There's no single facet they're winning in (imagery, video, text, NLP) and even being generalized they're behind. It's not too late to hitch themselves to someone who's winning (like Microsoft did with OpenAI) but they want to be Apple too much, with proprietary everything.

Politico · May 28, 2024

SEO just got interesting again...

Smith said:
<gpt>Based on the insights from the leaked Google documents, here are ten important changes an SEO practitioner might consider to enhance their strategy:

This "summary" is a VERY ChatGPT answer - generic fluffy bullshit that sounds smart but doesn't say much and doesn't seem like it actually analyzed the document, but instead relied on it's own training data to probabilistically guess at what might look correct.

Smith said:
Privacy and Data Usage: Be transparent about data collection and usage to build trust.

Yes, this will definitely help me rank better.

Smith said:
Bias Understanding: Adapt strategies considering any potential biases in Google’s algorithm.

Gee thanks ChatGPT, super helpful. You just loosely defined SEO.

Also, not a single mention of links.

Sorry @Smith not trying to derail your helpfulness, I appreciate you but I love any chance to shit on ChatGPT. :smile:

---

I haven't read Rand's take, but did take a look at Mike's and it's all pretty interesting, but also more confirmation of what the community has always thought/known.

My initial takeaways:

Freshness matters - update content to include more unique insights on topic and change published/updated dates regularly
- Google classifies content in diff tiers - regularly updated content in flash memory, somewhat updated on SSDs and not very updated stored on HDD
Domain/Page age matters - the sandbox exists and how long you've been in the Google index matters, aged domains FTW
Anchors matter a LOT - the velocity of "spammy" anchors (exact/partial kw anchors?) coming to your site vs. your regular "graph" of link acquisition impacts your rankings. Not too much said here, but he proposes this has to do with Google being able to "ignore" neg SEO (lol)
- from experience, high velocity of spammy anchors from a neg SEO attack WILL ABSOLUTELY tank your page.
Pages are constantly re-ranked - I'd guess there's an initial ranking of a document, and then constant re-rankings using "twiddlers" that look at a whole variety of separate "signals"
- This could come in the form of "modification factors" which get applied to your "ranking"
Still Just Clicks + Links - it really is all about user query signals (search query, click via serp, dwell time on page, return to serp) and number/quality of INDEPENDENT - i.e. diff referring domains - links (relevance and strength of site/page of the link, anchor text). They weigh relevance factors on BOTH SIDES (your page + linking page) of your link, and (as learned in the DOJ case) fake their understanding of user satisfaction via clicks, bounce-backs, etc.
- Get links from sites that already rank well (on pages that already rank well if you can) and get users to stay on your page for as long as possible. Use natural anchor text and try to rely on page title, url and H1 for relevance (both on your page and linking page)
Author Reputation may be a thing - google DOES store author information and may cross reference it with other factors.
Demotions of Content Happen for any Number of Reasons - Anchor mismatches, location-based, product review based, EMD based, etc
- probably based on the "re-ranking twiddlers" that run constantly.
20 ITERATIONS - this one was very interesting. Apparently Goog only stores the last 20 iterations of a page. Mike posits it might be possible to have a "clean slate" on a page after this threshold is hit
Individual pages associated with homepage PageRank - your overall "domain authority" matters for all other pages you try to rank. Get those high quality RDs!
Page Title is VERY important in terms of relevance to a query. Goog weighs the search query against a page's title to help determine ranking.

Just my initial takeaways, will probably have to read these doc's again and again to really absorb it.

IMPORTANT NOTE: I'm a fuckin moron, so I may be totally wrong about everything. Please don't quote me on any of this, and feel free to tell me I'm stupid (just give me an explanation as to what and why)

Interested to hear other's takeaways or thoughts.

Encrypted · May 28, 2024

Ryuzaki said:
I would suspect this was purposefully leaked by someone on the inside who is as salty about it all as the rest of us on the outside.

This guy claims to be the whistleblower:

miketpowell · May 28, 2024

tomo8900 said:
The NavBoost (clickstream data) is an interesting one that stood out both in this leak and the DOJ trial.

From Rand’s post: “The power of Navboost and the intent of users is likely the most powerful ranking factor in Google’s systems.”

It's been clear for some time that this is the case but I'm still shocked that they’re placing so much weight behind these signals because surely it’s something very susceptible to being gamed?

Has anyone here ever run an experiment around the use of click farms or similar to fake those signals?

I’m thinking specifically:

Searches / clicks on your brand name

Non-brand searches / clicks making it appear your site has been ‘sought out’ in the SERPs

Long dwell time on site after clicking

Typically I’ve always erred towards sustainable methods, but if they’re going to abandon all measurement / rewarding of content quality in favour of CTR then I think it deserves to be gamed.

As far as I know, people have been doing that for 10+ years now. It worked amazing at first but pretty quickly did not work the same. It lasted much longer in the app stores, though. It was an easy way to the top of the app store in 2017 or so.

Now, it probably still works, but you would need to be pretty sophisticated to make it look like a normal user activity.

--

Doesn't sound anything different than many people have been saying around here since I joined this forum.

Encrypted · May 28, 2024

tomo8900 said:
but if they’re going to abandon all measurement / rewarding of content quality

You're assuming that they could even measure content quality (i mean the actual text on a page) in the first place.

This leak is further proof that HCU/ "rewarding quality content" was bullshit.

BakerStreet · May 28, 2024

Encrypted said:
You're assuming that they could even measure content quality (i mean the actual text on a page) in the first place.

This leak is further proof that HCU/ "rewarding quality content" was bullshit.

I think what happened with HCU was that they defined "quality" as the ones that got the best user signals. It makes sense in my niche. Gov sites started ranking really well even though they were unoptimized but, if you think about it, gov sites would have a high time on site and low click through rate, when the query is "FL drivers license" related.

My takeaway so far from not much research:
* You can't use SEO as your only channel. It doesn't work. You need other channels like social or discover etc.
* It is best to run a full marketing team with affiliate manager, seo, PPC, google news, google discover, social media manager, etc. The old days of using *just* seo is over as your user scores will be terrible.
* You gotta treat your site like how a product manager treats a SaaS: you need to iterate and refine the site to increase time on site, ctr, and lower bounce rate. Your product is the content and site design.
* You can't start a new site and hope it will rank. The algo favors existing sites so much.

tyealia · May 28, 2024

This seems like an API confirming everything that people have already been testing and already been saying works and that Google has been saying doesn't.

Basically I saw nothing that I haven't already seen in variable tests from Kyle. Fight club, rand fishkin, Matt and other people that test in the industry. But it was nice to see a confirmation.

One thing that stood out was the bucket of whitelisted domains for medical queries.

Also they only consider the first 20 versions of a page, meaning you may be essentially able to go clean slate after 20...interesting.

Encrypted · May 28, 2024

BakerStreet said:
I think what happened with HCU was that they defined "quality" as the ones that got the best user signals. It makes sense in my niche. Gov sites started ranking really well even though they were unoptimized but, if you think about it, gov sites would have a high time on site and low click through rate, when the query is "FL drivers license" related.

Nah, i don't think so... I doubt this leak reveals much about HCU.

I don't think HCU implementation is as complex as people think. My guess is that they simply evaluated each website's index for "keyword-focused content" above a certain treshold (i.e 90%+ of your entire URLS are blog posts with no product URLS or service pages.. Not uncommon for niche sites) and demoted them. Then, they put a padlock on those sites so none can recover.

Smith · May 28, 2024

Politico said:
Sorry @Smith not trying to derail your helpfulness, I appreciate you but I love any chance to shit on ChatGPT.

lol no offense taken at all... I wish I had time to sit down and dig through this right now. Appreciate you guys digging through the mountain of lies and looking forward to reviewing all your insights later.

Trump · May 28, 2024

I was scrolling through the document and found this:

isCovidLocalAuthority (type: boolean(), default: nil) - Bit to determine whether the site has the local authority covid signal, as computed by go/covid-local-authority

:happy:

... And another funny one:

isElectionAuthority (type: boolean(), default: nil) - Bit to determine whether the site has the election authority signal, as computed by go/election-authority

Jesse · May 28, 2024

Trump said:
I was scrolling through the document and found this:

isCovidLocalAuthority (type: boolean(), default: nil) - Bit to determine whether the site has the local authority covid signal, as computed by go/covid-local-authority

... And another funny one:

isElectionAuthority (type: boolean(), default: nil) - Bit to determine whether the site has the election authority signal, as computed by go/election-authority

Snort. Fun to see those in black and white.

secretagentdad · May 29, 2024

Well that's pretty interesting. The covid thing is in your face blatant.

I think with the ctr stuff you have to use cohort grouped accounts that are the selected sample.
Wonder if anyones figured out how to warm up their bots to get into the we count group.
Few years back that was the end all be all of how to rank or tank a site.

Ryuzaki · May 29, 2024

Search Engine Land talked to a Google spokesperson and got this quote:

“We would caution against making inaccurate assumptions about Search based on out-of-context, outdated, or incomplete information. We’ve shared extensive information about how Search works and the types of factors that our systems weigh, while also working to protect the integrity of our results from manipulation.”

Here's my interpretation... "We-we w-would caution you from b-believing your own eyes. You don't understand what you're seeing (and it's old and, we promise, not all the info is there anyways, even if it was real or to be trusted). By the way, we already told you how Search works, while also lying to you to protect our results from other monsters that aren't you. You're cool, I know you trust me."

bernard · May 29, 2024

I think the main takeaway here is that while you can have many variables in a machine learning model, usually the model left to itself will end up on just a few variables, "best fit"

I bet the siteAuthority variable is responsible for most of the shit going on the last years. I also bet the siteAuthority variable was trained on the "isElectionAuthority" variable.

BakerStreet · May 30, 2024

#5: Google Uses Click Data to Determine How to Weight Links in Rankings

This one’s fascinating, and comes directly from the anonymous source who first shared the leak. In their words: “Google has three buckets/tiers for classifying their link indexes (low, medium, high quality). Click data is used to determine which link graph index tier a document belongs to. See SourceType here, and TotalClicks here.” In summary:

If Forbes.com/Cats/ has no clicks it goes into the low-quality index and the link is ignored

If Forbes.com/Dogs/ has a high volume of clicks from verifiable devices (all the Chrome-related data discussed previously), it goes into the high-quality index and the link passes ranking signals

Once the link becomes “trusted” because it belongs to a higher tier index, it can flow PageRank and anchors, or be filtered/demoted by link spam systems. Links from the low-quality link index won’t hurt a site’s ranking; they are merely ignored.

From https://sparktoro.com/blog/an-anony...ents-with-me-everyone-in-seo-should-see-them/.

That makes perfect sense. So you need links with traffic for it to count. Without traffic, no pagerank. So it is really good if you get affilaite links to your site, since it'll send traffic.

br4in · May 30, 2024

Follow-up post by Mike King on Search Engine Land:

How SEO moves forward with the Google Content Warehouse API leak

br4in · May 31, 2024

Rand Fishkin talks about what he would change with his SEO strategy going forward

BakerStreet · Jun 7, 2024

br4in said:
Rand Fishkin talks about what he would change with his SEO strategy going forward

I watched that.

I wouldn't start a new site anymore and use SEO as a channel. Its a fucking dead end for new sites. You can't even find a niche anymore.

What I would (and am) doing is just going directly where the traffic is. So I'll run ads, do google discover, and do google news, promote my affiliate program and do social media marketing. Traffic is more important than rankings. API leak shows that naviboost is a huge ranking factor. Unless you have traffic already, you won't rank. It is that simple.

Politico · Jun 7, 2024

BakerStreet said:
dead end for new sites...Unless you have traffic already, you won't rank.

Actually quite the opposite from my experience.

mj22 · Jun 8, 2024

BakerStreet said:
API leak shows that naviboost is a huge ranking factor. Unless you have traffic already, you won't rank. It is that simple.

lololol. SMH.

Leaked Internal Google Engineering Documentation

br4in

Ryuzaki

お前はもう死んでいる

tomo8900

Smith

harrytwatter

just be nice ffs

Ryuzaki

お前はもう死んでいる

Politico

Encrypted

miketpowell

Encrypted

BakerStreet

tyealia

Encrypted

Smith

Trump

Jesse

secretagentdad

Time to be a hoot.

Ryuzaki

お前はもう死んでいる

bernard

BakerStreet

br4in

How SEO moves forward with the Google Content Warehouse API leak

br4in

BakerStreet

Politico

mj22

War Mongering.

Leaked Internal Google Engineering Documentation

お前はもう死んでいる

just be nice ffs

お前はもう死んでいる

Time to be a hoot.

お前はもう死んでいる

How SEO moves forward with the Google Content Warehouse API leak​

War Mongering.

How SEO moves forward with the Google Content Warehouse API leak