Is Google Using Your Website to Train its AI Large Language Model?

Ryuzaki · Apr 21, 2023

This should be fun for everyone. If you visit this Washington Post article and scroll down, you'll find a search field that lets you find out which rank and what percentage of tokens your site makes up in Google's C4 dataset:

For instance, BuSo is rank 183,243 and makes up 120k tokens, which is 0.00008% of all tokens.

Don't dox yourself, but did you find your website in the list?

My newest main project isn't in the list, but my previous main site ranks surprisingly fairly high up the list, an order of magnitude past BuSo in the percentage of all tokens.

Anyways, the C4 data set (the Colossal Clean Crawled Corpus) is 15 million websites in the English language. It's been used to train Google's T5 and Facebook's LLaMA. We don't know if OpenAI used it for ChatGPT.

I think all these companies need to cut us all a check if our sites appear in the list! Rabble Rabble!

ToffeeLa · Apr 21, 2023

Yes, I found a few of my sites in there, including some quite small ones and some newer ones. Interesting playing around with that searchbox...

Boy · Apr 21, 2023

Yeah, I found one of my very early niche sites and one I made in 2017 that I only keep around because it makes enough to cover domain renewal every year. Both are very small compared to the sites I'm creating today, so I'm surprised they're on the list. Hell, the 2017 one is nothing but 150-word 'press releases' for streetwear drops.

MrMedia · Apr 21, 2023

.0001 gang checking in

thisishatred · Apr 21, 2023

MrMedia said:
.0001 gang checking in

"User-agent: CCBot
Disallow: /"

Gang checking in.

CCarter · Apr 21, 2023

Interesting, got SERPWoo at 0.00004%

MrMedia said:
.0001 gang checking in

3 zeros is quite impressive.

Wickedfire.com - it's never heard of.

BlackHatWorld.com is at 0.001%

digitalpoint.com is at 0.00003%

searchenginejournal.com is at 0.002%

seroundtable is at 0.001%

Twitter is at 0.008%

A bit concerning that Cnn.com is at 0.01%.
--

Also note, this is another example of interactive content that gets backlinks naturally.

And they knew what they were doing because they forced me to sign up for their little newsletter. The curiosity was greater than the pain of spam - they're learning.

Steptoe · Apr 21, 2023

Pretty much everything of mine I checked that's at least a year old is in - interesting stuff. No wonder my hand-written stuff sets off the AI content detector alarms! :wink:

Darth · Apr 21, 2023

Nothing of mine is in - lol.

wikibum · Apr 22, 2023

Darth said:
Nothing of mine is in - lol.

Same boat, but not surprised, because my site is newish.

illmasterj · Apr 22, 2023

Interesting, even my small/hyper niche stuff is in there, like my 10-25 page rank and rent sites. The common theme on all of these is not rehashing the same old information on the internet, so I guess that unique content is what the beast wants to feed on.

googlealchemist · Apr 22, 2023

"About a third of the websites could not be categorized, mostly because they no longer appear on the internet. Those are not shown."

That seems strange to me...why would they have been pulling data, or trying to, on sites so shitty they aren't even around anymore? Unless I totally misunderstood which is very possible.

Primary decade+ old domain haha
0.0000004%

CCarter · Apr 22, 2023

googlealchemist said:
why would they have been pulling data, or trying to, on sites so shitty they aren't even around anymore?

To spot BAD content and sites. It makes sense, they see what shitty websites look like and use that to predict if you'll be around in the end.

We do it as humans when doing audits. We notice most bad sites have no social media presence or brand signals. Those are indicators of shitty websites. So then those sites aren't around later on. And it's a good indication that we were right about the dedication of the webmaster.

You want the bad data to show Google what NOT to waste time with.

Politico · Apr 23, 2023

0.00002% over here that's pretty cool

makoloko · Jun 11, 2023

My domain isn't in there.

By the way, WaPo let's reddit traffic in without the paywall. They put a link to their article in an ama here: https://www.reddit.com/r/IAmA/comments/13jev5j/were_washington_post_reporters_who_analyzed/

Is Google Using Your Website to Train its AI Large Language Model?

Ryuzaki

お前はもう死んでいる

ToffeeLa

Boy

@jdcharnell

MrMedia

thisishatred

CCarter

Final Boss ®

Steptoe

Darth

wikibum

illmasterj

googlealchemist

Google Alchemist

CCarter

Final Boss ®

Politico

makoloko

Coder