Is Google Using Your Website to Train its AI Large Language Model?

Ryuzaki

お前はもう死んでいる
Moderator
BuSo Pro
Digital Strategist
Joined
Sep 3, 2014
Messages
6,229
Likes
13,100
Degree
9
This should be fun for everyone. If you visit this Washington Post article and scroll down, you'll find a search field that lets you find out which rank and what percentage of tokens your site makes up in Google's C4 dataset:

aw1S7O6.png

For instance, BuSo is rank 183,243 and makes up 120k tokens, which is 0.00008% of all tokens.

Don't dox yourself, but did you find your website in the list?

My newest main project isn't in the list, but my previous main site ranks surprisingly fairly high up the list, an order of magnitude past BuSo in the percentage of all tokens.

Anyways, the C4 data set (the Colossal Clean Crawled Corpus) is 15 million websites in the English language. It's been used to train Google's T5 and Facebook's LLaMA. We don't know if OpenAI used it for ChatGPT.

I think all these companies need to cut us all a check if our sites appear in the list! Rabble Rabble!
 
Yes, I found a few of my sites in there, including some quite small ones and some newer ones. Interesting playing around with that searchbox...
 
Yeah, I found one of my very early niche sites and one I made in 2017 that I only keep around because it makes enough to cover domain renewal every year. Both are very small compared to the sites I'm creating today, so I'm surprised they're on the list. Hell, the 2017 one is nothing but 150-word 'press releases' for streetwear drops.
 
Interesting, got SERPWoo at 0.00004%

.0001 gang checking in
3 zeros is quite impressive.

Wickedfire.com - it's never heard of.

BlackHatWorld.com is at 0.001%

digitalpoint.com is at 0.00003%

searchenginejournal.com is at 0.002%

seroundtable is at 0.001%

Twitter is at 0.008%

A bit concerning that Cnn.com is at 0.01%.
--

Also note, this is another example of interactive content that gets backlinks naturally.

And they knew what they were doing because they forced me to sign up for their little newsletter. The curiosity was greater than the pain of spam - they're learning.
 
Pretty much everything of mine I checked that's at least a year old is in - interesting stuff. No wonder my hand-written stuff sets off the AI content detector alarms! :wink:
 
Interesting, even my small/hyper niche stuff is in there, like my 10-25 page rank and rent sites. The common theme on all of these is not rehashing the same old information on the internet, so I guess that unique content is what the beast wants to feed on.
 
"About a third of the websites could not be categorized, mostly because they no longer appear on the internet. Those are not shown."

That seems strange to me...why would they have been pulling data, or trying to, on sites so shitty they aren't even around anymore? Unless I totally misunderstood which is very possible.

Primary decade+ old domain haha
0.0000004%
 
why would they have been pulling data, or trying to, on sites so shitty they aren't even around anymore?

To spot BAD content and sites. It makes sense, they see what shitty websites look like and use that to predict if you'll be around in the end.

We do it as humans when doing audits. We notice most bad sites have no social media presence or brand signals. Those are indicators of shitty websites. So then those sites aren't around later on. And it's a good indication that we were right about the dedication of the webmaster.

You want the bad data to show Google what NOT to waste time with.
 
Back