- Joined
- Sep 3, 2014
- Messages
- 6,229
- Likes
- 13,100
- Degree
- 9
This should be fun for everyone. If you visit this Washington Post article and scroll down, you'll find a search field that lets you find out which rank and what percentage of tokens your site makes up in Google's C4 dataset:
For instance, BuSo is rank 183,243 and makes up 120k tokens, which is 0.00008% of all tokens.
Don't dox yourself, but did you find your website in the list?
My newest main project isn't in the list, but my previous main site ranks surprisingly fairly high up the list, an order of magnitude past BuSo in the percentage of all tokens.
Anyways, the C4 data set (the Colossal Clean Crawled Corpus) is 15 million websites in the English language. It's been used to train Google's T5 and Facebook's LLaMA. We don't know if OpenAI used it for ChatGPT.
I think all these companies need to cut us all a check if our sites appear in the list! Rabble Rabble!
Don't dox yourself, but did you find your website in the list?
My newest main project isn't in the list, but my previous main site ranks surprisingly fairly high up the list, an order of magnitude past BuSo in the percentage of all tokens.
Anyways, the C4 data set (the Colossal Clean Crawled Corpus) is 15 million websites in the English language. It's been used to train Google's T5 and Facebook's LLaMA. We don't know if OpenAI used it for ChatGPT.
I think all these companies need to cut us all a check if our sites appear in the list! Rabble Rabble!