Remember the post I published in late March 2023, with steps you can take to restrict ChatGPT from using content from your WordPress site?
It may not have worked, if your site’s content was already scraped.
Which it did with this site, lireo.com.
Not what I expected.
And from I understand, there’s nothing I can do about it.
Read on to learn more.
Millions of Sites Used to Train Large Language Models
Thanks to a message from my friend Chris Wiegman, I learned about today’s Washington Post Inside the secret list of websites that make AI like ChatGPT sound smart story.
The authors, Kevin Schaul, Szu Yu Chen, and Nitasha Tiku, analyzed Google’s Colossal Clean Crawled Corpus (C4) data set (along with Allen Institute for AI researchers), which contains content from 15 million websites.
About one-third of the sites weren’t categorized, and weren’t used in the Post analysis.
Of the remaining 10 million websites the Washington Post ranked, this site ranks 442,028 with 53K of tokens appearing in the data set.
Wondering what tokens are?
I didn’t know either, but the Post story explains. Tokens are:
…small bits of text used to process disorganized information—typically a word or phrase.
What Sites Are in the Data Set?
A lot.
There’s a wide range of sites in the data set, with the three biggest categories listed as:
- Business and industrial websites (16 percent)
- Technology (15 percent)
- News & Media (13 percent)
And the biggest sites in the data set are:
- Google Patents (where you can search and read the full text of patents from around the world)
- Wikipedia
- Scribd
Think it’s only businesses, news, and technology in the data set? No.
Personal blogs are also included.
What Bothers Me About the Findings in the Post Story
Like many other people who publish online, my content is copyrighted.
I didn’t provide consent for my content to be scraped and used for Large Language Model (LLM) training data.
Where is my check from Google? The check paying me for helping to train their LLM with my content?
The Post authors point that issue in their story, highlighting the “copyright symbol appears more than 200 million times in the C4 data set.”
Hey, Google. Why are you ignoring copyright?
Seems to me you’re in violation of Digital Millennium Copyright Act (DMCA) and all my content should be removed from your data set.
Wondering If Your Site Is Included?
Read the Washington Post story and use the search box to discover if your website is in Google’s C4 dataset.
If you find out how to bill Google for using my content and your content, let me know.
Both oldaintdead.com and webteacher.ws are included. I didn’t check any of my other sites.
Hi Virginia,
Thank you for your comment. What are your thoughts about Google using your web content for training their large language models? (BTW- I didn’t realize you had others sites besides oldaintdead.com and webteacher.ws.)
Both the ones I mentioned are self hosted. I have another blog that uses wordpress.com free space, which is less under my control content and copyright wise. And places like Twitter, Instagram, etc. aren’t under my control. But it bothers me that my personal copyrighted spaces are being used.