December 25, 2024

Humans, Generative AI, and Learning from Copyrighted Materials

Author: david
Source

If you’re not listening to the Latent Space podcast, you’re missing some of the best thinking on generative AI happening right now. The show notes for a recent episode begin, Stop me if you’ve heard this before: “GPT3 was trained on the entire Internet”. Blatantly, demonstrably untrue: the GPT3 dataset is a little over 600GB, primarily Wikipedia, Books corpuses, WebText and 2016-2019 CommonCrawl. The Macbook Air I am typing this on has more free disk space than that. In contrast, the “entire internet” is estimated to be 64 zetabytes, or 64 trillion GB. So it’s more accurate to say that…

Read more