In addition to the possible business threat, forcing OpenAI to identify its use of copyrighted data would expose the company to potential lawsuits. Generative AI systems like ChatGPT and DALL-E are trained using large amounts of data scraped from the web, much of it copyright protected. When companies disclose these data sources it leaves them open to legal challenges. OpenAI rival Stability AI, for example, is currently being sued by stock image maker Getty Images for using its copyrighted data to train its AI image generator.
Aaaaaand there it is. They don’t want to admit how much copyrighted materials they’ve been using.
LLMs are not book reports. They are not synthesizing information. They’re just pulling words based on probability distributions.Those probability distributions are based entirely on what training data has been fed into them.
You can see what this really means in action when you call on them to spit out paragraphs on topics they haven’t ingested enough sources for. Their distributions are sparse, and they’ll spit out entire chunks of text that are pulled directly from those sources, without citation.
If you write a book report that just reprinted significant swaths of the book, that would be plaigerism, and yes, would 100% be called copyright infringement.
Importantly, though, the copyright infringement for these models does not come at the point where it spits out passages from a copyrighted work. It occurs at the point where the work is copied and used for purposes that fall outside what the work is licensed for. And most people have not licensed their words for billion dollar companies to use them in for-profit products.
@Kichae
The exact same thing a human does when writing a sentence. I’m starting to think that the backlash against AI is simply because it’s showing us what simple machines we humans are as far as thinking and creativity goes.
Do you have an example of this? I’ve used GPT extensively for a while now, and I’ve never had it do that. If it gives me a chunk of data directly from a source, it always lists the source for me. However, I may not be digging deep enough into things it doesn’t understand. If we have a repeatable case of this, I’d love to see it so I can better understand it.
This is the meat and potatoes of it. When a work is made public, be it a book, movie, song, physical or digital, it is placed in the public domain and can be freely consumed by the public, and it then becomes part of our own particular data set. However, the public, up until a year ago, wasn’t capable of doing what an AI does on such a large scale and with such ease of use. The problem isn’t that it’s using copyright material to create. Humans do that all the time, we just call it an “homage” or “parody” or “style”. An AI can do it much better, much more accurately, and much more quickly, though. That’s the rub, and I’m fine with updating the laws based on evolving technology, but let’s call a spade a spade. AI isn’t doing anything that humans haven’t been doing for as long as their has been verbal storytelling. The difference is that AI is so much better at it than we are, and we need to decide if we should adjust what we allow our own works to be used for. If we do, though, it must effect the AI in the same way that it does the human, otherwise this debate will never end. If we hamstring the data that an AI can learn from, a human must have the same handicap.