Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Pearson taking legal action over use of its textbooks to train language models (standard.co.uk)
35 points by redbell on May 9, 2023 | hide | past | favorite | 60 comments


My favorite anti depression book is Feeling Good by Dr. Burns. I especially like the "if that were true, it would mean that..." exercise in the book, where you drill down to the root cause of your negative thoughts.

So I asked chatgpt about the book, and the exercise, and how to do it, and examples of the 9 cognitive distortions you're supposed to label your responses with. It was very helpful and knew the material well.

It made me wonder this exact point. If you wrote a book on a very rare subject that you were basically the only popular source of detailed info on... And now these models just answer questions for people without having to buy the book.. is that really fair?

Edit: I guess this isn't really that different than the fact I read the book and have described it to many friends and even done the exercise with them.


> answer questions for people without having to buy the book.. is that really fair?

Yes.

It's fair because that happens all the time with non-AI intelligences (i.e., people). No author has the right to prevent me from discussing a book with someone. If I were a clinician and incorporated that exercise into my practice, as long as I wasn't copying any content (e.g., worksheets, scoring tables, etc.), then I owe nothing to the author.

It's also a fundamental principle of copyright law, at least in the United States.

https://supreme.justia.com/cases/federal/us/499/340/

https://en.wikipedia.org/wiki/Feist_Publications,_Inc.,_v._R....


Are you OK with the developers of LLMs capturing the most value of every incremental piece of content created by humans in perpetuity?

The difference is scale. A human can use their learnings from copyrighted work to get a job making $80,000/year. An LLM can use their learnings from copyrighted work to become the biggest, most profitable company ever.


> Are you OK with the developers of LLMs capturing the most value of every incremental piece of content created by humans in perpetuity?

If it helps humanity as a whole, sure. Artificial gate-keeping based on economic factors such as people being able to keep their $80k jobs is not justifiable to me to stop technological progress; one could have said the exact same thing centuries ago with the advent of any number of inventions.


How does it help humanity as a whole if nobody is able to capture any value for their work?


What kind of "value" are they capturing?

In my viewpoint, something one makes which can be built upon by another freely and without limit is what helps humanity. I don't want anyone necessarily trying to capture rent-seeking "value" out of their work that cannot be extended by others. That is one reason why I abhor copyright, patents, and intellectual property in general.


Is an LLM that slurps up all data created ever, without compensating the creators of that data, and then charges for that access not the definition of "rent seeking"?

Next question, does this share similarities with academic journals that most folks do consider rent seeking organization?


I never said that was okay either. LLMs should be open source and open data. Yes, this viewpoint shares similarities with academic journals that also rent seek.


Thank you for the interesting discussion!

> I never said that was okay either. LLMs should be open source and open data.

> That is one reason why I abhor copyright, patents, and intellectual property in general.

If it turns out that closed-source LLMs become the most valuable companies in the world, any thoughts on how to align / reconcile your ideals around the reality of how the technology has developed in my hypothetical example?


There are multiple gradations of value. Linux exists even as Windows and macOS does also. They each have different strengths and weaknesses. It could be the case that only closed source LLMs are the most prevalent, but we have open source LLMs already available today that run on device and for free essentially. I don't necessarily see the world only moving towards closed source LLMs.


How do you propose that LLMs are paid for?


Through government funds, as other public works projects are. Or if people want to privately create their own, they can do so, as long as the results are open.


Why should just LLMs be funded that way and not all software?


You're right, all software should be funded that way.


Who would decide what to fund?


How are public works projects funded now? Same way. And if corporations want to fund certain projects too, they can. It's simple, extrapolate how we fund current projects and do the same with software, it's not that complicated.


Earlier you said:

> In my viewpoint, something one makes which can be built upon by another freely and without limit is what helps humanity.

Now you are saying that LLMs, and maybe even all software, should only be funded by the approval of appointed government committees.

Do you not see the contradiction in your viewpoint?


Did you miss the parts where I mentioned that private individuals and corporations can also fund their own software, as we do today? My point is that as long as code, models, and data are open source, we will be free to mix and match and improve upon them, instead of only having rich corporations hold them. One way to do so is to have the government fund projects and open source them, as an alternative to corporate closed source control.


> Did you miss the parts where I mentioned that private individuals and corporations can also fund their own software, as we do today?

You aren’t being honest.

Today we’re allowed to fund our own software and use it as we please for private purposes or to license it to others. Today there is no law forcing us to give it away.

This isn’t what you are proposing.


How am I not being honest? If you misunderstand or don't read my words as they are written, that's not really my problem.

How do you know what I'm proposing? Perhaps we should have a law forcing software to be open source.


You cannot extend anyone’s work if you have no food or resources.


That is an economic argument for something like UBI, not an argument to artificially keep jobs and limit technological progress just so people can make money.


All jobs are artificially kept, as would UBI be. They are all artifacts of the law.

Money is a way to allocate resources, and technological progress depends on people having incentives to contribute.

Do you think LLMs are a natural phenomenon?


> technological progress depends on people having incentives to contribute

Does it? That stands in stark contrast to open source in general, where people contribute without any monetary incentives.


> where people contribute without any monetary incentives.

First of all you added in the word ‘monetary’, to straw-man the position.

Secondly, even if we run with your straw-man, I see little evidence this is true. Most successful projects are either corporately funded or begging for corporate funding. You see article after article here bemoaning the lack of funding for open source.


Okay, if we take any incentives in general, that means anything we do is incentivized. Me eating food is incentivized by me not starving to death. It's not a particularly enlightening argument, hence why I preemptively added "monetary," as we were already talking about economic incentives and UBI, but if you want to explain what other incentives you are talking about, please do so, I would want to hear.

In today's world with no UBI, it's no wonder people bemoan the lack of funding, since it's an issue of actually living to produce the OSS one wants to make. If we have something akin to UBI, where we don't force people to stall technological progress in order to satisfy having a job, then this issue would disappear.

And to be clear, I'm also not against corporate funding for OSS, so long as it does remain OSS at the end of the day. Linux has corporate funding yet maintains its OSS status.


Earlier you said “people contribute without any monetary incentives”, and now you are admitting that they do need monetary incentives because there is no UBI.

Which is the truth?


People contribute already without any monetary incentives, but it sure would help to have UBI. I'm not sure what part you're confused about, as it's not an "either/or" proposition.


Does "work" here mean buying rights to books? And since Pearson aren't paying their fair share, why should the public do the work of holding a trial?


What do you mean they aren’t paying their fair share? You just said they bought the rights.

Also it seems like you are saying we should just pre-judge cases without using the courts. Basically just end the rule of law. Not sure that’s a good plan.


AI and humans aren't the same thing. Even if we do anthropomorphize AI.


> No author has the right to prevent me from discussing a book with someone.

a lossy compression function is not "someone"


Oh? And what section of title 17 does that ostensible distinction implicate?


At what point is it discussing a subject, vs basically verbatim listing things, methods, reasoning from a book?


Reproduction of the precise form of expression is the only thing covered by copyright.

Facts are not protected by any IP law in the United States.

Ideas may be protected by patent law, if the idea constitutes a novel, non-obvious invention.


> It made me wonder this exact point. If you wrote a book on a very rare subject that you were basically the only popular source of detailed info on... And now these models just answer questions for people without having to buy the book.. is that really fair?

Dr Burns seems to have done well, so in his case I'd think he was at least adequately compensated for his life's work. You can't answer "fairness" on copyright in all cases just by that, but this particular case I feel there's little or no harm done.

If people start selling these GPT things because they let you avoid fair recompense, that's another thing we'll have to work out as a society.


If you wrote a book on a very rare subject that you were basically the only popular source of detailed info on... And now these models just answer questions for people without having to buy the book.. is that really fair?

What if a person bought a copy of the book, read it and learned the material, and then provided advice to people based on the material. I think most people would say that that is fair. So why is it less fair for the AI to do what is effectively the same thing? Is it simply a matter of scale? Or is there something more fundamental at work?


I'd argue ChatGPT is closer to restating copyrighted materials in a different way. And that is legally questionable. Whether that actually matters depends on how much money is at stake and how much each party cares to spend to enforce or protect their viewpoint.

Allowing these systems to utilize copyrighted materials without compensating the copyright holder is a legal loophole the size of the sun. Then it becomes a perfectly legal defense to write "an AI" that slurps up copyrighted material and produces results that are "based on" that material.

My issue is that the copyrighted work is fundamentally integrated into the algorithm. Say you wrote a program that would input a copyrighted work, then prints out the contents, but with some words replaced with synonyms. Would the output of that be okay? What about if phrases were replaced? How about if entire sections are added, rearranged, and/or removed? At some point, the results become so mangled as to not be recognizable, but the fact that a copyrighted work serves as an input to the program is still an issue, IMHO.

A major difference between humans and computers is that a human can't (generally) recall hundreds of pages of text and regurgitate it on demand. So a human who has read a book only retains a fraction of the contents, and even then, generally they don't retain all of the information accurately.


I get what you're saying, but I'm still not convinced that introducing the notion of "a computer program" vs "a human" changes anything in the equation. That is to say, IF you - as a human - had a "photographic memory" and after reading a copyrighted work began to spit out big chunks of it verbatim in response to a query, then you would already be guilty of copyright infringement, no? Likewise, if a computer program reads 10 books on, I dunno, let's say Quantum Physics, and then - in response to a query - emits a technically correct answer that isn't obviously cribbed from any one book... has it violated copyright?

My point is not to say that contemporary "AI" systems never emit things that are arguably copyright infringement. It's more just to say that there's nothing about being an AI that makes it necessary that its output be copyright infringing.


As far as I can tell, restating materials in a different way does not infringe copyright.


It can be seen as a derived work. A translation of Harry Potter in to French, Japanese or Latin is not free from the copyright of the original work. That is literally restating materials in a different way.


Right, that’s just citing sources. Even if your source is only one book (the Cliff Notes model).


Answer is pretty obvious: One is a computer program, literally designed to pilfer copyrighted works, the other is a human being.


If you smuggle "pilfer" into the description of AI, then in that case it's "obvious".

But the comparison could just as easily be written, "One is a computer program, literally designed to learn from copyrighted works, the other is a human being that learns from copyrighted works"

If ChatGPT is parroting entire paragraphs of copyrighted work in its answers, then "pilfer" is probably the right word. Or maybe "plagiarize". But if it's training on the information gleaned from the work, and using that training to synthesize the answers, isn't that a lot more like how a learned person applies their knowledge?

I haven't yet made up my mind completely on how IP should interact with AI. I see good arguments for and against. But the answer is far from obvious!


So we can paint ourselves into corner with legislation and lawyers and prevent "AI" progress or have a constitutional referendum to decide if it's fair or not fair use.


Why is it "pilfering" when the computer program does it, but "learning" when a human does it?


Because those two things are not even equivalent.

A human can reason, make decisions, act in best interest, judge context of a situation, apply it to new scenarios, judge emotion, and create a new derivative work.

The computer program is literally taking someones work and adding fancy way to search through the other person's work, without adding any value.

We need to stop fantasizing these fancy toys are actually any sort of intelligence.


We need to stop fantasizing these fancy toys are actually any sort of intelligence.

That strikes me as a rather extraordinary claim. It seems obvious to me that contemporary AI's are "some" sort of intelligence, albeit with open questions around "what kind of intelligence?" and "how intelligent are they?".


Because we say so? We get to define the rules here, and there's no imperative to treat the two as the same thing.


I do wonder if it will disincentivise the author from writing the book in the first place. What’s the point if an AI is going to regurgitate the material to every Tom, Dick and Harry thereby killing the demand for the book.


That was basically my edit, we may have missed each other.

The only point I can see making is that for some reason I doubt openai actually paid the author for the contents of the book, but maybe they did.

Maybe they borrowed ebooks one by one from some digital library and ingested them that way for free.


Somehow I feel like they just scraped libgen and sci hub

Much more "scalable" i.e. easy to get raw training dataset. Move fast and break things


What if one person bought the book, copied it, and distributed it for negligible cost to anyone with a passing interest in the contents? Sounds pretty similar to copyright infringement.


As a side note, how would we all feel if someone wrote a depression /cognitive behavioral therapy ai chat bot, charged $20 a month for it, and then it just basically ran you through the exact methods from the feeling good book?


This happens all the time with “self help” tiktokers and YouTubers and whatnot that are just recycling others ideas and content.

How would you like to stop this? It doesn’t violate copyright unless they read the book verbatim. I guess ideas could be patented as a utility method by the author and it would prevent someone else from doing that method for 20 years.

Mostly this is just the way things were designed and I think that it’s good in the sense that the purpose of creation is to further human experience, not maximize creator revenue.


That was the business model of Joyable, except it charged more and had a human periodically sanity-check your progress on the exercises.


People and information want to be free.

I'm sure the author would have no problem with everyone in the world receiving a free summary — once they know they've been fed and cared for.


Burns was a working professional before writing the book. If he wanted to release the book for free, he would have. If he wanted to provide summaries of it online, he would have. If he wanted to discuss the book online for free, he would have and does. But I don't think it being fed into a questions and answer service is something he really wanted.


> Bird also said it was usually easy to tell what a large language model such as ChatGPT has been trained on, because “you can ask it”.

So what does that mean? They literally asked “Did you read book X during training”?

Sounds like they’re fools. Desperate fools, most likely.


They don't have much of an argument, a lot of their material is full of doublespeak narratives, and watered down information. Characteristics which any computer will fail at utilizing properly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: