Hacker News

FSF statement on copyright infringement lawsuit Bartz v. Anthropic

256 points by m463 4 days ago

110 Comments
I'm really confused by the FSF statement here. The court ruled that the use of copyrighted information is fair use. The issue is that Anthropic pirated (obtained illegally) copyrighted work and that was the offense. FSF books are free to download and store etc. The license says: "This is a free license allowing use of the work for any purpose without payment." So how can they claim that their rights were infringed when the court ruled that the problem was the illegal downloading of copyrighted work? It's impossible to illegally download a FSF book.

By briandw 8 hours ago
1. There are some uses of the copies that were not granted summary judgement
> And, as for any copies made from central library copies but not used for training, this order does not grant summary judgment for Anthropic. On this record in this posture, the central library copies were retained even when no longer serving as sources for training copies, “hundreds of engineers” could access them to make copies for other uses, and engineers did make other copies.
Whether or not those other actions met the requirements of the FDL is untested and would be the subject of a trial, had this gone to trial, but it didn't.
2. The FDL does have requirements that must be met for the use of copies to be permissible -- it doesn't allow you to do anything you want.

By kube-system 3 hours ago
While there are requirements on making copies in the FDL, I think it is extraordinarily unlikely that a court would find that a company making internal copies would violate the license when those restrictions are just along the lines of "and you must include a copy of the license".
And the FSF would be extremely foolish to ever pursue such a suit, because extremely ordinary non-AI related activities involving working with internal local documents also make copies in a similar way. If OpenAI violated the FDL by doing so then the FDL is a foot gun of a license that companies would be well advised to avoid.
The only suit that makes any sense would be the one against using the FDL licensed documents to train the not-FDL licensed AI... and the judge already rejected that in this case.

By gpm 3 hours ago
The requirements being easy to meet doesn't absolve someone from having to follow them. And the FSF clearly says they aren't pursuing a suit in this case. I'm not suggesting that filing suit here would make sense, rather, that the FSF does care deeply and fundamentally about requiring requiring copyleft restrictions -- they vehemently don't tolerate permissive use under their licenses.
> If OpenAI violated the FDL by doing so then the FDL is a foot gun of a license that companies would be well advised to avoid.
That has been said about a lot of FSFs licenses, and in fact, many companies do avoid them.

By kube-system 3 hours ago
I interpreted this to be a class action to which they were a party, not something they principally launched themselves?

By nativeit 3 hours ago
It is. I'm just commenting on why it isn't that straightforward that the FSF presumably wouldn't care. Copyleft is an exercise of copyright. The FSF doesn't believe in permissive use of works - they believe in using copyright licensing to force others to share the way they believe others should share.

By kube-system 3 hours ago
They said that the LLM can hold the information but can't produce it ipsis litteris. But if it produces a plagiarized version then it needs be settled with those who publish it.
That said, one might say it was unintentional and it would be impossible to verify all llm work with that premise.

By motbus3 3 hours ago
I also found the statement bizarre. They don’t seem to have any argument for compensation of any kind unless the books were under a restrictive license that required derived works to also be open source.

By aaaronic 4 hours ago
No, even under such a restrictive license (which I think the GFDL is?) there's no argument.
Their copyright was not violated by anthropic downloading the books, because anthropic had a license to do that.
And their copyright was not violated by anthropic training on the books, because the court found that no ones copyright was violated by doing this. Antrhopic didn't need a license to do this. So the restrictive terms of the license can't prevent it.
I mean they might have an argument for compensation based on "well the settlement Anthropic agreed to didn't exclude us even though they didn't violate our copyright"... but just for the compensation outlined in the settlement.

By gpm 4 hours ago
I think they are implying that they believe models trained on their copyrighted information should be open-source.

By gwbas1c 3 hours ago
Not 'open source' but 'free', of which RMS has very strong feelings about the difference.

By kube-system 3 hours ago
Didn't understand it well, I thought it was because english is not my native language.

By makz 2 hours ago
The framing of 'share your weights freely' as a remedy is interesting but underspecified. The FSF's argument is essentially that training on copyrighted code without permission is infringement, and the remedy should be open weights. But open weights don't undo the infringement -- they just make a potentially infringing artifact publicly available. That's not how copyright remedies work. What they're actually asking for is more like a compulsory license, which Congress would have to create. The demand for open weights as a copyright remedy is a policy argument dressed up as a legal one.

By bobokaytop 17 hours ago
In GPL cases for software, making the offending proprietary code publicly available under the GPL has been the usu outcome.
But whether you can actually be compelled to do that isn't well tested in court. Challenging that the GPL is enforcable in that way leads you down the path that you had no valid license at all, and for past GPL offenders that would have been the worse outcome. AI companies could change that

By wongarsu 16 hours ago
> But open weights don't undo the infringement -- they just make a potentially infringing artifact publicly available.
This is true when talking about the infringement of the copyrights of others. But when discussing the infringement of GPL copyleft, making a potentially infringing artifact publicly available likely satisfies the license conditions.
The evil is that this case was settled, and before being settled was decided in a way contrary to all previous copyright decisions. The courts decided that rap records had to clear every single sample, thereby basically destroying the art form, but now you can literally feed every book into a blender, piece another book together out of the pieces, and sell it.
Hip-hop when it peaked with the Bomb Squad was such a frenetic mix of so many recognizable, unrecognizable, and transformed sources that it doesn't resemble anything that was made after the decisions against Biz Markie and De La Soul. Afterwards, you just licensed one song, slightly cut it up, and rapped over it. It was just a new way to sell old shit to young people unfamiliar with it.
Now you can literally just train a machine on the same stuff, and it's legal. A machine transformation was elevated over human creativity, simply because rich people wanted it.

By pessimizer 7 hours ago
> The courts decided that rap records had to clear every single sample, thereby basically destroying the art form, but now you can literally feed every book into a blender, piece another book together out of the pieces, and sell it.
Are they still enforcing the old way on hip hop samples, or has that changed with the recent rulings? If the new way of doing things is applied fairly to everyone that seems like a win.

By zavec 3 hours ago
> The framing of 'share your weights freely' as a remedy is interesting but underspecified. The FSF's argument is essentially that training on copyrighted code without permission is infringement, and the remedy should be open weights.
Ignoring the fact that the statement doesn't talk about FSF code in the training data at all, [0] are you sure about that? From the start of the last of three paragraph in the statement:
```
  Obviously, the right thing to do is protect computing freedom: share complete training inputs with every user of the LLM, together with the complete model, training configuration settings, and the accompanying software source code. Therefore, we urge Anthropic and other LLM developers that train models using huge datasets downloaded from the Internet to provide these LLMs to their users in freedom.
```
This seems to me to be consistent with the FSF's stance of "You told the computer how to do it. The right thing to do is to give the humans operating that computer the software, input data, and instructions that they need to do it, too.".
[0] In fact, it talks about the inclusion of a book published under the terms of the GNU FDL, [1] which requires distribution of modified copies of a covered work to -themselves- be covered by the GNU FDL.
[1] <https://www.gnu.org/licenses/fdl-1.3.html>

By simoncion 14 hours ago
> It is a class action lawsuit… the parties agreed to settle instead of waiting for the trial…
It would be nice if members of the class could vote to force a case to trial. For the typical token settlement amount, I’m sure many would rather have the precedent-setting case instead.

By teeray 13 hours ago
If/when you get a postcard/spam email that you're included in a potential class action lawsuit settlement, you can opt out of the class (in which case you preserve your legal rights to sue separately) or file comments with the Court.

By ksherlock 12 hours ago
You can, but then you lose the power of a collective and have to manage a lawsuit yourself. If you are being represented as part of a group, then you should have means to direct that representation.

By teeray 10 hours ago
With some coordination, perhaps enough people could opt out and start a new class action lawsuit.

By ksherlock 3 hours ago
Surely some firms choose to hold referendums already, but I could see that being a good law! As Better Call Saul explored in its early seasons, the interests of the large law firm can easily diverge significantly from the interests of the plaintiffs.

By bbor 7 hours ago
Maybe not with the current administration pressuring the courts on the matter in some deranged manner or another

By gentleman11 5 hours ago
A related topic that I have in the past thought about is, whether LLM derived code would necessitate the release under a copyleft license because of the training data. Never saw a cogent analysis that explained either why or why not this is the case beyond practicality due to models having been utilized in closed source codebases already…

By Topfi 17 hours ago
The short answer is that we don't know. The longer answer based purely on this case is that there's an argument that training is fair use and so copyleft doesn't have any impact on the model, but this is one case in California and doesn't inherently set precedent in the US in general and has no impact at all on legal interpretations in other countries.

By mjg59 17 hours ago
The dearth of case law here still makes a negative outcome for FSF pretty dangerous, even if they don't appeal it and set precedent in higher courts. It might not be binding but every subsequent case will be able to site it, potentially even in other common law countries that lack case law on the topic.
And then there is the chilling effect. If FSF can't enforce their license, who is going to sue to overturn the precedent? Large companies, publishers, and governments have mostly all done deals with the devil now. Joe Blow random developer is going to get a strip mall lawyer and overturn this? Seems unlikely

By bragr 16 hours ago
I don't think this argument is a winner. It fails on a few grounds:
First, unless you can point to regurgitation of memorized code, you're not able to make an argument about distribution or replication. This is part of the problem that most publishers are having with prose text and LLMs. Modern LLMs don't memorize harry potter like GPT3 did. The memorization older models showed came from problems in the training data, e.g. harry potter and people writing about harry potter are extraordinarily over-represented. It's similar to how with stable diffusion you could prompt for anything in the region of "Van Gogh's Starry Night" and get it, since it was in the training data 50-100 different ways. You can't reliably do this with Opus or GPT5. If they're not redistributing the code verbatim, they're not in violation of the license. One could argue that the models produce "derivative works, but..."
The derivative works argument is inapt. The point of it is to disrupt someone's end-run around the license by saying that building on top of GPL code is not enough to non-GPL it. We imagine this will still work for LLMs because of the GPLs virality--I can't enclose a critical GPL module in non-GPL code and not release the GPL code. But the models aren't DOING THAT. They're not reaching for XYZ GPL'd project to build with. They're vibing out a sparsely connected network of information about literally trillions of lines of software. What comes out is a mishmash of code from here and there, and only coincidentally resembles GPL code, when it does. In order to make this argument work, you need a theory of how LLMs are trained and operate that supports it. Regardless of whether or not one of those theories exist, in court, you'd need to show that your theory was better than the company's expert witness's theory. Good luck.
Second, infringement would need discovery to uncover and would be contingent on user input. This is why the NYT sued for deleted user prompts to ChatGPT--the plaintiffs can't show in public that the content is infringing, so they need to seek discovery to find evidence. That's only going to work in cases where you survive a motion to dismiss--which is EXACTLY where a few of these suits have failed. You need to show first that you can succeed on the merits, then you proceed. That will cut down many of these challenges since they just can't show the actual infringement.
Third, and I think this is the most important, the license protections here are enforced by *copyright*. For copyright it very much matters if something is lifted verbatim vs modified. It is not like patent protection where things like clean room design are shown to have mattered to real courts on real matters. In additional contrast to patents, copyright doesn't care if the outcome is close. That's very much a concern for patents. If I patent a gizmo and you produce a gizmo that operates through nearly identical mechanisms to those I patented, then you can be sued--they don't need to be exact. If I write a novel about a boy wizard with glasses who takes a train to a school in Scotland and you write a novel about a boy wizard with glasses who takes a boat to a school in Inishmurray, I can't sue you for copyright infringement. You need to copy the words I wrote and distribute them to rise to a violation.

By adampunk 10 hours ago
> Modern LLMs don't memorize harry potter like GPT3 did. [...] You can't reliably do this with Opus or GPT5.
If you try any modern LLM, you will find that you can. Easily [0], reliably [1], consistently [2]. All these examples are with models released in 2025/26.
[0] https://arxiv.org/html/2601.02671?amp=&amp=
[1] https://arxiv.org/abs/2506.12286
[2] https://ai.stanford.edu/blog/verbatim-memorization/

By Topfi 6 hours ago
You can't do that without already having the contents of the book, in which case getting an LLM to regurgitate it with partial prompting shouldn't be legally relevant at all. What it regurgitates will have errors, and if you try to chain that as prompt cues without re-basing each cue to the actual text (which you have separately), the LLM's output will rapidly lose coherence with the original work.
If its responses were perfect so that you could chain them, or if you could ask "please give me words 10-15 of chapter 3 paragraph 4 of HPatSS, and it did so, then you'd have a better case to complain. Still, the counterargument is that repeated prompting like that, explicitly asking for copyright violation, is the real crime. Are you going to throw someone in prison if they memorize the entirety of HPatSS and recite arbitrary parts of it on demand?
Combining both issues: that LLMs are only regurgitating mostly accurate continuations, and they're only providing that to the person who explicitly asked... any meaningful copyright violation moves downstream. If you record someone reciting HPatSS from memory, and post it on youtube, you are (or should be considered) the real copyright violator, not them.
If you ask for an identifiable short segment of writing, or a piece of art, and get something close enough that violates copyright, that should really be your problem if you redistribute it (whether manually or because you've coded something to allow 3rd parties to submit LLM prompts and feed answers back to them, and they go on to redistribute it).
Blaming LLMs for "copyright violation" is like persuading a retarded person to do something illegal and then blaming them for it.

By harshreality 5 hours ago
So, they have to do anything special to those models in order to get them to regurgitate ~ 100%? Any special prompts they needed to use to get sonnet to cough that up?
What is the real copyright risk of there being an arcane procedure to sometimes recover most of a text? So far it’s nothing. Which is what I’m saying. Pragmatically this is a loser of an argument in a court room. It is too easy for the chain of reasoning to be disrupted and even undisrupted the argument for model maker liability is attenuated.

By adampunk 6 hours ago
> unless you can point to regurgitation of memorized code
I have, on many occasions, gotten an LLM to do just this. It's not particularly hard. In the most recent case google's search bar LLM happily regurgitated a digital ocean article as if it was it's own output. Searching for some strings in the comments located the original page and it was a 95% match between origin and output.
> The memorization older models showed came from problems in the training data,
And what proof do you have that they "fixed" this? And what was the fix?
> harry potter and people writing about harry potter
I'm not sure that's how you get GPT to reproduce upwards of 85% of Harry Potter novels.
> Second, infringement would need discovery to uncover and would be contingent on user input.
That's not at all how copyright infringement works. That would be if you wanted to prove malice and get triple damages. Copyright infringement is an exceptionally simple violation of the law. You either copied, or you did not.
> For copyright it very much matters if something is lifted verbatim vs modified.
Transformation is a valid defense for _some_ uses. It is not for commercial uses. Using LLM generated code for commercial purposes is a hazard.

By themafia 7 hours ago
This must be why all of these copyright plaintiffs are having tremendous days in court! If even half of this were correct, they wouldn’t be losing in summary judgment.
We have yet to see a single judgment come down against a model maker for distributing the gist of content. We have yet to see a single judgment come down against a model maker for infringement at all.
Copyright is just an inapt tool here. It’s not going to do the job. It is not as though big interests have not tried to use this tool. It just doesn’t reflect what’s actually happening and it’s going to lose again and again.
We can imagine a theoretical legal regime where what is done with large language models counts as copyright infringement, we just don’t live in a world where that regime holds.

By adampunk 6 hours ago
Thank you FSF!
The hero we need, but not the hero we deserve..
The issue is that every CS masters student & AI researcher knows how to build a SOTA LLM.. But, only a few companies have the resources.
The process:
(1) steal as much data from the internet as possible (data is everything) (2) raise incomprehensible amounts of money (3) find a location where you can take over the energy grid for training (4) put a black box around it so nobody can see the weights (5) charge users $$$ to use (6) retrain models with user session data (opt in by default) (7) peek around at how users are using, (maybe) change policies to stop them from using that way, and (maybe) rapidly develop features for that use case.
(Sorry that last one is jaded and not fair - just included to give you a picture of what could be happening with this sort of tech) …
The entire premise of the product is “built on the backs of any & everyone who has ever published a work”

By MajorArana 10 hours ago
> The entire premise of the product is “built on the backs of any & everyone who has ever published a work”
Do any products exist which are not built on uncompensated work of other people in the past?
Generally speaking societies do better when knowledge is shared and not hoarded.
Hoarding knowledge via legal constructs is great at concentrating wealth to the hoarder at the expense of everyone else.
We should restore copyright to its original term lengths.
I agree with the stance of Anthropic et al that these models should be built with all possible information.
I agree with the stance of the FSF that the resulting models should be as freely usable/available as possible.

By margalabargala 8 hours ago
> Generally speaking societies do better when knowledge is shared and not hoarded.
These companies do even better because we're not allowed to share the knowledge (read, illegally copy protected works) and they are.

By tpxl 7 hours ago
Exactly.

By margalabargala 7 hours ago
What weak, counter-productive, messaging. This is like having a bully punching you in the face and responding with “hey man, I’m not going to do anything about this, I’m not even going to tell an adult, but I’d urge you to consider not punching me in the face”. Great news for the bully! You just removed one concern from their mind, essentially giving the permission to be as bad to you as they want.

By latexr 16 hours ago
It's the FSF and their licensing is what it is. What other messagaging would be consistent with the foundation's mission?

By nazgulsenpai 8 hours ago
They could not mention they usually don’t sue and that they are small and “have to pick [their] battles”, which effectively means “there will be no repercussions from our side, we won’t even consider trying, so continue to do as you please and even worse”.
Saying nothing is an option. It is very possible (and the FSF has done it) to put yourself into a weaker position by saying something.
You don’t have to lie, but you don’t have to unpromptedly volunteer you don’t have a hand to play, either.

By latexr 7 hours ago
Thanks for explaining, that's fair.

By nazgulsenpai 6 hours ago
at uncloseai dot com we took up the AGPLv3 only because it will be maximum freedom if anyone like Anthropic or OpenAI get caught training on us they should open their weights to satisfy the license and terms of use. we have a browser extension & a white paper also covered by the same license I will also settle out of court to maximize user freedom by opening the model weights & distillation for free.

By unfirehose 2 hours ago
It looks like the stance of FSF is for proliferation of the copyleft to trained LLMs
> "Therefore, we urge Anthropic and other LLM developers that train models using huge datasets downloaded from the Internet to provide these LLMs to their users in freedom"

By kavalg 17 hours ago
No, it looks like the stance of the FSF is that models should be free as a matter of principle, the same as their stance when it comes to software. Nothing in the linked post contradicts the description that the judgement was that the training was fair use.

By mjg59 17 hours ago
Ironically RMS is to be blamed for AI coding.
https://bit1993.bearblog.dev/blame-rms-for-ai-coding/

By bit1993 3 hours ago
That’s a wild take. AI companies are to blame for AI coding.

By nativeit 3 hours ago
RMS did not invent software freedom. That was the natural state of the world before Congress said otherwise; and RMS was not even the only organization building freely redistributable software. He was just the most politically vocal about it.

By kmeisthax an hour ago
In the same way that Christopher Columbus is to be blamed for this comment, sure.

By stavros 3 hours ago
The FSF seems toothless when it comes to actually enforcing anything regarding license violations.

By jamesnorden 14 hours ago
How dare they? Defending freedom of these filthy people and dignity of authors against these nice familiar corporations!
The rephrased¹ title "FSF Threatens Anthropic over Infringed Copyright: Share Your LLMs Free" certainly doesn’t dramatise enough how odious an act it can be.
¹ Original title is "The FSF doesn't usually sue for copyright infringement, but when we do, we settle for freedom"

By psychoslave 15 hours ago
Huh, I've been waiting for the FSF to say something about the current big issue: mandatory Operating System age-asking. Maybe now that they've meddled in a copyright lawsuit that has no broader ramifications for the public (the people they supposedly fight for), they can get back to that.

By phendrenad2 11 hours ago
> Among the works we hold copyrights over is Sam Williams and Richard Stallman's Free as in freedom: Richard Stallman's crusade for free software, which was found in datasets used by Anthropic as training inputs for their LLMs.
This is the reason why AI companies won't let anyone inspect which content was in the training set. It turns out the suspicions from many copyright holders (including the FSF) was true (of course).
Anthropic and others will never admit it, hence why they wanted to settle and not risk going to trial. AI boosters obviously will continue to gaslight copyright holders to believe nonsense like: "It only scraped the links, so AI didn't directly train on your content!", or "AI can't see like humans, it only see numbers, binary or digits" or "AI didn't reproduce exactly 100% of the content just like humans do when tracing from memory!".
They will not share the data-set used to train Claude, even if it was trained on AGPLv3 code.

By rvz 18 hours ago
There's already legal requirements in the EU that you must publish what goes into your training set. This information must apparently be publshed before the august 2 next year.

By impossiblefork 15 hours ago
Guess the solution is to not do it and simply pay fines (or not pay fines, if you don't have any EU operations).

By ronsor 7 hours ago
Yes, unfortunately. I don't really understand this obsession with regulations that involve fines. One would think that people would have the courage to make laws that either ban things or don't.
I think the fines will effectively be mandatory though, even with no obvious EU operations.

By impossiblefork 4 hours ago
They simply have way too much incentive to train on anything they can get their hands on. They are driving businesses, that are billions in losses so far. Someone somewhere is probably being told to feed the monster anything they can get, and not to document it, threatened with an NDA and personal financial ruin, if the proof of it ever came out. Opaque processes acting as a shield, like they do in so many other businesses.

By zelphirkalt 16 hours ago
>share complete training inputs with every user of the LLM
They don't have the rights to distribute the training data.

By charcircuit 17 hours ago
So if a user can bring an LLM to output a copy of some training data, then the ones who distribute the LLM are engaging in illegal activity?

By zelphirkalt 16 hours ago
It isn't illegal as a LLM model is transformative.

By charcircuit 15 hours ago
So is awk, sed. Good luck convinving any judge/lawyer.

By anthk 7 hours ago
It's not copyright infringement to distribute awk or sed either. In fact it comes bundled with most Linux distributions.

By charcircuit an hour ago
Good. I want to see more lawsuits going after these hyper scalers for blatantly disregarding copyright law while simultaneously benefiting from it. In a just world they would all go down and we would be left with just the OSS models. But we don't live in a fair world :(

By slopinthebag 18 hours ago
Where's the threat? The FSF was notified that as part of the settlement in Bartz v. Anthropic they were potentially entitled to money, but in this case the works in question were released under a license that allowed free duplication and distribution so no harm was caused. There's then a note that if the FSF had been involved in such a suit they'd insist on any settlement requiring that the trained model be released under a free license. But they weren't, and they're not.
(Edit: In the event of it being changed to match the actual article title, the current subject line for this thread is " FSF Threatens Anthropic over Infringed Copyright: Share Your LLMs Freel")

By mjg59 19 hours ago
> but in this case the works in question were released under a license that allowed free duplication and distribution so no harm was caused.
FSF licenses contain attribution and copyleft clauses. It's "do whatever you want with it provided that you X, Y and Z". Just taking the first part without the second part is a breach of the license.
It's like renting a car without paying and then claiming "well you said I can drive around with it for the rest of the day, so where is the harm?" while conveniently ignoring the payment clause.
You maybe confusing this with a "public domain" license.

By teiferer 18 hours ago
If what you do with a copyrighted work is covered by fair use it doesn't matter what the license says - you can do it anyway. The GFDL imposes restrictions on distribution, not copying, so merely downloading a copy imposes no obligation on you and so isn't a copyright infringement either.
I used to be on the FSF board of directors. I have provided legal testimony regarding copyleft licenses. I am excruciatingly aware of the difference between a copyleft license and the public domain.

By mjg59 17 hours ago
> I am excruciatingly aware of the difference between a copyleft license and the public domain.
Then why did you say "no harm was caused"? Clearly the harm of "using our copylefted work to create proprietary software" was caused. Do you just mean economic harm? If so, I think that's where the parent comments confusion originates.

By danlitt 16 hours ago
No harm under copyright law

By mjg59 10 hours ago
> The GFDL imposes restrictions on distribution, not copying, so merely downloading a copy imposes no obligation on you and so isn't a copyright infringement either.
The restrictions fall not only on verbatim distribution, but derivative works too. I am not aware whether model outputs are settled to be or not to be (hehe) derivative works in a court of law, but that question is at the vey least very much valid.

By friendzis 15 hours ago
It's the third sentence of the article:
> the district court ruled that using the books to train LLMs was fair use but left for trial the question of whether downloading them for this purpose was legal.

By mcherm 14 hours ago
No, those are separate issues.
The pipeline is something like: download material -> store material -> train models on material -> store models trained on material -> serve output generated from models.
These questions focus on the inputs to the model training, the question I have raised focuses on the outputs of the model. If [certain] outputs are considered derivative works of input material, then we have a cascade of questions which parts of the pipeline are covered by the license requirements. Even if any of the upstream parts of this simplified pipeline are considered legal, it does not imply that that the rest of the pipeline is compliant.

By friendzis 14 hours ago
I'm also skeptical that it's impossible to get an LLM to reproduce some code verbatim. Google had that paper a while back about getting diffusion models to spit out images that were essentially raw training data, and I wouldn't be surprised if the same is possible for LLMs.

By protimewaster 8 hours ago
Models, however, can reproduce copyleft code verbatim, and are being redistributed. Doesn't that count?
Licences like AGPL also don't have redistribution as their only restriction.

By snovv_crash 16 hours ago
Stack Overflow has verbatim copied GPL code in some of its questions and answers. As presented by SO, that code is not under the GPL license (this also applies to other licenses - the BSD advertising clause and the original json will cause similar problems).
Arguably, the use of the code in the Stack Overflow question and answer is fair use.
The problem occurs not when someone reads the Q&A with the improperly licensed code but rather when they then copy that code verbatim into their own non GPL product and distribute that without adherence to the GPL.
It's the last step - some human distributing the improperly licensed software that is the violation of the GPL.
This same chain of what is allowed and what is not is equally applicable to LLMs. Providing examples from GPL licensed material to answer a question isn't a license violation. The human copying that code (from any source) and pasting it into their own software is a license violation.
---
Some while back I had a discussion with a Swiss developer about the indefinite article used before "hobbit" in a text game. They used "an hobbit" and in the discussion of fixing it, I quoted the first line of The Hobbit. "In a hole in the ground there lived a hobbit." That cleared it up and my use of it in that (and this) discussion is fair use.
If someone listening to that conversation (or reading this one) thought that the bit that I quoted would be great on a T-shirt and them printed that up and distributed it - that would be a copyright violation.
Google's use of thumbnails for images was found to be fair use. https://en.wikipedia.org/wiki/Perfect_10,_Inc._v._Amazon.com...
```
    The Ninth Circuit did, however, overturn the district court's decision that Google's thumbnail images were unauthorized and infringing copies of Perfect 10's original images. Google claimed that these images constituted fair use, and the circuit court agreed. This was because they were "highly transformative."
```
If I was to then take those thumbnails from a google image search and distribute that as an icon library, I would then be guilty of copyright infringement.
I believe that Stack Overflow, Google Images, and LLM models and their output constitutes an example of transformative fair use. What someone does with that output is where copyright infringement happens.
My claim isn't that AI vendors are blameless but rather that in the issue of copyright and license adherence it is the human in the process that is the one who has agency and needs to follow copyright (and for AI agents that were unleashed without oversight, it is the human that spun them up or unleashed them).

By shagie 12 hours ago
That's really interesting. I'm a lawyer, and I had always interpreted the license like a ToS between the developers. That (in my mind) meant that the license could impose arbitrary limitations above the default common law and statutory rules and that once you touched the code you were pregnant with those limitations, but this does make sense. TIL. So, thanks.

By piker 16 hours ago
Licenses != contracts, and well, the FSF's position has always been that the GPL isn't a contract, and contracts are what allow you to impose arbitrary limitations. Most EULAs are actually contracts.

By ronsor 5 hours ago
Does the reasoning in the cases where people to whom GPL software was distributed could sue the distributor for source code, rather than relying on the copyright holder suing for breach of copyright strengthen the argument that arbitrary limitations are enforceable?

By graemep 14 hours ago
Unrelated question regarding this part, since you seem to be an expert on this:
> If what you do with a copyrighted work is covered by fair use it doesn't matter what the license says - you can do it anyway.
How is it that contracts can prohibit trial by jury but they can't ban prohibit fair use of copyrighted work? Is there a list of things a contract is and isn't allows to prohibit, and explanations/reasons for them?

By dataflow 10 hours ago
The general answer is because there is a statute or court opinion that says so for one thing and a different one that says something else for the other thing.
It's also relevant that copyright (and fair use) is federal law, contracts are state law and federal law preempts state law.

By AnthonyMouse 10 hours ago
This means that you can ignore any part of licenses you don't want to and just copy any software you want, non-free software included.

By materialpoint 15 hours ago
No. The GFDL grants you permission to copy the work.

By mjg59 10 hours ago
This is in fact how I operate.

By mikkupikku 15 hours ago
But fair use is dependent on you getting the work legally. Is downloading a book with the intention of violating the GFDL a legal way of acquiring it.

By thayne 11 hours ago
This article is talking about a book though, not software.
"Sam Williams and Richard Stallman's Free as in freedom: Richard Stallman's crusade for free software"
"GNU Free Documentation License (GNU FDL). This is a free license allowing use of the work for any purpose without payment."
I'm not familiar with this license or how it compares to their software licenses, but it sounds closer to a public domain license.

By jcul 18 hours ago
It sounds that way a bit from the one sentence. But that’s not the case at all.
> 4. MODIFICATIONS
> You may copy and distribute a Modified Version of the Document under the conditions of sections 2 and 3 above, provided that you release the Modified Version under precisely this License, with the Modified Version filling the role of the Document, thus licensing distribution and modification of the Modified Version to whoever possesses a copy of it. In addition, you must do these things in the Modified Version:
Etc etc.
In short, it is a copyleft license. You must also license derivative works under this license.
Just fyi, the gnu fdl is (unsurprisingly) available for free online - so if you want to know what it says, you can read it!

By kennywinker 18 hours ago
And the judgement said that the training was fair use, but that the duplication might be an infringement. The GFDL doesn't restrict duplication, only distribution, so if training on GFDLed material is fair use and not the creation of a derivative work then there's no damage.

By mjg59 17 hours ago
> The GFDL doesn't restrict duplication
Right. I can publish the work in whole without asking permission. That’s unrestricted duplication.
However, as i read it, an LLM spitting out snippets from the text is not “duplicating” the work. That would fall under modifications. From the license:
> A "Modified Version" of the Document means any work containing the Document or a portion of it, either copied verbatim, or with modifications and/or translated into another language.
I read that pretty clearly as any work containing text from a gnu fdl document is a modification not a duplication.

By kennywinker 7 hours ago
Last time I checked online LLMs distribute parts of their training corpus when you prompt them.

By leni536 15 hours ago
For this to stand up in court you'd need to show that an LLM is distributing "a modified version of the document".
If I took a book and cut it up into individual words (or partial words even), and then used some of the words with words from every other book to write a new book, it'd be hard to argue that I'm really "distributing the first book", even if the subject of my book is the same as the first one.
This really just highlights how the law is a long way behind what's achievable with modern computing power.

By onion2k 17 hours ago
You’re just describing transformative use. I’m not a lawyer, but an example from music - taking a single drum hit from a james brown song is apparently not transformative. Taking a vibe from another song is also maybe not transformative, e.g. robin thicke and pharrell’s “blurred lines” was found to legally take the “feel” from Marvin Gaye’s “Got to Give it Up”
Which is all to say that the law is actually really bad at determining what is right and wrong, and our moral compasses should not defer to the law. Unfortunately, moral compasses are often skewed by money - like how normal compassess are skewed by magnets

By kennywinker 7 hours ago
Presumably, a suitable prompt could get the LLM to produce whole sections of the book which would demonstrate that the LLM contains a modified version.

By ndsipa_pomu 17 hours ago
FDL is famously annoying.
wikipedia used to be under FDL and they lobbied FSF to allow an escape hatch to Commons for a few months, because FDL was so annoying.

By karel-3d 17 hours ago
Telling mjg59 they are confused about a license is an audacious move. But I understand your question and I have the same question.

By ghighi7878 15 hours ago
They don't need the "do whatever" permission if everything they do is fair use. They only need the downloading permission, and it's free to download.

By Dylan16807 18 hours ago
I don't like the editorialized title either but I would say that the actual post title
"The FSF doesn't usually sue for copyright infringement, but when we do, we settle for freedom"
and this sentence at the end
" We are a small organization with limited resources and we have to pick our battles, but if the FSF were to participate in a lawsuit such as Bartz v. Anthropic and find our copyright and license violated, we would certainly request user freedom as compensation."
could be seen as "threatening".

By darkwater 17 hours ago
It's just an indication to model trainers that they should take care to omit FSF software from training.
Not a nothing burger, but not totally insignificant either.

By lelanthran 19 hours ago
Is it? The FSF's description of the judgement is that the training was fair use, but that the actual downloading of the material may have been a copyright infringement. What software does the FSF hold copyright to that can't be downloaded freely? Under what circumstances would the FSF be in a position to influence the nature of a settlement if they weren't harmed?

By mjg59 18 hours ago
Is harm necessary to show in a copyright infringement case?

By jfoster 18 hours ago
Copyright infringement causes harm, so if there's no harm there's no infringement. You can freely duplicate GFDLed material, so downloading it isn't an infringement. If training a model on that downloaded material is fair use then there's no infringement.

By mjg59 17 hours ago
Is the FSF threatening Anthropic? The way I read it looks like they are not:
> We are a small organization with limited resources and we have to pick our battles, but if the FSF were to participate in a lawsuit such as Bartz v. Anthropic and find our copyright and license violated, we would certainly request user freedom as compensation.
Sounds more like “we can’t and won’t sue, but this is the kind of compensation that we think would be appropriate”

By grodriguez100 17 hours ago
HN really needs some stricter rules for editorialized title. The HN title has nothing to do with the link (unless the article is edited?)

By raincole 16 hours ago
The rule is fine and clear, it just wasn’t followed here. There’s no reason to have a stricter rule, what you’re complaining about is its enforcement. Two moderators can’t read everything, if you have a complaint, email them (contact link at the bottom of the page), they are quite responsive.

By latexr 16 hours ago
flag the submission?

By touristtam 16 hours ago
The title is:
The FSF doesn't usually sue for copyright infringement, but when we do, we settle for freedom

By politelemon 19 hours ago
Misleading title

By khalic 16 hours ago