Attempting to prohibit(?) code contributions which might have used an LLM such as Copilot?

2023-11-12

Someone asked me this question in the fediverse:

I wrote this: https://github.com/gophercloud/gopherkube/blob/main/CONTRIBUTING.md#licensing

Does it make sense from a legal perspective?

As usual, nothing in this blogpost is legal advice. Treat this as the ramblings of a lawyer who wrote this having pondered it while having lunch on a Sunday, while also trying to work out (i.e. “has failed so far”) how to get a Brother label printer working successfully under Linux.

The relevant text

The project’s CONTRIBUTING text says this:

The code is licensed under the Apache License, Version 2.0. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

And each source file (that I checked) contains the usual, suggested-by-Apache, licence header:

Copyright 2023.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Critically, the website text goes on to say:

By sending a Pull request, you certify that everything you’re sending is your own production and you agree to submit it to the same license. In particular, by contributing to this project you certify that no part of your contribution was generated by a Language model (this includes, among others: ChatGPT and Microsoft Copilot) unless you can prove that the Language model you used was exclusively trained on material that was licensed to be used for that purpose.

This is a developer certificate of origin - a “DCO”.

What is a “DCO” / developer certificate of origin?

A DCO is a means by which a project, and users of the project, can (attempt to) get some more assurances about the contributions which people are making.

A DCO is typically:

something you agree to / attest on each and every commit
a promise made by you, not your employer
typically a promise that you authored the commit in question, that you have the rights to submit it

A DCO is not:

a licence
a copyright assignment, in which you give someone else your rights

Here’s the DCO for the Linux Foundation.

Picking through this DCO

you certify that everything you’re sending is your own production

I don’t know what “your own production” means.

I am reading this through a lawyer’s eyes; perhaps it is clearer to others - or, at least, that the lack of clarity is acceptable or even desirable.

you agree to submit it to the same license.

I am assuming - but I don’t know for sure - that when someone attempts to commit something to the project, they need to agree to this.

I am also assuming that the wording about the Apache License, Version 2.0, is included in that wording each time. Otherwise, “the same license” is vague.

I reckon it would be clearer to say “You are licensing [your production] under the Apache License, Version 2.0.”

In particular

I think that this qualifies “your own production”, but I am not sure.

by contributing to this project

Why not link it specifically to this commit?

you certify that no part of your contribution was generated by a Language model (this includes, among others: ChatGPT and Microsoft Copilot)

So not just the fact that you have not used an LLM, but that nothing in your contribution was generated from one.

I doubt many people would be willing to make that promise for anything other than their own code.

I suspect that this would stop people from including third party Free / open source software, even if it fell within the licensing requirements.

The breadth of this might be by design - it is a political statement, after all.

unless you can prove that the Language model you used was exclusively trained on material that was licensed to be used for that purpose.

Hmmm…

Again, I doubt that anyone could say this unless they had compiled the training data set, and train the language model themselves.

What does “can prove” mean? I’m not required to demonstrate that proof on demand (“I can prove it, but I’m not going to”), and I’m not required to adhere to any particular standard of proof (a lesser point, IMHO).

exclusively trained on material that was licensed to be used for that purpose

The requirement of “licensed to be used for that purpose” excludes anything in the public domain. Which is odd - if something is no longer subject to the restrictions of copyright, why would it be excluded here? Surely I, and anyone else, is, and should be, free to use it for training an LLM, if I so wanted?

Does it cover code under a permissive licence (for example, the MIT license? The obligation under the MIT license requires me to include a copy of the software-specific copyright notice, and the “permission statement”:

in all copies or substantial portions of the Software

If I include MIT-licensed software in my training data, and I keep the licensing information intact, I’ve met the MIT licence’s requirements in respect of those data.

Is the resulting LLM a “copy” of the software? Does it contain a “substantial portion” of the software? I don’t know.

But if it does not, then there is no requirement to include either of those things.

So is that code “licensed to be used for that purpose” or not?

licensed to be used for that purpose

Does this require a specific statement within the licence, that the licence grants permission to use it within an LLM / training data set? i.e. that it was licensed for that purpose?

Or is it sufficient that the scope of the licence permits this, even if not explicit (which, IMHO, would cover the MIT licence). i.e. that “that purpose” falls within the scope of the grant of licence.

And - much bigger question, to which I offer not answer - is whether training an LLM on someone’s code is an act restricted by copyright or, for some other reason, does not require a licence. If it is not an act restricted by copyright, or is otherwise permitted by copyright law, then no licence is needed. So this DCO would prohibit use of that code, because it specifically requires that the code was “licensed to be used for that purpose”, even if no licence is, in law, required.

The “so what?” of a DCO

Let’s say that someone contributes code to the project which was trained on an LLM, which does not comply with the requirements of the DCO.

So what?

Realistically, is the project going to attempt to identify, and sue, that person? Wherever they are in the world? Do they have the resources?

Do they want to be The Free Software Project That Sued A Contributor?

Does having this in a DCO add anything?

I’m still pondering this.

I can see that it has value from a political point of view.

From a legal point of view…? My current conclusion is “meh”. Not just for the reasons above, in terms of establishing a (potential) claim as against a contributor, but more perhaps in terms of whether it shields a project from liability from someone whose code was used to train the LLM which was used to generate the commit which was made to the project.

There is a big question here, obviously, in terms of whether the result of the LLM infringes the copyright in the training data. I’m offering no answer to that.

But if it does in this particular case, does this DCO statement shield the project from a claim of infringement?

I don’t think it does, from an English law point of view, at least - copyright infringement does not require intention. Being able to point to someone who contributed the code doesn’t (to my mind, based on a few moments’ thought over lunch) amount to a legal defence. So you’re back to trying to claim from the contributor for some or all of the losses that the project suffers.

Conclusion

Politically clear(ish) and interesting.

Legally a bit of a mess … if anyone cares.

Is this a project which has so many people falling over themselves to contribute that, if a few are turned away by this, it doesn’t matter?

Or is this a project where those likely to contribute are likely bought in to the political vision, and so would be happy with it?

Or where someone is so keen to contribute that they would be willing to put themselves at legal risk (likely or not) to do so?

Neil's blog