Developers can now let an AI assistant write code for them

Developers can now let an AI assistant write code for them - but what are the IP implications?

A year after its announcement, GitHub Copilot was launched last month to the public at large. The tool promises to "fundamentally change the nature of software development" by providing AI-based coding suggestions, saving developers time and effort. What are the intellectual property implications for those who build or buy software created using Copilot?

The launch of Copilot also coincides with the UK Government's response to its public consultation on artificial intelligence and intellectual property, and the impact of the Government's proposals are considered below.

What Copilot does

In essence, Copilot offers developers highly advanced 'auto-complete' functionality, in exchange for a monthly or yearly subscription fee. By simply starting to type (e.g. a function definition and parameters), Copilot will 'take over' and suggest lines of code in real time. It is even capable of suggesting the code for an entire function based only on the developer's 'comment' describing in plain English what the function should do (i.e. code annotations starting with // in Java or # in Python). The developer then simply checks that the code does what they intended.

The potential of AI-assisted coding is truly immense. Such tools allow developers to focus on the 'big picture' of the software they are building, minimise scope for human error, ease the process of learning a new programming language or framework, and significantly accelerate the process of building software.

How it works

Copilot is powered by an AI model created by OpenAI, called 'Codex'. The model was trained on a combination of publicly available source code (e.g. as stored on GitHub repositories) and natural language text.

As a developer types using a compatible text editor, the text data is transmitted to the Copilot service in real time. Codex then considers the input text data, accounting for the surrounding context in which it appears (both in the file currently being edited and in related project files) and generates lines or blocks of code that do what the developer probably intended to achieve.

As to accuracy, GitHub's own trial suggests that 26% of Copilot auto-completions are accepted by the developer without amendment. Having been trained on the existing stock of publicly available source code, accuracy varies significantly depending on the volume of similar code within the training dataset. As a result, Copilot's utility may be diminished where a developer is building wholly novel software.

UK Government consultation

The UK Intellectual Property Office (IPO) ran a public consultation on Artificial Intelligence and IP: copyright and patents from October 2021 to January 2022, and the results of that consultation were published on 28 June 2022. The UK Government has stated that it will expand the scope of the current 'data mining' exception to liability for copyright infringement, but leave other aspects of copyright and patent protection for AI-generated material unchanged. The relevant conclusions from the consultation are addressed below where applicable to Copilot.

Who owns the copyright in the end product?

Where a developer uses Copilot when writing source code, a critical issue is whether the developer will own the copyright in that code. If not, the developer will not be able to assign or on-license their work to their client – you cannot give what you do not have. Generally, the author is the first owner of the copyright under UK law; identification of the author is therefore key to determining ownership.

GitHub suggests that "Copilot is a tool, like a compiler or pen" and, as a result, "the code you write with GitHub Copilot's help belongs to you". Unfortunately, the default legal position is not so clear-cut. A compiler follows instructions to translate source code written by a human and a pen moves as the writer manipulates it, and the specific human 'author' of the copyright work is easy to identify without controversy. Copilot, on the other hand, 'authors' text itself that would ordinarily have been conceived and recorded by a human.

Somewhat uniquely, UK copyright law protects computer-generated works that do not have a human creator. The author of such works is deemed to be "the person by whom the arrangements necessary for the creation of the work are undertaken"^[1]. In response to the public consultation on Artificial Intelligence and IP, the UK Government has decided not to make any changes to the existing law on the deemed authorship of computer-generated works.

Considering the way Copilot works, GitHub as the provider of Copilot might in some instances be a stronger candidate for the deemed author (and therefore the deemed owner) than the developer using the tool who, one could argue, simply presses the "start" button.

However, even if GitHub rather than the user were the first owner by operation of law, Copilot's terms of use state that:

"The code, functions, and other output returned to you by GitHub Copilot are called "Suggestions." GitHub does not claim any rights in Suggestions, and you retain ownership of and responsibility for [your code], including Suggestions you include in [your code]."

GitHub clearly has no interest in owning Copilot-generated source code that is incorporated into a developer's work. However, whether this term is effective in assigning IP rights to the developer is an open question.

Does copyright subsist at all?

Source code is generally protected by copyright as a 'literary work' provided that the code is 'original'. The English courts have held that originality means that an author has applied their own skill, judgement and individual effort, and has not copied the work. Whether a computer-generated work is required to meet, and does meet, the originality requirements established under EU law has been the subject of much academic discussion, which goes beyond the scope of this note^[2].

On the narrow question of whether the new code has been 'copied', where Copilot generates an entirely new work, albeit resulting from training based on existing code, that new work is likely to satisfy the 'not copied' requirement. However, in some cases code may be derived from a very small training dataset, based on very few examples. Such code is less likely to satisfy the originality requirements, and may not attract copyright protection at all.

Did Copilot's training programme infringe existing copyright?

There has been lively debate as to whether Copilot's training data, being source code taken from public repositories, infringes copyright in that source code^[3]. Where AI is trained using datasets comprising literary works (i.e. source code), the AI tool will typically need to make a copy of the work, which requires the owner's permission unless the copying falls under a statutory exception to copyright infringement liability.

If the training of the AI models has taken place in the EU, it may fall under the 'text and data mining' (TDM) exception. EU member states must permit "reproductions and extractions of lawfully accessible works … for the purposes of text and data mining"^[4]. However, similar exceptions are not available in every jurisdiction.

The UK left the EU before implementing the TDM exception, and a similar pre-existing exemption in the CDPA is limited to research "for a non-commercial purpose"^[5], which may not apply to Codex and Copilot. However, the UK Government has now stated that it intends to adopt a new TDM exception, to allow TDM "for any purpose"^[6]. Unlike the position in the EU, once the UK Government's proposals are implemented, copyright owners will not have the right to opt out of the TDM exception in the UK, thereby strengthening the position of data miners.

In the case of training data produced by GitHub users, GitHub may have circumvented the problem by contract. The GitHub Terms of Service define 'Service' to include "the applications, software, products, and services provided by GitHub, including any Beta Previews". The Terms go on to grant a licence of content generated by users of GitHub to GitHub to "provide the Service" and for the purpose of "improving the Service over time", which may include the use of TDM to improve any of its products.

Is there a risk of third-party infringement claims or copyleft 'infection'?

A further concern, though difficult to quantify at this early stage, is that Copilot may inadvertently reproduce code that it has been trained on verbatim (like a toddler repeating phrases it has heard adults say).

This is conceded by GitHub which, in its FAQs, claims that "about 1% of the time, a suggestion may contain some code snippets longer than ~150 characters that matches the training set". Given how machine learning works, the risk is more pronounced in relation to niche software for which there is a narrow base of source code available as training data for the model.

The effect is that a developer's code may end up with snippets that have been, in effect, copied and pasted from source code, which opens the possibility of third-party infringement claims.

Further, existing source code that finds its way into a developer's work may have been made available under an open-source licence such as the 'GNU General Public License'. Many open-source licences are restrictive 'copyleft' licences. The concern is that a developer's work might inadvertently be a derivative of copyleft-licensed code, which under the terms of the relevant copyleft licence must also be licensed under the same or a similar copyleft licence.

Further uptake of Copilot by the global community of developers may reveal the degree to which it regurgitates pre-existing code. However, lines of code being reproduced in this way may not reach the "substantial part" threshold required to give rise to liability for copyright infringement in the UK.

Managing the risks

Until the questions of legal ownership and infringement are resolved, developers and the clients that engage their services should consider:

subjecting all output to thorough code review (as part of a robust governance programme); and
running open-source and third-party code review audits to identify issues before code goes into production.

The above sets out the UK legal position, and companies should exercise caution and where possible seek legal advice tailored to their circumstances and the location of their activities.

References

[2] See Hervey & Lavy, The Law of Artificial Intelligence, at 8-137 to 8-146.

[3] The Free Software Foundation has called for White Papers: https://www.fsf.org/blogs/licensing/fsf-funded-call-for-white-papers-on-philosophical-and-legal-questions-around-copilot.

[4] Article 4(1) Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on copyright and related rights in the Digital Single Market and amending Directives 96/9/EC and 2001/29/EC (Copyright Directive).

[5] Section 29A(1) CDPA.

[6] 28 June 2022 Government response to consultation of Artificial Intelligence and Intellectual Property https://www.gov.uk/government/consultations/artificial-intelligence-and-ip-copyright-and-patents/outcome/artificial-intelligence-and-intellectual-property-copyright-and-patents-government-response-to-consultation

Clifford Chance

Artificial intelligence

Talking Tech