The Data Bill returns to the Commons tomorrow. Already a complex piece of legislation, the bill covers everything from retention of biometric data to the registration of births and deaths. Unfortunately, some peers have attempted to hijack the bill to settle a wider debate over how our copyright laws should adapt to the development of exciting new AI services.
The Government announced a consultation on this topic at the end of last year and has received thousands of responses. Ministers and officials are working through those and attempting to find a resolution that works for everyone. That should be possible.
AI models learn by looking at publicly-available data on the Internet just like humans do. I have no right to track you down and demand payment if reading this article sharpens your language skills, and allows you to get that exciting new job you were hoping for (you’re welcome). Copyright was never intended to frustrate the kind of activity that AI training represents.
The Government’s proposals are to make it clear that AI training does not represent a breach of copyright. This is known as a text and data mining exception and similar protection (albeit, varying in form) is already in place in our peers: Japan, the EU, Singapore and the United States.
Debates over where the line is drawn between inspiration and copying are both legitimate and important. Courts have been wrestling with that question in various forms for generations, long before the rise of AI. As the film-maker James Cameron argued recently, though, that comes down to looking at the outputs, what comes out of the model, not the data that goes into the model.
Instead, the focus should be upon the measures that the Government is considering alongside a text and data mining exception.
The first of these is a machine-readable opt out, so that rights holders who do not want their content in AI training datasets can instruct that it is not used. There is ongoing work to make the control this gives developers easier to implement, more granular and functional for a greater range of content (e.g. images and video), building on well-established protocols.
An opt out has real costs. There will be content opted out for no particularly good reason that could be improving the performance of AI models. There will be costs for AI developers (particularly onerous for startups) not imposed in other major economies. UK AI development will pay a price, but this is the kind of area where the Government might be able to find a compromise.
By contrast, the suggestion that developers should need to secure ‘opt in’ for each source is obviously incompatible with the realities of how large language models are developed – you would be trying to opt in the Internet.
The second is around transparency. The proposal from the Lords here seems almost intended to be impossible for AI developers to comply with. Developers would be required to disclose comprehensive information regarding all text and data used in pre-training, training and fine-tuning of AI models on a monthly basis, whether it is subject to copyright or not.
This is a hopeless endeavour. It is important to remember that copyright does not have to be registered anywhere, so there is no database that AI developers can go to and find out who holds what copyright over what content. You would be asking companies to summarise all of the terabytes of data that goes into sophisticated large language models.
It would also kill competition. If the task is hard to imagine for even the largest developers, it is unthinkable for the startups, scale-ups and other smaller developers who are currently an important part of a dynamic AI sector. Even if they could comply, they would be giving away the secret sauce about how their model is built and trained, making it impossible to succeed by taking a new and different approach.
Finally, it would be a real risk to the security of AI models. If you tell people what data is being used and how, it will create all kinds of opportunities for people to exploit those models, poisoning the inputs in ways that mean they don’t work as intended. This would be an unnecessary security risk and undermine the role AI can play in improving important public services.
The final element raised in the Lords is what this means for models trained in other countries. Peers have demanded that any services that have ‘links’ to the UK should have to comply with UK copyright law, regardless of where the actual AI development takes place. This is the opposite of how copyright law normally works and would imply AI developers need to not just know the copyright law where they’re working, but in the dozens of countries where their work might be used.
This would force companies to choose between leaving the UK market or compromising the development of their models globally. The most sophisticated services will be delayed or never introduced here, with severe consequences for the ability of all kinds of UK businesses to compete in global markets.
All these extreme measures are motivated by concerns raised by some media organisations and other rights holders.
In some cases, they are hoping there is a pot of gold if they can force AI developers to come to them for content that can be used to train models. This fundamentally misunderstands how those models work. They are trained on enormous volumes of content from across the Internet and no particular content is adding material value. No one wins from an exhaustive commercial negotiation to get the fraction of a penny of value that each individual source is contributing.
The real opportunity is for companies with distinctive datasets, engaging with a thriving AI development sector to bring those datasets to the table and license them. Deals like that are being done all the time already and we could see more in the UK if we have the TDM exception that will enable more investment here.
In other cases, they are worried about being displaced by AI. As has been seen with earlier waves of creative technology (e.g. digital editing software), any tool that improves productivity has the potential to reduce the need for some kinds of tasks, but it will create other opportunities. The UK’s creative industries are much more likely to thrive by using new tools to compete for growing global markets, as they have done through those other waves of technological change, than by cutting the UK off from AI developments that will be put to work by their competitors abroad. Even if we wanted to put AI technology back in a box, with all the missed opportunities that would represent, that is not in the gift of UK copyright law.
On the flip side, this issue is vital to a thriving UK AI sector. A survey we released last week of 500 developers, investors and others working in the UK AI ecosystem showed how much this matters. First, 94% reported that their own work was very or somewhat reliant on models built using such techniques, with a full 54% reporting ‘very reliant’. A further 76% thought that if the United Kingdom chose not to introduce an equivalent protection for text and data mining to those in the EU, US and Japan, it would be an important signal and the sector would likely reconsider whether the UK is a competitive environment for AI investments.
It also matters for the Government’s wider plans for AI to contribute to UK economic growth and improvements in public services. For 64% of those we surveyed, if the UK does not introduce such a protection, then the Government’s wider commitment to AI will seem a lot less credible. Its AI Opportunities Action Plan published in January is a jigsaw, and initiatives like AI Growth Zones will not deliver their full value if the copyright reform piece is missing.
There is patient work to be done to finalise the Government’s plans for copyright and AI. I know industry will be keen to engage with them every step of the way. For that to happen, the first thing that needs to happen is for Parliament to reject the shortcut proposed in the Lords and give Ministers time to bring forward constructive proposals.
Click here to subscribe to our daily briefing – the best pieces from CapX and across the web.
CapX depends on the generosity of its readers. If you value what we do, please consider making a donation.