Text- and data mining beyond borders

AI needs to have access to huge amounts of data in order to be trained. In this article, I discuss the need for suitable text- and data mining exceptions in copyright law that stimulate AI development as well as enable human authors and creators to still earn a revenue. 

Machine learning models, including generative AI, all analyze existing data in order to learn from them. That means that for all AI tools, text-and data mining (TDM) activities are required. Art. 2(2) of the EU CDSM Directive defines text and data mining as “any automated analytical technique aimed at analyzing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations.” When the data mined includes copyright-protected works (like literary, musical, artistic works, software or databases), these activities are likely to interfere with the rights of rightholders, in particular the reproduction right. So either authorization is sought from the (millions of) rightholders of that data used, which can be close to impossible for large AI models processing huge quantities of data, or an exception allows TDM under certain circumstances.

European Union

The CDSM Directive introduced two TDM exceptions. Importantly, both require lawful access to perform TDM activities. In other words, users of the copyright protected works need to pay for accessing them, presenting a remuneration for rightholders. This requirement is not present in the US and Japanese approaches. The first exception in Article 3 is narrow; it deals with TDM activities carried out 1) by research organizations and cultural heritage institutions and 2) for the purpose of scientific research. Private actors who want to carry out TDM for commercial purposes will not benefit from this exception. Article 4 contains a second, broader option, which applies to TDM activities for any kind of purpose, even with a commercial motive. However, rightholders can opt-out from this exception. While the purpose of such an opt out may have been to empower rightholders to reach a mutually beneficial agreement with hi-tech-industry, currently, several obstacles stand in the way.

United States

The US relies on its well-established fair use doctrine[1] to define the boundaries of permissible use of copyright works. Any case of fair use needs to consider four cumulative conditions, including 1) whether the use is of a commercial or non-profit nature, and whether the use is transformative (adding something new),  2) the nature of the copyrighted work and whether its use relates to its creative expression, 3) the amount and substantiality of the portion used, and 4) the effect of the use upon the potential market. The US Second Circuit Court in the Google Authors Guild v. Google[2] case already found that the TDM activities in the Google Books project fell under fair use: Google created a new and valuable service (transformative), by making snippets of books and publications searchable, providing increased public exposure to those books and thereby not competing with existing markets.

While this finding applies to training AI that makes useful information available, we do not yet know whether it also holds true for training generative AI. First, in the case The New York Times recently brought against OpenAI, a 100+ page document identified Times articles that were recognizable in ChatGPT’s output. Where copyright works are reproduced, transformative use is not given. While this is not likely to occur regularly in generative AI models, the more important difficulty lies in the fourth factor that considers the effect on the potential market of original, human-created works. Arguably, AI-generated works are (already) in direct competition with works created by human authors, leading to a potential chilling effect for human creativity.

Japan

Japan has implemented an exception[3] that does not consider “reading” a work as copying, as long as the exploitation does no unreasonably prejudice the interests of copyright owners. Where AI training merely uses these works as a source of useful information, no one is enjoying the original expression of a work. Since the latter is the essence of what is protected under copyright, the mere mining of data does not infringe copyright. For generative AI, however, it is questionable whether that reasoning will hold true. Arguably, generative AI creates something new that imitates human styles of painting, writing or making music. Here, the expression in the output generated is enjoyed by humans. Similarly to the US fair use doctrine, where human and artificially generated works compete with each other, the interests of copyright owners could be unreasonably prejudiced.

Conclusion

This leaves us to realize that coming up with a workable solution for TDM activities is not easy. I argue that any approach needs to consider how the human creator can still be remunerated, either by requiring lawful access or in the form of a levy or statutory license payable when AI developers use human-created works as training data. At the same time, the regulatory effect of such rules must be considered, as it is in no region’s/country’s interest to be left out of AI training activities. We have an interest that dominant AI models are shaped on the basis of data about norms and values, cultural heritage, political systems, etc. from all regions.
 

[1] 17 U.S.C. § 106(1) and § 107.

[2] Authors Guild v. Google, Inc. No. 13-4829-cv (2d Cir. Oct. 16, 2015).

[3] Art. 30-4 ii) of the Japanese Copyright Act.

Tags:

A. Moerland

Anke Moerland is Associate Professor of Intellectual Property Law in the European and International Law Department, Maastricht University. She holds a PhD on Intellectual property protection in EU bilateral trade agreements from Maastricht University.

Also read