Progress in Natural Language Processing Technologies: Regulating Quality and Accessibility of Training Data

Keywords: personal data, data regime, generative neural network, artificial intelligence, natural language processing, large language models, data access, copyright

Abstract

Progress in natural language processing technologies (NLP) is a cardinal factor of major socioeconomic importance behind innovative digital products. However, inadequate legal regulation of quality and accessibility of training data is a major obstacle to this technological development. The paper is focused on regulatory issues affecting the quality and accessibility of data needed for language model training. In analyzing the normative barriers and proposing ways to remove them, the author of the paper argues for the need to develop a comprehensive regulatory system designed to ensure sustainable development of the technology.

Author Biography

Ilya Ilyin, Saint Petersburg State University

Postgraduate Student

References

Dash N.S., Arulmozi S. (2018) History, features, and typology of language corpora. Singapore: Springer, p. 291.

Feng Z. (2023) Formal analysis for natural language processing: a handbook. Berlin: Springer Nature, pp. 7,8, 25.

Gavrilov E.P. (2009) Copyright and the content of artistic work. Patenty i litsenzii=Patents and Licenses, no. 7, pp. 31–38 (in Russ.)

Glauner P. (2024) Technical foundations of generative AI models. Legal Tech — Zeitschrift für die digitale Anwendung, pp. 24–34.

Goldberg Y. (2017) Features for textual data. In: Neural network methods for natural language processing. Cham: Springer, pp. 65–76.

Gracheva D.A. (2023) Free use of copyright and related rights in the context of development of digital technologies in Russia. Trudy po intellektualnoy sobstvennosti=Works on Intellectual Property, vol. 45, no. 2, pp. 44–52 (in Russ.)

Hacker P. (2021) A legal framework for AI training data—from first principles to the Artificial Intelligence Act. Law, Innovation and Technology, vol. 13, no. 2, pp. 257–301.

Hirschberg J., Manning C.D. (2015) Advances in natural language processing. Science, vol. 349, no. 6245, pp. 261–266.

Kashanin A.V. (2010) Development of ideas on the form and content of works in the copyright doctrine. The problem of protectability of research works. Vestnik grazhdanskogo prava=Bulletin of Civil Law, vol. 10, no. 2, pp. 68–138 (in Russ.)

Kelli A., Vider K., Lindén K. (2016) The regulatory and contractual framework as an integral part of the CLARIN infrastructure. CLARIN Annual Conference. Linköping University Electronic Press, pp. 13-24. Available at: https://helda.helsinki.fi/server/api/core/bitstreams/1f7b8a3c-790c-4e66-9677-f5f9aca785d6/content (accessed: 04.07.2024)

Khyani D. et al. (2021) An interpretation of lemmatization and stemming in natural language processing. Journal of Shanghai University for Science and Technology, vol. 22, no. 10, pp. 350–357.

Kolain M., Grafenauer C., Ebers M. (2021) Anonymity assessment-a universal tool for measuring anonymity of data sets under the GDPR with a special focus on smart robotics. Rutgers Computer & Technology Law Journal, vol. 48, p. 174.

Kolzdorf M.A. (2021) Free use of the items subject to copyright and related rights in Big Data processing. Zakon=Law, no. 5, pp. 142–164 (in Russ.)

Li T.C. (2022) Algorithmic destruction. Southern Methodist University Law Review, vol. 75, pp. 480-505. DOI: https://doi.org/10.25172/smulr.75.3.2

Lythreatis S. et al. (2022) The digital divide: a review and future research agenda. Technological Forecasting and Social Change, vol. 175, pp. 1–11.

Mushakov V.E. (2022) Constitutional human rights in the context of addressing the digital divide. Vestnik Sankt-Petersburgskogo universiteta MVD=Bulletin of Saint Petersburg University of Interior Ministry, no. 1, pp. 69–73 (in Russ.)

Oostveen M. (2016) Identifiability and the applicability of data protection to big data. International Data Privacy Law, vol. 6, no. 4, pp. 299–309.

Rahman A. (2020) Algorithms of oppression: how search engines reinforce racism. New Media & Society, vol. 22, no. 3, pp. 575–577. DOI: https://doi.org/10.1177/1461444819876115.

Rogers S.E. (2016) Bridging the 21st century digital divide. TechTrends, vol. 60, no. 3, pp. 197–199.

Russo A., Proutiere A. (2021) Poisoning attacks against data-driven control methods. 2021 American Control Conference (ACC). IEEE, pp. 3234–3241. Available at: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9482992 (accessed: 04.07.2024). DOI: 10.23919/ACC50511.2021.9482992.

Schneier B. (2015) Data and Goliath: the hidden battles to collect your data and control your world. N.Y.: Norton, 448 p.

Truyens M., Van Eecke P. (2014) Legal aspects of text mining. Computer Law & Security Review, vol. 30, no. 2, pp. 153–170.

Zhou M. et al. (2020) Progress in neural NLP: modeling, learning, and reasoning. Engineering, vol. 6, no. 3, pp. 275–290.

Published
2024-07-20
How to Cite
IlyinI. (2024). Progress in Natural Language Processing Technologies: Regulating Quality and Accessibility of Training Data. Legal Issues in the Digital Age, 5(2), 36-56. https://doi.org/10.17323/2713-2749.2024.2.36.56
Section
Artificial Intelligence and Law