Progress in Natural Language Processing Technologies: Regulating Quality and Accessibility of Training Data
Аннотация
Progress in natural language processing technologies (NLP) is a cardinal factor of major socioeconomic importance behind innovative digital products. However, inadequate legal regulation of quality and accessibility of training data is a major obstacle to this technological development. The paper is focused on regulatory issues affecting the quality and accessibility of data needed for language model training. In analyzing the normative barriers and proposing ways to remove them, the author of the paper argues for the need to develop a comprehensive regulatory system designed to ensure sustainable development of the technology.
Литература
Dash N.S., Arulmozi S. (2018) History, features, and typology of language corpora. Singapore: Springer, p. 291. DOI: https://doi.org/10.1007/978-981-10-7458-5
Feng Z. (2023) Formal analysis for natural language processing: a handbook. Berlin: Springer Nature, pp. 7,8, 25. DOI: https://doi.org/10.1007/978-981-16-5172-4
Gavrilov E.P. (2009) Copyright and the content of artistic work. Patenty i litsenzii=Patents and Licenses, no. 7, pp. 31–38 (in Russ.)
Glauner P. (2024) Technical foundations of generative AI models. Legal Tech — Zeitschrift für die digitale Anwendung, pp. 24–34.
Goldberg Y. (2017) Features for textual data. In: Neural network methods for natural language processing. Cham: Springer, pp. 65–76. DOI: https://doi.org/10.1007/978-3-031-02165-7_6
Gracheva D.A. (2023) Free use of copyright and related rights in the context of development of digital technologies in Russia. Trudy po intellektualnoy sobstvennosti=Works on Intellectual Property, vol. 45, no. 2, pp. 44–52 (in Russ.)
Hacker P. (2021) A legal framework for AI training data—from first principles to the Artificial Intelligence Act. Law, Innovation and Technology, vol. 13, no. 2, pp. 257–301. DOI: https://doi.org/10.1080/17579961.2021.1977219
Hirschberg J., Manning C.D. (2015) Advances in natural language processing. Science, vol. 349, no. 6245, pp. 261–266. DOI: https://doi.org/10.1126/science.aaa8685
Kashanin А.V. (2010) Development of ideas on the form and content of works in the copyright doctrine. The problem of protectability of research works. Vestnik grazhdanskogo prava=Bulletin of Civil Law, vol. 10, no. 2, pp. 68–138 (in Russ.)
Kelli A., Vider K., Lindén K. (2016) The regulatory and contractual framework as an integral part of the CLARIN infrastructure. CLARIN Annual Conference. Linköping University Electronic Press, pp. 13-24. Available at: https://helda.helsinki.fi/server/api/core/bitstreams/1f7b8a3c-790c-4e66-9677-f5f9aca785d6/content (accessed: 04.07.2024)
Khyani D. et al. (2021) An interpretation of lemmatization and stemming in natural language processing. Journal of Shanghai University for Science and Technology, vol. 22, no. 10, pp. 350–357.
Kolain M., Grafenauer C., Ebers M. (2021) Anonymity assessment-a universal tool for measuring anonymity of data sets under the GDPR with a special focus on smart robotics. Rutgers Computer & Technology Law Journal, vol. 48, p. 174.
Kolzdorf М.А. (2021) Free use of the items subject to copyright and related rights in Big Data processing. Zakon=Law, no. 5, pp. 142–164 (in Russ.) DOI: https://doi.org/10.37239/0869-4400-2021-16-5-142-164
Li T.C. (2022) Algorithmic destruction. Southern Methodist University Law Review, vol. 75, pp. 480-505. DOI: https://doi.org/10.25172/smulr.75.3.2 DOI: https://doi.org/10.25172/smulr.75.3.2
Lythreatis S. et al. (2022) The digital divide: a review and future research agenda. Technological Forecasting and Social Change, vol. 175, pp. 1–11. DOI: https://doi.org/10.1016/j.techfore.2021.121359
Mushakov V.Е. (2022) Constitutional human rights in the context of addressing the digital divide. Vestnik Sankt-Petersburgskogo universiteta MVD=Bulletin of Saint Petersburg University of Interior Ministry, no. 1, pp. 69–73 (in Russ.) DOI: https://doi.org/10.35750/2071-8284-2022-1-69-73
Oostveen M. (2016) Identifiability and the applicability of data protection to big data. International Data Privacy Law, vol. 6, no. 4, pp. 299–309. DOI: https://doi.org/10.1093/idpl/ipw012
Rahman A. (2020) Algorithms of oppression: how search engines reinforce racism. New Media & Society, vol. 22, no. 3, pp. 575–577. DOI: https://doi.org/10.1177/1461444819876115. DOI: https://doi.org/10.1177/1461444819876115
Rogers S.E. (2016) Bridging the 21st century digital divide. TechTrends, vol. 60, no. 3, pp. 197–199. DOI: https://doi.org/10.1007/s11528-016-0057-0
Russo A., Proutiere A. (2021) Poisoning attacks against data-driven control methods. 2021 American Control Conference (ACC). IEEE, pp. 3234–3241. Available at: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9482992 (accessed: 04.07.2024). DOI: 10.23919/ACC50511.2021.9482992. DOI: https://doi.org/10.23919/ACC50511.2021.9482992
Schneier B. (2015) Data and Goliath: the hidden battles to collect your data and control your world. N.Y.: Norton, 448 p.
Truyens M., Van Eecke P. (2014) Legal aspects of text mining. Computer Law & Security Review, vol. 30, no. 2, pp. 153–170. DOI: https://doi.org/10.1016/j.clsr.2014.01.009
Zhou M. et al. (2020) Progress in neural NLP: modeling, learning, and reasoning. Engineering, vol. 6, no. 3, pp. 275–290. DOI: https://doi.org/10.1016/j.eng.2019.12.014
Авторы, присылающие рукописи для рассмотрения к публикации в Журнале, принимают Политику лицензирования, авторских прав, открытого доступа и использования репозиториев.