Automation of Forensic Authorship Attribution: Problems and Prospects
Abstract
The article deals with validation of an integrative attribution algorithm based on the analysis of the author’s idiostyle using methods of interpretative linguistics with objectification of the available data with the help of mathematical statistics. The algorithm addresses the identification problem of the attribution. The choice of parameters describing the individual style of an author assumes that the text is a product of an authentic language personality described by psycholinguistic (Yu.N. Karaulov), sociolinguistic and forensic linguistic (S.M. Vul, M. Coulthard, R. Shuy) methods. To validate a hypothesis that the identification problem of attribution is best resolved by the integrative methodology, we have created the KhoRom application which brings together the aforementioned approaches to the analysis of language personality: http://khorom-attribution.ru/#/. It can be used to compare two language personality models and determine to what extent they are similar using the following metrics: Pearson correlation coefficient, linear regression determination coefficient and Student’s t-criterion. Importantly, this application also describes the interpreted model of language personality to inform the user on the importance of values of each parameter. The system has a wealth of features, with the user able to choose parameters, view parameter implementation in the document and edit the final list of parameter implementations (in case of malfunction, the application performance can be corrected manually). The created application is only a part of the attribution algorithm. The data produced by mathematical statistics need to be analyzed by expert judgment through the use of methodological recommendations developed for the algorithm. The effectiveness of this methodology has been proved by its validation on texts of various length and genres, with a number of documents pertaining to fiction, journalism, official and colloquial styles being analyzed. For texts of all discourses except colloquial, the developed algorithm has demonstrated a high level of accuracy (F-score of 0.8 to 1). For better applicability of the algorithm to colloquial texts, the authors have developed a number of improvements pending implementation.
References
Аpresyan Yu.D. (1966) Ideas and methods of modern structural linguistics. Мoscow: Nauka, 302 p. (in Russ.)
Bacciu A., Morgia M. et al. (2019) Cross-domain authorship attribution combining instance-based and profile-based features. Notebook for PAN at CLEF 2019. Available at: http://ceur-ws.org/Vol2380/paper_220. pdf (accessed: 05.07.2020)
Baranov А.N. (2001) Introduction to Applied Linguistics. Manual.Мoscow: Editorial URSS, 360 p. (in Russ.)
Batura Т.V. (2012) Formal Ways of text authorship identification. Vestnik Novosibirskogo gosudarstvennogo universiteta. Informatcionnye tehnologii=Journal of Novosibirsk State University. Information Technology, vol. 2, no. 4, pp. 81–94 (in Russ.)
Belousov К.I. (2010) Linguistic Models and Language Reality Modeling Issues. Vestnik Orenburgskogo gosudarstvennogo universiteta=Journal of Orenburg State University, no. 11, pp. 94–97 (in Russ.)
Bessmertny I.А., Nugumanova А.B. (2012) Automatic Thesaurus Building Method Based on Statistical Processing of Texts in the Natural Language. Izvestia Tomskogo gosudarstvennogo politekhnicheskogo universiteta=Proceedings of Tomsk State Polytechnical University, no. 5, pp. 125–130 (in Russ.)
Bloch B. (1948) A set of postulates for phonemic analysis. Language, vol. 24, no. 1, pp. 3–46. DOI: https://doi.org/10.2307/410284
Bloomfield L. (1926) A set of postulates for the science of language. Language, vol. 2, no. 2, pp. 153–164. DOI: https://doi.org/10.2307/408741
Campbell L. (1867) The Sophisties and Polilicus of Plato. Oxford: Clarendon Press, 170 p.
Coulthard M. (2004) Author identification, idiolect, and linguistic uniqueness. Applied Linguistics, vol. 24, no. 4, pp. 431–447. DOI: https://doi.org/10.1093/applin/25.4.431
Custódio J., Paraboni I. (2018) EACH-USP Ensemble Cross-Domain Authorship Attribution. Notebook for PAN at CLEF 2018. Available at: http://ceur-ws. org/Vol-2125/paper_76.pdf (accessed: 05.07.2020)
Encyclopedia of Forensic Science (1999) Т.В. Averyanova (ed.).Мoscow: Prospekt, 442 p. (in Russ.)
Galyashina Е.I., Yermolova Е.I. (2005) Linguo-forensic tools for authorship attribution of written and oral texts. Papers of the International Research Conference. Moscow, pp. 20–22 (in Russ.)
Gomzin A. et al. (2018) Detection of author’s educational level and age based on comments analysis. Paper presented at Dialogue, Moscow, 30 May–2 June 2018. Available at: URL: http://www.dialog-21.ru/media/4279/gomzin_turdakov.pdf (2018) (accessed: 05.07.2020) (in Russ.) DOI: https://doi.org/10.22320/07183607.2018.21.37.00
Goroshko Е.I. (2003) Forensic authorship attribution: gender identification of the author of a document. Theory and practice of forensic investigation and science. Pravo, no. 3, pp. 221–226 (in Russ.)
Hjelmslev L. (2005) Prolegomena to a theory of language. Мoscow: Editorial URSS, 243 p. (in Russ.)
Juola P. (2006) Authorship Attribution. Foundations and Trends in Information Retrieval, vol. 1, no. 3, pp. 233–334. DOI: https://doi.org/10.1561/1500000005
Ionova S.V., Ogorelkov I.V. (2020) Gender-based Individual Speech Diagnostics in Authorship Attribution: Quantitative Approach. Vestnik Volgogradskogo gosudarstvennogo universiteta. Linguistika=Journal of Volgograd State University. Linguistics, vol. 19, no. 1, pp. 115–127. DOI:https://doi.org/10.15688/jvolsu2.2020.1.10 (in Russ.) DOI: https://doi.org/10.15688/jvolsu2.2020.1.10
Karaulov Yu. N. (1987) Russian Language and Language Personality. Мoscow: Nauka, 264 p. (in Russ.)
Khmelyov D.V. (2002) Linguo-analyzer. E-resource. Available at: URL: http://www.rusf.ru/books/analysis/ (accessed: 16.11.2017) (in Russ.)
Khomenko A., Baranova Yu., Romanov A., Zadvornov K. (2021) The Linguistic modeling as a basis for creating authorship attribution software. Computational linguistics and intellectual technologies. Proceedings of the International Conference “Dialogue 2021” Moscow. Available at: URL: http://www.dialog21.ru/media/5315/khomenkoaplusetal048.pdf (accessed: 23.06.2021) (in Russ.)
Komissarov А.Yu. (2000) Forensic Investigation of Written Speech: Manual. Мoscow: Forensic Agency of the Interior Ministry of the Russian Federation, 126 p. (in Russ.)
Koppel M., Schler J. (2003) Exploiting Stylistic Idiosyncrasies for Authorship Attribution. Proceedings of IJCAI’03 Workshop on Computational Approaches to Style Analysis and Synthesis, vol. 69, pp. 72–80.
Korobov M. (2015) Morphological analyzer and generator for Russian and Ukrainian languages. In: Khachay M.Y., Konstantinova N.A. (eds.). AIST 2015. CCIS, vol. 542, pp. 320–332. Available at: https://doi.org/10.1007/978-3-319-26123-2_31 (accessed: 05.07.2020) (in Russ.) DOI: https://doi.org/10.1007/978-3-319-26123-2_31
Leonard R., Ford J., Christensen T. (2017) Forensic linguistics: applying the science of linguistics to the issues of the law. Hofstra Law Review, vol. 45, pp. 881–897.
Linguistics of Constructions (2010) Е.V. Rakhilina (ed.). Мoscow: Azbukovnik Publishing, 584 p. (in Russ.)
Litvinova Т.А. (2019) Idiolect as Object of Corpus Idiolectology: Towards a New Field in Linguistics. Vestnik Novgorodskogo gosudarstvennogo universiteta imeni Yaroslava Mudrogo=Bulletin of the Yaroslav Mudriy Novgorod State University, no. 7, pp. 1–5 (in Russ.)
Litvinova T., Rangel F. et al. (2017) Overview of the Rus Profiling PAN at FIRE Track on Cross-genre Gender Identification in Russian. Working notes of FIRE 2017. Forum for Information Retrieval Evaluation. Bangalore, pp. 1–7. Available at: URL: http://ceur-ws.org/Vol-2036/T1-1.pdf (accessed: 05.07.2019) (in Russ.)
Litvinova Т.А., Gromova А.V. (2020) The Use of Computer Technologies
Litvinova T., Sboev A., Panicheva P. (2018) Profiling the age of Russian bloggers. Proceedings of the 7th International Conference, AINL 2018. Saint Petersburg, pp. 167–177 (in Russ.) DOI: https://doi.org/10.1007/978-3-030-01204-5_16
Losev А.F. (2004) Introduction to the General Theory of Linguistic Models. Мoscow: Editorial URSS, 293 p. (in Russ.)
Marusenko М.А. (1990) The use of image recognition methods for attribution of anonymous and pseudonymous literary texts. Leningrad: University, 1990. 164 p. (in Russ.)
Marusenko М.А. (2003) Attribution of anonymous and pseudonymous texts as a standard image recognition problem. Istoriographiya y istochnikovedeniye otechestvennoy istorii=Historiography and Research of Sources of National History, no. 3, pp. 18–22 (in Russ.)
McMenamin G. (2002) Forensic linguistics: advances in forensic stylistics. London: Routledge, 361 p. DOI: https://doi.org/10.1201/9781420041170.ch9
Murauer B., Tschuggnall M., Specht G. (2018) Dynamic Parameter Search for Cross-Domain Authorship Attribution. Notebook for PAN at CLEF 2018. Available at: http://ceur-ws.org/Vol-2125/paper_84.pdf (accessed: 05.07.2020)
Muttenthaler L., Lucas G., Amann J. (2019) Authorship Attribution in Fan-Fictional Texts given variable length Character and Word N-Grams. Notebook for PAN at CLEF 2019. Available at: http://ceur-ws.org/Vol-2380/paper_49.pdf (accessed: 05.07.2020)
Paducheva Е.В. (1974) On semantics of syntax. Мoscow: Nauka, 291 p. (in Russ.)
Radbil Т.B., Markina M.V. (2019) Probability Statistical Models in Attribution of Texts by Russian Language Authors. Politicheskaya Lingvistika=Political Linguistics, no. 2, pp. 156–166 (in Russ.) DOI: https://doi.org/10.26170/pl19-02-18
Revzin I.I. (1977) Modern Structural Linguistics: Issues and Methods.Мoscow: Nauka, 263 p. (in Russ.)
Rodionova Е.S. (2008a) Linguistic Methods of Attribution and Dating of Literary Texts: towards Corneille-Moliere Problem. Candidate of Philological Sciences Summary. Saint Petersburg, 25 p. (in Russ.)
Rodionova Е.S. (2008b) Methods of literary text attribution. In: Structural and applied linguistics: inter-university collection. А.S. Gerda (ed.). Saint Petersburg: University, 2008, pp. 118–127 (in Russ.)
Rogov А.А. et al. (2019) Software support for solving text attribution problems. Programmnaya Inzheneriya=Programming Engineering, no. 5, pp. 234–240 (in Russ.) DOI: https://doi.org/10.17587/prin.10.234-240
Romanov А.S. (2010) Methodology and Software Package for Identification of Authors of Unknown Texts. Candidate of Engineering Sciences Summary. Tomsk, 26 p. (in Russ.) for Forensic Authorship Attribution: Issues and Prospects. Vestnik Volgogradskogo gosudarstvennogo universiteta. Lingvistika=Journal of Volgograd State University. Linguistics, vol. 19, no. 1, pp. 77–88. DOI:https://doi.org/10.15688/jvolsu2.2020.1.7 (in Russ.) DOI: https://doi.org/10.15688/jvolsu2.2020.1.7
Romanov A.S., Kurtukova A., Fedotova A. et al. (2021) Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks. Future Internet, vol. 13, issue 3, pp. 1–16. DOI: https://doi.org/10.3390/fi13010003
Rubtsova I.I., Yermolayeva Е.I., Bezrukova А.I. et al. (2007) Comprehensive methodology of authorship attribution: methodological recommendations. Мoscow: Forensic Agency of the Ministry of Interior, 192 p. (in Russ.)
Russian Grammar Rules: Collected works (2005) N. Yu. Shvedov (ed.). Мoscow: Nauka, 665 p. Available at: URL: http://rusgram.narod.ru/index.html. (in Russ.)
Shevelyov О. G. (2007) Methods of automatic classification of texts in the natural language: manual. Tomsk: ТМL-Press, 144 p. (in Russ.)
Shtoff V. (1966) Modeling and philosophy. Мoscow: Nauka, 304 p. (in Russ.)
Shuy R. (2005) Creating Language Crimes: How Law Enforcement Uses (and Misuses) Language. N. Y.: Oxford University Press, 194 p. DOI: https://doi.org/10.1093/acprof:oso/9780195181661.001.0001
Sidorov Yu.В. et al. (1999) Computer-assisted system for linguistic analysis of literary texts. In: Saint Petersburg Assembly of Young Researchers and Specialists. Abstracts of reports. Saint Petersburg: University Press, p. 66. (in Russ.)
Stamatatos E. (2017) Authorship attribution using text distortion. Proceedings of 15th Conference of the European Chapter of the Association for Computational Linguistics, Long Papers, pp. 1138–1149. DOI: https://doi.org/10.18653/v1/E17-1107
Stepanenko А.А. (2017) Gender attribution of computer network communication texts. Vestnik Tomskogo gosudarstvennogo universiteta=Journal of Tomsk State University, no. 5, pp. 17–25. DOI: 10.17223/15617793/415/3 (in Russ.) DOI: https://doi.org/10.17223/15617793/415/3
Timashev А.N. (2007) Atributor: version 1.01: software description. Available at: URL: http://www.textology.ru/atr_resum.html (accessed: 01.02.2016) (in Russ.)
Vinogradov V.V. (1961) The Authorship Problem and Theory of Styles. Мoscow: Goslitizdat, 614 p. (in Russ.)
Vul S.М. (2007) Forensic Authorship Identification: Methodological Basis. Guidebook. Kharkov: KhNIISE Press, 64 p. (in Russ.)
Wright D. (2017) Implementing word n-grams to identify authors and idiolects: a corpus approach to a forensic linguistic problem. International Journal of Corpus Linguistics, vol. 22, no. 2, pp. 212–241. DOI: https://doi.org/10.1075/ijcl.22.2.03wri
Zakharov V.N. et al. (2000) System Support Programme for Attribution of Articles Authored by F.M. Dostoevsky. Trudy Petrozavodskogo gosudarstvennogo universiteta=Research Works of the Petrozavodsk State University. Applied Mathematics and Information Technology Series, issue 9, pp. 113–122 (in Russ.)
Zakharov V.N., Khokhlova М.V. (2008) The Statistical Method for Identification of Collocations. Language Engineering in Search of Meanings. Collection of reports to the conference workshop “Web-Based Linguistic Information Technologies”. 11th All-Russia Joint Conference “Internet and Modern Society”. Saint Petersburg: University, 2008, pp. 40–54 (in Russ.)
Authors who publish with this journal agree to the Licensing, Copyright, Open Access and Repository Policy.