Automation of Forensic Authorship Attribution: Problems and Prospects

Keywords: attribution, language personality, automated text processing, linguistic model, mathematical model, attributive software, forensic authorship attribution

Abstract

The article deals with validation of an integrative attribution algorithm based on the analysis of the author’s idiostyle using methods of interpretative linguistics with objectification of the available data with the help of mathematical statistics. The algorithm addresses the identification problem of the attribution. The choice of parameters describing the individual style of an author assumes that the text is a product of an authentic language personality described by psycholinguistic (Yu.N. Karaulov), sociolinguistic and forensic linguistic (S.M. Vul, M. Coulthard, R. Shuy) methods. To validate a hypothesis that the identification problem of attribution is best resolved by the integrative methodology, we have created the KhoRom application which brings together the aforementioned approaches to the analysis of language personality: http://khorom-attribution.ru/#/. It can be used to compare two language personality models and determine to what extent they are similar using the following metrics: Pearson correlation coefficient, linear regression determination coefficient and Student’s t-criterion. Importantly, this application also describes the interpreted model of language personality to inform the user on the importance of values of each parameter. The system has a wealth of features, with the user able to choose parameters, view parameter implementation in the document and edit the final list of parameter implementations (in case of malfunction, the application performance can be corrected manually). The created application is only a part of the attribution algorithm. The data produced by mathematical statistics need to be analyzed by expert judgment through the use of methodological recommendations developed for the algorithm. The effectiveness of this methodology has been proved by its validation on texts of various length and genres, with a number of documents pertaining to fiction, journalism, official and colloquial styles being analyzed. For texts of all discourses except colloquial, the developed algorithm has demonstrated a high level of accuracy (F-score of 0.8 to 1). For better applicability of the algorithm to colloquial texts, the authors have developed a number of improvements pending implementation.

Downloads

Download data is not yet available.

Author Biographies

Tatiana Romanova, Department of Humanities, National Research University Higher School of Economics

Professor, Doctor of Sciences (Philology)

Anna Khomenko, Department of Humanities, National Research University Higher School of Economics
Senior Lecturer, Candidate of Sciences (Philology), expert

References

Аpresyan Yu.D. (1966) Ideas and methods of modern structural linguistics. Мoscow: Nauka, 302 p. (in Russ.)

Bacciu A., Morgia M. et al. (2019) Cross-domain authorship attribution combining instance-based and profile-based features. Notebook for PAN at CLEF 2019. Available at: http://ceur-ws.org/Vol2380/paper_220. pdf (accessed: 05.07.2020)

Baranov А.N. (2001) Introduction to Applied Linguistics. Manual.Мoscow: Editorial URSS, 360 p. (in Russ.)

Batura Т.V. (2012) Formal Ways of text authorship identification. Vestnik Novosibirskogo gosudarstvennogo universiteta. Informatcionnye tehnologii=Journal of Novosibirsk State University. Information Technology, vol. 2, no. 4, pp. 81–94 (in Russ.)

Belousov К.I. (2010) Linguistic Models and Language Reality Modeling Issues. Vestnik Orenburgskogo gosudarstvennogo universiteta=Journal of Orenburg State University, no. 11, pp. 94–97 (in Russ.)

Bessmertny I.А., Nugumanova А.B. (2012) Automatic Thesaurus Building Method Based on Statistical Processing of Texts in the Natural Language. Izvestia Tomskogo gosudarstvennogo politekhnicheskogo universiteta=Proceedings of Tomsk State Polytechnical University, no. 5, pp. 125–130 (in Russ.)

Bloch B. (1948) A set of postulates for phonemic analysis. Language, vol. 24, no. 1, pp. 3–46.

Bloomfield L. (1926) A set of postulates for the science of language. Language, vol. 2, no. 2, pp. 153–164.

Campbell L. (1867) The Sophisties and Polilicus of Plato. Oxford: Clarendon Press, 170 p.

Coulthard M. (2004) Author identification, idiolect, and linguistic uniqueness. Applied Linguistics, vol. 24, no. 4, pp. 431–447.

Custódio J., Paraboni I. (2018) EACH-USP Ensemble Cross-Domain Authorship Attribution. Notebook for PAN at CLEF 2018. Available at: http://ceur-ws. org/Vol-2125/paper_76.pdf (accessed: 05.07.2020)

Encyclopedia of Forensic Science (1999) Т.В. Averyanova (ed.).Мoscow: Prospekt, 442 p. (in Russ.)

Galyashina Е.I., Yermolova Е.I. (2005) Linguo-forensic tools for authorship attribution of written and oral texts. Papers of the International Research Conference. Moscow, pp. 20–22 (in Russ.)

Gomzin A. et al. (2018) Detection of author’s educational level and age based on comments analysis. Paper presented at Dialogue, Moscow, 30 May–2 June 2018. Available at: URL: http://www.dialog-21.ru/media/4279/gomzin_turdakov.pdf (2018) (accessed: 05.07.2020) (in Russ.)

Goroshko Е.I. (2003) Forensic authorship attribution: gender identification of the author of a document. Theory and practice of forensic investigation and science. Pravo, no. 3, pp. 221–226 (in Russ.)

Hjelmslev L. (2005) Prolegomena to a theory of language. Мoscow: Editorial URSS, 243 p. (in Russ.)

Juola P. (2006) Authorship Attribution. Foundations and Trends in Information Retrieval, vol. 1, no. 3, pp. 233–334.

Ionova S.V., Ogorelkov I.V. (2020) Gender-based Individual Speech Diagnostics in Authorship Attribution: Quantitative Approach. Vestnik Volgogradskogo gosudarstvennogo universiteta. Linguistika=Journal of Volgograd State University. Linguistics, vol. 19, no. 1, pp. 115–127. DOI:https://doi.org/10.15688/jvolsu2.2020.1.10 (in Russ.)

Karaulov Yu. N. (1987) Russian Language and Language Personality. Мoscow: Nauka, 264 p. (in Russ.)

Khmelyov D.V. (2002) Linguo-analyzer. E-resource. Available at: URL: http://www.rusf.ru/books/analysis/ (accessed: 16.11.2017) (in Russ.)

Khomenko A., Baranova Yu., Romanov A., Zadvornov K. (2021) The Linguistic modeling as a basis for creating authorship attribution software. Computational linguistics and intellectual technologies. Proceedings of the International Conference “Dialogue 2021” Moscow. Available at: URL: http://www.dialog21.ru/media/5315/khomenkoaplusetal048.pdf (accessed: 23.06.2021) (in Russ.)

Komissarov А.Yu. (2000) Forensic Investigation of Written Speech: Manual. Мoscow: Forensic Agency of the Interior Ministry of the Russian Federation, 126 p. (in Russ.)

Koppel M., Schler J. (2003) Exploiting Stylistic Idiosyncrasies for Authorship Attribution. Proceedings of IJCAI’03 Workshop on Computational Approaches to Style Analysis and Synthesis, vol. 69, pp. 72–80.

Korobov M. (2015) Morphological analyzer and generator for Russian and Ukrainian languages. In: Khachay M.Y., Konstantinova N.A. (eds.). AIST 2015. CCIS, vol. 542, pp. 320–332. Available at: https://doi.org/10.1007/978-3-319-26123-2_31 (accessed: 05.07.2020) (in Russ.)

Leonard R., Ford J., Christensen T. (2017) Forensic linguistics: applying the science of linguistics to the issues of the law. Hofstra Law Review, vol. 45, pp. 881–897.

Linguistics of Constructions (2010) Е.V. Rakhilina (ed.). Мoscow: Azbukovnik Publishing, 584 p. (in Russ.)

Litvinova Т.А. (2019) Idiolect as Object of Corpus Idiolectology: Towards a New Field in Linguistics. Vestnik Novgorodskogo gosudarstvennogo universiteta imeni Yaroslava Mudrogo=Bulletin of the Yaroslav Mudriy Novgorod State University, no. 7, pp. 1–5 (in Russ.)

Litvinova T., Rangel F. et al. (2017) Overview of the Rus Profiling PAN at FIRE Track on Cross-genre Gender Identification in Russian. Working notes of FIRE 2017. Forum for Information Retrieval Evaluation. Bangalore, pp. 1–7. Available at: URL: http://ceur-ws.org/Vol-2036/T1-1.pdf (accessed: 05.07.2019) (in Russ.)

Litvinova Т.А., Gromova А.V. (2020) The Use of Computer Technologies

Litvinova T., Sboev A., Panicheva P. (2018) Profiling the age of Russian bloggers. Proceedings of the 7th International Conference, AINL 2018. Saint Petersburg, pp. 167–177 (in Russ.)

Losev А.F. (2004) Introduction to the General Theory of Linguistic Models. Мoscow: Editorial URSS, 293 p. (in Russ.)

Marusenko М.А. (1990) The use of image recognition methods for attribution of anonymous and pseudonymous literary texts. Leningrad: University, 1990. 164 p. (in Russ.)

Marusenko М.А. (2003) Attribution of anonymous and pseudonymous texts as a standard image recognition problem. Istoriographiya y istochnikovedeniye otechestvennoy istorii=Historiography and Research of Sources of National History, no. 3, pp. 18–22 (in Russ.)

McMenamin G. (2002) Forensic linguistics: advances in forensic stylistics. London: Routledge, 361 p.

Murauer B., Tschuggnall M., Specht G. (2018) Dynamic Parameter Search for Cross-Domain Authorship Attribution. Notebook for PAN at CLEF 2018. Available at: http://ceur-ws.org/Vol-2125/paper_84.pdf (accessed: 05.07.2020)

Muttenthaler L., Lucas G., Amann J. (2019) Authorship Attribution in Fan-Fictional Texts given variable length Character and Word N-Grams. Notebook for PAN at CLEF 2019. Available at: http://ceur-ws.org/Vol-2380/paper_49.pdf (accessed: 05.07.2020)

Paducheva Е.В. (1974) On semantics of syntax. Мoscow: Nauka, 291 p. (in Russ.)

Radbil Т.B., Markina M.V. (2019) Probability Statistical Models in Attribution of Texts by Russian Language Authors. Politicheskaya Lingvistika=Political Linguistics, no. 2, pp. 156–166 (in Russ.)

Revzin I.I. (1977) Modern Structural Linguistics: Issues and Methods.Мoscow: Nauka, 263 p. (in Russ.)

Rodionova Е.S. (2008a) Linguistic Methods of Attribution and Dating of Literary Texts: towards Corneille-Moliere Problem. Candidate of Philological Sciences Summary. Saint Petersburg, 25 p. (in Russ.)

Rodionova Е.S. (2008b) Methods of literary text attribution. In: Structural and applied linguistics: inter-university collection. А.S. Gerda (ed.). Saint Petersburg: University, 2008, pp. 118–127 (in Russ.)

Rogov А.А. et al. (2019) Software support for solving text attribution problems. Programmnaya Inzheneriya=Programming Engineering, no. 5, pp. 234–240 (in Russ.)

Romanov А.S. (2010) Methodology and Software Package for Identification of Authors of Unknown Texts. Candidate of Engineering Sciences Summary. Tomsk, 26 p. (in Russ.) for Forensic Authorship Attribution: Issues and Prospects. Vestnik Volgogradskogo gosudarstvennogo universiteta. Lingvistika=Journal of Volgograd State University. Linguistics, vol. 19, no. 1, pp. 77–88. DOI:https://doi.org/10.15688/jvolsu2.2020.1.7 (in Russ.)

Romanov A.S., Kurtukova A., Fedotova A. et al. (2021) Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks. Future Internet, vol. 13, issue 3, pp. 1–16.

Rubtsova I.I., Yermolayeva Е.I., Bezrukova А.I. et al. (2007) Comprehensive methodology of authorship attribution: methodological recommendations. Мoscow: Forensic Agency of the Ministry of Interior, 192 p. (in Russ.)

Russian Grammar Rules: Collected works (2005) N. Yu. Shvedov (ed.). Мoscow: Nauka, 665 p. Available at: URL: http://rusgram.narod.ru/index.html. (in Russ.)

Shevelyov О. G. (2007) Methods of automatic classification of texts in the natural language: manual. Tomsk: ТМL-Press, 144 p. (in Russ.)

Shtoff V. (1966) Modeling and philosophy. Мoscow: Nauka, 304 p. (in Russ.)

Shuy R. (2005) Creating Language Crimes: How Law Enforcement Uses (and Misuses) Language. N. Y.: Oxford University Press, 194 p.

Sidorov Yu.В. et al. (1999) Computer-assisted system for linguistic analysis of literary texts. In: Saint Petersburg Assembly of Young Researchers and Specialists. Abstracts of reports. Saint Petersburg: University Press, p. 66. (in Russ.)

Stamatatos E. (2017) Authorship attribution using text distortion. Proceedings of 15th Conference of the European Chapter of the Association for Computational Linguistics, Long Papers, pp. 1138–1149.

Stepanenko А.А. (2017) Gender attribution of computer network communication texts. Vestnik Tomskogo gosudarstvennogo universiteta=Journal of Tomsk State University, no. 5, pp. 17–25. DOI: 10.17223/15617793/415/3 (in Russ.)

Timashev А.N. (2007) Atributor: version 1.01: software description. Available at: URL: http://www.textology.ru/atr_resum.html (accessed: 01.02.2016) (in Russ.)

Vinogradov V.V. (1961) The Authorship Problem and Theory of Styles. Мoscow: Goslitizdat, 614 p. (in Russ.)

Vul S.М. (2007) Forensic Authorship Identification: Methodological Basis. Guidebook. Kharkov: KhNIISE Press, 64 p. (in Russ.)

Wright D. (2017) Implementing word n-grams to identify authors and idiolects: a corpus approach to a forensic linguistic problem. International Journal of Corpus Linguistics, vol. 22, no. 2, pp. 212–241.

Zakharov V.N. et al. (2000) System Support Programme for Attribution of Articles Authored by F.M. Dostoevsky. Trudy Petrozavodskogo gosudarstvennogo universiteta=Research Works of the Petrozavodsk State University. Applied Mathematics and Information Technology Series, issue 9, pp. 113–122 (in Russ.)

Zakharov V.N., Khokhlova М.V. (2008) The Statistical Method for Identification of Collocations. Language Engineering in Search of Meanings. Collection of reports to the conference workshop “Web-Based Linguistic Information Technologies”. 11th All-Russia Joint Conference “Internet and Modern Society”. Saint Petersburg: University, 2008, pp. 40–54 (in Russ.)

Published
2022-07-01
How to Cite
RomanovaT., & KhomenkoA. (2022). Automation of Forensic Authorship Attribution: Problems and Prospects. Legal Issues in the Digital Age, 3(2), 90-115. Retrieved from https://lida.hse.ru/article/view/14588