Establishing Similarities between Text Documents

141

Abstract

This article discusses a method for assessing the similarity of texts, which is based on the analysis of comparison of sentences from different texts. The advantages of the method are that it takes into account the coverage of the standard sentence by a sentence from the compared text, the general assessment of the informational significance of the words of the standard sentence in the sentence of the compared text, the similarity of the syntactic structures of sentences, the coincidence of semantic meanings and connections. The application of this method is illustrated by the example of solving the problem of finding the similarities between two texts.

General Information

Keywords: similarity of texts, comparison of texts, word usage, natural language

Journal rubric: Data Analysis

Article type: scientific article

DOI: https://doi.org/10.17759/mda.2023130403

Received: 20.11.2023

Accepted:

For citation: Khoroshilov A.A., Kan A.V., Evdokimova E.A., Pitskhelauri S.G. Establishing Similarities between Text Documents. Modelirovanie i analiz dannikh = Modelling and Data Analysis, 2023. Vol. 13, no. 4, pp. 45–58. DOI: 10.17759/mda.2023130403. (In Russ., аbstr. in Engl.)

References

  1. Avtomaticheskaya obrabotka tekstov na estestvennom yazyke i analiz dannykh: ucheb. posobie [Automatic natural language text processing and data analysis: tutorial]/ Bol'shakova E.I., Vorontsov K.V., Efremova N.E., Klyshinskii E.S., Lukashevich N.V., Sapin A.S. — M.: Izd-vo NIU VShE, 2017. — 269 p.
  2. Yu. Sokirko. Semanticheskie slovari v avtomaticheskoĭ obrabotke teksta (po materialam sistemy DIALING) [Semantic dictionaries in automatic text processing (based on the materials of the DARLING system)] / Diss kand.t.n. // [Elektronnyi resurs] URL: http://www.aot.ru/docs/sokirko/sokirko-candid-1.html (date of application 23.10.2023)
  3. V. Sochenkov. Metod sravneniya tekstov dlya resheniya poiskovo- analiticheskikh zadach // Iskusstvennyi intellekt i prinyatie resheniĭ [Text comparison method for solving search and analytical problems // Artificial intelligence and decision making]. M.: ISA RAN, 2013, No2, p.95-106.
  4. Manning K., Ragkhavan P., Shyuttse Kh. Vvedenie v informatsionnyi poisk. — Vil'yams [Introduction to information retrieval. — Williams], 2011. — ISBN 978-5-8459-1623-5
  5. Osipov G. S., Tikhomirov I. A., Smirnov I. V. «Semanticheskiĭ poisk v seti internet sredstvami poiskovoĭ mashiny Exactus» [Semantic search on the Internet using the Exactus search engine]. Trudy odinnadtsatoĭ natsional'noĭ konferentsii po iskusstvennomu intellektu s mezhdunarodnym uchastiem KII-2008. — T. 3. — M.: LENAND, 2008. — S. 323–328
  6. Pekunov V.V. Izvlechenie informatsii iz neironnykh setei pryamogo rasprostraneniya v vide prostykh algebraicheskikh modelei // Informatsionnye tekhnologii. 17. T. 23. № 1. S. 76
  7. Khoroshilov Al-dr A., Kan A.V. Koverninskii I.V., Revina V.D., Khoroshilov A.A. Avtomaticheskoe izvlechenie faktogrficheskoi informatsii iz nauchno-tekhnicheskikh tekstov aviatsionnoi otrasli // sb. «Informatsionnye i telekommunikatsionnye tekhnologii» [Automatic extraction of factual information from scientific and technical texts of the aviation industry // Sat. "Information and telecommunication technologies"], № 43, 2019, str. 71-78.
  8. Khoroshilov Al-dr A., Musabaev R.R., Kozlovskaya Ya.D., Nikitin Yu.V., Khoroshilov Al-ei A. Avtomaticheskoe vyyavlenie i klassifikatsiya informatsionnykh sobytii v tekstakh SMI // Nauchno-tekhnicheskaya informatsiya [Automatic detection and classification of information events in media texts // Scientific and technical information]. Seriya 2: Informatsionnye protsessy i sistemy. VINITI RAN. 2020. №7. S. 27–38. ISSN: 0548-0027. DOI: 10.36535/0548-0027-2020-07-4.
  9. Mbaĭkodzhi, A.A. Dral', I.V. Sochenkov. Metod avtomaticheskoi klassifikatsii korotkikh tekstovykh soobshcheniĭ // Informatsionnye tekhnologii i vychislitel'nye sistemy [Method for automatic classification of short text messages // Information technologies and computing systems.]. M.: ISA RAN No3, 2012. S. 93-102.
  10. Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008
  11. Rafael C. Gonzalez, Richard E. Woods. Digital Image Processing, Prentice Hall. -2002. -793 р.
  12. Zipf, G. K. Selected studies of the principle of relative frequencies of language / Cambridge, Massachusetts: Harvard Unive, 1932.

Information About the Authors

Aleksander A. Khoroshilov, Doctor of Engineering, Senior Research, Central Research Institute of the Ministry of Defence of the Russian Federation, Moscow, Russia, ORCID: https://orcid.org/0000-0001-6641-3105, e-mail: khoroshilov@mail.ru

Anna V. Kan, PhD in Engineering, Associate Professor, Institute of Moscow Aviation Institute (National Research University), Head of the Analytical Department, Federal State Budgetary Institution «National Research Center» Institute named after N.E. Zhukovsky, Moscow, Russia, ORCID: https://orcid.org/0000-0001-9410-406X, e-mail: kan_a@mail.ru

Ekaterina A. Evdokimova, 1st Category Mathematician, Federal Research Center «Informatics and Management», Russian Academy of Sciences (IPI RAS), Moscow, Russia, ORCID: https://orcid.org/0000-0003-4719-2786, e-mail: evdokimovaekan@mail.ru

Sofya G. Pitskhelauri, master's student at the Institute of Information Technologies and Applied Mathematics, Moscow Aviation Institute (National Research University)(MAI), Moscow, Russia, e-mail: sofyauptuns@gmail.com

Metrics

Views

Total: 116
Previous month: 13
Current month: 1

Downloads

Total: 141
Previous month: 15
Current month: 3