Modelling and Data Analysis
2023. Vol. 13, no. 4, 45–58
doi:10.17759/mda.2023130403
ISSN: 2219-3758 / 2311-9454 (online)
Establishing Similarities between Text Documents
Abstract
This article discusses a method for assessing the similarity of texts, which is based on the analysis of comparison of sentences from different texts. The advantages of the method are that it takes into account the coverage of the standard sentence by a sentence from the compared text, the general assessment of the informational significance of the words of the standard sentence in the sentence of the compared text, the similarity of the syntactic structures of sentences, the coincidence of semantic meanings and connections. The application of this method is illustrated by the example of solving the problem of finding the similarities between two texts.
General Information
Keywords: similarity of texts, comparison of texts, word usage, natural language
Journal rubric: Data Analysis
Article type: scientific article
DOI: https://doi.org/10.17759/mda.2023130403
Received: 20.11.2023
Accepted:
For citation: Khoroshilov A.A., Kan A.V., Evdokimova E.A., Pitskhelauri S.G. Establishing Similarities between Text Documents. Modelirovanie i analiz dannikh = Modelling and Data Analysis, 2023. Vol. 13, no. 4, pp. 45–58. DOI: 10.17759/mda.2023130403. (In Russ., аbstr. in Engl.)
References
- Avtomaticheskaya obrabotka tekstov na estestvennom yazyke i analiz dannykh: ucheb. posobie [Automatic natural language text processing and data analysis: tutorial]/ Bol'shakova E.I., Vorontsov K.V., Efremova N.E., Klyshinskii E.S., Lukashevich N.V., Sapin A.S. — M.: Izd-vo NIU VShE, 2017. — 269 p.
- Yu. Sokirko. Semanticheskie slovari v avtomaticheskoĭ obrabotke teksta (po materialam sistemy DIALING) [Semantic dictionaries in automatic text processing (based on the materials of the DARLING system)] / Diss kand.t.n. // [Elektronnyi resurs] URL: http://www.aot.ru/docs/sokirko/sokirko-candid-1.html (date of application 23.10.2023)
- V. Sochenkov. Metod sravneniya tekstov dlya resheniya poiskovo- analiticheskikh zadach // Iskusstvennyi intellekt i prinyatie resheniĭ [Text comparison method for solving search and analytical problems // Artificial intelligence and decision making]. M.: ISA RAN, 2013, No2, p.95-106.
- Manning K., Ragkhavan P., Shyuttse Kh. Vvedenie v informatsionnyi poisk. — Vil'yams [Introduction to information retrieval. — Williams], 2011. — ISBN 978-5-8459-1623-5
- Osipov G. S., Tikhomirov I. A., Smirnov I. V. «Semanticheskiĭ poisk v seti internet sredstvami poiskovoĭ mashiny Exactus» [Semantic search on the Internet using the Exactus search engine]. Trudy odinnadtsatoĭ natsional'noĭ konferentsii po iskusstvennomu intellektu s mezhdunarodnym uchastiem KII-2008. — T. 3. — M.: LENAND, 2008. — S. 323–328
- Pekunov V.V. Izvlechenie informatsii iz neironnykh setei pryamogo rasprostraneniya v vide prostykh algebraicheskikh modelei // Informatsionnye tekhnologii. 17. T. 23. № 1. S. 76
- Khoroshilov Al-dr A., Kan A.V. Koverninskii I.V., Revina V.D., Khoroshilov A.A. Avtomaticheskoe izvlechenie faktogrficheskoi informatsii iz nauchno-tekhnicheskikh tekstov aviatsionnoi otrasli // sb. «Informatsionnye i telekommunikatsionnye tekhnologii» [Automatic extraction of factual information from scientific and technical texts of the aviation industry // Sat. "Information and telecommunication technologies"], № 43, 2019, str. 71-78.
- Khoroshilov Al-dr A., Musabaev R.R., Kozlovskaya Ya.D., Nikitin Yu.V., Khoroshilov Al-ei A. Avtomaticheskoe vyyavlenie i klassifikatsiya informatsionnykh sobytii v tekstakh SMI // Nauchno-tekhnicheskaya informatsiya [Automatic detection and classification of information events in media texts // Scientific and technical information]. Seriya 2: Informatsionnye protsessy i sistemy. VINITI RAN. 2020. №7. S. 27–38. ISSN: 0548-0027. DOI: 10.36535/0548-0027-2020-07-4.
- Mbaĭkodzhi, A.A. Dral', I.V. Sochenkov. Metod avtomaticheskoi klassifikatsii korotkikh tekstovykh soobshcheniĭ // Informatsionnye tekhnologii i vychislitel'nye sistemy [Method for automatic classification of short text messages // Information technologies and computing systems.]. M.: ISA RAN No3, 2012. S. 93-102.
- Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008
- Rafael C. Gonzalez, Richard E. Woods. Digital Image Processing, Prentice Hall. -2002. -793 р.
- Zipf, G. K. Selected studies of the principle of relative frequencies of language / Cambridge, Massachusetts: Harvard Unive, 1932.
Information About the Authors
Metrics
Views
Total: 160
Previous month: 7
Current month: 3
Downloads
Total: 202
Previous month: 11
Current month: 13