Semantic analysis of test responses using synthetic data generation

B.Y. Polyakov

doi:10.17759/mda.2025150410

Modelling and Data Analysis
2025. Vol. 15, no. 4, 156–164
doi:10.17759/mda.2025150410
ISSN: 2219-3758 / 2311-9454 (online)

Semantic analysis of test responses using synthetic data generation

36

B.Y. Polyakov

Abstract

Purpose. To evaluate the feasibility of using synthetic data generated by large language models for training automated classifiers of text responses in educational and professional testing. Methods. The experiment involved generating 100 response examples using LLMs, followed by text preprocessing (tokenization, stemming, TF-IDF) and training two classification models - logistic regression and RBF network, with subsequent evaluation on a test dataset. Results. The models achieved accuracy of 80% and 65-90% respectively. Systematic limitations were identified: high keywords dependency, insensitivity to semantic inversions, and contextual blindness in classification. Conclusions. The approach shows promise for developing auxiliary assessment tools, though current limitations prevent complete replacement of human evaluators. Further refinement is needed for practical implementation.

General Information

Keywords: LLM, large language models, generative AI, test automatization, text processing

Journal rubric: Short Messages

Article type: announcing

DOI: https://doi.org/10.17759/mda.2025150410

Received 24.10.2025

Revised 11.11.2025

Accepted 13.11.2025

Published 28.12.2025

For citation: Polyakov, B.Y. (2025). Semantic analysis of test responses using synthetic data generation. Modelling and Data Analysis, 15(4), 156–164. (In Russ.). https://doi.org/10.17759/mda.2025150410

License: CC BY-NC 4.0

References

Воронин, В.М., Курицин, С.В., Наседкина, З.А., Ицкович, М.М. (2017). Использование латентного семантического анализа как альтернативы пропозиционального анализа в исследованиях понимания текста. Гуманизация образования. 2017. №2. (с. 11-19) https://www.elibrary.ru/item.asp?id=29369554
Voronin, V.M., Kuritsin, S.V., Nasedkina, Z.A., Itstovich, M.M. (2017). Using a latent semantic analysis as alternatives of sentential analysis in studies of text understanding. Humanization of education, 2017(2), 11-19 (In Russ.) https://www.elibrary.ru/item.asp?id=29369554
Нежников, Р.И., Марьенков, А.Н. (2024). Сравнительный анализ моделей трансформера для классификации неструктурированной текстовой информации. Прикаспийский журнал: управление и высокие технологии. 2024. №2 (66) (с. 32-38). https://www.elibrary.ru/item.asp?id=71199707
Nezhnikov, R.I., Marenkov, A.N. (2024). Comparative Analysis of Transformer Models for Classification of Unstructured Text Information. Caspian Journal: Control and High Technologies, 2024, 2 (66), 32-38. (In Russ.). https://www.elibrary.ru/item.asp?id=71199707
Ребенок, К.В. (2024) Эффективность нейросетевых алгоритмов в автоматическом реферировании и суммаризации текста. Вестник НГУ. Серия: Информационные технологии. 2024. №1. (с. 49-61) https://doi.org/10.25205/1818-7900-2024-22-4-49-61
Rebenok, K.V. (2024). Efficiency of Neural Network Algorithms in Automatic Abstracting and Summarization Text. Vestnik NSU. Series: Information Technologies. 2024;22(4):49-61. (In Russ.) https://doi.org/10.25205/1818-7900-2024-22-4-49-61

Information About the Authors

Borislav Y. Polyakov, Junior Researcher, Research Assistant, Laboratory of Mathematical Psychology and Applied Software of the Center for Information Technologies for Psychological Research, Moscow State University of Psychology and Education, Moscow, Russian Federation, ORCID: https://orcid.org/0000-0002-6457-9520, e-mail: deslion@yandex.ru

Conflict of interest

The author declare no conflict of interest.

Ethics statement

The study was conducted using synthetic data generated by language models. As the research did not involve human participants, ethics committee approval was not required.

Metrics

Web Views

Whole time: 182
Previous month: 34
Current month: 7

PDF Downloads

Whole time: 36
Previous month: 8
Current month: 0

Total

Whole time: 218
Previous month: 42
Current month: 7

PlumX

article metrics