Semantic analysis of test responses using synthetic data generation

 
Audio is AI-generated
1

Abstract

Purpose. To evaluate the feasibility of using synthetic data generated by large language models for training automated classifiers of text responses in educational and professional testing. Methods. The experiment involved generating 100 response examples using LLMs, followed by text preprocessing (tokenization, stemming, TF-IDF) and training two classification models - logistic regression and RBF network, with subsequent evaluation on a test dataset. Results. The models achieved accuracy of 80% and 65-90% respectively. Systematic limitations were identified: high keywords dependency, insensitivity to semantic inversions, and contextual blindness in classification. Conclusions. The approach shows promise for developing auxiliary assessment tools, though current limitations prevent complete replacement of human evaluators. Further refinement is needed for practical implementation.

General Information

Keywords: LLM, large language models, generative AI, test automatization, text processing

Journal rubric: Short Messages

Article type: announcing

DOI: https://doi.org/10.17759/mda.2025150410

Received 24.10.2025

Revised 11.11.2025

Accepted

Published

For citation: Polyakov, B.Y. (2025). Semantic analysis of test responses using synthetic data generation. Modelling and Data Analysis, 15(4), 156–164. (In Russ.). https://doi.org/10.17759/mda.2025150410

© Polyakov B.Y., 2025

License: CC BY-NC 4.0

References

  1. Воронин, В.М., Курицин, С.В., Наседкина, З.А., Ицкович, М.М. (2017). Использование латентного семантического анализа как альтернативы пропозиционального анализа в исследованиях понимания текста. Гуманизация образования. 2017. №2. (с. 11-19) https://www.elibrary.ru/item.asp?id=29369554
    Voronin, V.M., Kuritsin, S.V., Nasedkina, Z.A., Itstovich, M.M. (2017). Using a latent semantic analysis as alternatives of sentential analysis in studies of text understanding. Humanization of education, 2017(2), 11-19 (In Russ.) https://www.elibrary.ru/item.asp?id=29369554
  2. Нежников, Р.И., Марьенков, А.Н. (2024). Сравнительный анализ моделей трансформера для классификации неструктурированной текстовой информации. Прикаспийский журнал: управление и высокие технологии. 2024. №2 (66) (с. 32-38). https://www.elibrary.ru/item.asp?id=71199707
    Nezhnikov, R.I., Marenkov, A.N. (2024). Comparative Analysis of Transformer Models for Classification of Unstructured Text Information. Caspian Journal: Control and High Technologies, 2024, 2 (66), 32-38. (In Russ.). https://www.elibrary.ru/item.asp?id=71199707
  3. Ребенок, К.В. (2024) Эффективность нейросетевых алгоритмов в автоматическом реферировании и суммаризации текста. Вестник НГУ. Серия: Информационные технологии. 2024. №1. (с. 49-61) https://doi.org/10.25205/1818-7900-2024-22-4-49-61
    Rebenok, K.V. (2024). Efficiency of Neural Network Algorithms in Automatic Abstracting and Summarization Text. Vestnik NSU. Series: Information Technologies. 2024;22(4):49-61. (In Russ.) https://doi.org/10.25205/1818-7900-2024-22-4-49-61

Information About the Authors

Borislav Y. Polyakov, Junior Researcher, Research Assistant, Laboratory of Mathematical Psychology and Applied Software of the Center for Information Technologies for Psychological Research, Moscow State University of Psychology and Education, Moscow, Russian Federation, ORCID: https://orcid.org/0000-0002-6457-9520, e-mail: deslion@yandex.ru

Conflict of interest

The author declare no conflict of interest.

Ethics statement

The study was conducted using synthetic data generated by language models. As the research did not involve human participants, ethics committee approval was not required.

Metrics

 Web Views

Whole time: 2
Previous month: 0
Current month: 2

 PDF Downloads

Whole time: 1
Previous month: 0
Current month: 1

 Total

Whole time: 3
Previous month: 0
Current month: 3