Identification and Classification of Toxic Statements by Machine Learning Methods

162

Abstract

The number of comments left on social media platforms can reach several million per day, so their owners are interested in automatic content filtering. In this paper, the task of identifying offensive statements in texts is considered. When solving the problem, various methods of vector text conversion were considered: TF-IDF, Word2Vec, Glove, etc. The results of the application of classical text classification methods and neural network methods (LSTM, CNN) were also considered and presented.

General Information

Keywords: Natural Language Processing (NLP), Classification, Gradient boosting, XGBoost, CatBoost, Recurrent Neural Network, LSTM, Convolutional Neural Network

Journal rubric: Optimization Methods

Article type: scientific article

DOI: https://doi.org/10.17759/mda.2022120103

Received: 18.01.2022

Accepted:

For citation: Platonov E.N., Rudenko V.Y. Identification and Classification of Toxic Statements by Machine Learning Methods. Modelirovanie i analiz dannikh = Modelling and Data Analysis, 2022. Vol. 12, no. 1, pp. 27–48. DOI: 10.17759/mda.2022120103. (In Russ., аbstr. in Engl.)

References

  1. Riz R. Natural language processing in Java. DMK-Press. 2016.264 p.
  2. Perspective API. URL: https://www.perspectiveapi.com
  3. van Aken B., Risch J., Krestel R., Löser A. Challenges for toxic comment classification: An in-depth error analysis. 2018, arXiv:1809.07572.
  4. Risch J., Krestel R. Toxic Comment Detection in Online Discussions. Deep Learning-Based Approaches for Sentiment Analysis. Springer, Singapore, 2020. P. 85–109.
  5. Weiss K., Khoshgoftaar T.M., Wang D. A survey of transfer learning // Big Data, 3: 9. 2016. https://doi.org/10.1186/s40537-016-0043-6
  6. Andrusyak B., Rimel M., Kern R. Detection of Abusive Speech for Mixed Sociolects of Russian and Ukrainian Languages //RASLAN. – 2018. – P. 77-84.
  7. Li Y., Yang T. Word Embedding for Understanding Natural Language: A Survey. In: Srinivasan S. (eds) Guide to Big Data Applications. Studies in Big Data, vol 26. Springer, Cham. https://doi.org/10.1007/978-3-319-53817-4_4
  8. Liu C. et al. Research of text classification based on improved TF-IDF algorithm // IEEE International Conference of Intelligent Robotic and Control Engineering (IRCE). 2018 P. 218–222.
  9. word2vec // URL: https://code.google.com/archive/p/word2vec/
  10. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space // Proceedings of Workshop at ICLR, 2013
  11. Bojanowski P, et al. Enriching word vectors with subword information // Transactions of the Association for Computational Linguistics. 2017. V. 5. P. 135–146.
  12. Pennington J., Socher R., Manning C. D. Glove: Global vectors for word representation // Proceedings of the conference on empirical methods in natural language processing (EMNLP). 2014. P. 1532–1543.
  13. Wieting J. et al. From paraphrase database to compositional paraphrase model and back // Transactions of the Association for Computational Linguistics. 2015. V. 3. P. 345–358.
  14. Chen T., Guestrin C. Xgboost: A scalable tree boosting system // Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016. P. 785–794.
  15. Dorogush A. V., Ershov V., Gulin A. CatBoost: gradient boosting with categorical features support // arXiv preprint arXiv:1810.11363. 2018.
  16. Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term memory // Neural computation, V. 9(8). P. 1735–1780, 1997.
  17. Staudemeyer R. C., Morris E. R. Understanding LSTM — a tutorial into Long Short-Term Memory Recurrent Neural Networks // arXiv preprint arXiv:1909.09586. 2019. URL:https://arxiv.org/pdf/1909.09586.pdf
  18. Understanding LSTM Networks URL:https://colah.github.io/posts/2015-08-Understanding-LSTMs/
  19. Convolutional Neural Network. An Introduction to Convolutional Neural Networks. URL: https://towardsdatascience.com/convolutional-neural-network-17fb77e76c05
  20. Bai S., Kolter J. Z., Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling // CoRR, abs/1803.01271. 2018. http://arxiv.org/abs/1803.01271
  21. Quora Insincere Questions Classification. URL: https://www.kaggle.com/c/quora-insincere-questions-classification/data
  22. T. Fawcett. An introduction to ROC analysis // Pattern Recognition Letters, V. 27. 2006. P. 861–874.
  23. Hastie T., Tibshirani R., Friedman J. The Elements of Statistical Learning. Springer-Verlag, New York. 2017.
  24. Eli5 Documentation. URL: https://eli5.readthedocs.io/en/latest/
  25. Tulio Ribeiro M., Singh S., Guestrin C. Why Should I Trust You? Explaining the Predictions of Any Classifier // KDD'16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. P. 1135-1144.
  26. glove.840B.300d — pre-trained word vectors GloVe. URL: https://nlp.stanford.edu/projects/glove/
  27. wiki-news-300d-1M – pre-trained word vectors trained using fastText. URL: https://fasttext.cc/docs/en/english-vectors.html
  28. paragram_300_sl999 – New Paragram-SL999 300 dimensional embeddings tuned on SimLex999 dataset. URL: https://www.kaggle.com/ranik40/paragram-300-sl999
  29. GoogleNews-vectors-negative300 — pre-trained word vectors trained using Word2Vec. URL: https://code.google.com/archive/p/word2vec/

Information About the Authors

Evgeniy N. Platonov, PhD in Physics and Matematics, Assistant Professor, Moscow Aviation Institute (National Research University), Moscow, Russia, ORCID: https://orcid.org/0000-0001-8502-1350, e-mail: en.platonov@gmail.com

Veronika Y. Rudenko, Student of the Institute of Information Technologies and Applied Mathematics, Moscow Aviation Institute (National Research University), Moscow, Russia, ORCID: https://orcid.org/0000-0003-0010-331X, e-mail: super.ruden2011@mail.ru