Comparative analysis of the effectiveness of methods for constructing quite interpretable linear regression models

64

Abstract

Previously, the author managed to reduce the problem of constructing a quite interpretable linear regression, estimated using ordinary least squares method, to a mixed-integer 0-1 linear programming problem. In such models, the signs of the estimates correspond to the substantive meaning of the factors, the absolute contributions of the variables to the overall determination are significant, and the degree of multicollinearity is small. The optimal solution to the formulated problem can also be found by generating all subsets method. The purpose of this article is to conduct a comparative analysis of the effectiveness of these two approaches. To conduct computational experiments, 5 sets of real statistical data of various volumes were used. As a result, more than 550 different mixed-integer 0-1 problems were solved using the LPSolve package under different conditions. At the same time, the efficiency of solving similar problems using the generating all subsets method in the Gretl package was assessed. In all experiments, our proposed method turned out to be many times more effective than the generating all subsets method. The highest efficiency was achieved in solving the subset selection problem from 103 variables, solving each of which by generating all subsets would require estimating approximately 2103 (10.1 nonillion) models, which a conventional computer would not have been able to cope with in 1000 years. In LPSolve, each of these problems was solved in 32 – 191 seconds. The proposed method was able to process a large data sample containing 40 explanatory variables and 515,345 observations in an acceptable time, which confirms the independence of its effectiveness from the sample size. It has been revealed that tightening the requirements for multicollinearity and absolute contributions of variables in the linear constraints of the problem almost always reduces the speed of its solution.

General Information

Keywords: linear regression, ordinary least squares method, interpretability, mixed-integer 0-1 linear programming problem, generating all subsets method, contributions of variables to determination, multicollinearity, efficiency

Journal rubric: Optimization Methods

Article type: scientific article

DOI: https://doi.org/10.17759/mda.2023130404

Received: 30.10.2023

Accepted:

For citation: Bazilevskiy M.P. Comparative analysis of the effectiveness of methods for constructing quite interpretable linear regression models. Modelirovanie i analiz dannikh = Modelling and Data Analysis, 2023. Vol. 13, no. 4, pp. 59–83. DOI: 10.17759/mda.2023130404. (In Russ., аbstr. in Engl.)

References

  1. Strizhov V.V., Krymova E.A. Metody vybora regressionnykh modeley [Methods for selecting regression models]. Moscow, Vychislitel'nyy tsentr im. A.A. Dorodnitsyna RAN, 2010. 60 p.
  2. Miller A. Subset selection in regression. CRC Press, 2002.
  3.  Konno H., Yamamoto R. Choosing the best set of variables in regression analysis using integer programming, Journal of Global Optimization, 2009, vol. 44, pp. 273–282. DOI:10.1007/s10898-008-9323-9.
  4. Miyashiro R., Takano Y. Mixed integer second-order cone programming formulations for variable selection in linear regression, European Journal of Operational Research, 2015, vol. 247, pp. 721–731. DOI:10.1016/j.ejor.2015.06.081.
  5. Miyashiro R., Takano Y. Subset selection by Mallows’ Cp: A mixed integer programming approach, Expert Systems with Applications, 2015, vol. 42, pp. 325–331. DOI:10.1016/j. eswa.2014.07.056.
  6. Park Y.W., Klabjan D. Subset selection for multiple linear regression via optimization, Journal of Global Optimization, 2020, vol. 77, pp. 543–574. DOI:10.1007/s10898-020-00876-1.
  7. Tamura R., Kobayashi K., Takano Y., Miyashiro R., Nakata K., Matsui T. Mixed integer quadratic optimization formulations for eliminating multicollinearity based on variance inflation factor, Journal of Global Optimization, 2019, vol. 73, pp. 431–446. DOI:10.1007/s10898-018-0713-3.
  8. Tamura R., Kobayashi K., Takano Y., Miyashiro R., Nakata K., Matsui T. Best subset selection for eliminating multicollinearity, Journal of the Operations Research Society of Japan, 2017, vol. 60(3), pp. 321–336. DOI:10.15807/jorsj.60.321.
  9. Bertsimas D., Li M.L. Scalable holistic linear regression, Operations Research Letters, 2020, vol. 48, is. 3, pp. 203–208. DOI:10.1016/j.orl.2020.02.008.
  10. Chung S., Park Y.W., Cheong T. A mathematical programming approach for integrated multiple linear regression subset selection and validation, Pattern Recognition, 2020, vol. 108. DOI:10.1016/j.patcog.2020.107565.
  11. Takano Y., Miyashiro R. Best subset selection via cross-validation criterion, Top, 2020, vol. 28, is. 2, pp. 475–488. DOI: 10.1007/s11750-020-00538-1.
  12. Bazilevskiy M.P. Svedenie zadachi otbora informativnykh regressorov pri otsenivanii lineynoy regressionnoy modeli po metodu naimen'shikh kvadratov k zadache chastichno-bulevogo lineynogo programmirovaniya [Reduction the problem of selecting informative regressors when estimating a linear regression model by the method of least squares to the problem of partial-Boolean linear programming], Modeling, Optimization and Information Technology, 2018. vol. 6, no. 1 (20), pp. 108–117.
  13. Bazilevskiy M.P. Otbor informativnykh regressorov s uchetom mul'tikollinearnosti mezhdu nimi v regressionnykh modelyakh kak zadacha chastichno-bulevogo lineynogo programmirovaniya [Subset selection in regression models with considering multicollinearity as a task of mixed 0-1 integer linear programming], Modeling, Optimization and Information Technology, 2018, vol. 6, no. 2 (21), pp. 104–118.
  14. Bazilevskiy M.P. Otbor znachimykh po kriteriyu St'yudenta informativnykh regressorov v otsenivaemykh s pomoshch'yu MNK regressionnykh modelyakh kak zadacha chastichno-bulevogo lineynogo programmirovaniya [Selection of informative regressors significant by Student’s t-test in regression models estimated using OLS as a partial Boolean linear programming problem], Proceedings of Voronezh State University. Series: Systems Analysis and Information Technologies, 2021, no. 3, pp. 5-16.
  15. Bazilevskiy M.P. Formalizatsiya protsessa otbora informativnykh regressorov v lineynoy regressii v vide zadachi chastichno-bulevogo lineynogo programmirovaniya s ogranicheniyami na koeffitsienty interkorrelyatsiy [Formalization the subset selection process in linear regression as a mixed integer 0-1 linear programming problem with constraints on intercorrelation coefficients], Modern High Technologies, 2023, no. 8, pp. 10-14.
  16. Bazilevskiy M.P. Sposob opredeleniya parametra M v zadache chastichno-bulevogo lineynogo programmirovaniya dlya otbora regressorov v lineynoy regressii [Method for the M parameter determination in 0-1 mixed-integer linear programming problem for subset selection in linear regression], Bulletin of the Technological University, 2022, vol. 25, no. 2, pp. 62–66.
  17. Knowledge Extraction based on Evolutionary Learning [Elektronnyy resurs]. URL https://sci2s.ugr.es/keel/dataset.php?cod=93 (Accessed 04.10.2023).
  18. UCI Machine Learning Repository [Elektronnyy resurs]. URL https://archive.ics.uci.edu/dataset/464/superconductivty+data (Accessed 04.10.2023).
  19. UCI Machine Learning Repository [Elektronnyy resurs]. URL https://archive.ics.uci.edu/dataset/437/residential+building+data+set (Accessed 04.10.2023).
  20. UCI Machine Learning Repository [Elektronnyy resurs]. URL https://archive.ics.uci.edu/dataset/203/yearpredictionmsd (Accessed 04.10.2023).

Information About the Authors

Mikhail P. Bazilevskiy, PhD in Engineering, Associate Professor, Department of Mathematics, Irkutsk State Transport University (ISTU), Irkutsk, Russia, ORCID: https://orcid.org/0000-0002-3253-5697, e-mail: mik2178@yandex.ru

Metrics

Views

Total: 129
Previous month: 11
Current month: 8

Downloads

Total: 64
Previous month: 7
Current month: 3