Development of an ETL Process Based on Open Source Technologies to Solve the Problem of Data Delivery to Consumers

68

Abstract

The article discusses the issues of developing an ETL process for a data warehouse based on open source technologies, instead of private software supplied by the vendor. The process allows you to deliver data from the source to the consumer, focusing on the speed of delivery, the resources spent and the convenience of development. The architecture for solving the problem with a description of the processes being replaced is presented, data transmission over a new process is implemented. Modern tools used to work with data are involved, methods of interaction with them and selection of technical characteristics for the process are described.

General Information

Keywords: database, open source, software, ETL process, data delivery

Journal rubric: Software

Article type: scientific article

DOI: https://doi.org/10.17759/mda.2023130210

Received: 12.04.2023

For citation: Starkov V.V., Gorbatova S.S., Vodolaga V.I. Development of an ETL Process Based on Open Source Technologies to Solve the Problem of Data Delivery to Consumers. Modelirovanie i analiz dannikh = Modelling and Data Analysis, 2023. Vol. 13, no. 2, pp. 180–193. DOI: 10.17759/mda.2023130210. (In Russ., аbstr. in Engl.)

References

  1. David Loshin. ETL (Extract, Transform, Load) . Business Intelligence. — 2nd. — Morgan Kaufmann, 2012. — 400 p
  2. Ralph Kimball, Joe Caserta. The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. — John Wiley & Sons, 2004. — 528 p.
  3. David Haertzen. ETL Tools . The Analytical Puzzle: Profitable Data Warehousing, Business Intelligence and Analytics. — Technics Publications, 2012. — 346 p.
  4. S. Riza, U. Lezerson, Sh. Ouen, D. Uills. Spark dlya professionalov: sovremennye patterny obrabotki bol'shikh dannykh = Advanced Analytics with Spark. Patterns for Learning from Data at Scale (O’Reilly, 2015). 2017. — 272 p.
  5. Uorren R., Karau Kh. Effektivnyi Spark. Masshtabirovanie i optimizatsiya = High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark. 2018. — 352 s.
  6. Kh. Karau, E. Konvinski, P. Vendell, M. Zakhariya. Izuchaem Spark. Molnienosnyi analiz dannykh = Learning Spark: Lightning-Fast Big Data Analytics (O’Reilly, 2015). 2015. — 304 s.
  7. Narkhid Niya, Shapira Gven, Palino Todd. Apache Kafka. Potokovaya obrabotka i analiz dannykh. — SPb., 2019 p = 320.
  8. Vohra, Deepak (October 2016). Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools (1st ed.). Apress. p. 429.

Information About the Authors

Viacheslav V. Starkov, e-mail: starkov.viatcheslav@yandex.ru

Svetlana S. Gorbatova, Senior Lecturer, Moscow Institute of Steel and Alloys (National Research Technological University) (NUST MISIS), Moscow, Russia, ORCID: https://orcid.org/0009-0005-5213-6780, e-mail: ssgorbatova@misis.ru

Victoria I. Vodolaga, Master's Degree, Lomonosov Moscow State University (MSU), Moscow, Russia, ORCID: https://orcid.org/0009-0003-1816-0088, e-mail: vikavodolaga1@gmail.com

Metrics

Views

Total: 149
Previous month: 18
Current month: 16

Downloads

Total: 68
Previous month: 12
Current month: 11