Analytical Review of Data Lakes and Perspectives of Application in the Field of Education

: cc. 373 - 382
Lviv Polytechnic National University, Department of Information Systems and Networks
Lviv Polytechnic National University, Department of Information Systems and Networks

An analytical review of the development of Data Lakes and its application in various industries, as part of Big data concept solutions, was conducted. The available standard architectural solutions for the Data Lake organization are considered. Also, specialized areas that require different or additional aspects to solve the tasks, depending on the field of Data Lake use, are taken into account. For the proper organization of Data Lake, various data processing tools are used, including distributed data storage systems, semantic networks, and especially metadata. Metadata plays a huge role in recognizing the purpose of data and possible relationships between it and entities. An overview of the prospects for the use of Data Lake, in particular as context of Smart City, distance education and the education industry in general, was conducted.

  1. Wieder, P., & Nolte, H. (2022). Toward data lakes as central building blocks for data management and analysis. Frontiers in big Data, 5.
  2. Alhammad, N., Bogatu, A., & Paton, N. W. (2022). Towards Schema Inference for Data Lakes. arXiv preprint arXiv:2206.03881.
  3. Hai, R., Miller, R., Jarke, M., & Quix, C. J. (2020). Data Integration and Metadata Management in Data Lakes (Doctoral dissertation, Ph. D. Dissertation. RWTH Aachen University. https://doi. org/10.18154/RWTH-2020- 08233).
  4. Piantella, D. (2022). A Research on Data Lakes and their Integration Challenges. In The 30th Italian Symposium on Advanced Database Systems.
  5. Chen, Z. (2022). Observations and Expectations on Recent Developments of Data Lakes. Procedia Computer Science, 214, 405–411.
  6. Eichler, R., Giebler, C., Gröger, C., Schwarz, H., & Mitschang, B. (2021). Modeling metadata in data lakes a generic model. Data & Knowledge Engineering, 136, 101931.
  7. Thorogood, A. (2020). Policy-aware data lakes: a flexible approach to achieve legal interoperability for global research collaborations. Journal of Law and the Biosciences, 7(1), lsaa065.
  8. Langenecker, S., Sturm, C., Schalles, C., & Binnig, C. (2021). Towards learned metadata extraction for data lakes. BTW 2021.
  9. Megdiche, I., Ravat, F., & Zhao, Y. (2021). Metadata management on data processing in data lakes. In SOFSEM 2021: Theory and Practice of Computer Science: 47th International Conference on Current Trends in Theory and Practice of Computer Science, SOFSEM 2021, Bolzano-Bozen, Italy, January 25–29, 2021, Proceedings 47, 553–562. Springer International Publishing.
  10. Cayeux, E., Damski, C., Macpherson, J., Laing, M., Annaiyappa, P., Harbidge, P., ... & Carney, J. (2022). Connecting Multilayer Semantic Networks to Data Lakes: The Representation of Data Uncertainty and Quality. SPE Drilling & Completion, 1–16.
  11. Nargesian, F., Pu, K. Q., Zhu, E., Ghadiri Bashardoost, B., & Miller, R. J. (2020, June). Organizing data lakes for navigation. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, 1939–1950.
  12. Arora, S., Yang, B., Eyuboglu, S., Narayan, A., Hojel, A., Trummer, I., & Ré, C. (2023). Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes. arXiv preprint arXiv:2304.09433.
  13. Fan, G., Wang, J., Li, Y., Zhang, D., & Miller, R. (2022). Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning. arXiv preprint arXiv:2210.01922.
  14. Nolte, H., & Wieder, P. (2022). Realising data-centric scientific workflows with provenance-capturing on data lakes. Data Intelligence, 4(2), 426–438.
  15. Couto, J. C., & Ruiz, D. D. (2022, June). An overview about data integration in data lakes. In 2022 17th Iberian Conference on Information Systems and Technologies (CISTI), 1–7.
  16. Song, J., & He, Y. (2021, June). Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes. In Proceedings of the 2021 International Conference on Management of Data, 1678–1691.
  17. Villarroya, S., Viqueira, J. R., Cotos, J. M., & Taboada, J. A. (2022). Enabling efficient distributed spatial join on large scale vector-raster data lakes. IEEE Access, 10, 29406–29418.
  18. Darmont, J., Favre, C., Loudcher, S., & Noûs, C. (2020, October). Data lakes for digital humanities. In Proceedings of the 2nd International Conference on Digital Tools & Uses Congress, 1–4.
  19. Dong, Y., Takeoka, K., Xiao, C., & Oyamada, M. (2021, April). Efficient joinable table discovery in data lakes: A high-dimensional similarity-based approach. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), 456–467. IEEE.
  20. Zhang, Y., & Ives, Z. G. (2020, June). Finding related tables in data lakes for interactive data science. In Proceedings of the 2020 ACM SIGMOD International Conference.
  21. Saeedan, M., & Eldawy, A. (2022). Spatial parquet: A column file format for geospatial data lakes [extended version]. arXiv preprint arXiv:2209.02158.
  22. Chen, Z., Gu, Z., Cao, L., Fan, J., Madden, S., & Tang, N. (2023). Symphony: Towards natural language query answering over multi-modal data lakes. In Conference on Innovative Data Systems Research, CIDR, 8–151.
  23. Molnár, B., Pisoni, G., & Tarcsi, Á. (2020). Data Lakes for Insurance Industry: Exploring Challenges and Opportunities for Customer Behaviour Analytics, Risk Assessment, and Industry Adoption. ICETE (3), 127–134.
  24. Eder, J., & Shekhovtsov, V. A. (2021). Data quality for federated medical data lakes. International Journal of Web Information Systems, 17(5), 407–426.
  25. Hai, R., Koutras, C., Quix, C., & Jarke, M. (2023). Data Lakes: A Survey of Functions and Systems. IEEE Transactions on Knowledge and Data Engineering.
  26. Manco, C., Dolci, T., Azzalini, F., Barbierato, E., Gribaudo, M., & Tanca, L. (2023). HEALER: A Data Lake Architecture for Healthcare.
  27. Suresh, P., Keerthika, P., Sathiyamoorthi, V., Logeswaran, K., Sentamilselvan, K., Sangeetha, M., & Sa- gana, C. (2021). Cloud-based big data analysis tools and techniques towards sustainable smart city services. In Decision support systems and industrial IoT in smart grid, factories, and cities, 63–90. IGI Global.