PSOBER: PSO based entity resolution

Y. Aassem; I. Hafidi; H. Khalfi; N. Aboutabit

Entity Resolution is the task of mapping the records within a database to their corresponding entities. The entity resolution problem presents a lot of challenges because of the absence of complete information in records, variant distribution of records for different entities and sometimes overlaps between records of different entities. In this paper, we have proposed an unsupervised method to solve this problem. The previously mentioned problem is set as a partitioning problem. Thereafter, an optimization algorithm-based technique is proposed to solve the entity resolution problem. The presented approach enables the partitioning of records across entities. A comparative analysis with the genetic algorithm over datasets proves the efficiency of the considered approach.

entity resolution

cluster validity index

particle swarm optimization

distance measure

генетичний алгоритм

unsupervised algorithm

Yin X., Han J., Yu P. S. Object Distinction: Distinguishing Objects with Identical Names. IEEE 23rd International Conference on Data Engineering. 1242–1246 (2007).
Christen P., Goiser K. Quality and Complexity Measures for Data Linkage and Deduplication. Quality Measures in Data Mining. 127–151 (2007).
Hernández M. A., Stolfo S. J. The merge/purge problem for large databases. ACM SIGMOD Record. 24 (2), 127–138 (2007).
Mishra S., Mondal S., Saha S. Entity matching technique for bibliographic database. Database and expert systems applications. DEXA 2013. 34–41 (2013).
Draisbach U., Naumann F., Szott S., Wonneberg O. Adaptive Windows for Duplicate Detection. 2012 IEEE 28th International Conference on Data Engineering. 1073–1083 (2012).
Christen P. Data Matching: Concepts and Techniques for Record Linkage. Entity Resolution and Duplicate Detection. Springer (2012).
Aassem Y., Hafidi I., Aboutabit N. Enhanced Duplicate Count Strategy: Towards New Algorithms to Improve Duplicate Detection. NISS2020: Proceedings of the 3rd International Conference on Networking, Information Systems & Security. Article No. 58, 1–7 (2020).
Benkhaled H., Berrabah D., Boufares F. A novel approach to improve the Record Linkage process. 2019 6th International Conference on Control, Decision and Information Technologies (CoDIT). 1504–1509 (2019).
De Carvalho D. M., Laender A. H. F., Goncalves M. A., Da Silva A. S. A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineerin. 24 (3), 399–412 (2012).
Isele R., Bizer C. Learning expressive linkage rules using genetic programming. Proceedings of the VLDB Endowmen. 5 (11), 1638–1649 (2012).
Lyaqini S., Nachaoui M., Quafafou M. Non-smooth classification model based on new smoothing technique. Journal of Physics: Conference Series. 1743 (1), 012025 (2021).
Golberg D. E. Genetic algorithms in search, optimization, and machine learning. Addion Wesley Professional (1989).
Ribeiro Filho J. L., Treleaven P. C., Alippi C. Genetic algorithm programming environments. Computer. 27 (6), 28–43 (1994).
Mishra S., Saha S., Mondal S. GAEMTBD: Genetic algorithm based entity matching techniques for bibliographic databases. Applied Intelligence. 47, 197–230 (2017).
Eberhart R. C., Kennedy J. A new optimizer using particle swarm theory. MHS'95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science. 39–43 (1995).
Caliński T., Harabasz J. A dendrite method for cluster analysis. Communications in Statistics. 3 (1), 1–27 (1972).
Tang J., Zhang J., Yao L., Li J., Zhang L., Su Z. Arnetminer: extraction and mining of academic social networks. KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 990–998 (2008).
Tang J., Fong A. C. M., Wang B., Zhang J. A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering. 24 (6), 975–987 (2012).
Wang X., Tang J., Cheng H., Yu P. S. ADANA: Active name disambiguation. 2011 IEEE 11th International Conference on Data Mining. 794–803 (2011).
Nachaoui M. Parameter learning for combined first and second order total variation for image reconstruction. Advanced Mathematical Models & Applications. 5 (1), 53–69 (2020).
Wang J., Li G., Yu J. X., Feng J. Entity matching: how similar is similar. Proceedings of the VLDB Endowment. 4 (10), 622–633 (2011).
Sun Y., Wu T., Yin Z., Cheng H., Han J., Yin X., Zhao P. BibNetMiner: mining bibliographic information networks. SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 1341–1344 (2008).
DeRose P., Shen W., Chen F., Lee Y., Burdick D., Doan A., Ramakrishnan R. DBLife: A community information management platform for the database research community. CIDR. 169–172 (2007).
Jin H., Huang L., Yuan P. Name disambiguation using semantic association clustering. 2009 IEEE International Conference on e-Business Engineering. 42–48 (2009).
Mishra S., Saha S., Mondal S. Cluster validation techniques for bibliographic databases. Proceedings of the 2014 IEEE Students' Technology Symposium. 93–98 (2014).
Rousseeuw P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. 20, 53–65 (1987).
Xie X. L., Beni G. A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence. 13 (8), 841–847 (1991).
Mishra S., Saha S., Mondal S. On validation of clustering techniques for bibliographic databases. 2014 22nd International Conference on Pattern Recognition. 3150–3155 (2014).
Cramer N. L. A representation for the adaptive generation of simple sequential programs. Proceedings of the First International Conference on Genetic Algorithms. 183–187 (1985).
Holland J. H. Adaptation in natural and artificial systems. MIT (1975).
De Carvalho M. G., Laender A. H., Goncalves M. A., Da Silva A. A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineering. 24 (3), 399–412 (2012).
Isele R., Bizer C. Learning expressive linkage rules using genetic programming. Proceedings of the VLDB Endowment. 5 (11), 1638–1649 (2012).
Wagner R. A., Fischer M. J. The String-to-String Correction Problem. Journal of the ACM. 21 (1), 168–173 (1974).
Kondrak G. N-gram similarity and distance. Proceedings of the 12th international conference on String Processing and Information Retrieval. 115–126 (2005).
Hsu W. J., Du M. W. Computing a longest common subsequence for a set of strings. BIT Numerical Mathematics. 24, 45–59 (1984).
Christen P., Churches T. Febrl–Freely extensible biomedical record linkage. ANU Computer Science Technical Reports (2002).