Fuzzy Matching: An Alternative Technique for Merging Extracted Web Data

Abstract

Web scrapping has been a popular method for collecting data from websites. This is because data on the internet is updated frequently thus making it a good source for getting accurate information. However, the non-homogeneous nature of each website may cause the data from the different internet web sources to have different data making the quality of the data inconsistent. Previous study has proposed the use of record linkage method to merge data from multiple websites. The record linkage method proposed by previous study used deterministic technique to match data which match the string of matching variable to merge data. However, deterministic technique requires the matching variable to be an exact match to be able to match. This study explores the use of fuzzy matching technique as an alternative technique. A comparison in this study found out that fuzzy matching has a slightly better performance in merging web data. However, the main drawback of fuzzy matching is that it is hard to determine the threshold to trigger a match. Therefore, the future work should focus on exploring an optimal method on determining the threshold for fuzzy matching to making the process more streamlined.

Author Biographies

Lee Qi Zian, Universiti Teknikal Malaysia Melaka
Lee Qi Zian is a graduate student at the Faculty of Information and Communication Technology, Universiti Teknikal Malaysia Melaka. He received his Bachelor Degree in Computer Science (Artificial Intelligent) in 2021. He is currently pursuing his Master Degree in Technology in 2023.
Nur Zareen Zulkarnain, Universiti Teknikal Malaysia Melaka

Nur Zareen Zulkarnain is currently a Senior Lecturer at Universiti Teknikal Malaysia Melaka. She received her Ph.D. in Computer Science (Natural Language Processing) from the University of Salford, United Kingdom. Her research interests include sentiment analysis, ontology, informatics, and data analytics.

Yogan Jaya Kumar, Universiti Teknikal Malaysia Melaka

Yogan Jaya Kumar is a Senior Lecture in Universiti Teknikal Malaysia Melaka.  He earned his Bachelor’s Degree and Master’s Degree from Universiti Sains Malaysia. He completed his Ph.D. in 2014 in the field of Computer Science. His research interest  involves the field of text mining, information extraction and AI applications.

References

J. Hillen, "Web scraping for food price research," British Food Journal, vol. 121, no. 12, pp. 3350-3361, 2019, doi: 10.1108/BFJ-02-2019-0081.

C. G. Konny, B. K. Williams, and D. M. Friedman, "Big data in the us consumer price index: Experiences and plans," Big Data for 21st Century Economic Statistics, 2019.

C. Osbat et al., "What micro price data teach us about the inflation process: web-scraping in PRISMA," 2022.

K. Maharana, S. Mondal, and B. Nemade, "A review: Data pre-processing and data augmentation techniques," Global Transitions Proceedings, vol. 3, no. 1, pp. 91-99, 2022/06/01/ 2022, doi: https://doi.org/10.1016/j.gltp.2022.04.020.

K. Sankpal, "A Review on Data Normalization Techniques," International Journal of Engineering Research and, vol. V9, 07/06 2020, doi: 10.17577/IJERTV9IS060915.

S. Rässler, "Data Fusion: Identification Problems, Validity, and Multiple Imputation," AUSTRIAN JOURNAL OF STATISTICS Volume, vol. 33, pp. 153-171, 01/01 2004, doi: 10.17713/ajs.v33i1&2.436.

R. D’Allerto and M. Raggi, "From collection to integration: Non-parametric Statistical Matching between primary and secondary farm data," Statistical Journal of the IAOS, vol. 37, pp. 1-11, 04/16 2021, doi: 10.3233/SJI-200644.

G. Saporta, "Data fusion and data grafting," Computational Statistics & Data Analysis, vol. 38, no. 4, pp. 465-473, 2002/02/28/ 2002, doi: https://doi.org/10.1016/S0167-9473(01)00072-X.

I. Lewaa, M. S. Hafez, and M. A. Ismail, "Data integration using statistical matching techniques: A review," Statistical Journal of the IAOS, vol. 37, pp. 1391-1410, 2021, doi: 10.3233/SJI-210835.

M. Jamali-Phiri et al., "Addressing data deficiencies in assistive technology by using statistical matching methodology: a case study from Malawi," Disability and Rehabilitation: Assistive Technology, vol. 18, no. 4, pp. 415-422, 2023/05/19 2023, doi: 10.1080/17483107.2020.1861118.

F. D. d’Ovidio, P. Perchinunno, and L. Antonucci, "Data Integration Techniques for the Identification of Poverty Profiles," Social Indicators Research, vol. 156, no. 2, pp. 515-531, 2021/08/01 2021, doi: 10.1007/s11205-019-02255-0.

M.-R. Namazi-Rad, R. Tanton, D. Steel, P. Mokhtarian, and S. Das, "An unconstrained statistical matching algorithm for combining individual and household level geo-specific census and survey data," Computers, Environment and Urban Systems, vol. 63, pp. 3-14, 2017/05/01/ 2017, doi: https://doi.org/10.1016/j.compenvurbsys.2016.11.003.

S. R. and R. S., "Web Scraping Online Newspaper Death Notices for the Estimation of the Local Number of Deaths," in Proceedings of the 12th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2019), 2019, vol. 5: HEALTHINF, pp. 319-325, doi: 10.5220/0007382603190325.

B. E. Shook-Sa, M. G. Hudgens, A. L. Kavee, and D. L. Rosen, "Estimating the Number of Persons with HIV in Jails Via Web Scraping and Record Linkage," Journal of the Royal Statistical Society Series A: Statistics in Society, vol. 185, no. Supplement_2, pp. S270-S287, 2022, doi: 10.1111/rssa.12909.

T. Tuoto, D. Fusco, and L. Di Consiglio, "Exploring Solutions for Linking Big Data in Official Statistics," in Studies in Theoretical and Applied Statistics, Cham, C. Perna, M. Pratesi, and A. Ruiz-Gazen, Eds., 2018// 2018: Springer International Publishing, pp. 49-58.

G. Navarro, "A Guided Tour to Approximate String Matching," ACM Computing Surveys, vol. 33, 04/06 2000, doi: 10.1145/375360.375365.

H. Suman, H. Tamkiya, and A. Kushwah, Candidate Background Verification Using Machine Learning and Fuzzy Matching. 2020.

T. De Waal, Statistical matching: Experimental results and future research questions. 2015.

M. Jamali-Phiri et al., "Addressing data deficiencies in assistive technology by using statistical matching methodology: a case study from Malawi," Disability and Rehabilitation: Assistive Technology, pp. 1-15, 2020, doi: 10.1080/17483107.2020.1861118.

V. M K and K. K, "A Survey on Similarity Measures in Text Mining," Machine Learning and Applications: An International Journal, vol. 3, pp. 19-28, 03/30 2016, doi: 10.5121/mlaij.2016.3103.

Published
2024-05-31
How to Cite
Zian, L., Zulkarnain, N. Z., & Kumar, Y. (2024). Fuzzy Matching: An Alternative Technique for Merging Extracted Web Data. Journal of Advanced Computing Technology and Application (JACTA), 6(1), 1-13. https://doi.org/10.54554/jacta.2024.06.01.001
Section
Articles