Fuzzy Matching: An Alternative Technique for Merging Extracted Web Data
Abstract
Web scrapping has been a popular method for collecting data from websites. This is because data on the internet is updated frequently thus making it a good source for getting accurate information. However, the non-homogeneous nature of each website may cause the data from the different internet web sources to have different data making the quality of the data inconsistent. Previous study has proposed the use of record linkage method to merge data from multiple websites. The record linkage method proposed by previous study used deterministic technique to match data which match the string of matching variable to merge data. However, deterministic technique requires the matching variable to be an exact match to be able to match. This study explores the use of fuzzy matching technique as an alternative technique. A comparison in this study found out that fuzzy matching has a slightly better performance in merging web data. However, the main drawback of fuzzy matching is that it is hard to determine the threshold to trigger a match. Therefore, the future work should focus on exploring an optimal method on determining the threshold for fuzzy matching to making the process more streamlined.
References
J. Hillen, "Web scraping for food price research," British Food Journal, vol. 121, no. 12, pp. 3350-3361, 2019, doi: 10.1108/BFJ-02-2019-0081.
C. G. Konny, B. K. Williams, and D. M. Friedman, "Big data in the us consumer price index: Experiences and plans," Big Data for 21st Century Economic Statistics, 2019.
C. Osbat et al., "What micro price data teach us about the inflation process: web-scraping in PRISMA," 2022.
K. Maharana, S. Mondal, and B. Nemade, "A review: Data pre-processing and data augmentation techniques," Global Transitions Proceedings, vol. 3, no. 1, pp. 91-99, 2022/06/01/ 2022, doi: https://doi.org/10.1016/j.gltp.2022.04.020.
K. Sankpal, "A Review on Data Normalization Techniques," International Journal of Engineering Research and, vol. V9, 07/06 2020, doi: 10.17577/IJERTV9IS060915.
S. Rässler, "Data Fusion: Identification Problems, Validity, and Multiple Imputation," AUSTRIAN JOURNAL OF STATISTICS Volume, vol. 33, pp. 153-171, 01/01 2004, doi: 10.17713/ajs.v33i1&2.436.
R. D’Allerto and M. Raggi, "From collection to integration: Non-parametric Statistical Matching between primary and secondary farm data," Statistical Journal of the IAOS, vol. 37, pp. 1-11, 04/16 2021, doi: 10.3233/SJI-200644.
G. Saporta, "Data fusion and data grafting," Computational Statistics & Data Analysis, vol. 38, no. 4, pp. 465-473, 2002/02/28/ 2002, doi: https://doi.org/10.1016/S0167-9473(01)00072-X.
I. Lewaa, M. S. Hafez, and M. A. Ismail, "Data integration using statistical matching techniques: A review," Statistical Journal of the IAOS, vol. 37, pp. 1391-1410, 2021, doi: 10.3233/SJI-210835.
M. Jamali-Phiri et al., "Addressing data deficiencies in assistive technology by using statistical matching methodology: a case study from Malawi," Disability and Rehabilitation: Assistive Technology, vol. 18, no. 4, pp. 415-422, 2023/05/19 2023, doi: 10.1080/17483107.2020.1861118.
F. D. d’Ovidio, P. Perchinunno, and L. Antonucci, "Data Integration Techniques for the Identification of Poverty Profiles," Social Indicators Research, vol. 156, no. 2, pp. 515-531, 2021/08/01 2021, doi: 10.1007/s11205-019-02255-0.
M.-R. Namazi-Rad, R. Tanton, D. Steel, P. Mokhtarian, and S. Das, "An unconstrained statistical matching algorithm for combining individual and household level geo-specific census and survey data," Computers, Environment and Urban Systems, vol. 63, pp. 3-14, 2017/05/01/ 2017, doi: https://doi.org/10.1016/j.compenvurbsys.2016.11.003.
S. R. and R. S., "Web Scraping Online Newspaper Death Notices for the Estimation of the Local Number of Deaths," in Proceedings of the 12th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2019), 2019, vol. 5: HEALTHINF, pp. 319-325, doi: 10.5220/0007382603190325.
B. E. Shook-Sa, M. G. Hudgens, A. L. Kavee, and D. L. Rosen, "Estimating the Number of Persons with HIV in Jails Via Web Scraping and Record Linkage," Journal of the Royal Statistical Society Series A: Statistics in Society, vol. 185, no. Supplement_2, pp. S270-S287, 2022, doi: 10.1111/rssa.12909.
T. Tuoto, D. Fusco, and L. Di Consiglio, "Exploring Solutions for Linking Big Data in Official Statistics," in Studies in Theoretical and Applied Statistics, Cham, C. Perna, M. Pratesi, and A. Ruiz-Gazen, Eds., 2018// 2018: Springer International Publishing, pp. 49-58.
G. Navarro, "A Guided Tour to Approximate String Matching," ACM Computing Surveys, vol. 33, 04/06 2000, doi: 10.1145/375360.375365.
H. Suman, H. Tamkiya, and A. Kushwah, Candidate Background Verification Using Machine Learning and Fuzzy Matching. 2020.
T. De Waal, Statistical matching: Experimental results and future research questions. 2015.
M. Jamali-Phiri et al., "Addressing data deficiencies in assistive technology by using statistical matching methodology: a case study from Malawi," Disability and Rehabilitation: Assistive Technology, pp. 1-15, 2020, doi: 10.1080/17483107.2020.1861118.
V. M K and K. K, "A Survey on Similarity Measures in Text Mining," Machine Learning and Applications: An International Journal, vol. 3, pp. 19-28, 03/30 2016, doi: 10.5121/mlaij.2016.3103.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a copyright form (JACTA) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).