DATA QUALITY IN IOT OPEN DATASETS A METHODOLOGICAL REVIEW
Abstract
The growth of the Internet of Things (IoT) has significantly increased data generated by connected devices, leading to challenges in data duplication that threaten data quality and reliability. The purpose of this study is to assess and thoroughly examine the quality of open-source IoT datasets, focusing on the occurrence and impact of duplicate data. By employing a Systematic Literature Review (SLR) and a literature-based comparative analysis, we reviewed and compared existing techniques for detecting these issues. Our findings reveal that while various methods have been proposed, there remains a lack of standardized approaches specifically designed for the unique characteristics of IoT environments. The study concludes by highlighting the need for more reliable and scalable solutions that are capable of handling the diverse and dynamic nature of IoT data, also offering insights into future research directions.
References
V. D. Gowda, V. N. R. Bandaru, A. Y. Begum, D. Palanikkumar, and A. C. Jadhav, “Internet of Things (IoT): Definitions, Components, Characteristics and Applications,” in Current Overview on Science and Technology Research Vol. 8, 2022. doi: 10.9734/bpi/costr/v8/3535c.
K. Rajamohan, S. Rangasamy, N. A. Pinto, B. E. Manoj, D. Mukherjee, and J. Shukla, “IoVST: Internet of vehicles and smart traffic - Architecture, applications, and challenges,” in Handbook of Research on Machine Learning-Enabled IoT for Smart Applications Across Industries, 2023. doi: 10.4018/978-1-6684-8785-3.ch015.
T. T. Nguyen, T. T. Huynh, M. T. Pham, T. D. Hoang, T. T. Nguyen, and Q. V. H. Nguyen, “Validating functional redundancy with mixed generative adversarial networks,” Knowl Based Syst, vol. 264, p. 110342, Mar. 2023, doi: 10.1016/J.KNOSYS.2023.110342.
T. Mansouri, M. Reza, S. Moghadam, F. Monshizadeh, and A. Zareravasan, “IoT Data Quality Issues and Potential Solutions: A Literature Review,” Comput J, vol. 66, no. 3, pp. 615–625, Mar. 2023, doi: https://doi.org/10.48550/arXiv.2103.13303.
N. Zubair, N. A, K. Hebbar, and Y. Simmhan, “Characterizing IoT Data and its Quality for Use,” Jun. 2019, [Online]. Available: http://arxiv.org/abs/1906.10497
A. Singh and S. Mahapatra, “Network-based applications of multimedia big data computing in IoT environment,” in Intelligent Systems Reference Library, vol. 163, 2020. doi: 10.1007/978-981-13-8759-3_17.
S. Vongsingthong and S. Smanchat, “A Review of Data Management in Internet of Things,” KKU Research Journal, vol. 20, no. 2, 2015.
A. Chowdhury, A. Pal, A. Raut, and M. Kumar, “KIHCDP: An Incremental Hierarchical Clustering Approach for IoT Data Using Dirichlet Process,” IEEE Access, vol. 12, pp. 56019–56032, 2024, doi: 10.1109/ACCESS.2024.3385628.
D. Puschmann, P. Barnaghi, and R. Tafazolli, “Adaptive Clustering for Dynamic IoT Data Streams,” IEEE Internet Things J, vol. 4, no. 1, 2017, doi: 10.1109/JIOT.2016.2618909.
D. Salian, “Usability of Open Data,” in Open-Source Horizons - Challenges and Opportunities for Collaboration and Innovation, 2023. doi: 10.5772/intechopen.1003269.
H. Shahid, M. Angel Vázquez, L. Reynaud, F. Parzysz, and M. Shaat, “Open Datasets for AI-Enabled Radio Resource Control in Non-Terrestrial Networks,” 2024. doi: 2404.12813.
Y. Bertrand, R. Van Belle, J. De Weerdt, and E. Serral, “Defining Data Quality Issues in Process Mining with IoT Data,” in Lecture Notes in Business Information Processing, 2023. doi: 10.1007/978-3-031-27815-0_31.
J. N. Kabi, C. W. Maina, and E. T. Mharakurwa, Anomaly Detection in IoT Data. IEEE / 2023 IST-Africa Conference (IST-Africa), 2023.
J. Byabazaire, G. M. P. O’Hare, R. Collier, and D. Delaney, “Dynamic Data Source Selection: A Case of Weather Stations for IoT Applications,” in 2022 IEEE 8th World Forum on Internet of Things, WF-IoT 2022, 2022. doi: 10.1109/WF-IoT54382.2022.10152030.
E. Widad, E. Saida, and Y. Gahi, “Quality Anomaly Detection Using Predictive Techniques: An Extensive Big Data Quality Framework for Reliable Data Analysis,” IEEE Access, vol. 11, pp. 103306–103318, 2023, doi: 10.1109/ACCESS.2023.3317354.
S. Tverdal et al., “Edge-based Data Profiling and Repair as a Service for IoT,” in ACM International Conference Proceeding Series, Association for Computing Machinery, Nov. 2023, pp. 17–24. doi: 10.1145/3627050.3627065.
J. H. Buelvas, D. Múnera, and N. Gaviria, “DQ-MAN: A tool for multi-dimensional data quality analysis in IoT-based air quality monitoring systems,” Internet of Things, vol. 22, p. 100769, Jul. 2023, doi: 10.1016/J.IOT.2023.100769.
W. P. Jiang, B. Wu, Z. Jiang, and S. B. Yang, “Cloning Vulnerability Detection in Driver Layer of IoT Devices,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2020. doi: 10.1007/978-3-030-41579-2_6.
Y. Gao, L. Chen, J. Han, G. Wu, and S. Liu, “Similarity-based deduplication and secure auditing in IoT decentralized storage,” Journal of Systems Architecture, vol. 142, 2023, doi: 10.1016/j.sysarc.2023.102961.
H. Li, H. Lu, C. S. Jensen, B. Tang, and M. A. Cheema, “Spatial Data Quality in the Internet of Things: Management, Exploitation, and Prospects,” ACM Comput Surv, vol. 55, no. 3, 2022, doi: 10.1145/3498338.
S. An-Dong and Z. Fang, “Research on Open Source Solutions of Data Collection for Industrial Internet of Things,” in Proceedings - 2021 7th International Symposium on Mechatronics and Industrial Informatics, ISMII 2021, 2021. doi: 10.1109/ISMII52409.2021.00045.
C. Wang et al., “Apache IoTDB: Time-series Database for Internet of Things,” Proceedings of the VLDB Endowment, vol. 13, no. 12, 2020, doi: 10.14778/3415478.3415504.
S. Aydin and M. N. Aydin, “Semantic and syntactic interoperability for agricultural open-data platforms in the context of IoT using crop-specific trait ontologies,” Applied Sciences (Switzerland), vol. 10, no. 13, 2020, doi: 10.3390/app10134460.
S. Muralidharan, B. Yoo, and H. Ko, “Designing a semantic digital twin model for IoT,” in Digest of Technical Papers - IEEE International Conference on Consumer Electronics, 2020. doi: 10.1109/ICCE46568.2020.9043088.
X. Ding, H. Wang, G. Li, H. Li, Y. Li, and Y. Liu, “IoT data cleaning techniques: A survey,” Intelligent and Converged Networks, vol. 3, no. 4, pp. 325–339, Dec. 2022, doi: 10.23919/ICN.2022.0026.
Y. Long, H. Li, Z. Wan, and P. Tian, “Data Redundancy Detection Algorithm based on Multidimensional Similarity,” in Proceedings - 2023 International Conference on Frontiers of Robotics and Software Engineering, FRSE 2023, Institute of Electrical and Electronics Engineers Inc., 2023, pp. 180–187. doi: 10.1109/FRSE58934.2023.00032.
T. Searle, Z. Ibrahim, J. Teo, and R. Dobson, “Estimating redundancy in clinical text,” J Biomed Inform, vol. 124, p. 103938, Dec. 2021, doi: 10.1016/J.JBI.2021.103938.
D. Firmani, M. Mecella, M. Scannapieco, and C. Batini, “On the Meaningfulness of ‘Big Data Quality’ (Invited Paper),” Data Sci Eng, vol. 1, no. 1, pp. 6–20, Mar. 2016, doi: 10.1007/s41019-015-0004-7.
E. Frank, “Machine Learning Models for Data Quality Assessment,” 2024.
M. N. U. Khan, W. Cao, Z. Tang, A. Ullah, and W. Pan, “Energy-Efficient De-Duplication Mechanism for Healthcare Data Aggregation in IoT,” Future Internet, vol. 16, no. 2, Feb. 2024, doi: 10.3390/fi16020066.
W. Zheng, Y. Li, X. Wu, and J. Cheng, “Duplicate Bug Report detection using Named Entity Recognition,” Knowl Based Syst, vol. 284, p. 111258, Jan. 2024, doi: 10.1016/J.KNOSYS.2023.111258.
S. Fouchécourt et al., “Expanding duplication of the testis PHD Finger Protein 7 (PHF7) gene in the chicken genome,” Genomics, vol. 114, no. 4, p. 110411, Jul. 2022, doi: 10.1016/J.YGENO.2022.110411.
M. J. H. Girard-Madoux et al., “The immunological functions of the Appendix: An example of redundancy?,” 2018. doi: 10.1016/j.smim.2018.02.005.
M. Müller and D. Sauter, “The more the merrier? Gene duplications in the coevolution of primate lentiviruses with their hosts,” Curr Opin Virol, vol. 62, p. 101350, Oct. 2023, doi: 10.1016/J.COVIRO.2023.101350.
R. Ma and F. Labeau, “A family of fast index and redundancy assignments for error resilient multiple description coding,” Signal Process Image Commun, vol. 27, no. 6, pp. 612–624, Jul. 2012, doi: 10.1016/J.IMAGE.2012.01.020.
A. H. Adhab and N. A. Hussien, “Techniques of Data Deduplication for Cloud Storage: A Review,” International Journal of Engineering Research and Advanced Technology (ijerat), vol. 8, no. 4, pp. 7–18, 2022, doi: 10.31695/IJERAT.2022.8.4.2.
H. Deshingkar et al., “Data Deduplication Using Python,” in 2023 7th International Conference On Computing, Communication, Control And Automation, ICCUBEA 2023, Institute of Electrical and Electronics Engineers Inc., 2023. doi: 10.1109/ICCUBEA58933.2023.10391968.
M. U. Tahir, M. R. Naqvi, S. K. Shahzad, and M. W. Iqbal, “Resolving Data De-Duplication issues on Cloud,” in 2020 International Conference on Engineering and Emerging Technologies (ICEET), Lahore, Pakistan, Lahore, Pakistan: IEEE, 2020, pp. 1–5. doi: 10.1109/ICEET48479.2020.9048214.
K. Vijayalakshmi and V. Jayalakshmi, “Analysis on data deduplication techniques of storage of big data in cloud,” in Proceedings - 5th International Conference on Computing Methodologies and Communication, ICCMC 2021, Institute of Electrical and Electronics Engineers Inc., Apr. 2021, pp. 976–983. doi: 10.1109/ICCMC51019.2021.9418445.
Y. Chen, D. Li, Y. Hua, and W. He, “Effective and Efficient Content Redundancy Detection of Web Videos,” IEEE Trans Big Data, vol. 7, no. 1, pp. 187–198, May 2019, doi: 10.1109/tbdata.2019.2913674.
Y. Chen, D. Li, L. Yan, and Z. Ma, “Two-stage Detection of Semantic Redundancies in RDF Data,” Journal of Web Engineering, vol. 21, no. 8, pp. 2313–2337, 2022, doi: 10.13052/jwe1540-9589.2184.
L. Lu and P. Wang, “Duplication Detection in News Articles Based on Big Data,” in 2019 IEEE 4th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), Chengdu, China, Chengdu, China,: IEEE, 2019, pp. 15–19. doi: 10.1109/ICCCBDA.2019.8725674.
W. Xia, H. Jiang, D. Feng, and L. Tian, “DARE: A Deduplication-Aware Resemblance Detection and Elimination Scheme for Data Reduction with Low Overheads,” IEEE Transactions on Computers, vol. 65, no. 6, pp. 1692–1705, Jun. 2016, doi: 10.1109/TC.2015.2456015.
Y. Huang and F. Chiang, “Refining Duplicate Detection for Improved Data Quality,” in TDDL/MDQual/Futurity@TPDL, CEUR-WS.org, 2017. Accessed: May 12, 2024. [Online]. Available: https://ceur-ws.org/Vol-2038/paper3.pdf
A. Ali, N. A. Emran, and S. A. Asmai, “Missing values compensation in duplicates detection using hot deck method,” J Big Data, vol. 8, no. 1, Dec. 2021, doi: 10.1186/s40537-021-00502-1.
Statista. (2024). Number of Internet of Things (IoT) connected devices worldwide from 2019 to 2023, with forecasts from 2022 to 2030. [Online]. Available: https://www.statista.com/statistics/1183457/iot-connected-devices-worldwide/
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a copyright form (JACTA) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).