The Proposed Algorithm for Semi-Structured Data Integration: Case Study of Setiu Wetland Data Set

Authors

  • Mustafa Man School of Informatics and Applied Mathematics, Universiti Malaysia Terengganu, 21030 Kuala Nerus, Terengganu, Malaysia.
  • Ily Amalina Ahmad Sabri School of Informatics and Applied Mathematics, Universiti Malaysia Terengganu, 21030 Kuala Nerus, Terengganu, Malaysia.

Keywords:

Document Object Model, JSON, SemiStructured Data, WEIDJ,

Abstract

Recent evolutions in web technology and computer science provide environmental community in expanding resources for data collection and analysis. Today, people are facing challenges to the design of analysis methods, workflows, and interaction with data sets. Data integration is one of older research fields in database area. It is consists of three types of data; structured data, semi-structured data and unstructured data. Web pages is a part of semi-structured data. In this paper, we briefly introduce the problem of data extraction from web pages focus on images. We also discuss the evolution of extraction images from semi-structured to structured format using WEIDJ (Wrapper for extraction Images using Document Object Model (DOM) and JavaScript Object Notation Data (JSON) approach). An experiment was conducted on same website using different approach JSON and DOM to show the comparison of time performance.

References

S. López, J. Silva, and D. Insa, “Using the DOM tree for content extraction,” in Proceedings 8th International Workshop on Automated Specification and Verification of Web Systems, 2012, pp. 46-59.

S. M. Narawade, N. M. Prabhakar, N. S. Maruti, S. M. Bhagwat, and B. Burghate, “A web based data extraction using hierarchical (DOM) tree approach,” International Journal for Innovative Research in Science and Technology, vol. 2, no. 11, pp. 255-257, 2016.

T. Weninger, W. H. Hsu, and J. Han, “CETR: content extraction via tag ratios,” in Proceedings of the 19th international conference on World wide web, 2010, pp. 971-980.

A. Manjaramkar and R. L. Lokhande, “DEPTA: An efficient technique for web data extraction and alignment,” in 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2016, pp. 2307-2310.

V. B. Kadam and G. K. Pakle, “DEUDS: Data extraction using DOM tree and selectors,” International Journal of Computer Science and Information Technologies, vol. 5, no.2, pp. 1403-1410, 2014.

B. Mehta and M. Narvekar, “DOM tree based approach for web content extraction,” in 2015 International Conference on Communication, Information & Computing Technology (ICCICT), 2015, pp. 1-6.

D. Gibson, K. Punera, and A. Tomkins, “The volume and evolution of web page templates,” in Special interest tracks and posters of the 14th international conference on World Wide Web, 2005, pp. 830-839.

K. S. K. Niasi, E. Kannan, and M. M. Suhail, “Page-level data extraction approach for web pages using data mining techniques,” International Journal of Computer Science and Information Technologies, vol. 7, no. 3, pp. 1091-1096, 2016.

P. Rawat, S. Sayyad, S. Surinder, and S. Shelke, “Application for web data extraction and analysis,” Imperial Journal of Interdisciplinary Research, vol. 2, no. 7, pp. 148-152, 2016.

K. Kanaoka and M. Toyama, “Effective web data extraction with ducky,” in Proceedings of the 19th International Database Engineering & Applications Symposium, 2015, pp. 212-213.

D. Peng, L.-D. Cao, and W.-J. Xu, “Using JSON for data exchanging in web service applications,” Journal of Computational Information Systems, vol. 7, no. 16, pp. 5883-5890, 2011.

D. Buttler, L. Liu, and C. Pu, “A fully automated object extraction system for the World Wide Web,” in 21st International Conference on Distributed Computing Systems, 2001, pp. 361-370.

C. Hong-ping, F. Wei, Y. Zhou, Z. Lin, and C. Zhi-Ming, “Automatic data records extraction from list page in deep web sources,” in AsiaPacific Conference on Information Processing 2009 (APCIP 2009), 2009, pp. 370-373.

L. Liu, C. Pu, and W. Han, “XWRAP: An XML-enabled wrapper construction system for web information sources,” in 16th International Conference on Data Engineering, 2000, pp. 611-621.

V. Crescenzi, G. Mecca, and P. Merialdo, “Roadrunner: towards automatic data extraction from large web sites,” in VLDB, 2001, pp. 109-118.

C.-N. Hsu and M.-T. Dung, “Generating finite-state transducers for semi-structured data extraction from the web,” Information Systems, vol. 23, no. 9, pp. 521-538, 1998.

C.-H. Chang and S.-C. Lui, “IEPAD: information extraction based on pattern discovery,” in Proceedings of the 10th International Conference on World Wide Web, 2001, pp. 681-688.

M. K. Yusof and M. Man, “Efficiency of JSON approach for Data Extraction and Query Retrieval,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 4, no. 1, pp. 203-214, 2016.

M. Vagač, M. Melicherčík, M. Marko, P. Trhan, A. Michalíková, R. Kliment, et al., “Crawling images with web browser support,” in 2015 IEEE 13th International Scientific Conference on Informatics, 2015, pp. 286-289.

K. Kanaoka, Y. Fujii, and M. Toyama, “Ducky: a data extraction system for various structured web documents,” in Proceedings of the 18th International Database Engineering & Applications Symposium, 2014, pp. 342-347.

S. Z. Abidin, N. M. Idris, and A. H. Husain, “Extraction and classification of unstructured data in WebPages for structured multimedia database via XML,” in 2010 International Conference on Information Retrieval & Knowledge Management,(CAMP), 2010, pp. 44-49.

N. Derouiche, B. Cautis, and T. Abdessalem, “Automatic extraction of structured web data with domain knowledge,” in 2012 IEEE 28th International Conference on Data Engineering, 2012, pp. 726-737.

E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner, “Web data extraction, applications and techniques: a survey,” Knowledge-Based Systems, vol. 70, pp. 301-323, 2014.

M. Man, I. A. A. Sabri, M. M. A. Jalil, N. H. Ali, and S. Muhamad, “Information integration architecture system for empowering rural woman in Setiu wetlands,” presented at the Seminar Ekosistem Setiu 2016: Sains Marin & Sumber Akuatik Untuk Kelangsungan Hidup, Universiti Malaysia Terengganu, 2016.

I. A. A. Sabri and M. Man, “Multiple types of semi-structured data extraction using WEID,” presented at the Regional Conference on Sciences, Technology and Social Sciences (RCSTSS), Copthorne Hotel Cameron Highlands, 2016.

J. Creech. (2012, 31 May 2017). Biodiversity Web Resources. Available at http://www.istl.org/12-fall/internet.html.

I. A. A. Sabri and M. Man, “WEIDJ : An improvised algorithm for image extraction from web pages,” presented at the The 8th International Conference on Information Technology, Al-Zaytoonah University of Jordan (ZUJ), Amman, Jordan, 2017.

Downloads

Published

2017-10-20

How to Cite

Man, M., & Ahmad Sabri, I. A. (2017). The Proposed Algorithm for Semi-Structured Data Integration: Case Study of Setiu Wetland Data Set. Journal of Telecommunication, Electronic and Computer Engineering (JTEC), 9(3-3), 79–84. Retrieved from https://jtec.utem.edu.my/jtec/article/view/2876