The Proposed Algorithm for Semi-Structured Data Integration: Case Study of Setiu Wetland Data Set
Keywords:
Document Object Model, JSON, SemiStructured Data, WEIDJ,Abstract
Recent evolutions in web technology and computer science provide environmental community in expanding resources for data collection and analysis. Today, people are facing challenges to the design of analysis methods, workflows, and interaction with data sets. Data integration is one of older research fields in database area. It is consists of three types of data; structured data, semi-structured data and unstructured data. Web pages is a part of semi-structured data. In this paper, we briefly introduce the problem of data extraction from web pages focus on images. We also discuss the evolution of extraction images from semi-structured to structured format using WEIDJ (Wrapper for extraction Images using Document Object Model (DOM) and JavaScript Object Notation Data (JSON) approach). An experiment was conducted on same website using different approach JSON and DOM to show the comparison of time performance.References
S. López, J. Silva, and D. Insa, “Using the DOM tree for content extraction,” in Proceedings 8th International Workshop on Automated Specification and Verification of Web Systems, 2012, pp. 46-59.
S. M. Narawade, N. M. Prabhakar, N. S. Maruti, S. M. Bhagwat, and B. Burghate, “A web based data extraction using hierarchical (DOM) tree approach,” International Journal for Innovative Research in Science and Technology, vol. 2, no. 11, pp. 255-257, 2016.
T. Weninger, W. H. Hsu, and J. Han, “CETR: content extraction via tag ratios,” in Proceedings of the 19th international conference on World wide web, 2010, pp. 971-980.
A. Manjaramkar and R. L. Lokhande, “DEPTA: An efficient technique for web data extraction and alignment,” in 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2016, pp. 2307-2310.
V. B. Kadam and G. K. Pakle, “DEUDS: Data extraction using DOM tree and selectors,” International Journal of Computer Science and Information Technologies, vol. 5, no.2, pp. 1403-1410, 2014.
B. Mehta and M. Narvekar, “DOM tree based approach for web content extraction,” in 2015 International Conference on Communication, Information & Computing Technology (ICCICT), 2015, pp. 1-6.
D. Gibson, K. Punera, and A. Tomkins, “The volume and evolution of web page templates,” in Special interest tracks and posters of the 14th international conference on World Wide Web, 2005, pp. 830-839.
K. S. K. Niasi, E. Kannan, and M. M. Suhail, “Page-level data extraction approach for web pages using data mining techniques,” International Journal of Computer Science and Information Technologies, vol. 7, no. 3, pp. 1091-1096, 2016.
P. Rawat, S. Sayyad, S. Surinder, and S. Shelke, “Application for web data extraction and analysis,” Imperial Journal of Interdisciplinary Research, vol. 2, no. 7, pp. 148-152, 2016.
K. Kanaoka and M. Toyama, “Effective web data extraction with ducky,” in Proceedings of the 19th International Database Engineering & Applications Symposium, 2015, pp. 212-213.
D. Peng, L.-D. Cao, and W.-J. Xu, “Using JSON for data exchanging in web service applications,” Journal of Computational Information Systems, vol. 7, no. 16, pp. 5883-5890, 2011.
D. Buttler, L. Liu, and C. Pu, “A fully automated object extraction system for the World Wide Web,” in 21st International Conference on Distributed Computing Systems, 2001, pp. 361-370.
C. Hong-ping, F. Wei, Y. Zhou, Z. Lin, and C. Zhi-Ming, “Automatic data records extraction from list page in deep web sources,” in AsiaPacific Conference on Information Processing 2009 (APCIP 2009), 2009, pp. 370-373.
L. Liu, C. Pu, and W. Han, “XWRAP: An XML-enabled wrapper construction system for web information sources,” in 16th International Conference on Data Engineering, 2000, pp. 611-621.
V. Crescenzi, G. Mecca, and P. Merialdo, “Roadrunner: towards automatic data extraction from large web sites,” in VLDB, 2001, pp. 109-118.
C.-N. Hsu and M.-T. Dung, “Generating finite-state transducers for semi-structured data extraction from the web,” Information Systems, vol. 23, no. 9, pp. 521-538, 1998.
C.-H. Chang and S.-C. Lui, “IEPAD: information extraction based on pattern discovery,” in Proceedings of the 10th International Conference on World Wide Web, 2001, pp. 681-688.
M. K. Yusof and M. Man, “Efficiency of JSON approach for Data Extraction and Query Retrieval,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 4, no. 1, pp. 203-214, 2016.
M. Vagač, M. Melicherčík, M. Marko, P. Trhan, A. Michalíková, R. Kliment, et al., “Crawling images with web browser support,” in 2015 IEEE 13th International Scientific Conference on Informatics, 2015, pp. 286-289.
K. Kanaoka, Y. Fujii, and M. Toyama, “Ducky: a data extraction system for various structured web documents,” in Proceedings of the 18th International Database Engineering & Applications Symposium, 2014, pp. 342-347.
S. Z. Abidin, N. M. Idris, and A. H. Husain, “Extraction and classification of unstructured data in WebPages for structured multimedia database via XML,” in 2010 International Conference on Information Retrieval & Knowledge Management,(CAMP), 2010, pp. 44-49.
N. Derouiche, B. Cautis, and T. Abdessalem, “Automatic extraction of structured web data with domain knowledge,” in 2012 IEEE 28th International Conference on Data Engineering, 2012, pp. 726-737.
E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner, “Web data extraction, applications and techniques: a survey,” Knowledge-Based Systems, vol. 70, pp. 301-323, 2014.
M. Man, I. A. A. Sabri, M. M. A. Jalil, N. H. Ali, and S. Muhamad, “Information integration architecture system for empowering rural woman in Setiu wetlands,” presented at the Seminar Ekosistem Setiu 2016: Sains Marin & Sumber Akuatik Untuk Kelangsungan Hidup, Universiti Malaysia Terengganu, 2016.
I. A. A. Sabri and M. Man, “Multiple types of semi-structured data extraction using WEID,” presented at the Regional Conference on Sciences, Technology and Social Sciences (RCSTSS), Copthorne Hotel Cameron Highlands, 2016.
J. Creech. (2012, 31 May 2017). Biodiversity Web Resources. Available at http://www.istl.org/12-fall/internet.html.
I. A. A. Sabri and M. Man, “WEIDJ : An improvised algorithm for image extraction from web pages,” presented at the The 8th International Conference on Information Technology, Al-Zaytoonah University of Jordan (ZUJ), Amman, Jordan, 2017.
Downloads
Published
How to Cite
Issue
Section
License
TRANSFER OF COPYRIGHT AGREEMENT
The manuscript is herewith submitted for publication in the Journal of Telecommunication, Electronic and Computer Engineering (JTEC). It has not been published before, and it is not under consideration for publication in any other journals. It contains no material that is scandalous, obscene, libelous or otherwise contrary to law. When the manuscript is accepted for publication, I, as the author, hereby agree to transfer to JTEC, all rights including those pertaining to electronic forms and transmissions, under existing copyright laws, except for the following, which the author(s) specifically retain(s):
- All proprietary right other than copyright, such as patent rights
- The right to make further copies of all or part of the published article for my use in classroom teaching
- The right to reuse all or part of this manuscript in a compilation of my own works or in a textbook of which I am the author; and
- The right to make copies of the published work for internal distribution within the institution that employs me
I agree that copies made under these circumstances will continue to carry the copyright notice that appears in the original published work. I agree to inform my co-authors, if any, of the above terms. I certify that I have obtained written permission for the use of text, tables, and/or illustrations from any copyrighted source(s), and I agree to supply such written permission(s) to JTEC upon request.