Experimental Approach Based on Ensemble and Frequent Itemsets Mining for Image Spam Filtering

Authors

  • Nor Azman Mat Ariff Center for Artificial Intelligence Technology, Faculty of Technology and Information Science, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor Darul Ehsan, Malaysia.
  • Azizi Abdullah Center for Artificial Intelligence Technology, Faculty of Technology and Information Science, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor Darul Ehsan, Malaysia.
  • Mohammad Faidzul Nasrudin Center for Artificial Intelligence Technology, Faculty of Technology and Information Science, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor Darul Ehsan, Malaysia.

Keywords:

Ensemble Methods, Frequent Itemset Mining, Image Spam, SVM,

Abstract

Excessive amounts of image spam cause many problems to e-mail users. Since image spam is difficult to detect using conventional text-based spam approach, various image processing techniques have been proposed. In this paper, we present an ensemble method using frequent itemset mining (FIM) for filtering image spam. Despite the fact that FIM techniques are well established in data mining, it is not commonly used in the ensemble method. In order to obtain a good filtering performance, a SIFT descriptor is used since it is widely known as effective image descriptors. K-mean clustering is applied to the SIFT keypoints which produce a visual codebook. The bag-of-word (BOW) feature vectors for each image is generated using a hard bag-of-features (HBOF) approach. FIM descriptors are obtained from the frequent itemsets of the BOW feature vectors. We combine BOW, FIM with another three different feature selections, namely Information Gain (IG), Symmetrical Uncertainty (SU) and Chi Square (CS) with a Spatial Pyramid in an ensemble method. We have performed experiments on Dredze and SpamArchive datasets. The results show that our ensemble that uses the frequent itemsets mining has significantly outperform the traditional BOW and naive approach that combines all descriptors directly in a very large single input vector.

References

F. Gargiulo, A. Penta, A. Picariello, and C. Sansone, “Using Heterogeneous Features for Anti-spam Filters,” in 2008 19th International Conference on Database and Expert Systems Applications, 2008, pp. 670–674.

P. Hayati and V. Potdar, “Evaluation of spam detection and prevention frameworks for email and image spam,” in Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services - iiWAS ’08, 2008, p. 520.

M. Das, A. Bhomick, Y. J. Singh, and V. Prasad, “A modular approach towards image spam filtering using multiple classifiers,” in 2014 IEEE International Conference on Computational Intelligence and Computing Research, 2014, pp. 1–8.

G. Fumera, I. Pillai, and F. Roli, “Spam Filtering Based On The Analysis Of Text Information Embedded Into Images,” J. Mach. Learn. Res., vol. 7, pp. 2699–2720, 2006.

D. Yamakawa and N. Yoshiura, “Applying Tesseract-OCR to detection of image spam mails,” 2012 14th Asia-Pacific Netw. Oper. Manag. Symp., vol. 1, pp. 1–4, Sep. 2012.

A. Attar, R. M. Rad, and R. E. Atani, “A survey of image spamming and filtering techniques,” Artif. Intell. Rev., vol. 40, no. 1, pp. 71–105, Aug. 2011.

M. Dredze, R. Gevaryahu, and A. Elias-Bachrach, “Learning Fast Classifiers for Image Spam,” in Proceedings of the Fourth Conference on Email and Anti-Spam (CEAS’ 07), 2007, pp. 487–493.

H. B. Aradhye, G. K. Myers, and J. A. Herson, “Image analysis for efficient categorization of image-based spam e-mail,” in Eighth International Conference on Document Analysis and Recognition (ICDAR’05), 2005, no. c, p. 914–918 Vol. 2.

D. G. Lowe, “Object recognition from local scale-invariant features,” Proc. Seventh IEEE Int. Conf. Comput. Vis., vol. 2, no. [8, pp. 1150– 1157, 1999.

J. Chen, L. Zhang, and Y. Lu, “Application of Scale Invariant Feature Transform to Image Spam Filter,” in 2008 Second International Conference on Future Generation Communication and Networking Symposia, 2008, pp. 55–58.

X. Feng, R. Zheng, H. Jin, and L. Zhu, “Weighting scheme for image retrieval based on bag-of-visual-words,” IET Image Process., vol. 8, no. 9, pp. 509–518, Sep. 2014.

L.-J. Zhao, P. Tang, and L.-Z. Huo, “Land-Use Scene Classification Using a Concentric Circle-Structured Multiscale Bag-of-Visual-Words Model,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 7, no. 12, pp. 4620–4631, Dec. 2014.

L. Rokach, “Ensemble-based classifiers,” Artif. Intell. Rev., vol. 33, no. 1–2, pp. 1–39, Nov. 2009.

B. Fernando, E. Fromont, and T. Tuytelaars, “Effective Use of Frequent Itemset Mining for Image Classification,” in 12th European Conference on Computer Vision, 2012, pp. 214–227.

G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual Categorization with Bags of Keypoints,” in ECCV International Workshop on Statistical Learning in Computer Vision, 2004.

K. Grauman and T. Darrell, “The pyramid match kernel: discriminative classification with sets of image features,” in Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, 2005, p. 1458– 1465 Vol. 2.

A. Abdullah, R. C. Veltkamp, and M. A. Wiering, “Spatial Pyramids and Two-layer Stacking SVM Classifiers for Image Categorization: A Comparative Study,” in International Joint Conference on Neural Networks, IJCNN., 2009, pp. 5–12.

A. Abdullah, R. C. Veltkamp, and M. A. Wiering, “Fixed partitioning and salient points with MPEG-7 cluster correlograms for image categorization,” Pattern Recognit., vol. 43, no. 3, pp. 650–662, Mar. 2010.

S. Lazebnik, C. Schmid, and J. Ponce, “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006, vol. 2, pp. 2169–2178.

R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases,” ACM SIGMOD Rec., vol. 22, no. May, pp. 207–216, 1993.

B. U. Maheswari and P. Sumathi, “A Comparative Study of Rule Mining Based Web Usage Mining Algorithms,” Int. J. Sci. Res., vol. 4, no. 11, pp. 2540–2543, 2015.

A. M. Parekh, A. S. Patel, S. J. Parmar, and P. V. R. Patel, “Web usage Mining : Frequent Pattern Generation using Association Rule Mining and Clustering,” Int. J. Eng. Res. Technol., vol. 4, no. 4, pp. 1243– 1246, 2015.

L. C. Wuu, C. H. Hung, and S. F. Chen, “Building intrusion pattern miner for Snort network intrusion detection system,” J. Syst. Softw., vol. 80, no. 10, pp. 1699–1715, 2007.

S. Naulaerts, P. Meysman, W. Bittremieux, T. N. Vu, W. Vanden Berghe, B. Goethals, and K. Laukens, “A primer to frequent itemset mining for bioinformatics,” Brief. Bioinform., vol. 16, no. 2, pp. 216– 231, 2015.

A. Abdullah, R. C. Veltkamp, and M. A. Wiering, “Ensembles of novel visual keywords descriptors for image categorization,” in 2010 11th International Conference on Control Automation Robotics & Vision, 2010, no. December, pp. 1206–1211.

R. Duangsoithong and T. Windeatt, “Relevance and redundancy analysis for ensemble classifiers,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 5632 LNAI, pp. 206–220, 2009.

M. a. Wiering and H. van Hasselt, “Ensemble algorithms in reinforcement learning,” IEEE Trans. Syst. Man, Cybern. Part B Cybern., vol. 38, no. 4, pp. 930–936, 2008.

and C.-J. L. Chih-Wei Hsu, Chih-Chung Chang, “A Practical Guide to Support Vector Classification,” BJU international, vol. 101, no. 1. pp. 1396–400, 2008.

P. Fournier-Viger, “SPMF: A Java Open-Source Pattern Mining Library,” J. Mach. Learn. Res., vol. 15, pp. 3569–3573, 2014.

Downloads

Published

2018-02-05

How to Cite

Mat Ariff, N. A., Abdullah, A., & Nasrudin, M. F. (2018). Experimental Approach Based on Ensemble and Frequent Itemsets Mining for Image Spam Filtering. Journal of Telecommunication, Electronic and Computer Engineering (JTEC), 10(1-5), 121–126. Retrieved from https://jtec.utem.edu.my/jtec/article/view/3642