Experimental Approach Based on Ensemble and Frequent Itemsets Mining for Image Spam Filtering
Keywords:
Ensemble Methods, Frequent Itemset Mining, Image Spam, SVM,Abstract
Excessive amounts of image spam cause many problems to e-mail users. Since image spam is difficult to detect using conventional text-based spam approach, various image processing techniques have been proposed. In this paper, we present an ensemble method using frequent itemset mining (FIM) for filtering image spam. Despite the fact that FIM techniques are well established in data mining, it is not commonly used in the ensemble method. In order to obtain a good filtering performance, a SIFT descriptor is used since it is widely known as effective image descriptors. K-mean clustering is applied to the SIFT keypoints which produce a visual codebook. The bag-of-word (BOW) feature vectors for each image is generated using a hard bag-of-features (HBOF) approach. FIM descriptors are obtained from the frequent itemsets of the BOW feature vectors. We combine BOW, FIM with another three different feature selections, namely Information Gain (IG), Symmetrical Uncertainty (SU) and Chi Square (CS) with a Spatial Pyramid in an ensemble method. We have performed experiments on Dredze and SpamArchive datasets. The results show that our ensemble that uses the frequent itemsets mining has significantly outperform the traditional BOW and naive approach that combines all descriptors directly in a very large single input vector.References
F. Gargiulo, A. Penta, A. Picariello, and C. Sansone, “Using Heterogeneous Features for Anti-spam Filters,” in 2008 19th International Conference on Database and Expert Systems Applications, 2008, pp. 670–674.
P. Hayati and V. Potdar, “Evaluation of spam detection and prevention frameworks for email and image spam,” in Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services - iiWAS ’08, 2008, p. 520.
M. Das, A. Bhomick, Y. J. Singh, and V. Prasad, “A modular approach towards image spam filtering using multiple classifiers,” in 2014 IEEE International Conference on Computational Intelligence and Computing Research, 2014, pp. 1–8.
G. Fumera, I. Pillai, and F. Roli, “Spam Filtering Based On The Analysis Of Text Information Embedded Into Images,” J. Mach. Learn. Res., vol. 7, pp. 2699–2720, 2006.
D. Yamakawa and N. Yoshiura, “Applying Tesseract-OCR to detection of image spam mails,” 2012 14th Asia-Pacific Netw. Oper. Manag. Symp., vol. 1, pp. 1–4, Sep. 2012.
A. Attar, R. M. Rad, and R. E. Atani, “A survey of image spamming and filtering techniques,” Artif. Intell. Rev., vol. 40, no. 1, pp. 71–105, Aug. 2011.
M. Dredze, R. Gevaryahu, and A. Elias-Bachrach, “Learning Fast Classifiers for Image Spam,” in Proceedings of the Fourth Conference on Email and Anti-Spam (CEAS’ 07), 2007, pp. 487–493.
H. B. Aradhye, G. K. Myers, and J. A. Herson, “Image analysis for efficient categorization of image-based spam e-mail,” in Eighth International Conference on Document Analysis and Recognition (ICDAR’05), 2005, no. c, p. 914–918 Vol. 2.
D. G. Lowe, “Object recognition from local scale-invariant features,” Proc. Seventh IEEE Int. Conf. Comput. Vis., vol. 2, no. [8, pp. 1150– 1157, 1999.
J. Chen, L. Zhang, and Y. Lu, “Application of Scale Invariant Feature Transform to Image Spam Filter,” in 2008 Second International Conference on Future Generation Communication and Networking Symposia, 2008, pp. 55–58.
X. Feng, R. Zheng, H. Jin, and L. Zhu, “Weighting scheme for image retrieval based on bag-of-visual-words,” IET Image Process., vol. 8, no. 9, pp. 509–518, Sep. 2014.
L.-J. Zhao, P. Tang, and L.-Z. Huo, “Land-Use Scene Classification Using a Concentric Circle-Structured Multiscale Bag-of-Visual-Words Model,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 7, no. 12, pp. 4620–4631, Dec. 2014.
L. Rokach, “Ensemble-based classifiers,” Artif. Intell. Rev., vol. 33, no. 1–2, pp. 1–39, Nov. 2009.
B. Fernando, E. Fromont, and T. Tuytelaars, “Effective Use of Frequent Itemset Mining for Image Classification,” in 12th European Conference on Computer Vision, 2012, pp. 214–227.
G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual Categorization with Bags of Keypoints,” in ECCV International Workshop on Statistical Learning in Computer Vision, 2004.
K. Grauman and T. Darrell, “The pyramid match kernel: discriminative classification with sets of image features,” in Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, 2005, p. 1458– 1465 Vol. 2.
A. Abdullah, R. C. Veltkamp, and M. A. Wiering, “Spatial Pyramids and Two-layer Stacking SVM Classifiers for Image Categorization: A Comparative Study,” in International Joint Conference on Neural Networks, IJCNN., 2009, pp. 5–12.
A. Abdullah, R. C. Veltkamp, and M. A. Wiering, “Fixed partitioning and salient points with MPEG-7 cluster correlograms for image categorization,” Pattern Recognit., vol. 43, no. 3, pp. 650–662, Mar. 2010.
S. Lazebnik, C. Schmid, and J. Ponce, “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006, vol. 2, pp. 2169–2178.
R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases,” ACM SIGMOD Rec., vol. 22, no. May, pp. 207–216, 1993.
B. U. Maheswari and P. Sumathi, “A Comparative Study of Rule Mining Based Web Usage Mining Algorithms,” Int. J. Sci. Res., vol. 4, no. 11, pp. 2540–2543, 2015.
A. M. Parekh, A. S. Patel, S. J. Parmar, and P. V. R. Patel, “Web usage Mining : Frequent Pattern Generation using Association Rule Mining and Clustering,” Int. J. Eng. Res. Technol., vol. 4, no. 4, pp. 1243– 1246, 2015.
L. C. Wuu, C. H. Hung, and S. F. Chen, “Building intrusion pattern miner for Snort network intrusion detection system,” J. Syst. Softw., vol. 80, no. 10, pp. 1699–1715, 2007.
S. Naulaerts, P. Meysman, W. Bittremieux, T. N. Vu, W. Vanden Berghe, B. Goethals, and K. Laukens, “A primer to frequent itemset mining for bioinformatics,” Brief. Bioinform., vol. 16, no. 2, pp. 216– 231, 2015.
A. Abdullah, R. C. Veltkamp, and M. A. Wiering, “Ensembles of novel visual keywords descriptors for image categorization,” in 2010 11th International Conference on Control Automation Robotics & Vision, 2010, no. December, pp. 1206–1211.
R. Duangsoithong and T. Windeatt, “Relevance and redundancy analysis for ensemble classifiers,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 5632 LNAI, pp. 206–220, 2009.
M. a. Wiering and H. van Hasselt, “Ensemble algorithms in reinforcement learning,” IEEE Trans. Syst. Man, Cybern. Part B Cybern., vol. 38, no. 4, pp. 930–936, 2008.
and C.-J. L. Chih-Wei Hsu, Chih-Chung Chang, “A Practical Guide to Support Vector Classification,” BJU international, vol. 101, no. 1. pp. 1396–400, 2008.
P. Fournier-Viger, “SPMF: A Java Open-Source Pattern Mining Library,” J. Mach. Learn. Res., vol. 15, pp. 3569–3573, 2014.
Downloads
Published
How to Cite
Issue
Section
License
TRANSFER OF COPYRIGHT AGREEMENT
The manuscript is herewith submitted for publication in the Journal of Telecommunication, Electronic and Computer Engineering (JTEC). It has not been published before, and it is not under consideration for publication in any other journals. It contains no material that is scandalous, obscene, libelous or otherwise contrary to law. When the manuscript is accepted for publication, I, as the author, hereby agree to transfer to JTEC, all rights including those pertaining to electronic forms and transmissions, under existing copyright laws, except for the following, which the author(s) specifically retain(s):
- All proprietary right other than copyright, such as patent rights
- The right to make further copies of all or part of the published article for my use in classroom teaching
- The right to reuse all or part of this manuscript in a compilation of my own works or in a textbook of which I am the author; and
- The right to make copies of the published work for internal distribution within the institution that employs me
I agree that copies made under these circumstances will continue to carry the copyright notice that appears in the original published work. I agree to inform my co-authors, if any, of the above terms. I certify that I have obtained written permission for the use of text, tables, and/or illustrations from any copyrighted source(s), and I agree to supply such written permission(s) to JTEC upon request.