Textual, Non-Textual, and Hybrid Feature Engineering for SMS Spam Classification

Contemporary spam filters increasingly rely on resource-intensive, deep learning models. This study evaluates the performance, robustness, and deployability of lightweight learning models. It provides a head-to-head evaluation of probabilistic (Naïve Bayes) and margin-based (Support Vecto...

Full description

Bibliographic Details
Published in:IEEE Access
Main Authors: Aditi Ranjit Kumar Verma, Shriya Sadana
Format: Article
Language:English
Published: IEEE 2025-01-01
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11202451/
Description
Summary:Contemporary spam filters increasingly rely on resource-intensive, deep learning models. This study evaluates the performance, robustness, and deployability of lightweight learning models. It provides a head-to-head evaluation of probabilistic (Naïve Bayes) and margin-based (Support Vector Machine) classifiers on three feature spaces derived from the 5574-message UCI Short Message Service (SMS) spam collection. Our primary finding shows that a hybrid model, a fusion of bag-of-words (BoW) representation with 22 handcrafted metadata features, achieves the highest accuracy, with SVM peaking at 98.3%. To assess the resilience, the model was tested against adversarial attacks. The hybrid SVM model exhibited strong robustness when faced with altered data and maintained 72.41% accuracy against challenging semantic attacks. Furthermore, the hybrid SVM model demonstrated strong cross-dataset generalization, achieving 74.38% accuracy when trained on the original UCI data and tested on a modern, diverse dataset of SMS, Telegram, and email messages. Deployment analysis confirmed the efficiency of the framework, with processing of ~200 requests/s (fastest model) at less than 10ms latency and ~1.5% average CPU load on a standard CPU. The results establish three key principles for next-generation spam filters: 1) lexical information remains the dominant signal; 2) lightweight metadata provides measurable incremental value when paired with text; and 3) margin-based classifiers exploit multimodal fusion most effectively. Taken together, these findings validate that a lightweight hybrid feature-engineering approach provides a robust, generalizable, and resource-efficient solution for real-time spam mitigation, thereby presenting a compelling and practical alternative to computationally expensive deep learning architectures.
ISSN:2169-3536