Ensemble of feature selection models for malware datasets

dc.contributor.advisorDAG, HASANen_US
dc.contributor.authorCÜREBAL, FARUK
dc.date2022-09
dc.date.accessioned2023-08-02T10:42:43Z
dc.date.available2023-08-02T10:42:43Z
dc.date.issued2022
dc.departmentEnstitüler, Lisansüstü Eğitim Enstitüsü, İşletme Ana Bilim Dalıen_US
dc.description.abstractWhile the development of technology has made our lives easier, our dependence on it has also increased. Cybercriminals develop various types of malware to exploit this dependence. Thus, malware classification is essential for security researchers and incident response teams to take action against them and accelerate mitigation. In this study, we selected seven feature selection methods considering their popularity, effectiveness, and complexity: LOFO Importance (Leave One Feature Out) , FRUFS (Feature Relevance based Unsupervised Feature Selection), AGRM (A General Framework for Auto-Weighted Feature Selection with Global Redundancy Minimization), MI (Mutual Information), Chi-square test, mRMR (Minimum Redundancy and Maximum Relevance), BoostARoota. We performed all the experiments in this study using XGBoost (Extreme Gradient Boosting), RF (Random Forest), and HGB (Histogram-Based Gradient Boosting) machine learning classifiers and accuracy, F1-score, and AUC-score (Area under the ROC Curve) evaluation metrics. We measured the parameter sensitivities of these feature selection methods having adjustable parameters on two high-dimensional datasets: the Microsoft Malware Prediction dataset and the API Call Sequences dataset. These feature selection methods and parameters are FRUFS (model-c, random-state), BoostARoota (clf, iters), and LOFO (model). Only the ‘model’ parameter of the LOFO algorithm significantly affects the accuracy and F1-score evaluation metric results among the adjustable parameters. We then compared these seven feature selection algorithms using two high-dimensional malware datasets: the Microsoft Malware Prediction dataset and the API Import dataset. Overall results show that AGRM obtained better metric results than other feature selection methods. Behind AGRM, FRUFS, LOFO, MI, and mRMR achieved the best results in different metrics. Compared to MI and mRMR, LOFO is much less used in the malware domain, while FRUFS has not been used before. Since AGRM performs better and FRUFS and LOFO are newer than other algorithms, we decided to continue our work with these three feature selection methods. Finally, we combined three selected feature selection methods, LOFO Importance, FRUFS, and AGRM, to find the most important features and work with fewer features by reducing the multidimensionality. We trained three feature subsets from these feature selection methods with three models, XGBoost, RF, and HGB classifiers, using a stacking ensemble on the Microsoft Malware Prediction dataset and the API Import dataset. From the nine prediction probabilities we obtained, we eliminated the prediction probabilities containing the same information by setting a threshold in the correlation matrix. We gave the final prediction probabilities we obtained to the SVM (Support Vector Machine) meta classifier. Our model obtained an average of 1.2% better classification accuracy than the selected three feature selection methods on one of the well know malware datasets (Microsoft Malware Prediction dataset). For the API Import dataset, our model obtained an average 8% better classification accuracy than LOFO and FRUFS feature selection algorithms, and AGRM could not be used in that comparison due to insufficient RAM. Therefore, our proposed model was trained with fewer features and got better results.en_US
dc.identifier.urihttps://hdl.handle.net/20.500.12469/4453
dc.identifier.yoktezid766000en_US
dc.language.isoenen_US
dc.publisherKadir Has Üniversitesien_US
dc.relation.publicationcategoryTezen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectFeature Selectionen_US
dc.subjectEnsembleen_US
dc.subjectFRUFSen_US
dc.subjectAGRMen_US
dc.subjectLOFOen_US
dc.subjectMalware Classificationen_US
dc.titleEnsemble of feature selection models for malware datasetsen_US
dc.typeMaster Thesisen_US
dspace.entity.typePublication

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Faruk_Cürebal.pdf
Size:
895.08 KB
Format:
Adobe Portable Document Format
Description:
Ensemble of Feature Selection Models for Malware Datasets

Collections