Ensemble of Feature Selection Models for Malware Datasets

CÜREBAL, FARUK

Ensemble of Feature Selection Models for Malware Datasets

dc.contributor.advisor	DAG, HASAN	en_US
dc.contributor.author	CÜREBAL, FARUK
dc.date	2022-09
dc.date.accessioned	2023-08-02T10:42:43Z
dc.date.available	2023-08-02T10:42:43Z
dc.date.issued	2022
dc.description.abstract	While the development of technology has made our lives easier, our dependence on it has also increased. Cybercriminals develop various types of malware to exploit this dependence. Thus, malware classification is essential for security researchers and incident response teams to take action against them and accelerate mitigation. In this study, we selected seven feature selection methods considering their popularity, effectiveness, and complexity: LOFO Importance (Leave One Feature Out) , FRUFS (Feature Relevance based Unsupervised Feature Selection), AGRM (A General Framework for Auto-Weighted Feature Selection with Global Redundancy Minimization), MI (Mutual Information), Chi-square test, mRMR (Minimum Redundancy and Maximum Relevance), BoostARoota. We performed all the experiments in this study using XGBoost (Extreme Gradient Boosting), RF (Random Forest), and HGB (Histogram-Based Gradient Boosting) machine learning classifiers and accuracy, F1-score, and AUC-score (Area under the ROC Curve) evaluation metrics. We measured the parameter sensitivities of these feature selection methods having adjustable parameters on two high-dimensional datasets: the Microsoft Malware Prediction dataset and the API Call Sequences dataset. These feature selection methods and parameters are FRUFS (model-c, random-state), BoostARoota (clf, iters), and LOFO (model). Only the ‘model’ parameter of the LOFO algorithm significantly affects the accuracy and F1-score evaluation metric results among the adjustable parameters. We then compared these seven feature selection algorithms using two high-dimensional malware datasets: the Microsoft Malware Prediction dataset and the API Import dataset. Overall results show that AGRM obtained better metric results than other feature selection methods. Behind AGRM, FRUFS, LOFO, MI, and mRMR achieved the best results in different metrics. Compared to MI and mRMR, LOFO is much less used in the malware domain, while FRUFS has not been used before. Since AGRM performs better and FRUFS and LOFO are newer than other algorithms, we decided to continue our work with these three feature selection methods. Finally, we combined three selected feature selection methods, LOFO Importance, FRUFS, and AGRM, to find the most important features and work with fewer features by reducing the multidimensionality. We trained three feature subsets from these feature selection methods with three models, XGBoost, RF, and HGB classifiers, using a stacking ensemble on the Microsoft Malware Prediction dataset and the API Import dataset. From the nine prediction probabilities we obtained, we eliminated the prediction probabilities containing the same information by setting a threshold in the correlation matrix. We gave the final prediction probabilities we obtained to the SVM (Support Vector Machine) meta classifier. Our model obtained an average of 1.2% better classification accuracy than the selected three feature selection methods on one of the well know malware datasets (Microsoft Malware Prediction dataset). For the API Import dataset, our model obtained an average 8% better classification accuracy than LOFO and FRUFS feature selection algorithms, and AGRM could not be used in that comparison due to insufficient RAM. Therefore, our proposed model was trained with fewer features and got better results.	en_US
dc.identifier.uri	https://hdl.handle.net/20.500.12469/4453
dc.language.iso	en	en_US
dc.publisher	Kadir Has Üniversitesi	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Feature Selection	en_US
dc.subject	Ensemble	en_US
dc.subject	FRUFS	en_US
dc.subject	AGRM	en_US
dc.subject	LOFO	en_US
dc.subject	Malware Classification	en_US
dc.title	Ensemble of Feature Selection Models for Malware Datasets	en_US
dc.type	Master Thesis	en_US
dspace.entity.type	Publication
gdc.coar.access	open access
gdc.coar.type	text::thesis::master thesis
gdc.description.department	Enstitüler, Lisansüstü Eğitim Enstitüsü, İşletme Ana Bilim Dalı	en_US
gdc.description.publicationcategory	Tez	en_US
gdc.identifier.yoktezid	766000	en_US
relation.isOrgUnitOfPublication	b20623fc-1264-4244-9847-a4729ca7508c
relation.isOrgUnitOfPublication.latestForDiscovery	b20623fc-1264-4244-9847-a4729ca7508c

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Faruk_Cürebal.pdf
Size:: 895.08 KB
Format:: Adobe Portable Document Format
Description:: Ensemble of Feature Selection Models for Malware Datasets

Download

Collections

Tez Koleksiyonu