An ensemble of pre-trained transformer models for imbalanced multiclass malware classification

dc.authoridDemirkiran, Ferhat/0000-0001-7335-9370
dc.authoridUnal, Ugur/0000-0001-6552-6044
dc.authorscopusid57219836294
dc.authorscopusid56497768800
dc.authorscopusid57215332698
dc.authorscopusid6507328166
dc.contributor.authorDağ, Hasan
dc.contributor.authorCayir, Aykut
dc.contributor.authorUnal, Gur
dc.contributor.authorDag, Hasan
dc.date.accessioned2024-06-23T21:36:49Z
dc.date.available2024-06-23T21:36:49Z
dc.date.issued2022
dc.departmentKadir Has Universityen_US
dc.department-temp[Demirkiran, Ferhat] Kadir Has Univ, Cyber Secur Grad Program, Istanbul, Turkey; [Cayir, Aykut] Huawei R&D Ctr, Istanbul, Turkey; [Cayir, Aykut; Unal, Gur; Dag, Hasan] Kadir Has Univ, Management Informat Syst, Istanbul, Turkeyen_US
dc.descriptionDemirkiran, Ferhat/0000-0001-7335-9370; Unal, Ugur/0000-0001-6552-6044en_US
dc.description.abstractClassification of malware families is crucial for a comprehensive understanding of how they can infect devices, computers, or systems. Hence, malware identification enables security researchers and incident responders to take precautions against malware and accelerate mitigation. API call sequences made by malware are widely utilized features by machine and deep learning models for malware classification as these sequences represent the behavior of malware. However, traditional machine and deep learning models remain incapable of capturing sequence relationships among API calls. Unlike traditional machine and deep learning models, the transformer-based models process the sequences in whole and learn relationships among API calls due to multi-head attention mechanisms and positional embeddings. Our experiments demonstrate that the Transformer model with one transformer block layer surpasses the performance of the widely used base architecture, LSTM. Moreover, BERT or CANINE, the pre-trained transformer models, outperforms in classifying highly imbalanced malware families according to evaluation metrics: F1-score and AUC score. Furthermore, our proposed bagging-based random transformer forest (RTF) model, an ensemble of BERT or CANINE, reaches the state-of-the-art evaluation scores on the three out of four datasets, specifically it captures a state-of-the-art F1-score of 0.6149 on one of the commonly used benchmark dataset. (C) 2022 Elsevier Ltd. All rights reserved.en_US
dc.identifier.citation16
dc.identifier.doi10.1016/j.cose.2022.102846
dc.identifier.issn0167-4048
dc.identifier.issn1872-6208
dc.identifier.scopus2-s2.0-85136643921
dc.identifier.scopusqualityQ1
dc.identifier.urihttps://doi.org/10.1016/j.cose.2022.102846
dc.identifier.urihttps://hdl.handle.net/20.500.12469/5644
dc.identifier.volume121en_US
dc.identifier.wosWOS:000881541300005
dc.identifier.wosqualityQ1
dc.language.isoenen_US
dc.publisherElsevier Advanced Technologyen_US
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanıen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectTransformeren_US
dc.subjectTokenization-freeen_US
dc.subjectAPI Callsen_US
dc.subjectImbalanceden_US
dc.subjectMulticlassen_US
dc.subjectBERTen_US
dc.subjectCANINEen_US
dc.subjectEnsembleen_US
dc.subjectMalware classificationen_US
dc.titleAn ensemble of pre-trained transformer models for imbalanced multiclass malware classificationen_US
dc.typeArticleen_US
dspace.entity.typePublication
relation.isAuthorOfPublicatione02bc683-b72e-4da4-a5db-ddebeb21e8e7
relation.isAuthorOfPublication.latestForDiscoverye02bc683-b72e-4da4-a5db-ddebeb21e8e7

Files