An ensemble of pre-trained transformer models for imbalanced multiclass malware classification

dc.contributor.author Demirkiran, Ferhat
dc.contributor.author Cayir, Aykut
dc.contributor.author Unal, Gur
dc.contributor.author Dag, Hasan
dc.contributor.other Management Information Systems
dc.contributor.other 03. Faculty of Economics, Administrative and Social Sciences
dc.contributor.other 01. Kadir Has University
dc.date.accessioned 2024-06-23T21:36:49Z
dc.date.available 2024-06-23T21:36:49Z
dc.date.issued 2022
dc.description Demirkiran, Ferhat/0000-0001-7335-9370; Unal, Ugur/0000-0001-6552-6044 en_US
dc.description.abstract Classification of malware families is crucial for a comprehensive understanding of how they can infect devices, computers, or systems. Hence, malware identification enables security researchers and incident responders to take precautions against malware and accelerate mitigation. API call sequences made by malware are widely utilized features by machine and deep learning models for malware classification as these sequences represent the behavior of malware. However, traditional machine and deep learning models remain incapable of capturing sequence relationships among API calls. Unlike traditional machine and deep learning models, the transformer-based models process the sequences in whole and learn relationships among API calls due to multi-head attention mechanisms and positional embeddings. Our experiments demonstrate that the Transformer model with one transformer block layer surpasses the performance of the widely used base architecture, LSTM. Moreover, BERT or CANINE, the pre-trained transformer models, outperforms in classifying highly imbalanced malware families according to evaluation metrics: F1-score and AUC score. Furthermore, our proposed bagging-based random transformer forest (RTF) model, an ensemble of BERT or CANINE, reaches the state-of-the-art evaluation scores on the three out of four datasets, specifically it captures a state-of-the-art F1-score of 0.6149 on one of the commonly used benchmark dataset. (C) 2022 Elsevier Ltd. All rights reserved. en_US
dc.identifier.citationcount 16
dc.identifier.doi 10.1016/j.cose.2022.102846
dc.identifier.issn 0167-4048
dc.identifier.issn 1872-6208
dc.identifier.scopus 2-s2.0-85136643921
dc.identifier.uri https://doi.org/10.1016/j.cose.2022.102846
dc.identifier.uri https://hdl.handle.net/20.500.12469/5644
dc.language.iso en en_US
dc.publisher Elsevier Advanced Technology en_US
dc.relation.ispartof Computers & Security
dc.rights info:eu-repo/semantics/openAccess en_US
dc.subject Transformer en_US
dc.subject Tokenization-free en_US
dc.subject API Calls en_US
dc.subject Imbalanced en_US
dc.subject Multiclass en_US
dc.subject BERT en_US
dc.subject CANINE en_US
dc.subject Ensemble en_US
dc.subject Malware classification en_US
dc.title An ensemble of pre-trained transformer models for imbalanced multiclass malware classification en_US
dc.type Article en_US
dspace.entity.type Publication
gdc.author.id Demirkiran, Ferhat/0000-0001-7335-9370
gdc.author.id Unal, Ugur/0000-0001-6552-6044
gdc.author.institutional Dağ, Hasan
gdc.author.institutional Demirkıran, Ferhat
gdc.author.scopusid 57219836294
gdc.author.scopusid 56497768800
gdc.author.scopusid 57215332698
gdc.author.scopusid 6507328166
gdc.bip.impulseclass C3
gdc.bip.influenceclass C4
gdc.bip.popularityclass C4
gdc.coar.access open access
gdc.coar.type text::journal::journal article
gdc.description.department Kadir Has University en_US
gdc.description.departmenttemp [Demirkiran, Ferhat] Kadir Has Univ, Cyber Secur Grad Program, Istanbul, Turkey; [Cayir, Aykut] Huawei R&D Ctr, Istanbul, Turkey; [Cayir, Aykut; Unal, Gur; Dag, Hasan] Kadir Has Univ, Management Informat Syst, Istanbul, Turkey en_US
gdc.description.publicationcategory Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı en_US
gdc.description.scopusquality Q1
gdc.description.startpage 102846
gdc.description.volume 121 en_US
gdc.description.wosquality Q1
gdc.identifier.openalex W4288070321
gdc.identifier.wos WOS:000881541300005
gdc.oaire.diamondjournal false
gdc.oaire.impulse 38.0
gdc.oaire.influence 5.19993E-9
gdc.oaire.isgreen true
gdc.oaire.keywords FOS: Computer and information sciences
gdc.oaire.keywords Computer Science - Machine Learning
gdc.oaire.keywords Computer Science - Cryptography and Security
gdc.oaire.keywords Artificial Intelligence (cs.AI)
gdc.oaire.keywords Computer Science - Artificial Intelligence
gdc.oaire.keywords Statistics - Machine Learning
gdc.oaire.keywords Machine Learning (stat.ML)
gdc.oaire.keywords Cryptography and Security (cs.CR)
gdc.oaire.keywords Machine Learning (cs.LG)
gdc.oaire.popularity 2.590277E-8
gdc.oaire.publicfunded false
gdc.oaire.sciencefields 0202 electrical engineering, electronic engineering, information engineering
gdc.oaire.sciencefields 02 engineering and technology
gdc.openalex.fwci 6.93
gdc.openalex.normalizedpercentile 1.0
gdc.openalex.toppercent TOP 1%
gdc.opencitations.count 26
gdc.plumx.crossrefcites 40
gdc.plumx.mendeley 77
gdc.plumx.newscount 1
gdc.plumx.scopuscites 52
gdc.scopus.citedcount 52
gdc.wos.citedcount 35
relation.isAuthorOfPublication e02bc683-b72e-4da4-a5db-ddebeb21e8e7
relation.isAuthorOfPublication 695a8adc-2330-4d32-ab37-8b781716d609
relation.isAuthorOfPublication.latestForDiscovery e02bc683-b72e-4da4-a5db-ddebeb21e8e7
relation.isOrgUnitOfPublication ff62e329-217b-4857-88f0-1dae00646b8c
relation.isOrgUnitOfPublication acb86067-a99a-4664-b6e9-16ad10183800
relation.isOrgUnitOfPublication b20623fc-1264-4244-9847-a4729ca7508c
relation.isOrgUnitOfPublication.latestForDiscovery ff62e329-217b-4857-88f0-1dae00646b8c

Files