An ensemble of pre-trained transformer models for imbalanced multiclass malware classification

Demirkiran, Ferhat; Cayir, Aykut; Unal, Gur; Dag, Hasan

An ensemble of pre-trained transformer models for imbalanced multiclass malware classification

dc.contributor.author	Demirkiran, Ferhat
dc.contributor.author	Cayir, Aykut
dc.contributor.author	Unal, Gur
dc.contributor.author	Dag, Hasan
dc.date.accessioned	2024-06-23T21:36:49Z
dc.date.available	2024-06-23T21:36:49Z
dc.date.issued	2022
dc.description	Demirkiran, Ferhat/0000-0001-7335-9370; Unal, Ugur/0000-0001-6552-6044	en_US
dc.description.abstract	Classification of malware families is crucial for a comprehensive understanding of how they can infect devices, computers, or systems. Hence, malware identification enables security researchers and incident responders to take precautions against malware and accelerate mitigation. API call sequences made by malware are widely utilized features by machine and deep learning models for malware classification as these sequences represent the behavior of malware. However, traditional machine and deep learning models remain incapable of capturing sequence relationships among API calls. Unlike traditional machine and deep learning models, the transformer-based models process the sequences in whole and learn relationships among API calls due to multi-head attention mechanisms and positional embeddings. Our experiments demonstrate that the Transformer model with one transformer block layer surpasses the performance of the widely used base architecture, LSTM. Moreover, BERT or CANINE, the pre-trained transformer models, outperforms in classifying highly imbalanced malware families according to evaluation metrics: F1-score and AUC score. Furthermore, our proposed bagging-based random transformer forest (RTF) model, an ensemble of BERT or CANINE, reaches the state-of-the-art evaluation scores on the three out of four datasets, specifically it captures a state-of-the-art F1-score of 0.6149 on one of the commonly used benchmark dataset. (C) 2022 Elsevier Ltd. All rights reserved.	en_US
dc.identifier.doi	10.1016/j.cose.2022.102846
dc.identifier.issn	0167-4048
dc.identifier.issn	1872-6208
dc.identifier.scopus	2-s2.0-85136643921
dc.identifier.uri	https://doi.org/10.1016/j.cose.2022.102846
dc.identifier.uri	https://hdl.handle.net/20.500.12469/5644
dc.language.iso	en	en_US
dc.publisher	Elsevier Advanced Technology	en_US
dc.relation.ispartof	Computers & Security
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Transformer	en_US
dc.subject	Tokenization-free	en_US
dc.subject	API Calls	en_US
dc.subject	Imbalanced	en_US
dc.subject	Multiclass	en_US
dc.subject	BERT	en_US
dc.subject	CANINE	en_US
dc.subject	Ensemble	en_US
dc.subject	Malware classification	en_US
dc.title	An ensemble of pre-trained transformer models for imbalanced multiclass malware classification	en_US
dc.type	Article	en_US
dspace.entity.type	Publication
gdc.author.id	Demirkiran, Ferhat/0000-0001-7335-9370
gdc.author.id	Unal, Ugur/0000-0001-6552-6044
gdc.author.institutional	Dağ, Hasan
gdc.author.institutional	Demirkıran, Ferhat
gdc.author.scopusid	57219836294
gdc.author.scopusid	56497768800
gdc.author.scopusid	57215332698
gdc.author.scopusid	6507328166
gdc.bip.impulseclass	C3
gdc.bip.influenceclass	C4
gdc.bip.popularityclass	C4
gdc.coar.access	open access
gdc.coar.type	text::journal::journal article
gdc.description.department	Kadir Has University	en_US
gdc.description.departmenttemp	[Demirkiran, Ferhat] Kadir Has Univ, Cyber Secur Grad Program, Istanbul, Turkey; [Cayir, Aykut] Huawei R&D Ctr, Istanbul, Turkey; [Cayir, Aykut; Unal, Gur; Dag, Hasan] Kadir Has Univ, Management Informat Syst, Istanbul, Turkey	en_US
gdc.description.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı	en_US
gdc.description.scopusquality	Q1
gdc.description.startpage	102846
gdc.description.volume	121	en_US
gdc.description.wosquality	Q1
gdc.identifier.openalex	W4288070321
gdc.identifier.wos	WOS:000881541300005
gdc.oaire.diamondjournal	false
gdc.oaire.impulse	38.0
gdc.oaire.influence	5.19993E-9
gdc.oaire.isgreen	true
gdc.oaire.keywords	FOS: Computer and information sciences
gdc.oaire.keywords	Computer Science - Machine Learning
gdc.oaire.keywords	Computer Science - Cryptography and Security
gdc.oaire.keywords	Artificial Intelligence (cs.AI)
gdc.oaire.keywords	Computer Science - Artificial Intelligence
gdc.oaire.keywords	Statistics - Machine Learning
gdc.oaire.keywords	Machine Learning (stat.ML)
gdc.oaire.keywords	Cryptography and Security (cs.CR)
gdc.oaire.keywords	Machine Learning (cs.LG)
gdc.oaire.popularity	2.590277E-8
gdc.oaire.publicfunded	false
gdc.oaire.sciencefields	0202 electrical engineering, electronic engineering, information engineering
gdc.oaire.sciencefields	02 engineering and technology
gdc.openalex.fwci	6.93
gdc.openalex.normalizedpercentile	1.0
gdc.openalex.toppercent	TOP 1%
gdc.opencitations.count	26
gdc.plumx.crossrefcites	40
gdc.plumx.mendeley	77
gdc.plumx.newscount	1
gdc.plumx.scopuscites	52
gdc.scopus.citedcount	52
gdc.wos.citedcount	35
relation.isAuthorOfPublication	e02bc683-b72e-4da4-a5db-ddebeb21e8e7
relation.isAuthorOfPublication	695a8adc-2330-4d32-ab37-8b781716d609
relation.isAuthorOfPublication.latestForDiscovery	e02bc683-b72e-4da4-a5db-ddebeb21e8e7
relation.isOrgUnitOfPublication	ff62e329-217b-4857-88f0-1dae00646b8c
relation.isOrgUnitOfPublication	acb86067-a99a-4664-b6e9-16ad10183800
relation.isOrgUnitOfPublication	b20623fc-1264-4244-9847-a4729ca7508c
relation.isOrgUnitOfPublication.latestForDiscovery	ff62e329-217b-4857-88f0-1dae00646b8c

Collections

WoS İndeksli Yayınlar Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu

An ensemble of pre-trained transformer models for imbalanced multiclass malware classification

Files

Collections