An ensemble of pre-trained transformer models for imbalanced multiclass malware classification

No Thumbnail Available

Date

2022

Journal Title

Journal ISSN

Volume Title

Publisher

Elsevier Advanced Technology

Open Access Color

Green Open Access

Yes

OpenAIRE Downloads

OpenAIRE Views

Publicly Funded

No
Impulse
Top 1%
Influence
Top 10%
Popularity
Top 1%

Research Projects

Journal Issue

Abstract

Classification of malware families is crucial for a comprehensive understanding of how they can infect devices, computers, or systems. Hence, malware identification enables security researchers and incident responders to take precautions against malware and accelerate mitigation. API call sequences made by malware are widely utilized features by machine and deep learning models for malware classification as these sequences represent the behavior of malware. However, traditional machine and deep learning models remain incapable of capturing sequence relationships among API calls. Unlike traditional machine and deep learning models, the transformer-based models process the sequences in whole and learn relationships among API calls due to multi-head attention mechanisms and positional embeddings. Our experiments demonstrate that the Transformer model with one transformer block layer surpasses the performance of the widely used base architecture, LSTM. Moreover, BERT or CANINE, the pre-trained transformer models, outperforms in classifying highly imbalanced malware families according to evaluation metrics: F1-score and AUC score. Furthermore, our proposed bagging-based random transformer forest (RTF) model, an ensemble of BERT or CANINE, reaches the state-of-the-art evaluation scores on the three out of four datasets, specifically it captures a state-of-the-art F1-score of 0.6149 on one of the commonly used benchmark dataset. (C) 2022 Elsevier Ltd. All rights reserved.

Description

Demirkiran, Ferhat/0000-0001-7335-9370; Unal, Ugur/0000-0001-6552-6044

Keywords

Transformer, Tokenization-free, API Calls, Imbalanced, Multiclass, BERT, CANINE, Ensemble, Malware classification, FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Cryptography and Security, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Machine Learning (stat.ML), Cryptography and Security (cs.CR), Machine Learning (cs.LG)

Turkish CoHE Thesis Center URL

Fields of Science

0202 electrical engineering, electronic engineering, information engineering, 02 engineering and technology

Citation

WoS Q

Q1

Scopus Q

Q1
OpenCitations Logo
OpenCitations Citation Count
26

Source

Computers & Security

Volume

121

Issue

Start Page

102846

End Page

PlumX Metrics
Citations

CrossRef : 40

Scopus : 57

Captures

Mendeley Readers : 80

SCOPUS™ Citations

58

checked on Feb 06, 2026

Web of Science™ Citations

39

checked on Feb 06, 2026

Page Views

6

checked on Feb 06, 2026

Google Scholar Logo
Google Scholar™
OpenAlex Logo
OpenAlex FWCI
11.11256532

Sustainable Development Goals

SDG data is not available