Building high-quality datasets of packed executables: Enhancing static detection models via curated packed binary datasets

(2025)

Files

Alloin_02372201_2025.pdf
  • Embargoed access from 2026-09-02
  • Adobe PDF
  • 2.07 MB

Details

Supervisors
Faculty
Degree label
Abstract
Executable packing refers to a set of techniques applied to executable binaries, including compression, encryption, virtualization, and other operations. These techniques have been used for both legitimate purposes, such as software piracy protection and license management, as well as for malicious activities like distributing harmful software. The methods used for packing continue to evolve, becoming increasingly complex and sophisticated, particularly in terms of evading antivirus detection when applied to malicious binaries. As a result, packing poses a significant challenge in malware analysis, particularly when it comes to accurate detection. The Packing Box tool has been developed to generate datasets through mass-packing bunches of samples and to combine or filter datasets. The main aim of this experimental toolkit is to identify the most effective machine learning models for static detection of executable packing. While many functionalities of the toolkit are already well developed, it still lacks the ability to evaluate the quality of datasets and some functionalities to generate large ones, and studies arising from the use of this framework are still relying on small unevaluated datasets. The lack of datasets with proven quality for studying packing detection significantly hinders progress in the field. Without reliable and consistent datasets, it becomes challenging to compare different detection methods, especially machine learning models, in a meaningful way. As a result, the effectiveness of packing detection techniques remains uncertain, and progress in the development of more robust and accurate models is slowed down. A standardized, datasets of proven high quality are essential not only for evaluating existing approaches but also for driving innovation in the detection of packed executables. The aim of this master’s thesis is to address the lack of standardized and high-quality datasets by creating and improving dataset generation techniques, implementing robust methods for dataset validation, and assessing the quality of the datasets to ensure their reliability and effectiveness in training and evaluating packing detection models. These techniques are then integrated into the Packing Box for enhancing its capabilities.