Acknowledgements

AI Malware Guardian's detection models were trained using publicly available research datasets. We gratefully acknowledge the authors of these datasets and their contributions to the security research community. All datasets are used strictly in accordance with their respective licenses.

Open Research Datasets

COMISET

Full title: COMISET: Dataset for the analysis of malicious events in Windows systems

Authors: Pérez-Sánchez, A., Palacios, R., & López, G. (2025)

Institution: Comillas Pontifical University

DOI: 10.5281/zenodo.15375146

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Usage: COMISET was used to train the behavioral pattern detection component of AI Malware Guardian. The original dataset was not modified or redistributed — only derived model weights are incorporated into the product.

Citation:

Pérez-Sánchez, A., Palacios, R., & López, G. (2025). COMISET: Dataset for the
analysis of malicious events in Windows systems [Data set]. Zenodo.
https://doi.org/10.5281/zenodo.15375146

Binary-30K

Full title: Binary-30K: A Large-Scale Multi-Platform Binary Dataset for Machine Learning Research

Author: Bommarito, Michael J., II (2025)

Dataset: huggingface.co/datasets/mjbommar/binary-30k-tokenized

Paper: github.com/mjbommar/binary-bpe-paper

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Usage: Binary-30K was used to train the static PE file anomaly detection baseline model in AI Malware Guardian. The original dataset was not modified or redistributed — only derived model weights are incorporated into the product.

Citation:

@article{bommarito2025binary30k,
  title={Binary-30K: A Large-Scale Multi-Platform Binary Dataset for Machine Learning Research},
  author={Bommarito, Michael J., II},
  journal={arXiv preprint},
  year={2025},
  url={https://github.com/mjbommar/binary-bpe-paper}
}

LANL Comprehensive Multi-Source Cyber-Security Events (cyber1)

Full title: Comprehensive Multi-Source Cyber-Security Events

Author: Kent, Alexander D.

Institution: Los Alamos National Laboratory

Year: 2015

Dataset URL: csr.lanl.gov/data/cyber1/

License: CC0 1.0 Universal (Public Domain Dedication) — Los Alamos National Laboratory has waived all copyright and related rights. No attribution is legally required; acknowledgement is provided voluntarily.

Usage: The process start/stop event logs (proc.txt.gz) and labeled red team ground truth events (redteam.txt.gz) from this dataset were used to augment training data for the behavioral detection component of AI Malware Guardian. The original dataset was not modified or redistributed — only derived model weights are incorporated into the product.

Citation:

Kent, A. D. (2015). Comprehensive Multi-Source Cyber-Security Events [Data set].
Los Alamos National Laboratory. https://csr.lanl.gov/data/cyber1/

Benign & Malicious PE Files

Full title: Benign & Malicious PE Files

Author: Mauricio (Kaggle: amauricio)

Dataset URL: kaggle.com/datasets/amauricio/pe-files-malwares

License: CC0 1.0 Universal (Public Domain Dedication) — No attribution legally required; acknowledgement is provided voluntarily.

Usage: Static PE structural features (79 features extracted via the pefile Python library from benign and malicious Windows executables, including DOS Header, File Header, Optional Header, and section metadata) were used to augment training data for the static PE anomaly detection component of AI Malware Guardian. The original dataset was not modified or redistributed — only derived model weights are incorporated into the product.

Citation:

Mauricio. (n.d.). Benign & Malicious PE Files [Data set]. Kaggle.
https://www.kaggle.com/datasets/amauricio/pe-files-malwares

Malware & Goodware Dynamic Analysis Reports

Full title: Malware & Goodware Dynamic Analysis Reports

Author: Greimas (Kaggle)

Year: 2023

Dataset URL: kaggle.com/datasets/greimas/malware-and-goodware-dynamic-analysis-reports

License: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). Attribution is required. ShareAlike applies to derivative datasets, not to trained machine learning models.

Usage: Dynamic analysis reports from 26,200 Windows PE samples (8,600 goodware and 17,675 malware) executed in Windows 7 VMware virtual machines via CAPEv2 sandbox were used to augment training data for the behavioral detection component of AI Malware Guardian. The original dataset was not modified or redistributed — only derived model weights are incorporated into the product.

Citation:

Greimas. (2023). Malware & Goodware Dynamic Analysis Reports [Data set]. Kaggle.
https://www.kaggle.com/datasets/greimas/malware-and-goodware-dynamic-analysis-reports
Licensed under CC BY-SA 4.0: https://creativecommons.org/licenses/by-sa/4.0/

Quo Vadis: Dynamic Malware Analysis Dataset

Full title: Quo Vadis: Dynamic Malware Analysis Dataset (Malware and Benignware Behavioral Reports from Speakeasy Emulator)

Author: Trizna, Dmitrijs

Year: 2022

DOI: 10.1145/3560830.3563726

Dataset URL: kaggle.com/datasets/dmitrijstrizna/quo-vadis-malware-emulation

License: GNU Lesser General Public License, version 3.0 (LGPL-3.0). The LGPL governs distribution of the dataset itself. This product uses the dataset solely as training input; the original dataset was not modified, incorporated into source code, or redistributed in any form — only derived model weights are incorporated into the product.

Usage: Behavioral emulation reports for 93,533 32-bit Windows PE files (generated via the Speakeasy Windows emulator, covering 7 malware families: backdoor, coinminer, dropper, keylogger, ransomware, RAT, and trojan, plus benign clean samples) were used to augment training data for the behavioral detection component of AI Malware Guardian.

Citation:

@inproceedings{10.1145/3560830.3563726,
  author    = {Trizna, Dmitrijs},
  title     = {Quo Vadis: Hybrid Machine Learning Meta-Model Based on Contextual
               and Behavioral Malware Representations},
  year      = {2022},
  publisher = {Association for Computing Machinery},
  address   = {New York, NY, USA},
  url       = {https://doi.org/10.1145/3560830.3563726},
  doi       = {10.1145/3560830.3563726},
  booktitle = {Proceedings of the 15th ACM Workshop on Artificial Intelligence
               and Security},
  pages     = {127--136},
  series    = {AISec'22}
}

License Texts

The Creative Commons Attribution 4.0 International license requires that attribution be provided when the licensed work is shared. The full text of the CC BY 4.0 license is available at creativecommons.org/licenses/by/4.0/legalcode.

The Creative Commons Attribution-ShareAlike 4.0 International license requires attribution and that derivative datasets be shared under the same terms. The full text of the CC BY-SA 4.0 license is available at creativecommons.org/licenses/by-sa/4.0/legalcode. Note: ShareAlike applies only to derivative datasets, not to machine learning models trained on the data.

The GNU Lesser General Public License 3.0 governs distribution of the Quo Vadis dataset. This product does not distribute, incorporate, or link against the dataset — it was used solely as training input. The full text of the LGPL 3.0 is available at gnu.org/licenses/lgpl-3.0.html.

Third-Party Software Notices

AI Malware Guardian is built using open-source software components. The following notices are provided to fulfill the requirements of each component's license. A complete NOTICES.txt file is also bundled alongside the application in the installation directory.

ONNX Runtime

Licensed under the MIT License.

Source: github.com/microsoft/onnxruntime

Tauri

Licensed under the MIT License or Apache License 2.0.

Source: github.com/tauri-apps/tauri

Additional Open-Source Components

This product also includes the following open-source libraries, each licensed under the MIT License and/or Apache License 2.0. Full copyright notices are listed in the bundled NOTICES.txt file.