AI Malware Guardian's detection models were trained using publicly available research datasets. We gratefully acknowledge the authors of these datasets and their contributions to the security research community. All datasets are used strictly in accordance with their respective licenses.
Full title: COMISET: Dataset for the analysis of malicious events in Windows systems
Authors: Pérez-Sánchez, A., Palacios, R., & López, G. (2025)
Institution: Comillas Pontifical University
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Usage: COMISET was used to train the behavioral pattern detection component of AI Malware Guardian. The original dataset was not modified or redistributed — only derived model weights are incorporated into the product.
Citation:
Pérez-Sánchez, A., Palacios, R., & López, G. (2025). COMISET: Dataset for the analysis of malicious events in Windows systems [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15375146
Full title: Binary-30K: A Large-Scale Multi-Platform Binary Dataset for Machine Learning Research
Author: Bommarito, Michael J., II (2025)
Dataset: huggingface.co/datasets/mjbommar/binary-30k-tokenized
Paper: github.com/mjbommar/binary-bpe-paper
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Usage: Binary-30K was used to train the static PE file anomaly detection baseline model in AI Malware Guardian. The original dataset was not modified or redistributed — only derived model weights are incorporated into the product.
Citation:
@article{bommarito2025binary30k,
title={Binary-30K: A Large-Scale Multi-Platform Binary Dataset for Machine Learning Research},
author={Bommarito, Michael J., II},
journal={arXiv preprint},
year={2025},
url={https://github.com/mjbommar/binary-bpe-paper}
}
Full title: Comprehensive Multi-Source Cyber-Security Events
Author: Kent, Alexander D.
Institution: Los Alamos National Laboratory
Year: 2015
Dataset URL: csr.lanl.gov/data/cyber1/
License: CC0 1.0 Universal (Public Domain Dedication) — Los Alamos National Laboratory has waived all copyright and related rights. No attribution is legally required; acknowledgement is provided voluntarily.
Usage: The process start/stop event logs (proc.txt.gz) and labeled red team ground truth events (redteam.txt.gz) from this dataset were used to augment training data for the behavioral detection component of AI Malware Guardian. The original dataset was not modified or redistributed — only derived model weights are incorporated into the product.
Citation:
Kent, A. D. (2015). Comprehensive Multi-Source Cyber-Security Events [Data set]. Los Alamos National Laboratory. https://csr.lanl.gov/data/cyber1/
Full title: Benign & Malicious PE Files
Author: Mauricio (Kaggle: amauricio)
Dataset URL: kaggle.com/datasets/amauricio/pe-files-malwares
License: CC0 1.0 Universal (Public Domain Dedication) — No attribution legally required; acknowledgement is provided voluntarily.
Usage: Static PE structural features (79 features extracted via the pefile Python library from benign and malicious Windows executables, including DOS Header, File Header, Optional Header, and section metadata) were used to augment training data for the static PE anomaly detection component of AI Malware Guardian. The original dataset was not modified or redistributed — only derived model weights are incorporated into the product.
Citation:
Mauricio. (n.d.). Benign & Malicious PE Files [Data set]. Kaggle. https://www.kaggle.com/datasets/amauricio/pe-files-malwares
Full title: Malware & Goodware Dynamic Analysis Reports
Author: Greimas (Kaggle)
Year: 2023
Dataset URL: kaggle.com/datasets/greimas/malware-and-goodware-dynamic-analysis-reports
License: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). Attribution is required. ShareAlike applies to derivative datasets, not to trained machine learning models.
Usage: Dynamic analysis reports from 26,200 Windows PE samples (8,600 goodware and 17,675 malware) executed in Windows 7 VMware virtual machines via CAPEv2 sandbox were used to augment training data for the behavioral detection component of AI Malware Guardian. The original dataset was not modified or redistributed — only derived model weights are incorporated into the product.
Citation:
Greimas. (2023). Malware & Goodware Dynamic Analysis Reports [Data set]. Kaggle. https://www.kaggle.com/datasets/greimas/malware-and-goodware-dynamic-analysis-reports Licensed under CC BY-SA 4.0: https://creativecommons.org/licenses/by-sa/4.0/
Full title: Quo Vadis: Dynamic Malware Analysis Dataset (Malware and Benignware Behavioral Reports from Speakeasy Emulator)
Author: Trizna, Dmitrijs
Year: 2022
Dataset URL: kaggle.com/datasets/dmitrijstrizna/quo-vadis-malware-emulation
License: GNU Lesser General Public License, version 3.0 (LGPL-3.0). The LGPL governs distribution of the dataset itself. This product uses the dataset solely as training input; the original dataset was not modified, incorporated into source code, or redistributed in any form — only derived model weights are incorporated into the product.
Usage: Behavioral emulation reports for 93,533 32-bit Windows PE files (generated via the Speakeasy Windows emulator, covering 7 malware families: backdoor, coinminer, dropper, keylogger, ransomware, RAT, and trojan, plus benign clean samples) were used to augment training data for the behavioral detection component of AI Malware Guardian.
Citation:
@inproceedings{10.1145/3560830.3563726,
author = {Trizna, Dmitrijs},
title = {Quo Vadis: Hybrid Machine Learning Meta-Model Based on Contextual
and Behavioral Malware Representations},
year = {2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3560830.3563726},
doi = {10.1145/3560830.3563726},
booktitle = {Proceedings of the 15th ACM Workshop on Artificial Intelligence
and Security},
pages = {127--136},
series = {AISec'22}
}
The Creative Commons Attribution 4.0 International license requires that attribution be provided when the licensed work is shared. The full text of the CC BY 4.0 license is available at creativecommons.org/licenses/by/4.0/legalcode.
The Creative Commons Attribution-ShareAlike 4.0 International license requires attribution and that derivative datasets be shared under the same terms. The full text of the CC BY-SA 4.0 license is available at creativecommons.org/licenses/by-sa/4.0/legalcode. Note: ShareAlike applies only to derivative datasets, not to machine learning models trained on the data.
The GNU Lesser General Public License 3.0 governs distribution of the Quo Vadis dataset. This product does not distribute, incorporate, or link against the dataset — it was used solely as training input. The full text of the LGPL 3.0 is available at gnu.org/licenses/lgpl-3.0.html.
AI Malware Guardian is built using open-source software components. The following notices are provided to fulfill the requirements of each component's license. A complete NOTICES.txt file is also bundled alongside the application in the installation directory.
Copyright © Microsoft Corporation. All rights reserved.
Licensed under the MIT License.
Source: github.com/microsoft/onnxruntime
Copyright © 2019–2024 Tauri Programme within The Commons Conservancy.
Licensed under the MIT License or Apache License 2.0.
Source: github.com/tauri-apps/tauri
This product also includes the following open-source libraries, each licensed under the MIT License and/or Apache License 2.0. Full copyright notices are listed in the bundled NOTICES.txt file.