The Malware Open-source Threat Intelligence Family dataset contains 3,095 disarmed PE malware samples from 454 families

Related tags

Deep LearningMOTIF
Overview

MOTIF Dataset

The Malware Open-source Threat Intelligence Family (MOTIF) dataset contains 3,095 disarmed PE malware samples from 454 families, labeled with ground truth confidence. Family labels were obtained by surveying thousands of open-source threat reports published by 14 major cybersecurity organizations between Jan. 1st, 2016 Jan. 1st, 2021. The dataset also provides a comprehensive alias mapping for each family and EMBER raw features for each file.

Further information about the MOTIF dataset is provided in our paper.

If you use the provided data or code, please make sure to cite our paper:

@misc{joyce2021motif,
      title={MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels},
      author={Robert J. Joyce and Dev Amlani and Charles Nicholas and Edward Raff},
      year={2021},
      eprint={2111.15031},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Downloading the Dataset

Due to the size of the dataset, you must use Git LFS in order to clone the repository. Installation instructions for Git LFS are linked here. On Debian-based systems, the Git LFS package can be installed using:

sudo apt-get install git-lfs

Once Git LFS is installed, you can clone this repository using:

git lfs clone https://github.com/boozallen/MOTIF.git

Dataset Contents

The main dataset is located in dataset/ and contains the following files:

motif_dataset.jsonl

Each line of motif_dataset.jsonl is a .json object with the following entries:

Name Description
md5 MD5 hash of malware sample
sha1 SHA-1 hash of malware sample
sha256 SHA-256 hash of malware sample
reported_hash Hash of malware sample provided in report
reported_family Normalized family name provided in report
aliases List of known aliases for family
label Unique id for malware family (for ML purposes)
report_source Name of organization that published report
report_date Date report was published
report_url URL of report
report_ioc_url URL to report appendix (if any)
appeared Year and month malware sample was first seen

Each .json object also contains EMBER raw features (version 2) for the file:

Name Description
histogram EMBER histogram
byteentropy EMBER byte histogram
strings EMBER strings metadata
general EMBER general file metadata
header EMBER PE header metadata
section EMBER PE section metadata
imports EMBER imports metadata
exports EMBER exports metadata
datadirectories EMBER data directories metadata

motif_families.csv

This file contains an alias mapping for each of the 454 malware families in the MOTIF dataset. It also contains a succinct description of the family and the threat group or campaign that the family is attributed to (if any).

Column Description
Aliases List of known aliases for family
Description Brief sentence describing capabilities of malware family
Attribution (If any) Name of threat actor malware/campaign is attributed to

motif_reports.csv

This file provides information gathered from our original survey of open-source threat reports. We identified 4,369 malware hashes with 595 distinct reported family names during the survey, but we were unable to obtain some of the files and we restricted the MOTIF dataset to only files in the PE file format. The reported hash, family, source, date, URL, and IOC URL of any malware samples which did not make it into the final MOTIF dataset are located here.

MOTIF.7z

The disarmed malware samples are provided in this 1.47GB encrypted .7z file, which can be unzipped using the following password:

i_assume_all_risk_opening_malware

Each file is named in the format MOTIF_MD5, with MD5 indicating the file's hash prior to when it was disarmed.

X_train.dat and y_train.dat

EMBERv2 feature vectors and labels are provided in X_train.dat and y_train.dat, respectively. Feature vectors were computed using LIEF v0.9.0. These files are named for compatibility with the EMBER read_vectorized_features() function. MOTIF is not split into a training or test set, and X_train.dat and y_train.dat contain feature vectors and labels for the entire dataset.

Benchmark Models

We provide code for training the ML models described in our paper, located in benchmarks/. To support these models, code for modified versions of MalConv2 is included in the MalConv2/ directory.

Requirements:

Packages required for training the ML models can be installed using the following commands:

pip3 install -r requirements.txt
python3 setup.py install

Training the LightGBM or outlier detection models also requires EMBER:

pip3 install git+https://github.com/elastic/ember.git

Training the models:

The LightGBM model can be trained using the following command, where /path/to/MOTIF/dataset/ indicates the path to the dataset/ directory.

python3 lgbm.py /path/to/MOTIF/dataset/

The MalConv2 model can be trained using the following command, where /path/to/MOTIF/MOTIF_defanged/ indicates the path to the unzipped folder containing the disarmed malware samples:

python3 malconv.py /path/to/MOTIF/MOTIF_defanged/ /path/to/MOTIF/dataset/motif_dataset.jsonl

The three outlier detection models can be trained using the following command:

python3 outliers.py /path/to/MOTIF/dataset/

Proper Use of Data

Use of this dataset must follow the provided terms of licensing. We intend this dataset to be used for research purposes and have taken measures to prevent abuse by attackers. All files are prevented from running using the same technique as the SOREL dataset. We refer to their statement regarding safety and abuse of the data.

The malware we’re releasing is “disarmed” so that it will not execute. This means it would take knowledge, skill, and time to reconstitute the samples and get them to actually run. That said, we recognize that there is at least some possibility that a skilled attacker could learn techniques from these samples or use samples from the dataset to assemble attack tools to use as part of their malicious activities. However, in reality, there are already many other sources attackers could leverage to gain access to malware information and samples that are easier, faster and more cost effective to use. In other words, this disarmed sample set will have much more value to researchers looking to improve and develop their independent defenses than it will have to attackers.

Owner
Booz Allen Hamilton
The official GitHub organization of Booz Allen Hamilton
Booz Allen Hamilton
Using CNN to mimic the driver based on training data from Torcs

Behavioural-Cloning-in-autonomous-driving Using CNN to mimic the driver based on training data from Torcs. Approach First, the data was collected from

Sudharshan 2 Jan 05, 2022
李云龙二次元风格化!打滚卖萌,使用了animeGANv2进行了视频的风格迁移

李云龙二次元风格化!一键star、fork,你也可以生成这样的团长! 打滚卖萌求star求fork! 0.效果展示 视频效果前往B站观看效果最佳:李云龙二次元风格化: github开源repo:李云龙二次元风格化 百度AIstudio开源地址,一键fork即可运行: 李云龙二次元风格化!一键fork

oukohou 44 Dec 04, 2022
Calculates carbon footprint based on fuel mix and discharge profile at the utility selected. Can create graphs and tabular output for fuel mix based on input file of series of power drawn over a period of time.

carbon-footprint-calculator Conda distribution ~/anaconda3/bin/conda install anaconda-client conda-build ~/anaconda3/bin/conda config --set anaconda_u

Seattle university Renewable energy research 7 Sep 26, 2022
Small little script to scrape, parse and check for active tor nodes. Can be used as proxies.

TorScrape TorScrape is a small but useful script made in python that scrapes a website for active tor nodes, parse the html and then save the nodes in

5 Dec 04, 2022
Free like Freedom

This is all very much a work in progress! More to come! ( We're working on it though! Stay tuned!) Installation Open an Anaconda Prompt (in Windows, o

2.3k Jan 04, 2023
The ICS Chat System project for NYU Shanghai Fall 2021

ICS_Chat_System [Catenger] This is the ICS Chat System project for NYU Shanghai Fall 2021 Creators: Shavarsh Melikyan, Skyler Chen and Arghya Sarkar,

1 Dec 20, 2021
Pop-Out Motion: 3D-Aware Image Deformation via Learning the Shape Laplacian (CVPR 2022)

Pop-Out Motion Pop-Out Motion: 3D-Aware Image Deformation via Learning the Shape Laplacian (CVPR 2022) Jihyun Lee*, Minhyuk Sung*, Hyunjin Kim, Tae-Ky

Jihyun Lee 88 Nov 22, 2022
Denoising Diffusion Probabilistic Models

Denoising Diffusion Probabilistic Models This repo contains code for DDPM training. Based on Denoising Diffusion Probabilistic Models, Improved Denois

Alexander Markov 7 Dec 15, 2022
Fully-automated scripts for collecting AI-related papers

AI-Paper-collector Fully-automated scripts for collecting AI-related papers List of Conferences to crawel ACL: 21-19 (including findings) EMNLP: 21-19

Gordon Lee 776 Jan 08, 2023
Towards Flexible Blind JPEG Artifacts Removal (FBCNN, ICCV 2021)

Towards Flexible Blind JPEG Artifacts Removal (FBCNN, ICCV 2021) Jiaxi Jiang, Kai Zhang, Radu Timofte Computer Vision Lab, ETH Zurich, Switzerland 🔥

Jiaxi Jiang 282 Jan 02, 2023
1st ranked 'driver careless behavior detection' for AI Online Competition 2021, hosted by MSIT Korea.

2021AICompetition-03 본 repo 는 mAy-I Inc. 팀으로 참가한 2021 인공지능 온라인 경진대회 중 [이미지] 운전 사고 예방을 위한 운전자 부주의 행동 검출 모델] 태스크 수행을 위한 레포지토리입니다. mAy-I 는 과학기술정보통신부가 주최하

Junhyuk Park 9 Dec 01, 2022
Train an imgs.ai model on your own dataset

imgs.ai is a fast, dataset-agnostic, deep visual search engine for digital art history based on neural network embeddings.

Fabian Offert 5 Dec 21, 2021
Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding

The Hypersim Dataset For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real i

Apple 1.3k Jan 04, 2023
RM Operation can equivalently convert ResNet to VGG, which is better for pruning; and can help RepVGG perform better when the depth is large.

RM Operation can equivalently convert ResNet to VGG, which is better for pruning; and can help RepVGG perform better when the depth is large.

184 Jan 04, 2023
Camera calibration & 3D pose estimation tools for AcinoSet

AcinoSet: A 3D Pose Estimation Dataset and Baseline Models for Cheetahs in the Wild Daniel Joska, Liam Clark, Naoya Muramatsu, Ricardo Jericevich, Fre

African Robotics Unit 42 Nov 16, 2022
School of Artificial Intelligence at the Nanjing University (NJU)School of Artificial Intelligence at the Nanjing University (NJU)

F-Principle This is an exercise problem of the digital signal processing (DSP) course at School of Artificial Intelligence at the Nanjing University (

Thyrix 5 Nov 23, 2022
Tiny-NewsRec: Efficient and Effective PLM-based News Recommendation

Tiny-NewsRec The source codes for our paper "Tiny-NewsRec: Efficient and Effective PLM-based News Recommendation". Requirements PyTorch == 1.6.0 Tensor

Yang Yu 3 Dec 07, 2022
Code for BMVC2021 paper "Boundary Guided Context Aggregation for Semantic Segmentation"

Boundary-Guided-Context-Aggregation Boundary Guided Context Aggregation for Semantic Segmentation Haoxiang Ma, Hongyu Yang, Di Huang In BMVC'2021 Pape

Haoxiang Ma 31 Jan 08, 2023
A lightweight tool to get an AI Infrastructure Stack up in minutes not days.

K3ai will take care of setup K8s for You, deploy the AI tool of your choice and even run your code on it.

k3ai 105 Dec 04, 2022