A curated list of amazingly awesome Cybersecurity datasets

Overview

Awesome-Cybersecurity-Datasets

A curated list of amazingly awesome Cybersecurity datasets.

Please contribute to this list with new datasets by sending me a pull request or by contacting me at @santiagohramos.

Happy learning!

Table of contents

Datasets

Network traffic

  • Unified Host and Network Dataset - The Unified Host and Network Dataset is a subset of network and computer (host) events collected from the Los Alamos National Laboratory enterprise network over the course of approximately 90 days. The host event logs originated from most enterprise computers running the Microsoft Windows operating system on Los Alamos National Laboratory's (LANL) enterprise network. The network event data originated from many of the internal enterprise routers within the LANL enterprise network.
  • Comprehensive, Multi-Source Cyber-Security Events - This data set represents 58 consecutive days of de-identified event data collected from five sources within Los Alamos National Laboratory's corporate, internal computer network.
  • User-Computer Authentication Associations in Time - This anonymized data set encompasses 9 continuous months and represents 708,304,516 successful authentication events from users to computers collected from the Los Alamos National Laboratory (LANL) enterprise network.
  • Canadian Institute for Cybersecurity datasets - Canadian Institute for Cybersecurity datasets are used around the world by universities, private industry and independent researchers.
  • KDD Cup 1999 Data - This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.
  • 2017-SUEE-data-set - The data sets contain traffic in and out of the web server of the Student Union for Electrical Engineering (Fachbereichsvertretung Elektrotechnik) at Ulm University. Internal hosts are hosts from within the university network, some of them are cable bound, others connect through one of two wifi services on campus (eduroam and welcome). The data was mixed with attack traffic.
  • CTU-13 Dataset - A Labeled Dataset with Botnet, Normal and Background traffic.
  • PCAP files - Malware Traffic, Network Forensics, SCADA/ICS Network Captures, Packet Injection Attacks / Man-on-the-Side Attacks...
  • pcapt - Big repository of PCAP files.
  • Project Sonar - Project Sonar produces multiple UDP datasets every month. This data is gathered by sending protocol-specific UDP probes across the entire IPv4 address space. The types of probes sent each week continues to expand as the project matures.
  • IoT devices captures - This dataset represents the traffic emitted during the setup of 31 smart home IoT devices of 27 different types (4 types are represented by 2 devices each). Each setup was repeated at least 20 times per device-type.

Malware

  • UNSW-NB15 data set - This data set has nine families of attacks, namely, Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms. The Argus, Bro-IDS tools are utilised and twelve algorithms are developed to generate totally 49 features with the class label.
  • Malware Training Sets - Today (please refers to blog post date) the collected classified datasets is composed by the following samples: APT1 292 Samples, Crypto 2024 Samples, Locker 434 Samples, Zeus 2014 Samples
  • The Drebin Dataset - The dataset contains 5,560 applications from 179 different malware families. The samples have been collected in the period of August 2010 to October 2012 and were made available to us by the MobileSandbox project.
  • Stratosphere IPS - Malware captures, Normal captures, mixed captures...
  • Microsoft Malware Classification Challenge - You are provided with a set of known malware files representing a mix of 9 different families. Each malware file has an Id, a 20 character hash value uniquely identifying the file, and a Class, an integer representing one of 9 family names.

Software

  • Javascript Vulnerability dataset - Dataset constructed from the vulnerability information in public databases of the Node Security Project and the Snyk platform, and code fixing patches from GitHub.

WebApps

  • West Point NSA Data Sets - Snort Intrusion Detection Log. Domain Name Service Logs. Web Server Logs. Log Server Aggregate Log.
  • Web Attack Payloads - A collection of web attack payloads.
  • Machine-Learning-driven-Web-Application-Firewall - Set of good and bad queries to a web application firewall.
  • Internet-Wide Scan Data Repository - The Censys Projects publishes daily snapshots of what we know about each IPv4 host, Alexa Top Million website, and known X.509 certificate. These datasets contain structured, non-ephemeral JSON records that identify a host's configuration.
  • 500K HTTP Headers - Recently we crawled the Top 500K sites (as ranked by Alexa). Following requests from readers we are making available the HTTP Headers for research purposes.
  • HTTP DATASET CSIC 2010 - The HTTP dataset CSIC 2010 contains thousands of web requests automatically generated. It can be used for the testing of web attack protection systems. It was developed at the Information Security Institute of CSIC (Spanish Research National Council).
  • ISOT datasets - The ISOT Lab has collected through different projects various datasets some of which are available for public sharing. ISOT Web Interactions Dataset (Mouse/Keystroke/Site Actions), ISOT Botnet Dataset...
  • Web Logs Secrepo - Web logs generated by secrepo community and secrepo web application
  • Common Crawl - The Common Crawl corpus contains petabytes of data collected over the last 7 years. It contains raw web page data, extracted metadata and text extractions.
  • Website Classification Dataset - The entire selective archive is manually curated, including classification of sites into a two-tiered subject hierarchy. We have made this manually-generated classification information available as an open dataset, in tab-separated column format.
  • AZSecure-data - The AZSecure-data PORTAL currently provides access to Web forums, Internet phishing websites, Twitter data, and other data.

URLs & Domain Names

  • Malicious URLs Dataset - The data set consists of about 2.4 million URLs (examples) and 3.2 million features.
  • cybercrime-tracker - List of labeled malicious URLs.
  • Malware Domain List - Malware Domain List.
  • ZeuS Tracker - ZeuS Tracker tracks ZeuS Command&Control servers (hosts) around the world and provides you a domain- and a IP-blocklist.
  • Feodo Tracker - List of Feodo botnet C&C servers
  • Ransomware Tracker - Ransomware Tracker offers various types of blocklists that allows you to block Ransomware botnet C&C traffic.
  • URLhaus - URLhaus is a project from abuse.ch with the goal of sharing malicious URLs that are being used for malware distribution.
  • Alexa Top 1 Million - CSV dataset with the most popular sites by Alexa.
  • OpenDNS Top Domains List - The OpenDNS Top Domains List is the top 10,000 domain names our resolvers all over the globe are receiving queries for, sorted by popularity.
  • The Majestic Million - The million domains we find with the most referring subnets.
  • StopForumSpam - The data provided here represents what we believe will only ever ben used to abuse. IP Addresses, domains and usernames listed here will be returned in API results as "blacklisted".

Host

  • The ADFA Intrusion Detection Datasets - This dataset provides a contemporary Linux dataset for evaluation by traditional HIDS. This dataset provides a contemporary Windows dataset for evaluation by HIDS.
  • Unified Host and Network Dataset - The Unified Host and Network Dataset is a subset of network and computer (host) events collected from the Los Alamos National Laboratory enterprise network over the course of approximately 90 days. The host event logs originated from most enterprise computers running the Microsoft Windows operating system on Los Alamos National Laboratory's (LANL) enterprise network. The network event data originated from many of the internal enterprise routers within the LANL enterprise network.
  • Public Security Log Sharing Site - This site contains various free shareable log samples from various systems, security and network devices, applications, etc. The logs are collected from real systems, some contain evidence of compromise and other malicious activity. Wherever possible, the logs are NOT sanitized, anonymized or modified in any way (just as they came from the logging system).
  • Aktaion2 Data - The project is meant to be a learning/teaching tool on how to blend multiple security signals and behaviors into an expressive framework for intrusion detection.

Email

Fraud

  • Credit Card Fraud - The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

Honeypots

  • DDS Dataset Collection - A tar/gzip CSV file from a collection of AWS honeypots. A zip CSV file of domains and a high level classification of dga or legit along with a subclass of either legit, cryptolocker, gox or newgoz.
  • Threat_Research - Centralized repository to dump threat research data gathered from my network of honeypots.

Binaries

  • The ember dataset - The ember dataset is a collection of 1.1 million sha256 hashes from PE files that were scanned sometime in 2017. This repository makes it easy to reproducibly train the benchmark model, extend the provided feature set, or classify new PE files with the benchmark model.

Phishing

  • Phishing Websites Data Set - In this dataset, we shed light on the important features that have proved to be sound and effective in predicting phishing websites. In addition, we propose some new features.

Passwords

MISC

  • SecRepo - Samples of Security Related Data.
  • PANDA SHARE - This site stores recordings of executions produced by the PANDA dynamic analysis platform. The goal is to make dyanamic analysis repeatable. Any analysis dynamic analysis, run on the same replay, will produce the same results.
  • SHERLOCK - The dataset is essentially a massive time-series dataset spanning nearly every single kind of software and hardware sensor that can be sampled from a Samsung Galaxy S5 smartphone, without root privileges. The dataset contains over 600 billion data points in over 10 billion data records.
  • WerdLists - Wordlists, Dictionaries and Other Data Sets for Writing Software Security Test Cases.
Malware for Discord, designed to steal passwords, tokens, and inject discord folders for long-term use.

Vital What is Vital? Vital is malware primarily used to collect and extract information from the Discord desktop client. While it has other features (

HellSec 59 Dec 01, 2022
A Python tool to automate some dorking stuff to find information disclosures.

WebDork v1.0.3 A open-source tool to find publicly available sensitive information about Companies/Organisations! WebDork A Python tool to automate so

Rahul rc 123 Jan 08, 2023
Safe Policy Optimization with Local Features

Safe Policy Optimization with Local Feature (SPO-LF) This is the source-code for implementing the algorithms in the paper "Safe Policy Optimization wi

Akifumi Wachi 6 Jun 05, 2022
A collection of write-ups and solutions for Cyber FastTrack Spring 2021.

IMPORTANT: Please contact us before you use any styling or content shown here! Cyber FastTrack Spring 2021 / National Cyber Scholarship Competition -

Alice 48 Aug 28, 2022
Mad Spammer is a python webhook spammer which is very easy and safe to use.

Mad Spammer 👿 Pre-Setup: Open your terminal/console and type: pip install module colorama python MadSpammer.py Setup: After doing that, you should be

1 Nov 26, 2021
Docker Compose based system for running remote browsers (including Flash and Java support) connected to web archives

pywb Remote Browsers This repository provides a simple configuration for deploying any pywb with remote browsers provided by OWT/Shepherd Remote Brows

Webrecorder 10 Jul 28, 2022
Add a Web Server based on Rogue Mysql Server to allow remote user get

介绍 对于需要使用 Rogue Mysql Server 的漏洞来说,若想批量检测这种漏洞的话需要自备一个服务器。并且我常用的Rogue Mysql Server 脚本 不支持动态更改读取文件名、不支持远程用户访问读取结果、不支持批量化检测网站。于是乎萌生了这个小脚本的想法 Rogue-MySql-

6 May 17, 2022
AttractionFinder - 2022 State Qualified FBLA Attraction Finder Application

Attraction Finder Developers: Riyon Praveen, Aaron Bijoy, & Yash Vora How It Wor

$ky 2 Feb 09, 2022
Yara Based Detection Engine for web browsers

Yobi Yara Based Detection for web browsers System Requirements Yobi requires python3 and and right now supports only firefox and other Gecko-based bro

imp0rtp3 44 Nov 20, 2022
AMC- Automatic Media Access Control [MAC] Address Spoofing Tool

AMC (Automatic Media Access Control [MAC] Address Spoofing tool), helps you to protect your real network hardware identity. Each entered time interval your hardware address was changed automatically.

Dipen Chavan 14 Dec 23, 2022
PoC of proxylogon chain SSRF(CVE-2021-26855) to write file by testanull, censored by github

CVE-2021-26855 PoC of proxylogon chain SSRF(CVE-2021-26855) to write file by testanull, censored by github Why does github remove this exploit because

The Hacker's Choice 58 Nov 15, 2022
Scan all java processes on your host to check weather it's affected by log4j2 remote code execution

Log4j2 Vulnerability Local Scanner (CVE-2021-45046) Log4j 漏洞本地检测脚本,扫描主机上所有java进程,检测是否引入了有漏洞的log4j-core jar包,是否可能遭到远程代码执行攻击(CVE-2021-45046)。上传扫描报告到指定的服

86 Dec 09, 2022
Something I built to test for Log4J vulnerabilities on customer networks.

Log4J-Scanner Something I built to test for Log4J vulnerabilities on customer networks. I'm not responsible if your computer blows up, catches fire or

1 Dec 20, 2021
Open Source Intelligence gathering tool aimed at reducing the time spent harvesting information from open sources.

The Recon-ng Framework Recon-ng content now available on Pluralsight! Recon-ng is a full-featured reconnaissance framework designed with the goal of p

2.4k Jan 07, 2023
A set of blender assets created for the $yb NFT project.

fyb-blender A set of blender assets created for the $yb NFT project. Install just as you would any other Blender Add-on (via Edit-Preferences-Add-on

Pedro Arroyo 1 May 06, 2022
SPV SecurePasswordVerification

SPV SecurePasswordVerification Its is python module for doing a secure password verification without sharing the password directly. Features The passw

Merwin 1 Feb 12, 2022
Encrypted Python Password Manager

PyPassKeep Encrypted Python Password Manager About PyPassKeep (PPK for short) is an encrypted python password manager used to secure your passwords fr

KrisIsHere 1 Nov 17, 2021
Having a weak password is not good for a system that demands high confidentiality and security of user credentials

Having a weak password is not good for a system that demands high confidentiality and security of user credentials. It turns out that people find it difficult to make up a strong password that is str

PyLaboratory 0 Feb 07, 2022
Uncover the full name of a target on Linkedin.

Revealin Uncover the full name of a target on Linkedin. It's just a little PoC exploiting a design flaw. Useful for OSINT. Screenshot Usage $ git clon

mxrch 129 Dec 21, 2022
Phishing-Crack tools to punish friends

Phishing-Crack Phishing Tool Version 1.0.0 Created By temirovazat A Phishing Tool With PHP and Python3 Features Fake Instagram Phishing Page Fake Face

3 Oct 04, 2022