Full-featured Decision Trees and Random Forests learner.

Last update: Aug 15, 2022

Overview

CID3

This is a full-featured Decision Trees and Random Forests learner. It can save trees or forests to disk for later use. It is possible to query trees and Random Forests and to fill out an unlabeled file with the predicted classes. Documentation is not yet available, although the program options can be shown with command:

% java -jar cid3.jar -h

usage: java -jar cid3.jar
 -a,--analysis <name>    show causal analysis report
 -c,--criteria <name>    input criteria: c[Certainty], e[Entropy], g[Gini]
 -f,--file <name>        input file
 -h,--help               print this message
 -o,--output <name>      output file
 -p,--partition          partition train/test data
 -q,--query <type>       query model, enter: t[Tree] or r[Random forest]
 -r,--forest <amount>    create random forest, enter # of trees
 -s,--save               save tree/random forest
 -t,--threads <amount>   maximum number of threads (default is 500)
 -v,--validation         create 10-fold cross-validation
 -ver,--version          version

List of features

It uses a new Certainty formula as splitting criteria.
Provides causal analysis report, which shows how some attribute values cause a particular classification.
Creates full trees, showing error rates for train and test data, attribute importance, causes and false positives/negatives.
If no test data is provided, it can split the train dataset in 80% for training and 20% for testing.
Creates random forests, showing error rates for train and test data, attribute importance, causes and false positives/negatives. Random forests are created in parallel, so it is very fast.
Creates 10 Fold Cross-Validation for trees and random forests, showing error rates, mean and Standard Error and false positives/negatives. Cross-Validation folds are created in parallel.
Saves trees and random forests to disk in a compressed file. (E.g. model.tree, model.forest)
Query trees and random forest from saved files. Queries can contain missing values, just enter the character: “?”.
Make predictions and fill out cases files with those predictions, either from single trees or random forests.
Missing values imputation for train and test data is implemented. Continuous attributes are imputed as the mean value. Discrete attributes are imputed as MODE, which selects the value that is most frequent.
Ignoring attributes is implemented. In the .names file just set the attribute type as: ignore.
Three different splitting criteria can be used: Certainty, Entropy and Gini. If no criteria is invoked then Certainty will be used.

Example run with titanic dataset

[email protected] datasets % java -jar cid3.jar -f titanic

CID3 [Version 1.1]              Saturday October 30, 2021 06:34:11 AM
------------------
[ ✓ ] Read data: 891 cases for training. (10 attributes)
[ ✓ ] Decision tree created.

Rules: 276
Nodes: 514

Importance Cause   Attribute Name
---------- -----   --------------
      0.57   yes ············ Sex
      0.36   yes ········· Pclass
      0.30   yes ··········· Fare
      0.28   yes ······· Embarked
      0.27   yes ·········· SibSp
      0.26   yes ·········· Parch
      0.23    no ············ Age


[==== TRAIN DATA ====] 

Correct guesses:  875
Incorrect guesses: 16 (1.8%)

# Of Cases  False Pos  False Neg   Class
----------  ---------  ---------   -----
       549         14          2 ····· 0
       342          2         14 ····· 1

Time: 0:00:00

Requirements

CID3 requires JDK 15 or higher.

The data format is similar to that of C4.5 and C5.0. The data file format is CSV, and it could be split in two separated files, like: titanic.data and titanic.test. The class attribute column must be the last column of the file. The other necessary file is the "names" file, which should be named like: titanic.names, and it contains the names and types of the attributes. The first line is the class attribute possible values. This line could be left empty with just a dot(.) Below is an example of the titanic.names file:

0,1.  
PassengerId: ignore.  
Pclass: 1,2,3.  
Sex : male,female.  
Age: continuous.  
SibSp: discrete.  
Parch: discrete.  
Ticket: ignore.  
Fare: continuous.  
Cabin: ignore.  
Embarked: discrete.

Example of causal analysis

% java -jar cid3.jar -f adult -a education

From this example we can see that attribute "education" is a cause, which is based on the certainty-raising inequality. Once we know that it is a cause we then compare the causal certainties of its values. When it's value is "Doctorate" it causes the earnings to be greater than $50,000, with a probability of 0.73. A paper will soon be published with all the formulas used to calculate the Certainty for splitting the nodes and the certainty-raising inequality, used for causal analysis.

Importance Cause   Attribute Name
---------- -----   --------------
      0.56   yes ······ education

Report of causal certainties
----------------------------

[ Attribute: education ]

    1st-4th --> <=50K  (0.97)

    5th-6th --> <=50K  (0.95)

    7th-8th --> <=50K  (0.94)

    9th --> <=50K  (0.95)

    10th --> <=50K  (0.94)

    11th --> <=50K  (0.95)

    12th --> <=50K  (0.93)

    Assoc-acdm --> <=50K  (0.74)

    Assoc-voc --> <=50K  (0.75)

    Bachelors --> Non cause.

    Doctorate --> >50K  (0.73)

    HS-grad --> <=50K  (0.84)

    Masters --> >50K  (0.55)

    Preschool --> <=50K  (0.99)

    Prof-school --> >50K  (0.74)

    Some-college --> <=50K  (0.81)

Releases(v1.2.4)

v1.2.4(Apr 28, 2022)

Fixed a bug when entering an attribute name for causal analysis report.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.2.3(Mar 10, 2022)

Implemented progress animation when option -s is invoked.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.2.2(Mar 2, 2022)

Added progress animation to the analysis report.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.2.1(Jan 21, 2022)

Replaced a problematic character.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.2(Nov 9, 2021)

This version includes de correct calculation of causal certainties and the certainty raising inequality. Also the analysis report is sorted by attribute values.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.1.5(Nov 7, 2021)

Implemented correctly the causal analysis, using the certainty-raising inequality and the causal certainties.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.1.3(Nov 7, 2021)

Implemented causes for specific attribute values.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.1.2(Nov 6, 2021)

Minor patch.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.1.1(Oct 31, 2021)

This is a hurried patch to fix a problem in the causal analysis report. Now the report works as it was intended.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.1(Oct 30, 2021)

Release v1.1 contains many new features and fixes. Implemented report of causal certainties, which allows to see how certain attribute values cause a particular classification.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.0.7(Oct 28, 2021)

Code cleanup and new features implemented. When querying a tree now checks for invalid input and asks for correct input. This will be the last patch until version v1.1
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.0.6(Oct 28, 2021)

Correctly aligned text on console.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.0.5(Oct 27, 2021)

Reintroduced attribute importance for Entropy and Gini criteria.
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)
v1.0.4(Oct 27, 2021)

Removed causal analysis from Entropy and Gini criteria. It only makes sense with Certainty.
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)
v1.0.3(Oct 23, 2021)

Rolled back the parallel tests of Random Forests. It is much faster now.
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)
v1.0.2(Oct 23, 2021)

Minor changes.
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)
v1.0.1(Oct 23, 2021)

Now testing Random Forests is done in parallel.
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)
v1.0(Oct 18, 2021)

Releasing version v1.0
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)

Full body anonymization - Realistic Full-Body Anonymization with Surface-Guided GANs

Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, Amr Ahmed. KDD 2019.

gHHC Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, D

35 Nov 16, 2022

A python library to build Model Trees with Linear Models at the leaves.

212 Dec 30, 2022

Full-featured Decision Trees and Random Forests learner.

Related tags

Overview

CID3

List of features

Example run with titanic dataset

Requirements

Example of causal analysis

You might also like...

Full body anonymization - Realistic Full-Body Anonymization with Surface-Guided GANs

Random-Afg - Afghanistan Random Old Idz Cloner Tools

ElegantRL is featured with lightweight, efficient and stable, for researchers and practitioners.

This program writes christmas wish programmatically. It is using turtle as a pen pointer draw christmas trees and stars.

Simulate genealogical trees and genomic sequence data using population genetic models

TreeSubstitutionCipher - Encryption system based on trees and substitution

Python implementation of cover trees, near-drop-in replacement for scipy.spatial.kdtree

Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, Amr Ahmed. KDD 2019.

A python library to build Model Trees with Linear Models at the leaves.

Releases(v1.2.4)

v1.2.4(Apr 28, 2022)

v1.2.3(Mar 10, 2022)

v1.2.2(Mar 2, 2022)

v1.2.1(Jan 21, 2022)

v1.2(Nov 9, 2021)

v1.1.5(Nov 7, 2021)

v1.1.3(Nov 7, 2021)

v1.1.2(Nov 6, 2021)

v1.1.1(Oct 31, 2021)

v1.1(Oct 30, 2021)

v1.0.7(Oct 28, 2021)

v1.0.6(Oct 28, 2021)

v1.0.5(Oct 27, 2021)

v1.0.4(Oct 27, 2021)

v1.0.3(Oct 23, 2021)

v1.0.2(Oct 23, 2021)

v1.0.1(Oct 23, 2021)

v1.0(Oct 18, 2021)

Owner

Alejandro Penate-Diaz

Thermal Control of Laser Powder Bed Fusion using Deep Reinforcement Learning

Code implementing "Improving Deep Learning Interpretability by Saliency Guided Training"

Pyramid R-CNN: Towards Better Performance and Adaptability for 3D Object Detection

Randomizes the warps in a stock pokeemerald repo.

Overview of architecture and implementation of TEDS-Net, as described in MICCAI 2021: "TEDS-Net: Enforcing Diffeomorphisms in Spatial Transformers to Guarantee TopologyPreservation in Segmentations"

[NeurIPS'21] "AugMax: Adversarial Composition of Random Augmentations for Robust Training" by Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu, Animashree Anandkumar, and Zhangyang Wang.

In-place Parallel Super Scalar Samplesort (IPS⁴o)

This is the official PyTorch implementation for "Mesa: A Memory-saving Training Framework for Transformers".

Differentiable Neural Computers, Sparse Access Memory and Sparse Differentiable Neural Computers, for Pytorch

We will release the code of "ConTNet: Why not use convolution and transformer at the same time?" in this repo

Code for 2021 NeurIPS --- Towards Multi-Grained Explainability for Graph Neural Networks

Implementation of Retrieval-Augmented Denoising Diffusion Probabilistic Models in Pytorch

[CIKM 2019] Code and dataset for "Fi-GNN: Modeling Feature Interactions via Graph Neural Networks for CTR Prediction"

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

Airborne magnetic data of the Osborne Mine and Lightning Creek sill complex, Australia

GitHub repository for the ICLR Computational Geometry & Topology Challenge 2021

Repository aimed at compiling code, papers, demos etc.. related to my PhD on 3D vision and machine learning for fruit detection and shape estimation at the university of Lincoln

Codes for paper "Towards Diverse Paragraph Captioning for Untrimmed Videos". CVPR 2021

source code of “Visual Saliency Transformer” (ICCV2021)

Coarse implement of the paper "A Simultaneous Denoising and Dereverberation Framework with Target Decoupling", On DNS-2020 dataset, the DNSMOS of first stage is 3.42 and second stage is 3.47.