Predictive Modeling & Analytics on Home Equity Line of Credit

Overview

Predictive Modeling & Analytics on Home Equity Line of Credit Data (Python)

HMEQ Data Set

In this assignment we will use Python to examine a data set containing Home Equity Loans. The data set contains two target variables. The first target, TARGET_BAD_FLAG indicates whether or not the loan defaulted. If the value is set to 1, then the loan went bad and the bank lost money. If the value is set to 0, the loan was repaid.

The second target, TARGET_LOSS_AMT, indicates the amount of money that was lost for loans that went bad. The remaining variables contain information about the customer at the time that the loan was issued.

This is the data that we will use throughout this class in order to develop predictive models that will be used to determine the level of risk for each loan.

As with all real world data, this data is far from perfect.

It contains both numerical and categorical variables. It contains missing data. It contains outliers.

Table of Contents

  • Data Preparation
  • Tree Based Models
  • Regression Based Models
  • Neural Network

Building Machine Learning Models

Developed different predictive models to determine the level risk of each loan based on whether or not loans defaulted, and loss amount on bad loans. Evaluated each model with ROC curve and RMSE accuracy metrics.

Data Preparation

  • Download the HMEQ Data set
  • Read the data into Python
  • Explore both the input and target variables using statistical techniques.
  • Explore both the input and target variables using graphs and other visualization.
  • Look for relationships between the input variables and the targets.
  • Fix (impute) all missing data.
  • Note: For numerical data, create a flag variable to indicate if the value was missing
  • Convert all categorical variables numeric variables

Tree Based Models

We will continue to use Python to develop predictive models. In this assignment, we will use three different tree based techniques to analyze the data: DECISION TREES, RANDOM FORESTS, and GRADIENT BOOSTING. The deliverables for each technique are given below.

Create a Training and Test Data Set:

Decision Trees:

  • Develop a decision tree to predict the probability of default
  • Calculate the accuracy of the model on both the training and test data set
  • Create a graph that shows the ROC curves for both the training and test data set. Clearly label each curve and display the Area Under the ROC curve.
  • Display the Decision Tree using a Graphviz program
  • List the variables included in the decision tree that predict loan default.
  • Develop a decision tree to predict the loss amount assuming that the loan defaults
  • Calculate the RMSE for both the training data set and the test data set
  • Display the Decision Tree using a Graphviz program
  • List the variables included in the decision tree that predict loss amount.

Random Forests:

  • Develop a Random Forest to predict the probability of default
  • Calculate the accuracy of the model on both the training and test data set
  • Create a graph that shows the ROC curves for both the training and test data set. Clearly label each curve and display the Area Under the ROC curve.
  • List the variables included in the Random Forest that predict loan default.
  • Develop a Random Forest to predict the loss amount assuming that the loan defaults
  • Calculate the RMSE for both the training data set and the test data set
  • List the variables included in the Random Forest that predict loss amount.

Gradient Boosting:

  • Develop a Gradient Boosting model to predict the probability of default
  • Calculate the accuracy of the model on both the training and test data set
  • Create a graph that shows the ROC curves for both the training and test data set. Clearly - label each curve and display the Area Under the ROC curve.
  • List the variables included in the Gradient Boosting that predict loan default.
  • Develop a Gradient Boosting to predict the loss amount assuming that the loan defaults
  • Calculate the RMSE for both the training data set and the test data set
  • List the variables included in the Gradient Boosting that predict loss amount.

ROC Curves:

  • Generate a ROC curve for the Decision Tree, Random Forest, and Gradient Boosting models using the Test Data Set
  • Use different colors for each curve and clearly label them
  • Include the Area under the ROC Curve (AUC) on the graph.

Regression Based Models

we will continue to use Python to develop predictive models. In this assignment, we will use two different types of regression: Linear and Logistic. We will use Logistic regression to determine the probability of a crash. Linear regression will be used to calculate the damages assuming that a crash occurs

Create a Training and Test Data Set:

Logistic Regression

  • Develop a logistic regression model to determine the probability of a loan default. Use all of the variables.
  • Develop a logistic regression model to determine the probability of a loan default. Use the variables that were selected by a DECISION TREE.
  • Develop a logistic regression model to determine the probability of a loan default. Use the variables that were selected by a RANDOM FOREST.
  • Develop a logistic regression model to determine the probability of a loan default. Use the variables that were selected by a GRADIENT BOOSTING model.
  • Develop a logistic regression model to determine the probability of a loan default. Use the variables that were selected by STEPWISE SELECTION.
  • For each of the models
    • Calculate the accuracy of the model on both the training and test data set
    • Create a graph that shows the ROC curves for both the training and test data set. Clearly label each curve and display the Area Under the ROC curve.
    • Display a ROC curve for the test data with all your models on the same graph (tree based and regression). Discuss which one is the most accurate. Which one would you recommend using?
    • For one of the Regression Models, print the coefficients. Do the variables make sense? If not, what would you recommend?

Linear Regression:

  • Develop a linear regression model to determine the expected loss if the loan defaults. Use all of the variables.
  • Develop a linear regression model to determine the expected loss if the loan defaults. Use the variables that were selected by a DECISION TREE.
  • Develop a linear regression model to determine the expected loss if the loan defaults. Use the variables that were selected by a RANDOM FOREST.
  • Develop a linear regression model to determine the expected loss if the loan defaults. Use the variables that were selected by a GRADIENT BOOSTING model.
  • Develop a linear regression model to determine the expected loss if the loan defaults. Use the variables that were selected by STEPWISE SELECTION.
  • For each of the models
    • Calculate the RMSE for both the training data set and the test data set
    • List the RMSE for the test data set for all of the models created (tree based and regression). Discuss which one is the most accurate. Which one would you recommend using?
    • For one of the Regression Models, print the coefficients. Do the variables make sense? If not, what would you recommend?

Neural Networks

we will continue to use Python to develop predictive models. In this assignment, we will use two different types of regression: Linear and Logistic. We will use Logistic regression to determine the probability of a crash. Linear regression will be used to calculate the damages assuming that a crash occurs.

Create a Training and Test Data Set:

Tensor Flow Model To Predict Loan Defaults:

  • Develop a model using Tensor Flow that will predict Loan Default.

    • For your model, do the following:
    • Try at least three different Activation Functions
    • Try one and two hidden layers
    • Try using a Dropout Layer
  • Explore using a variable selection technique

  • For each of the models

    • Calculate the accuracy of the model on both the training and test data set
    • Create a graph that shows the ROC curves for both the training and test data set.
    • Clearly label each curve and display the Area Under the ROC curve.
    • Display a ROC curve for the test data with all your models on the same graph (tree based, regression, and TF). Discuss which one is the most accurate. Which one would you recommend using?

Tensor Flow Model to Predict Loss Given Default:

  • Develop a model using Tensor Flow that will predict Loan Default.
  • For your model, do the following:
    • Try at least three different Activation Functions
    • Try one and two hidden layers
    • Try using a Dropout Layer
  • Explore using a variable selection technique
  • For each of the models
    • Calculate the RMSE for both the training data set and the test data set
    • List the RMSE for the test data set for all of the models created (tree based, regression, and TF). Discuss which one is the most accurate. Which one would you recommend using?

Data Dictionary

VARIABLE DEFINITION ROLE TYPE CONVENTIONAL WISDOM
TARGET_BAD_FLAG BAD=1 (Loan was defaulted) TARGET BINARY HMEQ = Home Equity Line of Credit Loan. BINARY TARGET
TARGET_LOSS_AMT If loan was Bad, this was the amount not repaid. TARGET NUMBER HMEQ = Home Equity Line of Credit Loan. NUMERICAL TARGET
LOAN HMEQ Credit Line INPUT NUMBER The bigger the loan, the more risky the person
MORTDUE Current Outstanding Mortgage Balance INPUT NUMBER If you owe a lot of money on your current mortgage versus the value of your house, you are more risky.
VALUE Value of your house INPUT NUMBER If you owe a lot of money on your current mortgage versus the value of your house, you are more risky.
REASON Why do you want a loan? INPUT CATEGORY If you are consolidating debt, that might mean you are having financial trouble.
JOB What do you do for a living? INPUT CATEGORY Some jobs are unstable (and therefore are more risky)
YOJ Years on Job INPUT NUMBER If you habe been at your job for a while, you are less likely to lose that job. That makes you less risky.
DEROG Derogatory Marks on Credit Record. These are very bad things that stay on your credit report for 7 years. These include bankruptcies or leins placed on your property. INPUT NUMBER Lots of Derogatories mean that something really bad happened to you (such as a bankruptcy) in your past. This makes you more risky.
DELINQ Delinquencies on your current credit report. This refers to the number of times you were overdue when paying bills in the last three years. INPUT NUMBER When you have a lot of delinquencies, you might be more likely to default on a loan.
CLAGE Credit Line Age (in months) is how long you have had credit. Are you a new high school student with a new credit card or have you had credit cards for many years? INPUT NUMBER If you have had credit for a long time, you are considered less risky than a new high school student.
NINQ Number of inquiries. This is the number of times within the last 3 years that you went out looking for credit (such as opening a credit card at a store) INPUT NUMBER Conventional wisdom in that if you are looking for more credit, you might be in financial trouble. Thus you are risky.
CLNO Number of credit lines you have (credit cards, loans, etc.). INPUT NUMBER This is a double edged swoard. Peole who have a lot of credit lines tend to be safe. The reason is that if OTHER PEOPLE think you are trustworthy enough for a credit card, then maybe you are. However, if you have too many credit lines, you might be risky because you have the potential to run up a lot of debt.
DEBTINC Debt to Income Ratio. Take the money you spend every month and divide it by the amount of money you earn every month. INPUT NUMBER If your debt to income ratio is high then you are risky because you might not be able to pay your bills.
Owner
Dhaval Patel
Dhaval Patel
A Python Tools to imaging the shallow seismic structure

ShallowSeismicImaging Tools to imaging the shallow seismic structure, above 10 km, based on the ZH ratio measured from the ambient seismic noise, and

Xiao Xiao 9 Aug 09, 2022
Universal data analysis tools for atmospheric sciences

U_analysis Universal data analysis tools for atmospheric sciences Script written in python 3. This file defines multiple functions that can be used fo

Luis Ackermann 1 Oct 10, 2021
Desafio 1 ~ Bantotal

Challenge 01 | Bantotal Please read the instructions for the challenge by selecting your preferred language below: Español Português License Copyright

Maratona Behind the Code 44 Sep 28, 2022
A script to "SHUA" H1-2 map of Mercenaries mode of Hearthstone

lushi_script Introduction This script is to "SHUA" H1-2 map of Mercenaries mode of Hearthstone Installation Make sure you installed python=3.6. To in

210 Jan 02, 2023
Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

Damien Farrell 81 Dec 26, 2022
Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

Larch: Data Analysis Tools for X-ray Spectroscopy and More Documentation: http://xraypy.github.io/xraylarch Code: http://github.com/xraypy/xraylarch L

xraypy 95 Dec 13, 2022
track your GitHub statistics

GitHub-Stalker track your github statistics 👀 features find new followers or unfollowers find who got a star on your project or remove stars find who

Bahadır Araz 34 Nov 18, 2022
Stochastic Gradient Trees implementation in Python

Stochastic Gradient Trees - Python Stochastic Gradient Trees1 by Henry Gouk, Bernhard Pfahringer, and Eibe Frank implementation in Python. Based on th

John Koumentis 2 Nov 18, 2022
Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)

google_takeout_parser parses both the Historical HTML and new JSON format for Google Takeouts caches individual takeout results behind cachew merge mu

Sean Breckenridge 27 Dec 28, 2022
Fit models to your data in Python with Sherpa.

Table of Contents Sherpa License How To Install Sherpa Using Anaconda Using pip Building from source History Release History Sherpa Sherpa is a modeli

134 Jan 07, 2023
A model checker for verifying properties in epistemic models

Epistemic Model Checker This is a model checker for verifying properties in epistemic models. The goal of the model checker is to check for Pluralisti

Thomas Träff 2 Dec 22, 2021
peptides.py is a pure-Python package to compute common descriptors for protein sequences

peptides.py Physicochemical properties and indices for amino-acid sequences. 🗺️ Overview peptides.py is a pure-Python package to compute common descr

Martin Larralde 32 Dec 31, 2022
A Python module for clustering creators of social media content into networks

sm_content_clustering A Python module for clustering creators of social media content into networks. Currently supports identifying potential networks

72 Dec 30, 2022
Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python This project is a good starting point for those who have little

Himanshu Kumar singh 2 Dec 04, 2021
Retail-Sim is python package to easily create synthetic dataset of retaile store.

Retailer's Sale Data Simulation Retail-Sim is python package to easily create synthetic dataset of retaile store. Simulation Model Simulator consists

Corca AI 7 Sep 30, 2022
DefAP is a program developed to facilitate the exploration of a material's defect chemistry

DefAP is a program developed to facilitate the exploration of a material's defect chemistry. A large number of features are provided and rapid exploration is supported through the use of autoplotting

6 Oct 25, 2022
The Dash Enterprise App Gallery "Oil & Gas Wells" example

This app is based on the Dash Enterprise App Gallery "Oil & Gas Wells" example. For more information and more apps see: Dash App Gallery See the Dash

Austin Caudill 1 Nov 08, 2021
Exploratory data analysis

Exploratory data analysis An Exploratory data analysis APP TAPIWA CHAMBOKO 🚀 About Me I'm a full stack developer experienced in deploying artificial

tapiwa chamboko 1 Nov 07, 2021
PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

1 Feb 07, 2022
Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine

Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine Intro This repo contains the python/stan version of the Statistical Rethinking

Andrés Suárez 3 Nov 08, 2022