Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).

Overview

PandasVault ⁠— Advanced Pandas Functions and Code Snippets

The only Pandas utility package you would ever need. It has no exotic external dependencies. All functions have been compared and tested with alternatives, only the fastest equivalent functions have been developed and included in this package. The package has more than 20 wrapped functions and 100 snippets.


Github PandasVault Link, LinkedIn

You have the option to view this Readme or run a Colab Notebook.

pip install pandasvault

If you can identify performance improvements, or improvements in code length and styling, please open a pull request. This package is new, all help and criticisms are appreciated. I would love to hear about any additional function ideas. If you have a function to contribute please open an issues tab or email me at d.snow(at)nyu.edu.

List of Code

Table Processing

Table Exploration

Feature Processing

Feature Engineering

Model Validation


List of Functions

import pandas as pd
import numpy as np
import pandasvault as pv

"""TABLE PROCESSING"""
df = pv.list_shuff(["target","c","d"],df)
df = pv.reduce_mem_usage(df)

"""TABLE EXPLORATION"""
df = pv.corr_list(df)
df = pv.missing_data(df)

"""FEATURE PROCESSING"""
df = pv.drop_corr(df, thresh=0.1,keep_cols=["target"])
df = pv.replace_small_cat(df,["cat"])
qconstant_col = pv.constant_feature_detect(data=df,threshold=0.9)
df_train, scl = pv.scaler(df,target="target",cols_ignore=["a"],type="MinMax")
df_test = pv.scaler(df_test,scaler=scl,train=False, target="target",cols_ignore=["a"])
df = pv.impute_null_with_tail(df,cols=df.columns)
index,para = pv.outlier_detect(df,"a",threshold=0.5,method="IQR")
df = pv.windsorization(data=df,col='a',para=para,strategy='both')
df = pv.impute_outlier(data=df,col='a', outlier_index=index,strategy='mean')

"""FEATURE EXTRACTION"""
df = pv.auto_dummy(df, unique=3)
df = pv.binarise_empty(df, frac=0.6)
df = pv.polynomials(df, ["a","b"]) 
df = pv.transformations(df,["a","b"])
df = pv.pca_feature(df,variance_or_components=0.80,drop_cols=["target","a"])
df = pv.multiple_lags(df, start=1, end=2,columns=["a","target"])
df = pv.multiple_rolling(df, columns=["a"])
df = pv.date_features(df, date="date_fake")
df['distance_central'] = df.apply(pv.haversine_distance,axis=1)

"""MODEL VALIDATION"""
scores = pv.classification_scores(y_test, y_predict, y_prob)

Functions and Snippets Applied


If you are running the code for the first time load this test dataframe:

!pip install pandasvault
import pandas as pd
import numpy as np
import pandasvault as pv

np.random.seed(1)
"""quick way to create a data frame for testing""" 
df_test = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd']) \
    .assign(target=lambda x: (x['b']+x['a']/x['d'])*x['c'])

Table Processing



>>> Configure Pandas (func)


import pandas as pd

def pd_config():
    options = {
        'display': {
            'max_colwidth': 25,
            'expand_frame_repr': False,  # Don't wrap to multiple pages
            'max_rows': 14,
            'max_seq_items': 50,         # Max length of printed sequence
            'precision': 4,
            'show_dimensions': False
        },
        'mode': {
            'chained_assignment': None   # Controls SettingWithCopyWarning
        }
    }

    for category, option in options.items():
        for op, value in option.items():
            pd.set_option(f'{category}.{op}', value)  # Python 3.6+

if __name__ == '__main__':
    pv.pd_config()

>>> Data Frame Formatting


df = df_test.copy()
df["number"] = [3,10,1]
df_out = (
  df.style.format({"a":"${:.2f}", "target":"${:.5f}"})
 .hide_index()
 .highlight_min("a", color ="red")
 .highlight_max("a", color ="green")
 .background_gradient(subset = "target", cmap ="Blues")
 .bar("number", color = "lightblue", align = "zero")
 .set_caption("DF with different stylings")
) ; df_out

See Colab for Output


>>> Data Frames For Testing


df1 = pd.util.testing.makeDataFrame() # contains random values
print("Contains missing values")
df2 = pd.util.testing.makeMissingDataframe() # contains missing values
print("Contains datetime values")
df3 = pd.util.testing.makeTimeDataFrame() # contains datetime values
print("Contains mixed values")
df4 = pd.util.testing.makeMixedDataFrame(); df4.head() # contains mixed values
Contains missing values
Contains datetime values
Contains mixed values
A B C D
0 0.0 0.0 foo1 2009-01-01
1 1.0 1.0 foo2 2009-01-02
2 2.0 0.0 foo3 2009-01-05
3 3.0 1.0 foo4 2009-01-06
4 4.0 0.0 foo5 2009-01-07

>>> Lower Case Columns


## Lower-case all DataFrame column names 
df = df_test.copy() ; df
df.columns = ["A","BGs","c","dag","Target"]
df.columns = map(str.lower, df.columns); df
a bgs c dag target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910

>>> Front and Back Column Selection


df = df_test.copy(); df
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910
def front(self, n):
    return self.iloc[:, :n]

def back(self, n):
    return self.iloc[:, -n:]

pd.back = back
pd.front = front

pd.back(df,2)
d target
0 -1.0730 1.1227
1 -0.7612 -5.9994
2 -2.0601 -0.5910

>>> Fast Data Frame Split


df = df_test.copy(); df
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910
test =  df.sample(frac=0.4)
train = df[~df.isin(test)].dropna(); train
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994

>>> Create Features and Labels List


df = df_test.head()
y = 'target'
X = [name for name in df.columns if name not in [y, 'd']]
print('y =', y)
print('X =', X)
y = target
X = ['a', 'b', 'c']

>>> Short Basic Commands


1, "1", "0") df["k"] = df["category"].astype(str) +": " + df["d"].round(1).astype(str) df = df.append(df, ignore_index=True) ; df.head() ">
df = df_test.copy()
df["category"] = np.where( df["target"]>1, "1",  "0")
df["k"] = df["category"].astype(str) +": " + df["d"].round(1).astype(str) 
df = df.append(df, ignore_index=True) ; df.head()
a b c d target category k
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 1 1: -1.1
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 0 0: -0.8
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 0 0: -2.1
3 1.6243 -0.6118 -0.5282 -1.0730 1.1227 1 1: -1.1
4 0.8654 -2.3015 1.7448 -0.7612 -5.9994 0 0: -0.8
1) & (df['b'] <1)] """filter by conditions and the condition on row labels(index)""" df[(df.a > 0) & (df.index.isin([0, 1]))] """regexp filters on strings (vectorized), use .* instead of *""" df[df.category.str.contains(r'.*[0-9].*')] """logical NOT is like this""" df[~df.category.str.contains(r'.*[0-9].*')] """creating complex filters using functions on rows""" df[df.apply(lambda x: x['b'] > x['c'], axis=1)] """Pandas replace operation""" df["a"].round(2).replace(0.87, 17, inplace=True) df["a"][df["a"] < 4] = 19 """Conditionals and selectors""" df.loc[df["a"] > 1, ["a","b","target"]] """Selecting multiple column slices""" df.iloc[:, np.r_[0:2, 4:5]] """apply and map examples""" df[["a","b","c"]].applymap(lambda x: x+1) """add 2 to row 3 and return the series""" df[["a","b","c"]].apply(lambda x: x[0]+2,axis=0) """add 3 to col A and return the series""" df.apply(lambda x: x['a']+1,axis=1) """ Split delimited values in a DataFrame column into two new columns """ df['new1'], df['new2'] = zip(*df['k'].apply(lambda x: x.split(': ', 1))) """ Doing calculations with DataFrame columns that have missing values In example below, swap in 0 for df['col1'] cells that contain null """ df['new3'] = np.where(pd.isnull(df['b']),0,df['a']) + df['c'] """ Exclude certain data type or include certain data types """ df.select_dtypes(exclude=['O','float']) df.select_dtypes(include=['int']) """one liner to normalize a data frame""" (df[["a","b"]] - df[["a","b"]].mean()) / (df[["a","b"]].max() - df[["a","b"]].min()) """groupby used like a histogram to obtain counts on sub-ranges of a variable, pretty handy""" df.groupby(pd.cut(df.a, range(0, 1, 2))).size() """use a local variable use inside a query of pandas using @""" mean = df["a"].mean() df.query("a > @mean") """Calculate the % of missing values in each column""" df.isna().mean() """Calculate the % of missing values in each row""" rows = df.isna().mean(axis=1) ; df.head() ">
"""set display width, col_width etc for interactive pandas session""" 
pd.set_option('display.width', 200)
pd.set_option('display.max_colwidth', 20)
pd.set_option('display.max_rows', 100)
           
"""when you have an excel sheet with spaces in column names"""
df.columns = [c.lower().replace(' ', '_') for c in df.columns]

"""Add prefix to all columns"""
df.add_prefix("1_")

"""Add suffix to all columns"""
df.add_suffix("_Z")

"""Droping column where missing values are above a threshold"""
df.dropna(thresh = len(df)*0.95, axis = "columns") 

"""Given a dataframe df to filter by a series ["a","b"]:""" 
df[df['category'].isin(["1","0"])]

"""filter by multiple conditions in a dataframe df"""
df[(df['a'] >1) & (df['b'] <1)]

"""filter by conditions and the condition on row labels(index)"""
df[(df.a > 0) & (df.index.isin([0, 1]))]

"""regexp filters on strings (vectorized), use .* instead of *"""
df[df.category.str.contains(r'.*[0-9].*')]

"""logical NOT is like this"""
df[~df.category.str.contains(r'.*[0-9].*')]

"""creating complex filters using functions on rows"""
df[df.apply(lambda x: x['b'] > x['c'], axis=1)]

"""Pandas replace operation"""
df["a"].round(2).replace(0.87, 17, inplace=True)
df["a"][df["a"] < 4] = 19

"""Conditionals and selectors"""
df.loc[df["a"] > 1, ["a","b","target"]]

"""Selecting multiple column slices"""
df.iloc[:, np.r_[0:2, 4:5]] 

"""apply and map examples"""
df[["a","b","c"]].applymap(lambda x: x+1)

"""add 2 to row 3 and return the series"""
df[["a","b","c"]].apply(lambda x: x[0]+2,axis=0)

"""add 3 to col A and return the series"""
df.apply(lambda x: x['a']+1,axis=1)

""" Split delimited values in a DataFrame column into two new columns """
df['new1'], df['new2'] = zip(*df['k'].apply(lambda x: x.split(': ', 1)))

""" Doing calculations with DataFrame columns that have missing values
  In example below, swap in 0 for df['col1'] cells that contain null """ 
df['new3'] = np.where(pd.isnull(df['b']),0,df['a']) + df['c']

""" Exclude certain data type or include certain data types """
df.select_dtypes(exclude=['O','float'])
df.select_dtypes(include=['int'])

"""one liner to normalize a data frame""" 
(df[["a","b"]] - df[["a","b"]].mean()) / (df[["a","b"]].max() - df[["a","b"]].min())

"""groupby used like a histogram to obtain counts on sub-ranges of a variable, pretty handy""" 
df.groupby(pd.cut(df.a, range(0, 1, 2))).size()

"""use a local variable use inside a query of pandas using @"""
mean = df["a"].mean()
df.query("a > @mean")

"""Calculate the % of missing values in each column"""
df.isna().mean() 

"""Calculate the % of missing values in each row"""
rows = df.isna().mean(axis=1) ; df.head()
a b c d target category k new1 new2 new3
0 19.0 -0.6118 -0.5282 -1.0730 1.1227 1 1: -1.1 1 -1.1 18.4718
1 19.0 -2.3015 1.7448 -0.7612 -5.9994 0 0: -0.8 0 -0.8 20.7448
2 19.0 -0.2494 1.4621 -2.0601 -0.5910 0 0: -2.1 0 -2.1 20.4621
3 19.0 -0.6118 -0.5282 -1.0730 1.1227 1 1: -1.1 1 -1.1 18.4718
4 19.0 -2.3015 1.7448 -0.7612 -5.9994 0 0: -0.8 0 -0.8 20.7448

>>> Read Commands


df = pd.util.testing.makeMixedDataFrame()
df.to_csv("data.csv") ; df
A B C D
0 0.0 0.0 foo1 2009-01-01
1 1.0 1.0 foo2 2009-01-02
2 2.0 0.0 foo3 2009-01-05
3 3.0 1.0 foo4 2009-01-06
4 4.0 0.0 foo5 2009-01-07
"""To avoid Unnamed: 0 when loading a previously saved csv with index"""
"""To parse dates"""
"""To set data types"""

df_out = pd.read_csv("data.csv", index_col=0,
                 parse_dates=['D'],
                 dtype={"c":"category", "B":"int64"}).set_index("D")

"""Copy data to clipboard; like an excel copy and paste
df = pd.read_clipboard()
"""

"""Read table from website
df = pd.read_html(url, match="table_name")
"""

""" Read pdf into dataframe ()
!pip install tabula
from tabula import read_pdf
df = read_pdf('test.pdf', pages='all')
"""
df_out.head()
A B C
D
2009-01-01 0.0 0 foo1
2009-01-02 1.0 1 foo2
2009-01-05 2.0 0 foo3
2009-01-06 3.0 1 foo4
2009-01-07 4.0 0 foo5

>>> Create Ordered Categories


df = df_test.copy()
df["cats"] = ["bad","good","excellent"]; df
a b c d target cats
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 bad
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 good
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 excellent
"bad"] ">
import pandas as pd
from pandas.api.types import CategoricalDtype

print("Let's create our own categorical order.")
cat_type = CategoricalDtype(["bad", "good", "excellent"], ordered = True)
df["cats"] = df["cats"].astype(cat_type)

print("Now we can use logical sorting.")
df = df.sort_values("cats", ascending = True)

print("We can also filter this as if they are numbers.")
df[df["cats"] > "bad"]
Let's create our own categorical order.
Now we can use logical sorting.
We can also filter this as if they are numbers.
a b c d target cats
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 good
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 excellent

>>> Select Columns Based on Regex


df = df_test.head(); df
df.columns = ["a_l", "b_l", "c_r","d_r","target"]  ; df
a_l b_l c_r d_r target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910
df_out = df.filter(regex="_l",axis=1) ; df_out
a_l b_l
0 1.6243 -0.6118
1 0.8654 -2.3015
2 0.3190 -0.2494

>>> Accessing Group of Groupby Object


df = df_test.copy()
df = df.append(df, ignore_index=True)
df["groupie"] = ["falcon","hawk","hawk","eagle","falcon","hawk"]; df
a b c d target groupie
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 falcon
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 hawk
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 hawk
3 1.6243 -0.6118 -0.5282 -1.0730 1.1227 eagle
4 0.8654 -2.3015 1.7448 -0.7612 -5.9994 falcon
5 0.3190 -0.2494 1.4621 -2.0601 -0.5910 hawk
gbdf = df.groupby("groupie")
hawk = gbdf.get_group("hawk").mean(); hawk
a         0.5012
b        -0.9334
c         1.5563
d        -1.6272
target   -2.3938
dtype: float64

>>> Multiple External Selection Criteria


df = df_test.copy()
0 cr2 = df["b"] < 0 cr3 = df["c"] > 0 cr4 = df["d"] >-1 df[cr1 & cr2 & cr3 & cr4] ">
cr1 = df["a"] > 0
cr2 = df["b"] < 0
cr3 = df["c"] > 0
cr4 = df["d"] >-1

df[cr1 & cr2 & cr3 & cr4]
a b c d target
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994

>>> Memory Reduction Script (func)


df = df_test.copy(); df
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910
np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max: df[col] = df[col].astype(np.int8) elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max: df[col] = df[col].astype(np.int16) elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max: df[col] = df[col].astype(np.int32) elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max: df[col] = df[col].astype(np.int64) else: if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max: df[col] = df[col].astype(np.float16) elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max: df[col] = df[col].astype(np.float32) else: df[col] = df[col].astype(np.float64) else: df[col] = df[col].astype('category') end_mem = df.memory_usage().sum() / 1024**2 print('Memory usage after optimization is: {:.2f} MB'.format(end_mem)) print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem)) return df df_out = pv.reduce_mem_usage(df); df_out ">
import gc

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        gc.collect()
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df
df_out = pv.reduce_mem_usage(df); df_out
Memory usage of dataframe is 0.00 MB
Memory usage after optimization is: 0.00 MB
Decreased by 36.3%
a b c d target
0 1.6240 -0.6118 -0.5283 -1.0732 1.1230
1 0.8652 -2.3008 1.7451 -0.7612 -6.0000
2 0.3191 -0.2494 1.4619 -2.0605 -0.5908

>>> Verify Primary Key (func)


df = df_test.copy()
df["first_d"] = [0,1,2]
df["second_d"] = [4,1,9] ; df
a b c d target first_d second_d
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 0 4
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 1 1
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 2 9
def verify_primary_key(df, column_list):
    '''Verify if columns in column list can be treat as primary key'''

    return df.shape[0] == df.groupby(column_list).size().reset_index().shape[0]

verify_primary_key(df, ["first_d","second_d"])
True

>>> Shift Columns to Front (func)


df = df_test.copy(); df
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910
def list_shuff(items, df):
    "Bring a list of columns to the front"
    cols = list(df)
    for i in range(len(items)):
        cols.insert(i, cols.pop(cols.index(items[i])))
    df = df.loc[:, cols]
    df.reset_index(drop=True, inplace=True)
    return df

df_out = pv.list_shuff(["target","c","d"],df); df_out
target c d a b
0 1.1227 -0.5282 -1.0730 1.6243 -0.6118
1 -5.9994 1.7448 -0.7612 0.8654 -2.3015
2 -0.5910 1.4621 -2.0601 0.3190 -0.2494

>>> Multiple Column Assignments


df = df_test.copy(); df
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910
-1") .style.set_caption("Average consumption")) ; df_out ">
df_out = (df.assign(stringed = df["a"].astype(str),
            ounces = df["b"]*12,#                                     this will allow yo set a title
            galons = lambda df: df["a"]/128)
           .query("b > -1")
           .style.set_caption("Average consumption")) ; df_out
Average consumption
a b c d target stringed ounces galons
0 1.624 -0.6118 -0.5282 -1.073 1.123 1.6243453636632417 -7.341 0.01269
2 0.319 -0.2494 1.462 -2.06 -0.591 0.31903909605709857 -2.992 0.002492

>>> Method Chaining Technique


df = df_test.copy()
df[df>df.mean()]  = None ; df
a b c d target
0 NaN NaN -0.5282 NaN NaN
1 0.8654 -2.3015 NaN NaN -5.9994
2 0.3190 NaN NaN -2.0601 NaN
0] \ .round(2) \ .groupby(["target","b"]).max() \ .unstack() \ .fillna(0) \ .rolling(1).sum() \ .reset_index() \ .stack() \ .ffill().bfill() df_out ">
# with line continuation character
df_out = df.dropna(subset=["b","c"],how="all") \
.loc[df["a"]>0] \
.round(2) \
.groupby(["target","b"]).max() \
.unstack() \
.fillna(0) \
.rolling(1).sum() \
.reset_index() \
.stack() \
.ffill().bfill() 

df_out
a c d target
b
0 -2.3 0.87 0.0 0.0 -6.0
0.87 0.0 0.0 -6.0

>>> Load Multiple Files


import os
os.makedirs("folder",exist_ok=True,); df_test.to_csv("folder/first.csv",index=False) ; df_test.to_csv("folder/last.csv",index=False)
import glob
files = glob.glob('folder/*.csv')
dfs = [pd.read_csv(fp) for fp in files]
df_out = pd.concat(dfs)
df_out
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910

>>> Drop Rows with Column Substring


df = df_test.copy()
df["string_feature"] = ["1xZoo", "Safe7x", "bat4"]; df
a b c d target string_feature
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 1xZoo
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 Safe7x
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 bat4
substring = ["xZ","7z", "tab4"]

df_out = df[~df.string_feature.str.contains('|'.join(substring))]; df_out
a b c d target string_feature
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 Safe7x
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 bat4

>>> Unnest (Explode) a Column


df = df_test.head()
df["g"] = [[str(a)+lista for a in range(4)] for lista in ["a","b","c"]]; df
a b c d target g
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 [0a, 1a, 2a, 3a]
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 [0b, 1b, 2b, 3b]
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 [0c, 1c, 2c, 3c]
df_out = df.explode("g"); df_out.iloc[:5,:]
a b c d target g
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 0a
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 1a
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 2a
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 3a
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 0b

>>> Nest List Back into Column


### Run above example first 
df = df_out.copy()
df_out['g'] = df_out.groupby(df_out.index)['g'].agg(list); df_out.head()
a b c d target g
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 [0a, 1a, 2a, 3a]
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 [0a, 1a, 2a, 3a]
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 [0a, 1a, 2a, 3a]
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 [0a, 1a, 2a, 3a]
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 [0b, 1b, 2b, 3b]

>>> Split Cells With Lists


df = df_test.head()
df["g"] = [",".join([str(a)+lista for a in range(4)]) for lista in ["a","b","c"]]; df
a b c d target g
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 0a,1a,2a,3a
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 0b,1b,2b,3b
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 0c,1c,2c,3c
df_out = df.assign(g = df["g"].str.split(",")).explode("g"); df_out.head()
a b c d target g
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 0a
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 1a
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 2a
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 3a
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 0b

Table Exploration



>>> Groupby Functionality


df = df_test.head() 
df["gr"] = [1, 1 , 0] ;df
a b c d target gr
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 1
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 1
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 0
In [34]: gb.<TAB>  # noqa: E225, E999
gb.agg        gb.boxplot    gb.cummin     gb.describe   gb.filter     
gb.get_group  gb.height     gb.last       gb.median     gb.ngroups    
gb.plot       gb.rank       gb.std        gb.transform  gb.aggregate  
gb.count      gb.cumprod    gb.dtype      gb.first      gb.nth
gb.groups     gb.hist       gb.max        gb.min        gb.gender        
gb.prod       gb.resample   gb.sum        gb.var        gb.ohlc  
gb.apply      gb.cummax     gb.cumsum     gb.fillna          
gb.head       gb.indices    gb.mean       gb.name            
gb.quantile   gb.size       gb.tail       gb.weight
df_out = df.groupby('gr').agg([np.sum, np.mean, np.std]); df_out.iloc[:,:8]
a b c
sum mean std sum mean std sum mean
gr
0 0.3190 0.3190 NaN -0.2494 -0.2494 NaN 1.4621 1.4621
1 2.4898 1.2449 0.5367 -2.9133 -1.4566 1.1949 1.2166 0.6083

>>> Cross Correlation Series Without Duplicates (func)


df = df_test.copy(); df
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910
def corr_list(df):

  return  (df.corr()
          .unstack()
          .sort_values(kind="quicksort",ascending=False)
          .drop_duplicates().iloc[1:]); df_out
          
pv.corr_list(df)
b       target    0.9215
a       d         0.6605
        target    0.3206
b       a        -0.0724
c       d        -0.1764
        b        -0.4545
target  d        -0.4994
c       target   -0.7647
b       d        -0.7967
a       c        -0.8555
dtype: float64

>>> Missing Data Report (func)


df = df_test.copy()
df[df>df.mean()]  = None ; df
a b c d target
0 NaN NaN -0.5282 NaN NaN
1 0.8654 -2.3015 NaN NaN -5.9994
2 0.3190 NaN NaN -2.0601 NaN
def missing_data(data):
    "Create a dataframe with a percentage and count of missing values"
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

pv.df_out = missing_data(df); df_out
a b c d target
sum mean std sum mean std sum mean std sum mean std sum mean std
gr
0 0.3190 0.3190 NaN -0.2494 -0.2494 NaN 1.4621 1.4621 NaN -2.0601 -2.0601 NaN -0.5910 -0.5910 NaN
1 2.4898 1.2449 0.5367 -2.9133 -1.4566 1.1949 1.2166 0.6083 1.6072 -1.8342 -0.9171 0.2204 -4.8767 -2.4384 5.0361

>>> Duplicated Rows Report


df = df_test.copy()
df["a"].iloc[2] = df["a"].iloc[1]
df["b"].iloc[2] = df["b"].iloc[1] ; df
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.8654 -2.3015 1.4621 -2.0601 -0.5910
# Get a report of all duplicate records in a dataframe, based on specific columns
df_out = df[df.duplicated(['a', 'b'], keep=False)] ; df_out
a b c d target
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.8654 -2.3015 1.4621 -2.0601 -0.5910

>>> Skewness (func)


df = df_test.copy(); df
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910
from scipy.stats import skew

def display_skewness(data):
    '''show skewness information

        Parameters
        ----------
        data: pandas dataframe

        Return
        ------
        df: pandas dataframe
    '''
    numeric_cols = data.columns[data.dtypes != 'object'].tolist()
    skew_value = []

    for i in numeric_cols:
        skew_value += [skew(data[i])]
    df = pd.concat(
        [pd.Series(numeric_cols), pd.Series(data.dtypes[data.dtypes != 'object'].apply(lambda x: str(x)).values)
            , pd.Series(skew_value)], axis=1)
    df.columns = ['var_name', 'col_type', 'skew_value']

    return df

display_skewness(df)
var_name col_type skew_value
0 a float64 0.1963
1 b float64 -0.6210
2 c float64 -0.6659
3 d float64 -0.5427
4 target float64 -0.5418

Feature Processing



>>> Remove Correlated Pairs (func)


df= df_test.copy(); df
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910
def drop_corr(df, thresh=0.99,keep_cols=[]):
    df_corr = df.corr().abs()
    upper = df_corr.where(np.triu(np.ones(df_corr.shape), k=1).astype(np.bool))
    to_remove = [column for column in upper.columns if any(upper[column] > thresh)] ## Change to 99% for selection
    to_remove = [x for x in to_remove if x not in keep_cols]
    df_corr = df_corr.drop(columns = to_remove)
    return df.drop(to_remove,axis=1)

df_out = pv.drop_corr(df, thresh=0.1,keep_cols=["target"]); df_out
a b target
0 1.6243 -0.6118 1.1227
1 0.8654 -2.3015 -5.9994
2 0.3190 -0.2494 -0.5910

>>> Replace Infrequently Occuring Categories


df = df_test.copy()
df = df.append([df]*2)
df["cat"] = ["bat","bat","rat","mat","mat","mat","mat","mat","mat"]; df
a b c d target cat
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 bat
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 bat
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 rat
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 mat
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 mat
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 mat
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 mat
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 mat
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 mat
< thresh].index df[col] = df[col].replace(small_categories, "Other") return df df_out = pv.replace_small_cat(df,["cat"]); df_out.head() ">
def replace_small_cat(df, columns, thresh=0.2, term="other"):
  for col in columns:

    # Step 1: count the frequencies
    frequencies = df[col].value_counts(normalize = True)

  # Step 2: establish your threshold and filter the smaller categories

    small_categories = frequencies[frequencies < thresh].index

    df[col] = df[col].replace(small_categories, "Other")
    
  return df

df_out = pv.replace_small_cat(df,["cat"]); df_out.head()
a b c d target cat
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 bat
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 bat
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 Other
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 mat
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 mat

>>> Quasi-Constant Features Detection (func)


df = df_test.copy()
df["a"] = 3 
= threshold: quasi_constant_feature.append(feature) print(len(quasi_constant_feature),' variables are found to be almost constant') return quasi_constant_feature # the original dataset has no constant variable qconstant_col = pv.constant_feature_detect(data=df,threshold=0.9) df_out = df.drop(qconstant_col, axis=1) ; df_out ">
def constant_feature_detect(data,threshold=0.98):
    """ detect features that show the same value for the 
    majority/all of the observations (constant/quasi-constant features)
    
    Parameters
    ----------
    data : pd.Dataframe
    threshold : threshold to identify the variable as constant
        
    Returns
    -------
    list of variables names
    """
    
    data_copy = data.copy(deep=True)
    quasi_constant_feature = []
    for feature in data_copy.columns:
        predominant = (data_copy[feature].value_counts() / np.float(
                      len(data_copy))).sort_values(ascending=False).values[0]
        if predominant >= threshold:
            quasi_constant_feature.append(feature)
    print(len(quasi_constant_feature),' variables are found to be almost constant')    
    return quasi_constant_feature

# the original dataset has no constant variable
qconstant_col = pv.constant_feature_detect(data=df,threshold=0.9)
df_out = df.drop(qconstant_col, axis=1) ; df_out
1  variables are found to be almost constant
b c d target
0 -0.6118 -0.5282 -1.0730 1.1227
1 -2.3015 1.7448 -0.7612 -5.9994
2 -0.2494 1.4621 -2.0601 -0.5910
### I will take care of outliers separately

>>> Filling Missing Values Separately


df = df_test.copy()
df[df>df.mean()]  = None ; df
a b c d target
0 NaN NaN -0.5282 NaN NaN
1 0.8654 -2.3015 NaN NaN -5.9994
2 0.3190 NaN NaN -2.0601 NaN
# Clean up missing values in multiple DataFrame columns
dict_fill = {'a': 4,
              'b': 3,
              'c': 5,
              'd': 9999,
              'target': "False"}
df = df.fillna(dict_fill) ;df
a b c d target
0 4.0000 3.0000 -0.5282 9999.0000 False
1 0.8654 -2.3015 5.0000 9999.0000 -5.999
2 0.3190 3.0000 5.0000 -2.0601 False

>>> Conditioned Column Value Replacement


df = df_test.copy(); df
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910
# Set DataFrame column values based on other column values
df.loc[(df['a'] >1 ) & (df['c'] <0), ['target']] = np.nan ;df
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 NaN
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910

>>> Remove Non-numeric Values in Data Frame


df = df_test.copy().assign(target=lambda row: row["a"].round(4).astype(str)+"SC"+row["b"].round(4).astype(str))
df["a"] = "TI4560L" + df["a"].round(4).astype(str) ; df
a b c d target
0 TI4560L1.6243 -0.6118 -0.5282 -1.0730 1.6243SC-0.6118
1 TI4560L0.8654 -2.3015 1.7448 -0.7612 0.8654SC-2.3015
2 TI4560L0.319 -0.2494 1.4621 -2.0601 0.319SC-0.2494
df_out = df.replace('[^0-9]+', '', regex=True); df_out
a b c d target
0 456016243 -0.6118 -0.5282 -1.0730 1624306118
1 456008654 -2.3015 1.7448 -0.7612 0865423015
2 45600319 -0.2494 1.4621 -2.0601 031902494

>>> Feature Scaling, Normalisation, Standardisation (func)


df= df_test.copy(); df
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

def scaler(df,scaler=None,train=True, target=None, cols_ignore=None, type="Standard"):

  if cols_ignore:
    hold = df[cols_ignore].copy()
    df = df.drop(cols_ignore,axis=1)
  if target:
    x = df.drop([target],axis=1).values #returns a numpy array
  else:
    x = df.values
  if train:
    if type=="Standard":
      scal = StandardScaler()
    elif type=="MinMax":
      scal = MinMaxScaler()
    scal.fit(x)
    x_scaled = scal.transform(x)
  else:
    x_scaled = scaler.transform(x)
  
  if target:
    df_out = pd.DataFrame(x_scaled, index=df.index, columns=df.drop([target],axis=1).columns)
    df_out[target]= df[target]
  else:
    df_out = pd.DataFrame(x_scaled, index=df.index, columns=df.columns)
  
  df_out = pd.concat((hold,df_out),axis=1)
  if train:
    return df_out, scal
  else:
    return df_out

df_out_train, scl = pv.scaler(df,target="target",cols_ignore=["a"],type="MinMax")
df_out_test = pv.scaler(df_test,scaler=scl,train=False, target="target",cols_ignore=["a"]); df_out_test
a b c d target
0 1.6243 0.8234 0.0000 0.76 1.1227
1 0.8654 0.0000 1.0000 1.00 -5.9994
2 0.3190 1.0000 0.8756 0.00 -0.5910

>>> Impute Null with Tail Distribution (func)


df = df_test.copy()
df[df>df.mean()]  = None ; df
a b c d target
0 NaN NaN -0.5282 NaN NaN
1 0.8654 -2.3015 NaN NaN -5.9994
2 0.3190 NaN NaN -2.0601 NaN
0: df[i] = df[i].fillna(df[i].mean()+3*df[i].std()) else: warn("Column %s has no missing" % i) return df df_out = pv.impute_null_with_tail(df,cols=df.columns); df_out ">
def impute_null_with_tail(df,cols=[]):
    """
    replacing the NA by values that are at the far end of the distribution of that variable
    calculated by mean + 3*std
    """
    
    df = df.copy(deep=True)
    for i in cols:
        if df[i].isnull().sum()>0:
            df[i] = df[i].fillna(df[i].mean()+3*df[i].std())
        else:
            warn("Column %s has no missing" % i)
    return df    
  

df_out = pv.impute_null_with_tail(df,cols=df.columns); df_out
a b c d target
0 1.7512 NaN -0.5282 NaN NaN
1 0.8654 -2.3015 NaN NaN -5.9994
2 0.3190 NaN NaN -2.0601 NaN

>>> Detect Outliers (func)


df = df_test.copy(); df
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910
threshold print('Num of outlier detected:',outlier_index.value_counts()[1]) print('Proportion of outlier detected',outlier_index.value_counts()[1]/len(outlier_index)) return outlier_index, (median_absolute_deviation, median_absolute_deviation) para = (Upper_fence, Lower_fence) tmp = pd.concat([data[col]>Upper_fence,data[col]
def outlier_detect(data,col,threshold=3,method="IQR"):
  
    if method == "IQR":
      IQR = data[col].quantile(0.75) - data[col].quantile(0.25)
      Lower_fence = data[col].quantile(0.25) - (IQR * threshold)
      Upper_fence = data[col].quantile(0.75) + (IQR * threshold)
    if method == "STD":
      Upper_fence = data[col].mean() + threshold * data[col].std()
      Lower_fence = data[col].mean() - threshold * data[col].std()   
    if method == "OWN":
      Upper_fence = data[col].mean() + threshold * data[col].std()
      Lower_fence = data[col].mean() - threshold * data[col].std() 
    if method =="MAD":
      median = data[col].median()
      median_absolute_deviation = np.median([np.abs(y - median) for y in data[col]])
      modified_z_scores = pd.Series([0.6745 * (y - median) / median_absolute_deviation for y in data[col]])
      outlier_index = np.abs(modified_z_scores) > threshold
      print('Num of outlier detected:',outlier_index.value_counts()[1])
      print('Proportion of outlier detected',outlier_index.value_counts()[1]/len(outlier_index))
      return outlier_index, (median_absolute_deviation, median_absolute_deviation)


    para = (Upper_fence, Lower_fence)
    tmp = pd.concat([data[col]>Upper_fence,data[col]<Lower_fence],axis=1)
    outlier_index = tmp.any(axis=1)
    print('Num of outlier detected:',outlier_index.value_counts()[1])
    print('Proportion of outlier detected',outlier_index.value_counts()[1]/len(outlier_index))
    
    return outlier_index, para
    

index,para = pv.outlier_detect(df,"a",threshold=0.5,method="IQR")
print('Upper bound:',para[0],'\nLower bound:',para[1])
Num of outlier detected: 1
Proportion of outlier detected 0.3333333333333333
Upper bound: 1.5712030633954956 
Lower bound: 0.2658967957893529

>>> Windsorize Outliers (func)


# RUN above example first
df = df_test.copy(); df
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910
para[0],col] = para[0] data_copy.loc[data_copy[col] para[0],col] = para[0] elif strategy == 'bottom': data_copy.loc[data_copy[col]
def windsorization(data,col,para,strategy='both'):
    """
    top-coding & bottom coding (capping the maximum of a distribution at an arbitrarily set value,vice versa)
    """

    data_copy = data.copy(deep=True)  
    if strategy == 'both':
        data_copy.loc[data_copy[col]>para[0],col] = para[0]
        data_copy.loc[data_copy[col]<para[1],col] = para[1]
    elif strategy == 'top':
        data_copy.loc[data_copy[col]>para[0],col] = para[0]
    elif strategy == 'bottom':
        data_copy.loc[data_copy[col]<para[1],col] = para[1]  
    return data_copy


df_out = pv.windsorization(data=df,col='a',para=para,strategy='both'); df_out
a b c d target
0 1.5712 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910

>>> Drop Outliers


## run the top two examples
df = df_test.copy(); df
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910
df_out = df[~index] ; df_out
a b c d target
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910

>>> Impute Outliers


def impute_outlier(data,col,outlier_index,strategy='mean'):
    """
    impute outlier with mean/median/most frequent values of that variable.
    """

    data_copy = data.copy(deep=True)
    if strategy=='mean':
        data_copy.loc[outlier_index,col] = data_copy[col].mean()
    elif strategy=='median':
        data_copy.loc[outlier_index,col] = data_copy[col].median()
    elif strategy=='mode':
        data_copy.loc[outlier_index,col] = data_copy[col].mode()[0]   
        
    return data_copy
  
df_out = pv.impute_outlier(data=df,col='a', outlier_index=index,strategy='mean'); df_out
a b c d target
0 0.9363 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910

Feature Engineering



>>> Automated Dummy (one-hot) Encoding (func)


df["a"], 1, 2) ">
df = df_test.copy()
df["e"] = np.where(df["c"]> df["a"], 1,  2)
def auto_dummy(df, unique=15):
  # Creating dummies for small object uniques
  if len(df)<unique:
    raise ValueError('unique is set higher than data lenght')
  list_dummies =[]
  for col in df.columns:
      if (len(df[col].unique()) < unique):
          list_dummies.append(col)
          print(col)
  df_edit = pd.get_dummies(df, columns = list_dummies) # Saves original dataframe
  #df_edit = pd.concat([df[["year","qtr"]],df_edit],axis=1)
  return df_edit

df_out = pv.auto_dummy(df, unique=3); df_out
a b c d target e_1 e_2
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 0 1
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 1 0
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 1 0

>>> Binarise Empty Columns (func)


df = df_test.copy()
df[df>df.mean()]  = None ; df
a b c d target
0 NaN NaN -0.5282 NaN NaN
1 0.8654 -2.3015 NaN NaN -5.9994
2 0.3190 NaN NaN -2.0601 NaN
frac: # if more than 70% is null binarise print(col) this.append(col) df[col] = df[col].astype(float) df[col] = df[col].apply(lambda x: 0 if (np.isnan(x)) else 1) df = pd.get_dummies(df, columns = this) return df df_out = pv.binarise_empty(df, frac=0.6); df_out ">
def binarise_empty(df, frac=80):
  # Binarise slightly empty columns
  this =[]
  for col in df.columns:
      if df[col].dtype != "object":
          is_null = df[col].isnull().astype(int).sum()
          if (is_null/df.shape[0]) >frac: # if more than 70% is null binarise
              print(col)
              this.append(col)
              df[col] = df[col].astype(float)
              df[col] = df[col].apply(lambda x: 0 if (np.isnan(x)) else 1)
  df = pd.get_dummies(df, columns = this) 
  return df

df_out = pv.binarise_empty(df, frac=0.6); df_out
b
c
d
target
a b_0 b_1 c_0 c_1 d_0 d_1 target_0 target_1
0 NaN 1 0 0 1 1 0 1 0
1 0.8654 0 1 1 0 1 0 0 1
2 0.3190 1 0 1 0 0 1 1 0

>>> Polynomials (func)


df = df_test.copy(); df
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910
def polynomials(df, feature_list):
  for feat in feature_list:
    for feat_two in feature_list:
      if feat==feat_two:
        continue
      else:
       df[feat+"/"+feat_two] = df[feat]/(df[feat_two]-df[feat_two].min()) #zero division guard
       df[feat+"X"+feat_two] = df[feat]*(df[feat_two])

  return df

df_out = pv.polynomials(df, ["a","b"]) ; df_out
a b c d target a/b aXb b/a bXa
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 0.9613 -0.9937 -0.4687 -0.9937
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 inf -1.9918 -4.2124 -1.9918
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 0.1555 -0.0796 -inf -0.0796

>>> Transformations (func)


df = df_test.copy(); df
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910
def transformations(df,features):
  df_new = df[features]
  df_new = df_new - df_new.min()

  sqr_name = [str(fa)+"_POWER_2" for fa in df_new.columns]
  log_p_name = [str(fa)+"_LOG_p_one_abs" for fa in df_new.columns]
  rec_p_name = [str(fa)+"_RECIP_p_one" for fa in df_new.columns]
  sqrt_name = [str(fa)+"_SQRT_p_one" for fa in df_new.columns]

  df_sqr = pd.DataFrame(np.power(df_new.values, 2),columns=sqr_name, index=df.index)
  df_log = pd.DataFrame(np.log(df_new.add(1).abs().values),columns=log_p_name, index=df.index)
  df_rec = pd.DataFrame(np.reciprocal(df_new.add(1).values),columns=rec_p_name, index=df.index)
  df_sqrt = pd.DataFrame(np.sqrt(df_new.abs().add(1).values),columns=sqrt_name, index=df.index)

  dfs = [df, df_sqr, df_log, df_rec, df_sqrt]

  df=  pd.concat(dfs, axis=1)

  return df

df_out = pv.transformations(df,["a","b"]); df_out.iloc[:,:8]
a b c d target a_POWER_2 b_POWER_2 a_LOG_p_one_abs
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 1.7038 2.8554 0.8352
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 0.2985 0.0000 0.4359
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 0.0000 4.2114 0.0000

>>> Genetic Programming


! pip install gplearn
Collecting gplearn
�[?25l  Downloading https://files.pythonhosted.org/packages/43/6b/ee38cd74b32ad5056603aabbef622f9691f19d0869574dfc610034f18662/gplearn-0.4.1-py3-none-any.whl (41kB)
�[K     |████████████████████████████████| 51kB 2.5MB/s 
�[?25hRequirement already satisfied: scikit-learn>=0.20.0 in /usr/local/lib/python3.6/dist-packages (from gplearn) (0.22.1)
Requirement already satisfied: joblib>=0.13.0 in /usr/local/lib/python3.6/dist-packages (from gplearn) (0.14.1)
Requirement already satisfied: numpy>=1.11.0 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.20.0->gplearn) (1.17.5)
Requirement already satisfied: scipy>=0.17.0 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.20.0->gplearn) (1.4.1)
Installing collected packages: gplearn
Successfully installed gplearn-0.4.1
df = df_test.copy(); df
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910
from gplearn.genetic import SymbolicTransformer
function_set = ['add', 'sub', 'mul', 'div',
                'sqrt', 'log', 'abs', 'neg', 'inv','tan']

gp = SymbolicTransformer(generations=800, population_size=200,
                         hall_of_fame=100, n_components=10,
                         function_set=function_set,
                         parsimony_coefficient=0.0005,
                         max_samples=0.9, verbose=1,
                         random_state=0, n_jobs=6)

gen_feats = gp.fit_transform(df.drop("target", axis=1), df["target"]); df.iloc[:,:8]
df_out = pd.concat((df,pd.DataFrame(gen_feats, columns=["gen_"+str(a) for a in range(gen_feats.shape[1])])),axis=1); df_out.iloc[:,:8]
    |   Population Average    |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0    10.14             0.91       22                1                0     43.36m
a b c d target gen_0 gen_1 gen_2
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 -1.8292 -2.6469 0.5059
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 -3.5190 99.1619 3.6243
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 -1.4668 1.3677 3.1826

>>> Prinicipal Component Features (func)


df =df_test.copy(); df
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910
1: pca = PCA(n_components=variance_or_components) else: # automted selection based on variance pca = PCA(n_components=variance_or_components,svd_solver="full") X_pca = pca.fit_transform(df.drop(drop_cols,axis=1)) df = pd.concat((df[drop_cols],pd.DataFrame(X_pca, columns=["PCA_"+str(i+1) for i in range(X_pca.shape[1])])),axis=1) return df df_out = pv.pca_feature(df,variance_or_components=0.80,drop_cols=["target","a"]); df_out ">
from sklearn.decomposition import PCA, IncrementalPCA

def pca_feature(df, memory_issues=False,mem_iss_component=False,variance_or_components=0.80,drop_cols=None):

  if memory_issues:
    if not mem_iss_component:
      raise ValueError("If you have memory issues, you have to preselect mem_iss_component")
    pca = IncrementalPCA(mem_iss_component)
  else:
    if variance_or_components>1:
      pca = PCA(n_components=variance_or_components) 
    else: # automted selection based on variance
      pca = PCA(n_components=variance_or_components,svd_solver="full") 
  X_pca = pca.fit_transform(df.drop(drop_cols,axis=1))
  df = pd.concat((df[drop_cols],pd.DataFrame(X_pca, columns=["PCA_"+str(i+1) for i in range(X_pca.shape[1])])),axis=1)
  return df

df_out = pv.pca_feature(df,variance_or_components=0.80,drop_cols=["target","a"]); df_out
target a PCA_1 PCA_2
0 1.1227 1.6243 -1.2944 -0.7684
1 -5.9994 0.8654 1.5375 -0.4537
2 -0.5910 0.3190 -0.2431 1.2220

>>> Multiple Lags (func)


df = df_test.copy(); df
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910
def multiple_lags(df, start=1, end=3,columns=None):
  if not columns:
    columns = df.columns.to_list()
  lags = range(start, end+1)  # Just two lags for demonstration.

  df = df.assign(**{
      '{}_t_{}'.format(col, t): df[col].shift(t)
      for t in lags
      for col in columns
  })
  return df

df_out = pv.multiple_lags(df, start=1, end=2,columns=["a","target"]); df_out
a b c d target a_t_1 target_t_1 a_t_2 target_t_2
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 NaN NaN NaN NaN
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 1.6243 1.1227 NaN NaN
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 0.8654 -5.9994 1.6243 1.1227

>>> Multiple Rolling (func)


df = df_test.copy(); df
a b c d target
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910
def multiple_rolling(df, windows = [1,2], functions=["mean","std"], columns=None):
  windows = [1+a for a in windows]
  if not columns:
    columns = df.columns.to_list()
  rolling_dfs = (df[columns].rolling(i)                                    # 1. Create window
                  .agg(functions)                                # 1. Aggregate
                  .rename({col: '{0}_{1:d}'.format(col, i)
                                for col in columns}, axis=1)  # 2. Rename columns
                for i in windows)                                # For each window
  df_out = pd.concat((df, *rolling_dfs), axis=1)
  da = df_out.iloc[:,len(df.columns):]
  da = [col[0] + "_" + col[1] for col in  da.columns.to_list()]
  df_out.columns = df.columns.to_list() + da 

  return  df_out                      # 3. Concatenate dataframes

df_out = pv.multiple_rolling(df, columns=["a"]); df_out
a b c d target a_2_mean a_2_std a_3_mean a_3_std
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 NaN NaN NaN NaN
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 1.2449 0.5367 NaN NaN
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 0.5922 0.3863 0.9363 0.6555

>>> Date Features


df = df_test.copy()
df["date_fake"] = pd.date_range(start="2019-01-03", end="2019-01-06", periods=len(df)); df
a b c d target date_fake
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 2019-01-03 00:00:00
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 2019-01-04 12:00:00
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 2019-01-06 00:00:00
def date_features(df, date="date"):
  df[date] = pd.to_datetime(df[date])
  df[date+"_month"] = df[date].dt.month.astype(int)
  df[date+"_year"]  = df[date].dt.year.astype(int)
  df[date+"_week"]  = df[date].dt.week.astype(int)
  df[date+"_day"]   = df[date].dt.day.astype(int)
  df[date+"_dayofweek"]= df[date].dt.dayofweek.astype(int)
  df[date+"_dayofyear"]= df[date].dt.dayofyear.astype(int)
  df[date+"_hour"] = df[date].dt.hour.astype(int)
  df[date+"_int"] = pd.to_datetime(df[date]).astype(int)
  return df

df_out = date_features(df, date="date_fake"); df_out.iloc[:,:8]
a b c d target date_fake date_fake_month date_fake_year
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 2019-01-03 00:00:00 1 2019
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 2019-01-04 12:00:00 1 2019
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 2019-01-06 00:00:00 1 2019

>>> Haversine Distance (Location Feature) (func)


df = df_test.copy()
df["latitude"] = [39, 35 , 20]
df["longitude"]=  [-77, -40 , -10 ]
from math import sin, cos, sqrt, atan2, radians
def haversine_distance(row, lon="latitude", lat="longitude"):
    c_lat,c_long = radians(52.5200), radians(13.4050)
    R = 6373.0
    long = radians(row['longitude'])
    lat = radians(row['latitude'])
    
    dlon = long - c_long
    dlat = lat - c_lat
    a = sin(dlat / 2)**2 + cos(lat) * cos(c_lat) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    
    return R * c

df['distance_central'] = df.apply(pv.haversine_distance,axis=1); df.iloc[:,4:]
target latitude longitude distance_central
0 1.1227 39 -77 6702.7127
1 -5.9994 35 -40 4583.5988
2 -0.5910 20 -10 4141.6783

>>> Parse Address


df = df_test.copy()
df["addr"] = pd.Series([
            'Washington, D.C. 20003',
            'Brooklyn, NY 11211-1755',
            'Omaha, NE 68154' ]) ; df
a b c d target addr
0 1.6243 -0.6118 -0.5282 -1.0730 1.1227 Washington, D.C....
1 0.8654 -2.3015 1.7448 -0.7612 -5.9994 Brooklyn, NY 112...
2 0.3190 -0.2494 1.4621 -2.0601 -0.5910 Omaha, NE 68154
regex = (r'(?P
   
    [A-Za-z ]+), (?P
    
     [A-Z]{2}) (?P
     
      \d{5}(?:-\d{4})?)'
     
    
   )  

df.addr.str.replace('.', '').str.extract(regex)
city state zip
0 Washington DC 20003
1 Brooklyn NY 11211-1755
2 Omaha NE 68154

>>> Processing Strings in Pandas


df = pd.util.testing.makeMixedDataFrame()
df["C"] = df["C"] + " " + df["C"] ; df
A B C D
0 0.0 0.0 foo1 foo1 2009-01-01
1 1.0 1.0 foo2 foo2 2009-01-02
2 2.0 0.0 foo3 foo3 2009-01-05
3 3.0 1.0 foo4 foo4 2009-01-06
4 4.0 0.0 foo5 foo5 2009-01-07
"""convert column to UPPERCASE"""

col_name = "C"
df[col_name].str.upper()

"""count string occurence in each row"""
df[col_name].str.count(r'\d') # counts number of digits

"""count # o chars in each row"""
df[col_name].str.count('o') # counts number of digits

"""split rows"""
s = pd.Series(["this is a regular sentence", "https://docs.p.org", np.nan])
s.str.split()

"""this creates new columns with the different split values (instead of lists)"""
s.str.split(expand=True)  

"""limit the number of splits to 1, and start spliting from the rights side"""
s.str.rsplit("/", n=1, expand=True) 
0 1
0 this is a regula... None
1 https:/ docs.p.org
2 NaN NaN

>>> Filtering Strings in Pandas


df = pd.util.testing.makeMixedDataFrame()
df["C"] = df["C"] + " " + df["C"] ; df
A B C D
0 0.0 0.0 foo1 foo1 2009-01-01
1 1.0 1.0 foo2 foo2 2009-01-02
2 2.0 0.0 foo3 foo3 2009-01-05
3 3.0 1.0 foo4 foo4 2009-01-06
4 4.0 0.0 foo5 foo5 2009-01-07
Mon)""" df[col_name].str.replace(r'(\w+day\b)', lambda x: x.groups[0][:3]) # () in r'' creates a group with one element, which we acces with x.groups[0] """create dataframe from regex groups (str.extract() uses first match of the pattern only)""" df[col_name].str.extract(r'(\d?\d):(\d\d)') df[col_name].str.extract(r'(?P \d?\d):(?P \d\d)') df[col_name].str.extract(r'(?P

Model Validation



>>> Classification Metrics (func)


y_test = [0, 1, 1, 1, 0]
y_predict = [0, 0, 1, 1, 1]
y_prob = [0.2,0.6,0.7,0.7,0.9]
from sklearn.metrics import roc_auc_score, average_precision_score, confusion_matrix
from sklearn.metrics import log_loss, brier_score_loss, accuracy_score

def classification_scores(y_test, y_predict, y_prob):

  confusion_mat = confusion_matrix(y_test,y_predict)

  TN = confusion_mat[0][0]
  FP = confusion_mat[0][1]
  TP = confusion_mat[1][1]
  FN = confusion_mat[1][0]

  TPR = TP/(TP+FN)
  # Specificity or true negative rate
  TNR = TN/(TN+FP) 
  # Precision or positive predictive value
  PPV = TP/(TP+FP)
  # Negative predictive value
  NPV = TN/(TN+FN)
  # Fall out or false positive rate
  FPR = FP/(FP+TN)
  # False negative rate
  FNR = FN/(TP+FN)
  # False discovery rate
  FDR = FP/(TP+FP)

  ll = log_loss(y_test, y_prob) # Its low but means nothing to me. 
  br = brier_score_loss(y_test, y_prob) # Its low but means nothing to me. 
  acc = accuracy_score(y_test, y_predict)
  print(acc)
  auc = roc_auc_score(y_test, y_prob)
  print(auc)
  prc = average_precision_score(y_test, y_prob) 

  data = np.array([np.arange(1)]*1).T

  df_exec = pd.DataFrame(data)

  df_exec["Average Log Likelihood"] = ll
  df_exec["Brier Score Loss"] = br
  df_exec["Accuracy Score"] = acc
  df_exec["ROC AUC Sore"] = auc
  df_exec["Average Precision Score"] = prc
  df_exec["Precision - Bankrupt Firms"] = PPV
  df_exec["False Positive Rate (p-value)"] = FPR
  df_exec["Precision - Healthy Firms"] = NPV
  df_exec["False Negative Rate (recall error)"] = FNR
  df_exec["False Discovery Rate "] = FDR
  df_exec["All Observations"] = TN + TP + FN + FP
  df_exec["Bankruptcy Sample"] = TP + FN
  df_exec["Healthy Sample"] = TN + FP
  df_exec["Recalled Bankruptcy"] = TP + FP
  df_exec["Correct (True Positives)"] = TP
  df_exec["Incorrect (False Positives)"] = FP
  df_exec["Recalled Healthy"] = TN + FN
  df_exec["Correct (True Negatives)"] = TN
  df_exec["Incorrect (False Negatives)"] = FN

  df_exec = df_exec.T[1:]
  df_exec.columns = ["Metrics"]
  return df_exec


met = pv.classification_scores(y_test, y_predict, y_prob); met
0.6
0.5
Metrics
Average Log Likelihood 0.7500
Brier Score Loss 0.2380
Accuracy Score 0.6000
ROC AUC Sore 0.5000
Average Precision Score 0.6944
Precision - Bankrupt Firms 0.6667
False Positive Rate (p-value) 0.5000
Precision - Healthy Firms 0.5000
False Negative Rate (recall error) 0.3333
False Discovery Rate 0.3333
All Observations 5.0000
Bankruptcy Sample 3.0000
Healthy Sample 2.0000
Recalled Bankruptcy 3.0000
Correct (True Positives) 2.0000
Incorrect (False Positives) 1.0000
Recalled Healthy 2.0000
Correct (True Negatives) 1.0000
Incorrect (False Negatives) 1.0000
Owner
Derek Snow
Derek Snow
This creates a ohlc timeseries from downloaded CSV files from NSE India website and makes a SQLite database for your research.

NSE-timeseries-form-CSV-file-creator-and-SQL-appender- This creates a ohlc timeseries from downloaded CSV files from National Stock Exchange India (NS

PILLAI, Amal 1 Oct 02, 2022
Hydrogen (or other pure gas phase species) depressurization calculations

HydDown Hydrogen (or other pure gas phase species) depressurization calculations This code is published under an MIT license. Install as simple as: pi

Anders Andreasen 13 Nov 26, 2022
Port of dplyr and other related R packages in python, using pipda.

Unlike other similar packages in python that just mimic the piping syntax, datar follows the API designs from the original packages as much as possible, and is tested thoroughly with the cases from t

179 Dec 21, 2022
A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Realtime Financial Market Data Visualization and Analysis Introduction This repo shows my project about real-time stock data pipeline. All the code is

6 Sep 07, 2022
INF42 - Topological Data Analysis

TDA INF421(Conception et analyse d'algorithmes) Projet : Topological Data Analysis SphereMin Etant donné un nuage des points, ce programme contient de

2 Jan 07, 2022
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Trung-Duy Nguyen 27 Nov 01, 2022
COVID-19 deaths statistics around the world

COVID-19-Deaths-Dataset COVID-19 deaths statistics around the world This is a daily updated dataset of COVID-19 deaths around the world. The dataset c

Nisa Efendioğlu 4 Jul 10, 2022
Randomisation-based inference in Python based on data resampling and permutation.

Randomisation-based inference in Python based on data resampling and permutation.

67 Dec 27, 2022
Hidden Markov Models in Python, with scikit-learn like API

hmmlearn hmmlearn is a set of algorithms for unsupervised learning and inference of Hidden Markov Models. For supervised learning learning of HMMs and

2.7k Jan 03, 2023
PySpark Structured Streaming ROS Kafka ApacheSpark Cassandra

PySpark-Structured-Streaming-ROS-Kafka-ApacheSpark-Cassandra The purpose of this project is to demonstrate a structured streaming pipeline with Apache

Zekeriyya Demirci 5 Nov 13, 2022
Data Analytics: Modeling and Studying data relating to climate change and adoption of electric vehicles

Correlation-Study-Climate-Change-EV-Adoption Data Analytics: Modeling and Studying data relating to climate change and adoption of electric vehicles I

Jonathan Feng 1 Jan 03, 2022
Single-Cell Analysis in Python. Scales to >1M cells.

Scanpy – Single-Cell Analysis in Python Scanpy is a scalable toolkit for analyzing single-cell gene expression data built jointly with anndata. It inc

Theis Lab 1.4k Jan 05, 2023
Import, connect and transform data into Excel

xlwings_query Import, connect and transform data into Excel. Description The concept is to apply data transformations to a main query object. When the

George Karakostas 1 Jan 19, 2022
ped-crash-techvol: Texas Ped Crash Tech Volume Pack

ped-crash-techvol: Texas Ped Crash Tech Volume Pack In conjunction with the Final Report "Identifying Risk Factors that Lead to Increase in Fatal Pede

Network Modeling Center; Center for Transportation Research; The University of Texas at Austin 2 Sep 28, 2022
A program that uses an API and a AI model to get info of sotcks

Stock-Market-AI-Analysis I dont mind anyone using this code but please give me credit A program that uses an API and a AI model to get info of stocks

1 Dec 17, 2021
A probabilistic programming language in TensorFlow. Deep generative models, variational inference.

Edward is a Python library for probabilistic modeling, inference, and criticism. It is a testbed for fast experimentation and research with probabilis

Blei Lab 4.7k Jan 09, 2023
An implementation of the largeVis algorithm for visualizing large, high-dimensional datasets, for R

largeVis This is an implementation of the largeVis algorithm described in (https://arxiv.org/abs/1602.00370). It also incorporates: A very fast algori

336 May 25, 2022
Tools for analyzing data collected with a custom unity-based VR for insects.

unityvr Tools for analyzing data collected with a custom unity-based VR for insects. Organization: The unityvr package contains the following submodul

Hannah Haberkern 1 Dec 14, 2022
Sample code for Harry's Airflow online trainng course

Sample code for Harry's Airflow online trainng course You can find the videos on youtube or bilibili. I am working on adding below things: the slide p

102 Dec 30, 2022
Find exposed data in Azure with this public blob scanner

BlobHunter A tool for scanning Azure blob storage accounts for publicly opened blobs. BlobHunter is a part of "Hunting Azure Blobs Exposes Millions of

CyberArk 250 Jan 03, 2023