Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

Overview

pyspark-anonymizer

Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

Installing

pip install pyspark-anonymizer

Usage

Before Masking

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("your_app_name").getOrCreate()
df = spark.read.parquet("s3://amazon-reviews-pds/parquet/product_category=Electronics/")
df.limit(5).toPandas()
marketplace customer_id review_id product_id product_parent product_title star_rating helpful_votes total_votes vine verified_purchase review_headline review_body review_date year
0 US 51163966 R2RX7KLOQQ5VBG B00000JBAT 738692522 Diamond Rio Digital Player 3 0 0 N N Why just 30 minutes? RIO is really great, but Diamond should increa... 1999-06-22 1999
1 US 30050581 RPHMRNCGZF2HN B001BRPLZU 197287809 NG 283220 AC Adapter Power Supply for HP Pavil... 5 0 0 N Y Five Stars Great quality for the price!!!! 2014-11-17 2014
2 US 52246039 R3PD79H9CTER8U B00000JBAT 738692522 Diamond Rio Digital Player 5 1 2 N N The digital audio "killer app" One of several first-generation portable MP3 p... 1999-06-30 1999
3 US 16186332 R3U6UVNH7HGDMS B009CY43DK 856142222 HDE Mini Portable Capsule Travel Mobile Pocket... 5 0 0 N Y Five Stars I like it, got some for the Grandchilren 2014-11-17 2014
4 US 53068431 R3SP31LN235GV3 B00000JBSN 670078724 JVC FS-7000 Executive MicroSystem (Discontinue... 3 5 5 N N Design flaws ruined the better functions I returned mine for a couple of reasons: The ... 1999-07-13 1999

After Masking

In this example we will add the following data anonymizers:

  • drop_column on column "marketplace"
  • replace all values to "*" of the "customer_id" column
  • replace_with_regex "R\d" (R and any digit) to "*" on "review_id" column
  • sha256 on "product_id" column
  • filter_row with condition "product_parent != 738692522"
from pyspark.sql import SparkSession
import pyspark.sql.functions as spark_functions
import pyspark_anonymizer

spark = SparkSession.builder.appName("your_app_name").getOrCreate()
df = spark.read.parquet("s3://amazon-reviews-pds/parquet/product_category=Electronics/")

dataframe_anonymizers = [
    {
        "method": "drop_column",
        "parameters": {
            "column_name": "marketplace"
        }
    },
    {
        "method": "replace",
        "parameters": {
            "column_name": "customer_id",
            "replace_to": "*"
        }
    },
    {
        "method": "replace_with_regex",
        "parameters": {
            "column_name": "review_id",
            "replace_from_regex": "R\d",
            "replace_to": "*"
        }
    },
    {
        "method": "sha256",
        "parameters": {
            "column_name": "product_id"
        }
    },
    {
        "method": "filter_row",
        "parameters": {
            "where": "product_parent != 738692522"
        }
    }
]

df_parsed = pyspark_anonymizer.Parser(df, dataframe_anonymizers, spark_functions).parse()
df_parsed.limit(5).toPandas()
customer_id review_id product_id product_parent product_title star_rating helpful_votes total_votes vine verified_purchase review_headline review_body review_date year
0 * RPHMRNCGZF2HN 69031b13080f90ae3bbbb505f5f80716cd11c4eadd8d86... 197287809 NG 283220 AC Adapter Power Supply for HP Pavil... 5 0 0 N Y Five Stars Great quality for the price!!!! 2014-11-17 2014
1 * *U6UVNH7HGDMS c99947c06f65c1398b39d092b50903986854c21fd1aeab... 856142222 HDE Mini Portable Capsule Travel Mobile Pocket... 5 0 0 N Y Five Stars I like it, got some for the Grandchilren 2014-11-17 2014
2 * *SP31LN235GV3 eb6b489524a2fb1d2de5d2e869d600ee2663e952a4b252... 670078724 JVC FS-7000 Executive MicroSystem (Discontinue... 3 5 5 N N Design flaws ruined the better functions I returned mine for a couple of reasons: The ... 1999-07-13 1999
3 * *IYAZPPTRJF7E 2a243d31915e78f260db520d9dcb9b16725191f55c54df... 503838146 BlueRigger High Speed HDMI Cable with Ethernet... 3 0 0 N Y Never got around to returning the 1 out of 2 ... Never got around to returning the 1 out of 2 t... 2014-11-17 2014
4 * *RDD9FILG1LSN c1f5e54677bf48936fb1e9838869630e934d16ac653b15... 587294791 Brookstone 2.4GHz Wireless TV Headphones 5 3 3 N Y Saved my. marriage, I swear to god. Saved my.marriage, I swear to god. 2014-11-17 2014

Anonymizers from DynamoDB

You can store anonymizers on DynamoDB too.

Creating DynamoDB table

To create the table follow the steps below.

Using example script

On AWS console:

  • DynamoDB > Tables > Create table
  • Table name: "pyspark_anonymizer" (or any other of your own)
  • Partition key: "dataframe_name"
  • Customize the settings if you want
  • Create table

Writing Anonymizer on DynamoDB

You can run the example script, then edit your settings from there.

Parse from DynamoDB

from pyspark.sql import SparkSession
import pyspark.sql.functions as spark_functions
import pyspark_anonymizer
import boto3
from botocore.exceptions import ClientError as client_error

dynamo_table = "pyspark_anonymizer"
dataframe_name = "table_x"

dynamo_table = boto3.resource('dynamodb').Table(dynamo_table)
spark = SparkSession.builder.appName("your_app_name").getOrCreate()
df = spark.read.parquet("s3://amazon-reviews-pds/parquet/product_category=Electronics/")

df_parsed = pyspark_anonymizer.ParserFromDynamoDB(df, dataframe_name, dynamo_table, spark_functions, client_error).parse()

df_parsed.limit(5).toPandas()

The output will be same as the previous. The difference is that the anonymization settings will be in DynamoDB

Currently supported data masking/anonymization methods

  • Methods
    • drop_column - Drop a column.
    • replace - Replace all column to a string.
    • replace_with_regex - Replace column contents with regex.
    • sha256 - Apply sha256 hashing function.
    • filter_row - Apply a filter to the dataframe.
PySurvival is an open source python package for Survival Analysis modeling

PySurvival What is Pysurvival ? PySurvival is an open source python package for Survival Analysis modeling - the modeling concept used to analyze or p

Square 265 Dec 27, 2022
Implementation of the Object Relation Transformer for Image Captioning

Object Relation Transformer This is a PyTorch implementation of the Object Relation Transformer published in NeurIPS 2019. You can find the paper here

Yahoo 158 Dec 24, 2022
PROTEIN EXPRESSION ANALYSIS FOR DOWN SYNDROME

PROTEIN-EXPRESSION-ANALYSIS-FOR-DOWN-SYNDROME Down syndrome (DS) is a chromosomal disorder where organisms have an extra chromosome 21, sometimes know

1 Jan 20, 2022
Projeto: Machine Learning: Linguagens de Programacao 2004-2001

Projeto: Machine Learning: Linguagens de Programacao 2004-2001 Projeto de Data Science e Machine Learning de análise de linguagens de programação de 2

Victor Hugo Negrisoli 0 Jun 29, 2021
Tangram makes it easy for programmers to train, deploy, and monitor machine learning models.

Tangram Website | Discord Tangram makes it easy for programmers to train, deploy, and monitor machine learning models. Run tangram train to train a mo

Tangram 1.4k Jan 05, 2023
A python library for easy manipulation and forecasting of time series.

Time Series Made Easy in Python darts is a python library for easy manipulation and forecasting of time series. It contains a variety of models, from

Unit8 5.2k Jan 04, 2023
DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective. 10x Larger Models 10x Faster Trainin

Microsoft 8.4k Dec 30, 2022
QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

152 Jan 02, 2023
Fast Fourier Transform-accelerated Interpolation-based t-SNE (FIt-SNE)

FFT-accelerated Interpolation-based t-SNE (FIt-SNE) Introduction t-Stochastic Neighborhood Embedding (t-SNE) is a highly successful method for dimensi

Kluger Lab 547 Dec 21, 2022
Predict the output which should give a fair idea about the chances of admission for a student for a particular university

Predict the output which should give a fair idea about the chances of admission for a student for a particular university.

ArvindSandhu 1 Jan 11, 2022
A machine learning web application for binary classification using streamlit

Machine Learning web App This is a machine learning web application for binary classification using streamlit options this application contains 3 clas

abdelhak mokri 1 Dec 20, 2021
QML: A Python Toolkit for Quantum Machine Learning

QML is a Python2/3-compatible toolkit for representation learning of properties of molecules and solids.

176 Dec 09, 2022
A concept I came up which ditches the idea of "layers" in a neural network.

Dynet A concept I came up which ditches the idea of "layers" in a neural network. Install Copy Dynet.py to your project. Run the example Install matpl

Anik Patel 4 Dec 05, 2021
Module for statistical learning, with a particular emphasis on time-dependent modelling

Operating system Build Status Linux/Mac Windows tick tick is a Python 3 module for statistical learning, with a particular emphasis on time-dependent

X - Data Science Initiative 410 Dec 14, 2022
Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

Auto_TS: Auto_TimeSeries Automatically build multiple Time Series models using a Single Line of Code. Now updated with Dask. Auto_timeseries is a comp

AutoViz and Auto_ViML 519 Jan 03, 2023
Transform ML models into a native code with zero dependencies

m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native code

Bayes' Witnesses 2.3k Jan 03, 2023
ThunderGBM: Fast GBDTs and Random Forests on GPUs

Documentations | Installation | Parameters | Python (scikit-learn) interface What's new? ThunderGBM won 2019 Best Paper Award from IEEE Transactions o

Xtra Computing Group 648 Dec 16, 2022
ETNA is an easy-to-use time series forecasting framework.

ETNA is an easy-to-use time series forecasting framework. It includes built in toolkits for time series preprocessing, feature generation, a variety of predictive models with unified interface - from

Tinkoff.AI 674 Jan 07, 2023
PyTorch extensions for high performance and large scale training.

Description FairScale is a PyTorch extension library for high performance and large scale training on one or multiple machines/nodes. This library ext

Facebook Research 2k Dec 28, 2022
A Lucid Framework for Transparent and Interpretable Machine Learning Models.

Currently a Beta-Version lucidmode is an open-source, low-code and lightweight Python framework for transparent and interpretable machine learning mod

lucidmode 15 Aug 12, 2022