Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

Overview

pyspark-anonymizer

Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

Installing

pip install pyspark-anonymizer

Usage

Before Masking

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("your_app_name").getOrCreate()
df = spark.read.parquet("s3://amazon-reviews-pds/parquet/product_category=Electronics/")
df.limit(5).toPandas()
marketplace customer_id review_id product_id product_parent product_title star_rating helpful_votes total_votes vine verified_purchase review_headline review_body review_date year
0 US 51163966 R2RX7KLOQQ5VBG B00000JBAT 738692522 Diamond Rio Digital Player 3 0 0 N N Why just 30 minutes? RIO is really great, but Diamond should increa... 1999-06-22 1999
1 US 30050581 RPHMRNCGZF2HN B001BRPLZU 197287809 NG 283220 AC Adapter Power Supply for HP Pavil... 5 0 0 N Y Five Stars Great quality for the price!!!! 2014-11-17 2014
2 US 52246039 R3PD79H9CTER8U B00000JBAT 738692522 Diamond Rio Digital Player 5 1 2 N N The digital audio "killer app" One of several first-generation portable MP3 p... 1999-06-30 1999
3 US 16186332 R3U6UVNH7HGDMS B009CY43DK 856142222 HDE Mini Portable Capsule Travel Mobile Pocket... 5 0 0 N Y Five Stars I like it, got some for the Grandchilren 2014-11-17 2014
4 US 53068431 R3SP31LN235GV3 B00000JBSN 670078724 JVC FS-7000 Executive MicroSystem (Discontinue... 3 5 5 N N Design flaws ruined the better functions I returned mine for a couple of reasons: The ... 1999-07-13 1999

After Masking

In this example we will add the following data anonymizers:

  • drop_column on column "marketplace"
  • replace all values to "*" of the "customer_id" column
  • replace_with_regex "R\d" (R and any digit) to "*" on "review_id" column
  • sha256 on "product_id" column
  • filter_row with condition "product_parent != 738692522"
from pyspark.sql import SparkSession
import pyspark.sql.functions as spark_functions
import pyspark_anonymizer

spark = SparkSession.builder.appName("your_app_name").getOrCreate()
df = spark.read.parquet("s3://amazon-reviews-pds/parquet/product_category=Electronics/")

dataframe_anonymizers = [
    {
        "method": "drop_column",
        "parameters": {
            "column_name": "marketplace"
        }
    },
    {
        "method": "replace",
        "parameters": {
            "column_name": "customer_id",
            "replace_to": "*"
        }
    },
    {
        "method": "replace_with_regex",
        "parameters": {
            "column_name": "review_id",
            "replace_from_regex": "R\d",
            "replace_to": "*"
        }
    },
    {
        "method": "sha256",
        "parameters": {
            "column_name": "product_id"
        }
    },
    {
        "method": "filter_row",
        "parameters": {
            "where": "product_parent != 738692522"
        }
    }
]

df_parsed = pyspark_anonymizer.Parser(df, dataframe_anonymizers, spark_functions).parse()
df_parsed.limit(5).toPandas()
customer_id review_id product_id product_parent product_title star_rating helpful_votes total_votes vine verified_purchase review_headline review_body review_date year
0 * RPHMRNCGZF2HN 69031b13080f90ae3bbbb505f5f80716cd11c4eadd8d86... 197287809 NG 283220 AC Adapter Power Supply for HP Pavil... 5 0 0 N Y Five Stars Great quality for the price!!!! 2014-11-17 2014
1 * *U6UVNH7HGDMS c99947c06f65c1398b39d092b50903986854c21fd1aeab... 856142222 HDE Mini Portable Capsule Travel Mobile Pocket... 5 0 0 N Y Five Stars I like it, got some for the Grandchilren 2014-11-17 2014
2 * *SP31LN235GV3 eb6b489524a2fb1d2de5d2e869d600ee2663e952a4b252... 670078724 JVC FS-7000 Executive MicroSystem (Discontinue... 3 5 5 N N Design flaws ruined the better functions I returned mine for a couple of reasons: The ... 1999-07-13 1999
3 * *IYAZPPTRJF7E 2a243d31915e78f260db520d9dcb9b16725191f55c54df... 503838146 BlueRigger High Speed HDMI Cable with Ethernet... 3 0 0 N Y Never got around to returning the 1 out of 2 ... Never got around to returning the 1 out of 2 t... 2014-11-17 2014
4 * *RDD9FILG1LSN c1f5e54677bf48936fb1e9838869630e934d16ac653b15... 587294791 Brookstone 2.4GHz Wireless TV Headphones 5 3 3 N Y Saved my. marriage, I swear to god. Saved my.marriage, I swear to god. 2014-11-17 2014

Anonymizers from DynamoDB

You can store anonymizers on DynamoDB too.

Creating DynamoDB table

To create the table follow the steps below.

Using example script

On AWS console:

  • DynamoDB > Tables > Create table
  • Table name: "pyspark_anonymizer" (or any other of your own)
  • Partition key: "dataframe_name"
  • Customize the settings if you want
  • Create table

Writing Anonymizer on DynamoDB

You can run the example script, then edit your settings from there.

Parse from DynamoDB

from pyspark.sql import SparkSession
import pyspark.sql.functions as spark_functions
import pyspark_anonymizer
import boto3
from botocore.exceptions import ClientError as client_error

dynamo_table = "pyspark_anonymizer"
dataframe_name = "table_x"

dynamo_table = boto3.resource('dynamodb').Table(dynamo_table)
spark = SparkSession.builder.appName("your_app_name").getOrCreate()
df = spark.read.parquet("s3://amazon-reviews-pds/parquet/product_category=Electronics/")

df_parsed = pyspark_anonymizer.ParserFromDynamoDB(df, dataframe_name, dynamo_table, spark_functions, client_error).parse()

df_parsed.limit(5).toPandas()

The output will be same as the previous. The difference is that the anonymization settings will be in DynamoDB

Currently supported data masking/anonymization methods

  • Methods
    • drop_column - Drop a column.
    • replace - Replace all column to a string.
    • replace_with_regex - Replace column contents with regex.
    • sha256 - Apply sha256 hashing function.
    • filter_row - Apply a filter to the dataframe.
Repositório para o #alurachallengedatascience1

1° Challenge de Dados - Alura A Alura Voz é uma empresa de telecomunicação que nos contratou para atuar como cientistas de dados na equipe de vendas.

Sthe Monica 16 Nov 10, 2022
Unofficial pytorch implementation of the paper "Context Reasoning Attention Network for Image Super-Resolution (ICCV 2021)"

CRAN Unofficial pytorch implementation of the paper "Context Reasoning Attention Network for Image Super-Resolution (ICCV 2021)" This code doesn't exa

4 Nov 11, 2021
nn-Meter is a novel and efficient system to accurately predict the inference latency of DNN models on diverse edge devices

A DNN inference latency prediction toolkit for accurately modeling and predicting the latency on diverse edge devices.

Microsoft 241 Dec 26, 2022
Bayesian optimization in JAX

Bayesian optimization in JAX

Predictive Intelligence Lab 26 May 11, 2022
Anytime Learning At Macroscale

On Anytime Learning At Macroscale Learning from sequential data dumps (key) Requirements Python 3.7 Pytorch 1.9.0 Hydra 1.1.0 (pip install hydra-core

Meta Research 8 Mar 29, 2022
Create large-scale ML-driven multiscale simulation ensembles to study the interactions

MuMMI RAS v0.1 Released: Nov 16, 2021 MuMMI RAS is the application component of the MuMMI framework developed to create large-scale ML-driven multisca

4 Feb 16, 2022
Extreme Learning Machine implementation in Python

Python-ELM v0.3 --- ARCHIVED March 2021 --- This is an implementation of the Extreme Learning Machine [1][2] in Python, based on scikit-learn. From

David C. Lambert 511 Dec 20, 2022
Machine Learning Algorithms ( Desion Tree, XG Boost, Random Forest )

implementation of machine learning Algorithms such as decision tree and random forest and xgboost on darasets then compare results for each and implement ant colony and genetic algorithms on tsp map,

Mohamadreza Rezaei 1 Jan 19, 2022
Python factor analysis library (PCA, CA, MCA, MFA, FAMD)

Prince is a library for doing factor analysis. This includes a variety of methods including principal component analysis (PCA) and correspondence anal

Max Halford 915 Dec 31, 2022
Time Series Prediction with tf.contrib.timeseries

TensorFlow-Time-Series-Examples Additional examples for TensorFlow Time Series(TFTS). Read a Time Series with TFTS From a Numpy Array: See "test_input

Zhiyuan He 476 Nov 17, 2022
Dive into Machine Learning

Dive into Machine Learning Hi there! You might find this guide helpful if: You know Python or you're learning it 🐍 You're new to Machine Learning You

Michael Floering 11.1k Jan 03, 2023
A Python library for detecting patterns and anomalies in massive datasets using the Matrix Profile

matrixprofile-ts matrixprofile-ts is a Python 2 and 3 library for evaluating time series data using the Matrix Profile algorithms developed by the Keo

Target 696 Dec 26, 2022
LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRerank, Seq2Slate.

LibRerank LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRer

126 Dec 28, 2022
Automatic extraction of relevant features from time series:

tsfresh This repository contains the TSFRESH python package. The abbreviation stands for "Time Series Feature extraction based on scalable hypothesis

Blue Yonder GmbH 7k Jan 06, 2023
Pragmatic AI Labs 421 Dec 31, 2022
Feature-engine is a Python library with multiple transformers to engineer and select features for use in machine learning models.

Feature-engine is a Python library with multiple transformers to engineer and select features for use in machine learning models. Feature-engine's transformers follow scikit-learn's functionality wit

Soledad Galli 33 Dec 27, 2022
Polyglot Machine Learning example for scraping similar news articles.

Polyglot Machine Learning example for scraping similar news articles In this example, we will see how we can work with Machine Learning applications w

MetaCall 15 Mar 28, 2022
A Powerful Serverless Analysis Toolkit That Takes Trial And Error Out of Machine Learning Projects

KXY: A Seemless API to 10x The Productivity of Machine Learning Engineers Documentation https://www.kxy.ai/reference/ Installation From PyPi: pip inst

KXY Technologies, Inc. 35 Jan 02, 2023
This is the material used in my free Persian course: Machine Learning with Python

This is the material used in my free Persian course: Machine Learning with Python

Yara Mohamadi 4 Aug 07, 2022
Library for machine learning stacking generalization.

stacked_generalization Implemented machine learning *stacking technic[1]* as handy library in Python. Feature weighted linear stacking is also availab

114 Jul 19, 2022