One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

Related tags

Text Data & NLPOSAS
Overview

One Stop Anomaly Shop (OSAS)

Quick start guide

Step 1: Get/build the docker image

Option 1: Use precompiled image (might not reflect latest changes):

docker pull tiberiu44/osas:latest
docker image tag tiberiu44/osas:latest osas:latest

Option 2: Build the image locally

git clone https://github.com/adobe/OSAS.git
cd OSAS
docker build . -f docker/osas-elastic/Dockerfile -t osas:latest

Step 2: After building the docker image you can start OSAS by typing:

docker run -p 8888:8888/tcp -p 5601:5601/tcp -v <ABSOLUTE PATH TO DATA FOLDER>:/app osas

IMPORTANT NOTE: Please modify the above command by adding the absolute path to your datafolder in the appropiate location

After OSAS has started (it might take 1-2 minutes) you can use your browser to access some standard endpoints:

For Debug (in case you need to):

docker run -p 8888:8888/tcp -p 5601:5601/tcp -v <ABSOLUTE PATH TO DATA FOLDER>:/app -ti osas /bin/bash

Building the test pipeline

This guide will take you through all the necessary steps to configure, train and run your own pipeline on your own dataset.

Prerequisite: Add you own CSV dataset into your data-folder (the one provided in the docker run command)

Once you started your docker image, use the OSAS console to gain CLI access to all the tools.

In what follows, we assume that your dataset is called dataset.csv. Please update the commands as necessary in case you use a different name/location.

Be sure you are running scripts in the root folder of OSAS:

cd /osas

Step 1: Build a custom pipeline configuration file - this can be done fully manually on by bootstraping using our conf autogenerator script:

python3 osas/main/autoconfig.py --input-file=/app/dataset.csv --output-file=/app/dataset.conf

The above command will generate a custom configuration file for your dataset. It will try guess field types and optimal combinations between fields. You can edit the generated file (which should be available in the shared data-folder), using your favourite editor.

Standard templates for label generator types are:

[LG_MULTINOMIAL]
generator_type = MultinomialField
field_name = <FIELD_NAME>
absolute_threshold = 10
relative_threshold = 0.1

[LG_TEXT]
generator_type = TextField
field_name = <FIELD_NAME>
lm_mode = char
ngram_range = (3, 5)

[LG_NUMERIC]
generator_type = NumericField
field_name = <FIELD_NAME>

[LG_MUTLINOMIAL_COMBINER]
generator_type = MultinomialFieldCombiner
field_names = ['<FIELD_1>', '<FIELD_2>', ...]
absolute_threshold = 10
relative_threshold = 0.1

[LG_KEYWORD]
generator_type = KeywordBased
field_name = <FIELD_NAME>
keyword_list = ['<KEYWORD_1>', '<KEYWORD_2>', '<KEYWORD_3>', ...]

[LG_REGEX]
generator_type = KnowledgeBased
field_name = <FIELD_NAME>
rules_and_labels_tuple_list = [('<REGEX_1>','<LABEL_1>'), ('<REGEX_2>','<LABEL_2>'), ...]

You can use the above templates to add as many label generators you want. Just make sure that the header IDs are unique in the configuration file.

Step 2: Train the pipeline

python3 osas/main/train_pipeline --conf-file=/app/dataset.conf --input-file=/app/dataset.csv --model-file=/app/dataset.json

The above command will generate a pretrained pipeline using the previously created configuration file and the dataset

Step 3: Run the pipeline on a dataset

python3 osas/main/run_pipeline --conf-file=/app/dataset.conf --model-file=/app/dataset.json --input-file=/app/dataset.csv --output-file=/app/dataset-out.csv

The above command will run the pretrained pipeline on any compatible dataset. In the example we run the pipeline on the training data, but you can use previously unseen data. It will generate an output file with labels and anomaly scores and it will also import your data into Elasticsearch/Kibana. To view the result just use the the web interface.

Pipeline explained

The pipeline sequentially applies all label generators on the raw data, collects the labels and uses an anomaly scoring algorithm to generate anomaly scores. There are two main component classes: LabelGenerator and ScoringAlgorithm.

Label generators

NumericField

  • This type of LabelGenerator handles numerical fields. It computes the mean and standard deviation and generates labels according to the distance between the current value and the mean value (value<=sigma NORMAL, sigma<value<=2sigma BORDERLINE, 2sigma<value OUTLIER)

Params:

  • field_name: what field to look for in the data object

TextField

  • This type of LabelGenerator handles text fields. It builds a n-gram based language model and computes the perplexity of newly observed data. It also holds statistics over the training data (mean and stdev). (perplexity<=sigma NORMAL, sigma<preplexity<=2sigma BORDERLINE, 2perplexity<value OUTLIER)

Params:

  • field_name: What field to look for
  • lm_mode: Type of LM to build: char or token
  • ngram_range: N-gram range to use for computation

MultinomialField

  • This type of LabelGenerator handles fields with discreet value sets. It computes the probability of seeing a specific value and alerts based on relative and absolute thresholds.

Params

  • field_name: What field to use
  • absolute_threshold: Minimum absolute value for occurrences to trigger alert for
  • relative_threshold: Minimum relative value for occurrences to trigger alert for

MultinomialFieldCombiner

  • This type of LabelGenerator handles fields with discreet value sets and build advanced features by combining values across the same dataset entry. It computes the probability of seeing a specific value and alerts based on relative and absolute thresholds.

Params

  • field_names: What fields to combine
  • absolute_threshold: Minimum absolute value for occurrences to trigger alert for
  • relative_threshold: Minimum relative value for occurrences to trigger alert for

KeywordBased

  • This is a rule-based label generators. It applies a simple tokenization procedure on input text, by dropping special characters and numbers and splitting on white-space. It then looks for a specific set of keywords and generates labels accordingly

Params:

  • field_name: What field to use
  • keyword_list: The list of keywords to look for

OSAS has four unsupervised anomaly detection algorithms:

  • IFAnomaly: n-hot encoding, singular value decomposition, isolation forest (IF)

  • LOFAnomaly: n-hot encoding, singular value decomposition, local outlier factor (LOF)

  • SVDAnomaly: n-hot encoding, singular value decomposition, inverted transform, input reconstruction error

  • StatisticalNGramAnomaly: compute label n-gram probabilities, compute anomaly score as a sum of negative log likelihood

Owner
Adobe, Inc.
Open source from Adobe
Adobe, Inc.
Python wrapper for Stanford CoreNLP tools v3.4.1

Python interface to Stanford Core NLP tools v3.4.1 This is a Python wrapper for Stanford University's NLP group's Java-based CoreNLP tools. It can eit

Dustin Smith 610 Sep 07, 2022
A simple implementation of N-gram language model.

About A simple implementation of N-gram language model. Requirements numpy Data preparation Corpus Training data for the N-gram model, a text file lik

4 Nov 24, 2021
This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

Text Summarizer This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text. Team Members This mini-project was

1 Nov 16, 2021
Code for Findings at EMNLP 2021 paper: "Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning"

Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning This repo is for Findings at EMNLP 2021 paper: Learn Cont

INK Lab @ USC 6 Sep 02, 2022
Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

KB-NER: a Knowledge-based System for Multilingual Complex Named Entity Recognition The code is for the winner system (DAMO-NLP) of SemEval 2022 MultiC

116 Dec 27, 2022
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Stefan Dumitrescu 9 Nov 07, 2022
Persian Bert For Long-Range Sequences

ParsBigBird: Persian Bert For Long-Range Sequences The Bert and ParsBert algorithms can handle texts with token lengths of up to 512, however, many ta

Sajjad Ayoubi 63 Dec 14, 2022
👑 spaCy building blocks and visualizers for Streamlit apps

spacy-streamlit: spaCy building blocks for Streamlit apps This package contains utilities for visualizing spaCy models and building interactive spaCy-

Explosion 620 Dec 29, 2022
基于pytorch_rnn的古诗词生成

pytorch_peot_rnn 基于pytorch_rnn的古诗词生成 说明 config.py里面含有训练、测试、预测的参数,更改后运行: python main.py 预测结果 if config.do_predict: result = trainer.generate('丽日照残春')

西西嘛呦 3 May 26, 2022
This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

This Project is based on NLTK(Natural Language Toolkit) It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its

SaiVenkatDhulipudi 2 Nov 17, 2021
原神抽卡记录数据集-Genshin Impact gacha data

提要 持续收集原神抽卡记录中 可以使用抽卡记录导出工具导出抽卡记录的json,将json文件发送至[email protected],我会在清除个人信息后

117 Dec 27, 2022
Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Recurrent VLN-BERT Code of the Recurrent-VLN-BERT paper: A Recurrent Vision-and-Language BERT for Navigation Yicong Hong, Qi Wu, Yuankai Qi, Cristian

YicongHong 109 Dec 21, 2022
kochat

Kochat 챗봇 빌더는 성에 안차고, 자신만의 딥러닝 챗봇 애플리케이션을 만드시고 싶으신가요? Kochat을 이용하면 손쉽게 자신만의 딥러닝 챗봇 애플리케이션을 빌드할 수 있습니다. # 1. 데이터셋 객체 생성 dataset = Dataset(ood=True) #

1 Oct 25, 2021
Clone a voice in 5 seconds to generate arbitrary speech in real-time

This repository is forked from Real-Time-Voice-Cloning which only support English. English | 中文 Features 🌍 Chinese supported mandarin and tested with

Weijia Chen 25.6k Jan 06, 2023
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 2.1k Jan 01, 2023
RecipeReduce: Simplified Recipe Processing for Lazy Programmers

RecipeReduce This repo will help you figure out the amount of ingredients to buy for a certain number of meals with selected recipes. RecipeReduce Get

Qibin Chen 9 Apr 22, 2022
Python library for interactive topic model visualization. Port of the R LDAvis package.

pyLDAvis Python library for interactive topic model visualization. This is a port of the fabulous R package by Carson Sievert and Kenny Shirley. pyLDA

Ben Mabey 1.7k Dec 20, 2022
Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part

VILLA: Vision-and-Language Adversarial Training This is the official repository of VILLA (NeurIPS 2020 Spotlight). This repository currently supports

Zhe Gan 109 Dec 31, 2022
⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x using fastT5.

Reduce T5 model size by 3X and increase the inference speed up to 5X. Install Usage Details Functionalities Benchmarks Onnx model Quantized onnx model

Kiran R 399 Jan 05, 2023
SimCTG - A Contrastive Framework for Neural Text Generation

A Contrastive Framework for Neural Text Generation Authors: Yixuan Su, Tian Lan,

Yixuan Su 345 Jan 03, 2023