Deduplication is the task to combine different representations of the same real world entity.

Last update: Nov 17, 2022

Related tags

Overview

DedupliPy

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training without having to provide a large, manually labelled dataset.

DedupliPy is an end-to-end solution with advantages over existing solutions:

active learning; no large manually labelled dataset required
during active learning, the user gets notified when the model converged and training may be finished
works out of the box, advanced users can choose settings as desired (custom blocking rules, custom metrics, interaction features)

Developed by Frits Hermans

Documentation

Documentation can be found here

Installation

Normal installation

Install directly from Pypi:

pip install deduplipy

Install to contribute

Clone this Github repo and install in editable mode:

python -m pip install -e ".[dev]"
python setup.py develop

Usage

Apply deduplication your Pandas dataframe df as follows:

myDedupliPy = Deduplicator(col_names=['name', 'address'])
myDedupliPy.fit(df)

This will start the interactive learning session in which you provide input on whether a pair is a match (y) or not (n). During active learning you will get the message that training may be finished once algorithm training has converged. Predictions on (new) data are obtained as follows:

result = myDedupliPy.predict(df)

Deduplication is the task to combine different representations of the same real world entity.

Related tags

Overview

DedupliPy

Documentation

Installation

Normal installation

Install to contribute

Usage

Owner

Train and use generative text models in a few lines of code.

Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

BeautyNet is an AI powered model which can tell you whether you're beautiful or not.

🏖 Easy training and deployment of seq2seq models.

Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

This repository contains helper functions which can help you generate additional data points depending on your NLP task.

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

Calibre recipe to convert latest issue of Analyse & Kritik into an ebook

Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.

An open source framework for seq2seq models in PyTorch.

This github repo is for Neurips 2021 paper, NORESQA A Framework for Speech Quality Assessment using Non-Matching References.

Python module (C extension and plain python) implementing Aho-Corasick algorithm

Multi Task Vision and Language

A Chinese to English Neural Model Translation Project

DaCy: The State of the Art Danish NLP pipeline using SpaCy

String Gen + Word Checker

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

The (extremely) naive sentiment classification function based on NBSVM trained on wisesight_sentiment