Simple Similarities Service

Overview

simsity

Simsity is a Super Simple Similarities Service[tm].
It's all about building a neighborhood. Literally!

This repository contains simple tools to help in similarity retreival scenarios by making a convient wrapper around encoding strategies as well as nearest neighbor approaches. Typical usecases include early stage bulk labelling and duplication discovery.

Warning

Alpha software. Expect things to break. Do not use in production.

Quickstart

This is the basic setup for this package.

import pandas as pd

from simsity.service import Service
from simsity.indexer import PyNNDescentIndexer
from simsity.preprocessing import Identity, ColumnLister


# The Indexer handles the nearest neighbor search
# The Encoder handles the encoding of the datapoints
service = Service(
    indexer=PyNNDescentIndexer(metric="euclidean"),
    encoder=CountVectorizer()
)

# The encoder defines how we encode the data going in.
encoder = make_pipeline(
    ColumnLister(column="text"),
    CountVectorizer()
)

# The indexer handles the nearest neighbor lookup.
indexer = PyNNDescentIndexer(metric="euclidean", n_neighbors=2)

# The service combines the two into a single object.
service_clinc = Service(
    encoder=encoder,
    indexer=indexer,
)

# We can now train the service.
df_clinc = pd.read_csv("tests/data/clinc-data.csv")
service_clinc.train_from_dataf(df_clinc, features=["text"])

# Query the datapoints
service.query("give me directions", n_neighbors=20)

# Save the entire system
service.save("/tmp/simple-model")

# You can also load the model now.
reloaded = Service.load("/tmp/simple-model")

# We can also host it as a web service
reloaded.serve(host='0.0.0.0', port=8080)

# You can now POST to http://0.0.0.0:8080/query with payload:
# {"query": {"text": "hello there"}, "n_neighbors": 20}
Comments
  • Add support for pretrained encoders and transformed data

    Add support for pretrained encoders and transformed data

    First of all this project looks great! I've taken an initial stab at #12 and also tried to add support querying data that has already been transformed. If you have data that you've already transformed (e.g. a UMAP embedding), you probably don't want to rerun encoder.transform again. In this case you want to index the transformed data and query it directly.

    This is just a first crack so happy to incorporate any feedback you might have!

    opened by gclen 10
  • embetter: better embeddings

    embetter: better embeddings

    This is conceptual work in progress. The maintainer is actively researching this, please do not work on it.

    Problem Statement

    When you submit where is my phoone and you get similarities you may get things like:

    • where is my phone
    • where is my credit card

    Depending on your task, either the "where is" part of the sentence is more important or the "phone" part is more important. The encoder, however, may be very brittle when it comes to spelling errors. So to put it more generally;

    image

    The similarity in an embedded space in our case is very much "general". I'm using "general" here, as opposed to "specific" to indicate that these similarities have been constructed without having a task in mind.

    Similar Issue

    Suppose that we are deduplicating and we have a zipcode, city, first-, and last-name. How would our encoding be able to understand that having the same city is not a strong signal while having the first name certainly is? Can we really expect a standard encoding to understand this? Without labels ... I think not.

    opened by koaning 3
  • Add `Identity` as default encoder for Service.

    Add `Identity` as default encoder for Service.

    As mentioned in https://github.com/koaning/simsity/pull/13:

    I think the refit parameter should go in the Service() call. I think there should also be a parameter somewhere to avoid calling .transform() if the data has already been transformed. Do you think it is worth adding an additional parameter to Service() and keeping the indexed_from_transformed_data method?

    It's a fair remark. I think preventing a transfrom() is fair, but the solution would be to have an Identity() transformer that just keeps the data as-is. This would also make a great default value for the encoder.

    Made this issue to track progress and to discuss the approach.

    opened by koaning 2
  • Codecalm tutorial on simsity

    Codecalm tutorial on simsity

    Hi Vincent. Since I discovered you my barrier towards Python has eroded! Thank you. I'm a Data Scientist who wants to check if simsity can help with retrieving similar regions based on environmental variables.

    opened by FrancyJGLisboa 2
  • Update indexer

    Update indexer

    Hi! Are there any plans to add support for updating the indexer, i.e. add new documents without retraining the entire pipeline? Would be a very useful feature .

    from simsity.service import Service
    
    service = Service(
        indexer=indexer,
        encoder=encoder
    )
    
    service.train_from_dataf(df, features=["text"])
    
    ....
    
    service.update(new_docs, features=["text"])  # <- this
    
    
    opened by nthomsencph 1
  • New API

    New API

    I think the original design was flawed and this project should stick to the scikit-learn API more.

    from simsity.preprocessing import Grab
    from simsity.service import Service
    from simsity.indexer import (AnnoyIndexer, PynnDescentIndexed, NMSlibIndexer,
                                 PineconeIndexer, QdrantIndexer, WeviateIndexer)
    
    
    encoder = make_pipeline(
        make_union(
            make_pipeline(Grab("text"), SentenceEncoder()),
            make_pipeline(Grab("title"), SentenceEncoder())
        )
    )
    
    service = Service(encoder, indexer, batch_size=50)
    service.index(X)
    items, dists = service.query(X, n=10)
    
    opened by koaning 0
  • Education Day Goals

    Education Day Goals

    • [x] add typing + type checker
    • [x] add tests for the minhash tools
    • [ ] collect more useful datasets
    • [x] automate the benchmarking
    • [x] write getting started guides
    • [ ] record a quick demo for colleagues
    • [ ] add github actions stash
    opened by koaning 0
  • added-components

    added-components

    Adding the MinHash components. This is also an amazing opportunity to:

    • [ ] add types and a type checker
    • [ ] add some standard tests for indexers
    • [ ] add a script to run some benchmarks on the clinc dataset
    opened by koaning 0
Releases(0.1.1)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
A Telegram Bot Written In Python

TelegraphUploader A Telegram Bot Written In Python DEPLOY Local Machine Clone the repository Install requirements: pip3 install -r requirements.txt e

Wahyusaputra 2 Dec 29, 2021
A simple Telegram bot that converts a phone number to a direct whatsapp chat link

Open in WhatsApp I was using a great app to open a whatsapp chat with a given number directly without saving that number in my contact list, but I fel

Pathfinder 19 Dec 24, 2022
This is Instagram reposter that repost TikTok videos.

from-tiktok-to-instagram-reposter This script reposts videos from Tik Tok to your Instagram account. You must enter the username and password and slee

Mohammed 19 Dec 01, 2022
BeeDrive: Open Source Privacy File Transfering System for Teams and Individual Developers

BeeDrive For privacy and convenience purposes, more and more people try to keep data on their own hardwires instead of third-party cloud services such

Xuansheng Wu 8 Oct 31, 2022
PyFacebook

== PyFacebook == PyFacebook is a Python client library for the Facebook API. Samuel Cormier-Iijima ( Samuel Cormier-Iijima 573 Dec 20, 2022

Python-random-quote - A file-based quote bot written in Python

Let's Write a Python Quote Bot! This repository will get you started with building a quote bot in Python. It's meant to be used along with the Learnin

amir mohammad fateh 1 Jan 02, 2022
Dns-Client-Server - Dns Client Server For Python

Dns-client-server DNS Server: supporting all types of queries and replies. Shoul

Nishant Badgujar 1 Feb 15, 2022
Discord-disnake - This package allows to use disnake without changing the discord namespace

This package is a shim This module allows to use disnake using discord namespace. This is not an independent library. Installing Python 3.8 or higher

5 Dec 13, 2022
Automatically changes your discord status

Automatically changes your discord status, Be careful as this may get you rate limited and banned

octo 5 Sep 20, 2022
Automate TikTok follower bot, like bot, share bot, view bot and more using selenium

Zefoy TikTok Automator Automate TikTok follower bot, like bot, share bot, view bot and more using selenium. Click here to report bugs. Usage Download

555 Dec 30, 2022
A python library for anti-captcha.com

AntiCaptcha A python library for anti-captcha.com Documentation for the API Requirements git Install git clone https://github.com/ShayBox/AntiCaptcha.

Shayne Hartford 3 Dec 16, 2022
Code release for Transferable Curriculum for Weakly-Supervised Domain Adaptation (AAAI2019)

TCL Code release for Transferable Curriculum for Weakly-Supervised Domain Adaptation (AAAI2019) Dataset Office-31 dataset, with 0.4 label noise Requir

THUML @ Tsinghua University 17 Jul 07, 2022
This Wrapper is a Discum Copy With Addons, original one is made by Merubokkusu

Remaded Discum Its not Official Discum Wrapper ! This Wrapper is a Discum Copy With Addons, original one is made by Merubokkusu Authors @merubokkusu (

discum-remaded 8 Aug 09, 2022
Quickly edit your slack posts.

Lightning Edit Quickly edit your Slack posts. Heavily inspired by @KhushrajRathod's LightningDelete. Usage: Note: Before anything, be sure to head ove

Cole Wilson 14 Nov 19, 2021
Thread-safe Python RabbitMQ Client & Management library

AMQPStorm Thread-safe Python RabbitMQ Client & Management library. Introduction AMQPStorm is a library designed to be consistent, stable and thread-sa

Erik Olof Gunnar Andersson 167 Nov 20, 2022
A self hosted slack bot to conduct standups & generate reports.

StandupMonkey A self hosted slack bot to conduct standups & generate reports. Report Bug · Request Feature Installation Install already hosted bot (Us

Muhammad Haseeb 69 Jan 01, 2023
🤖 Automated follow/unfollow bot for GitHub. Uses GitHub API. Written in python.

GitHub Follow Bot Table of Contents Disclaimer How to Use Install requirements Authenticate Get a GitHub Personal Access Token Add your GitHub usernam

João Correia 37 Dec 27, 2022
Lumi-Bot - Discord bot that fetches cryptocurrency prices utilizing CoinGeko API

Lumi-Bot Discord bot that fetches and monitors cryptocurrency prices utilizing C

Diego Castro 2 Oct 08, 2022
Change between dark/light mode depending on the ambient light intensity

svart Change between dark/light mode depending on the ambient light intensity Installation Install using pip $ python3 -m pip install --user svart Ins

Siddharth Dushantha 169 Nov 26, 2022
A tiktok autoclaimer/sniper used to get og/rare usernames on tiktok.com

TikTok Autoclaimer A tiktok autoclaimer/sniper used to get og/rare usernames on tiktok.com Report Bug · Request Feature Features Asynchronous User fri

dropout 24 Dec 08, 2022