Simple Similarities Service

Overview

simsity

Simsity is a Super Simple Similarities Service[tm].
It's all about building a neighborhood. Literally!

This repository contains simple tools to help in similarity retreival scenarios by making a convient wrapper around encoding strategies as well as nearest neighbor approaches. Typical usecases include early stage bulk labelling and duplication discovery.

Warning

Alpha software. Expect things to break. Do not use in production.

Quickstart

This is the basic setup for this package.

import pandas as pd

from simsity.service import Service
from simsity.indexer import PyNNDescentIndexer
from simsity.preprocessing import Identity, ColumnLister


# The Indexer handles the nearest neighbor search
# The Encoder handles the encoding of the datapoints
service = Service(
    indexer=PyNNDescentIndexer(metric="euclidean"),
    encoder=CountVectorizer()
)

# The encoder defines how we encode the data going in.
encoder = make_pipeline(
    ColumnLister(column="text"),
    CountVectorizer()
)

# The indexer handles the nearest neighbor lookup.
indexer = PyNNDescentIndexer(metric="euclidean", n_neighbors=2)

# The service combines the two into a single object.
service_clinc = Service(
    encoder=encoder,
    indexer=indexer,
)

# We can now train the service.
df_clinc = pd.read_csv("tests/data/clinc-data.csv")
service_clinc.train_from_dataf(df_clinc, features=["text"])

# Query the datapoints
service.query("give me directions", n_neighbors=20)

# Save the entire system
service.save("/tmp/simple-model")

# You can also load the model now.
reloaded = Service.load("/tmp/simple-model")

# We can also host it as a web service
reloaded.serve(host='0.0.0.0', port=8080)

# You can now POST to http://0.0.0.0:8080/query with payload:
# {"query": {"text": "hello there"}, "n_neighbors": 20}
Comments
  • Add support for pretrained encoders and transformed data

    Add support for pretrained encoders and transformed data

    First of all this project looks great! I've taken an initial stab at #12 and also tried to add support querying data that has already been transformed. If you have data that you've already transformed (e.g. a UMAP embedding), you probably don't want to rerun encoder.transform again. In this case you want to index the transformed data and query it directly.

    This is just a first crack so happy to incorporate any feedback you might have!

    opened by gclen 10
  • embetter: better embeddings

    embetter: better embeddings

    This is conceptual work in progress. The maintainer is actively researching this, please do not work on it.

    Problem Statement

    When you submit where is my phoone and you get similarities you may get things like:

    • where is my phone
    • where is my credit card

    Depending on your task, either the "where is" part of the sentence is more important or the "phone" part is more important. The encoder, however, may be very brittle when it comes to spelling errors. So to put it more generally;

    image

    The similarity in an embedded space in our case is very much "general". I'm using "general" here, as opposed to "specific" to indicate that these similarities have been constructed without having a task in mind.

    Similar Issue

    Suppose that we are deduplicating and we have a zipcode, city, first-, and last-name. How would our encoding be able to understand that having the same city is not a strong signal while having the first name certainly is? Can we really expect a standard encoding to understand this? Without labels ... I think not.

    opened by koaning 3
  • Add `Identity` as default encoder for Service.

    Add `Identity` as default encoder for Service.

    As mentioned in https://github.com/koaning/simsity/pull/13:

    I think the refit parameter should go in the Service() call. I think there should also be a parameter somewhere to avoid calling .transform() if the data has already been transformed. Do you think it is worth adding an additional parameter to Service() and keeping the indexed_from_transformed_data method?

    It's a fair remark. I think preventing a transfrom() is fair, but the solution would be to have an Identity() transformer that just keeps the data as-is. This would also make a great default value for the encoder.

    Made this issue to track progress and to discuss the approach.

    opened by koaning 2
  • Codecalm tutorial on simsity

    Codecalm tutorial on simsity

    Hi Vincent. Since I discovered you my barrier towards Python has eroded! Thank you. I'm a Data Scientist who wants to check if simsity can help with retrieving similar regions based on environmental variables.

    opened by FrancyJGLisboa 2
  • Update indexer

    Update indexer

    Hi! Are there any plans to add support for updating the indexer, i.e. add new documents without retraining the entire pipeline? Would be a very useful feature .

    from simsity.service import Service
    
    service = Service(
        indexer=indexer,
        encoder=encoder
    )
    
    service.train_from_dataf(df, features=["text"])
    
    ....
    
    service.update(new_docs, features=["text"])  # <- this
    
    
    opened by nthomsencph 1
  • New API

    New API

    I think the original design was flawed and this project should stick to the scikit-learn API more.

    from simsity.preprocessing import Grab
    from simsity.service import Service
    from simsity.indexer import (AnnoyIndexer, PynnDescentIndexed, NMSlibIndexer,
                                 PineconeIndexer, QdrantIndexer, WeviateIndexer)
    
    
    encoder = make_pipeline(
        make_union(
            make_pipeline(Grab("text"), SentenceEncoder()),
            make_pipeline(Grab("title"), SentenceEncoder())
        )
    )
    
    service = Service(encoder, indexer, batch_size=50)
    service.index(X)
    items, dists = service.query(X, n=10)
    
    opened by koaning 0
  • Education Day Goals

    Education Day Goals

    • [x] add typing + type checker
    • [x] add tests for the minhash tools
    • [ ] collect more useful datasets
    • [x] automate the benchmarking
    • [x] write getting started guides
    • [ ] record a quick demo for colleagues
    • [ ] add github actions stash
    opened by koaning 0
  • added-components

    added-components

    Adding the MinHash components. This is also an amazing opportunity to:

    • [ ] add types and a type checker
    • [ ] add some standard tests for indexers
    • [ ] add a script to run some benchmarks on the clinc dataset
    opened by koaning 0
Releases(0.1.1)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
Using multiple API sources, create an app that allows users to filter through random locations based on their temperature range choices.

World_weather_analysis Overview Using multiple API sources, create an app that allows users to filter through random locations based on their temperat

Jason Boyer 2 Sep 16, 2022
A simple bot discord in PY with moderation controls

Voila un bot discord en py avec les commandes simples de modération tout simplement faut changer les lignes 70 vous mettez votre token de votre bot 53

Ethan 1 Nov 20, 2021
Change Discord HypeSquad in few seconds!

a simple python script that change your hypesquad to what house you choose

Ho3ein 5 Nov 16, 2022
ACL 2022: CAKE: A Scalable Commonsense-Aware Framework For Multi-View Knowledge Graph Completion

CAKE ACL 2022: CAKE: A Scalable Commonsense-Aware Framework For Multi-View Knowledge Graph Completion Introduction This is the PyTorch implementation

Niu Guanglin 31 Dec 07, 2022
Botto - A discord bot written in python that uses the hikari and lightbulb modules to make this bot

❓ About Botto Hi! This is botto, a discord bot written in python that uses the h

3 Sep 13, 2022
Andrei 1.4k Dec 24, 2022
A Discord bot to play bluffing games like Dobbins or Bobbins

Usage: pip install -r requirements.txt python3 bot.py DISCORD_BOT_TOKEN Gameplay: All commands are case-insensitive, with trailing punctuation and spa

4 May 27, 2022
Using Streamlit to build a simple UI on top of the OpenSea API

OpenSea API Explorer Using Streamlit to build a simple UI on top of the OpenSea API. 🤝 Contributing Contributions, issues and feature requests are we

Gavin Capriola 1 Jan 04, 2022
Discord Token Generator based on HTTPX, makes unverified tokens and automatically joins your server! this is used for memberboosting

Discord Token Generator | 2021 Features: (1) hCaptcha Bypasser, latest hfuck.py Updated by me (2) Free Proxy Support/Scrapper (3) Custom Realistic Dat

2 Nov 30, 2021
A pixeldrain python package using pixeldrain official api

Made with Python3 (C) @FayasNoushad Copyright permission under MIT License License - https://github.com/FayasNoushad/Pixeldrain/blob/main/LICENSE In

Fayas Noushad 6 Jan 26, 2022
Dumps to CSV all the resources in an organization's member accounts

AWS Org Inventory Dumps to CSV all the resources in an organization's member accounts. Set your environment's AWS_PROFILE and AWS_DEFAULT_REGION varia

Iain Samuel McLean Elder 2 Dec 24, 2021
A telegram bot to monitor the latest NFT price on BSC.

NFT_Monitor This is a telegram bot for monitoring price and ranking of NFT on Binance Smart Chain. Can fetch latest ranking and price in real time. .P

Niko Pang 10 Oct 09, 2022
❄️ Don't waste your money paying for new tokens, once you have used your tokens, clean them up and resell them!

TokenCleaner Don't waste your money paying for new tokens, once you have used your tokens, clean them up and resell them! If you have a very large qua

0xVichy 59 Nov 14, 2022
Pagination for your discord.py bot using the discord_components library!

Paginator - discord_components This repository is just an example code for how to carry out pagination using the discord_components library for python

Skull Crusher 9 Jan 31, 2022
Light weight Scripts and Apps for checking availability of Covid Vaccines in India. Notifies when vaccine becomes avialable in your area.

vaccine-checker Light weight Scripts and Apps for checking availability of Covid Vaccines in India. Notifies when vaccine becomes avialable in your ar

Abishek V Ashok 8 Jun 16, 2021
Webservice that notifies users on Slack when a change in GitLab concern them.

Gitlab Slack Notifier Webservice that notifies users on Slack when a change in GitLab concern them. Setup Slack Create a Slack app, go to "OAuth & Per

Heuritech 2 Nov 04, 2021
Discord Bot Sending Members - Leaked by BambiKu ( Me )

Wokify Bot Discord Bot Sending Members - Leaked by BambiKu ( Me ) Info The Bot was orginaly made by someone else! Ghost-Dev just wanted to sell "priva

bambiku 6 Jul 05, 2022
⚡ A really fast and powerful Discord Token Checker

discord-token-checker ⚡ A really fast and powerful Discord Token Checker How To Use? Do pip install -r requirements.txt in your command prompt Make to

vida 25 Feb 26, 2022
Discord.py-Bot-Template - Discord Bot Template with Python 3.x

Discord Bot Template with Python 3.x This is a template for creating a custom Di

Keagan Landfried 3 Jul 17, 2022
Telegram bot for searching videos in your PDisk account by @AbirHasan2005

PDisk-Videos-Search A Telegram bot for searching videos in your PDisk account by @AbirHasan2005. Configs API_ID - Get from @TeleORG_Bot API_HASH - Get

Abir Hasan 39 Oct 21, 2022