Improving Representations via Similarities

Related tags

Miscellaneousembetter
Overview

embetter

warning

I like to build in public, but please don't expect anything yet. This is alpha stuff!

notes

Improving Representations via Similarities

The object to implement:

Embetter(multi_output=True, epochs=50, sampling_kwargs)
  .fit(X, y)
  .fit_sim(X1, X2, y_sim, weights)
  .partial_fit(X, y, classes, weights)
  .partial_fit_sim(X1, X2, y_sim, weights)
  .predict(X)
  .predict_proba(X)
  .predict_sim(X1, X2)
  .transform(X)
  .translate_X_y(X, y, classes=none)

Observation: especially when multi_output=True there's an opportunity with regards to NaN y-values. We can simply choose with values to translate and which to ignore.

Comments
  • [WIP] Feature/progress bar

    [WIP] Feature/progress bar

    Fixes issue #20

    • [x] Adds progress bar to all text and image embedders.
    • [x] Tests for SentenceEncoder.
    • [ ] Use perfplot for progress bar?
    • [ ] Can we ensure fast NumPy vectorization while using a progress bar?
    opened by CarloLepelaars 5
  • [BUG] `device` should be attribute on `SentenceEncoder`

    [BUG] `device` should be attribute on `SentenceEncoder`

    The device argument in SentenceEncoder is not defined as an attribute. This leads to bugs when using it with sklearn. I encountered attribute errors when trying to print out a Pipeline representation that has SentenceEncoder as a component.

    Should be easy to fix by just adding self.device in SentenceEncoder.__init__. We can consider adding tests for text encoders so we can catch these errors beforehand.

    The scikit-learn development docs make it clear every argument should be defined as an attribute:

    every keyword argument accepted by init should correspond to an attribute on the instance. Scikit-learn relies on this to find the relevant attributes to set on an estimator when doing model selection.

    Error message: AttributeError: 'SentenceEncoder' object has no attribute 'device'.

    Reproduction: Python 3.8 with embetter = "^0.2.2"

    se = SentenceEncoder()
    repr(se)
    

    Fix:

    Add self.device on SentenceEncoder

    class SentenceEncoder(EmbetterBase):
        .
        .
        def __init__(self, name="all-MiniLM-L6-v2", device=None):
            if not device:
                device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
            self.device = device
            self.name = name
            self.tfm = SBERT(name, device=self.device)
    
    opened by CarloLepelaars 4
  • Color Histograms - Additional Tricks

    Color Histograms - Additional Tricks

    This approach could work pretty well as an implementation: https://danielmuellerkomorowska.com/2020/06/17/analyzing-image-histograms-with-scikit-image/

    To do something similar to what is explained here: https://www.pinecone.io/learn/color-histograms/

    opened by koaning 4
  • Support for word embeddings

    Support for word embeddings

    Hi,

    Do you think it would be a good idea to add support for static word embeddings (word2vec, glove, etc.)? The embedder would need:

    • A filename to a local embedding file (e.g., glove.6b.100d.txt)
    • Either a callable tokenizer or regex string (i.e., the way sci-kit learn's TfIdfVectorizer splits words).
    • A (name of a) pooling function (e.g., "mean", "max", "sum").

    The second and third parameters could easily have sensible defaults, of course. If you think it's a good idea, I can do the PR somewhere next week.

    Stéphan

    opened by stephantul 3
  • [FEATURE] SpaCyEmbedder

    [FEATURE] SpaCyEmbedder

    I think it would be a nice addition to add an embedder that can easily vectorize text through SpaCy. I already have an implementation class for this and would be happy to contribute it here.

    SpaCy Docs on vector: https://spacy.io/api/doc#vector

    Example code for single string:

    import spacy
    nlp = spacy.load("en_core_web_sm")
    doc = nlp("This here text")
    doc.vector
    
    opened by CarloLepelaars 2
  • `get_feature_names_out` for encoders

    `get_feature_names_out` for encoders

    I would be happy to implement get_feature_names_out for all the Embetter objects. I will implement them by just adding a new method (without a Mixin).

    opened by CarloLepelaars 1
  • Remove the classification layer in timm models

    Remove the classification layer in timm models

    I was playing a bit with the library and found out that the TimmEncoder returns 1000-dimensional vectors for all the models I selected. That is caused by returning the state of the last FC classification layer and the fact all of the models were trained on ImageNet with 1000 classes. In practice, it's typically replaced with identity.

    Are there any reasons for returning the state of that last layer as an embedding? I'd be happy to submit a PR fixing that.

    opened by kacperlukawski 1
  • xception mobilenet

    xception mobilenet

    https://keras.io/api/applications/

    https://www.tensorflow.org/api_docs/python/tf/keras/applications/mobilenet_v2/MobileNetV2 https://www.tensorflow.org/api_docs/python/tf/keras/applications/xception/Xception

    opened by koaning 0
  • 'SentenceEncoder' object has no attribute 'device'

    'SentenceEncoder' object has no attribute 'device'

    text_emb_pipeline = make_pipeline(
      ColumnGrabber("text"),
      SentenceEncoder('all-MiniLM-L6-v2')
    )
    
    # This pipeline can also be trained to make predictions, using
    # the embedded features. 
    text_clf_pipeline = make_pipeline(
      text_emb_pipeline,
      LogisticRegression()
    )
    
    dataf = pd.DataFrame({
      "text": ["positive sentiment", "super negative"],
      "label_col": ["pos", "neg"]
    })
    
    X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
    text_clf_pipeline.fit(dataf, dataf['label_col'])
    

    This code gives this error: 'SentenceEncoder' object has no attribute 'device'

    opened by nicholas-dinicola 6
Releases(0.2.2)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
Impf Bot.py 🐍⚡ automation for the German

Impf Bot.py 🐍⚡ automation for the German "ImpfterminService - 116117"

251 Dec 13, 2022
AlexaUsingPython - Alexa will pay attention to your order, as: Hello Alexa, play music, Hello Alexa

AlexaUsingPython - Alexa will pay attention to your order, as: Hello Alexa, play music, Hello Alexa, what's the time? Alexa will pay attention to your order, get it, and afterward do some activity as

Abubakar Sattar 10 Aug 18, 2022
Demo of using DataLoader to prevent out of memory

Demo of using DataLoader to prevent out of memory

3 Jun 25, 2022
poetry2nix turns Poetry projects into Nix derivations without the need to actually write Nix expressions

poetry2nix poetry2nix turns Poetry projects into Nix derivations without the need to actually write Nix expressions. It does so by parsing pyproject.t

Nix community projects 405 Dec 29, 2022
Huggingface package for the discrete VAE used for DALL-E.

DALL-E-Tokenizer Huggingface package for the discrete VAE used for DALL-E.

MyungHoon Jin 5 Sep 01, 2021
create cohort visualizations for a subscription business

pycohort The main revenue generator for subscription businesses is recurring payments. There might be additional one-time offerings but the number of

Yalim Demirkesen 4 Sep 09, 2022
A tool to nowcast quarterly data with monthly indicators: US consumption example

MIDAS_Nowcaster A tool to nowcast quarterly data with monthly indicators: US consumption example Pulls data directly from FRED from a list of codes -

Gene Kindberg-Hanlon 3 Oct 06, 2022
Advanced Variable Manager {AVM} [0.8.0]

Advanced Variable Manager {AVM} [0.8.0] By Grosse pastèque#6705 WARNING : This modules need some typing modifications ! If you try to run it without t

Big watermelon 1 Dec 11, 2021
Library for Memory Trace Statistics in Python

Memory Search Library for Memory Trace Statistics in Python The library uses tracemalloc as a core module, which is why it is only available for Pytho

Memory Search 1 Dec 20, 2021
LibreMind is a free meditation app made in under 24 hours. It has various meditation, breathwork, and visualization exercises.

libreMind Meditation exercises What is it? LibreMind is a free meditation app made in under 24 hours. It has various meditation, breathwork, and visua

1 May 24, 2022
Script to automate the scanning of "old printed photos"

photoscanner Script to automate the scanning of "old printed photos" Just run: ./scan_photos.py The script is prepared to be run by fades. Otherw

Facundo Batista 2 Jan 21, 2022
TurtleBot Control App - TurtleBot Control App With Python

TURTLEBOT CONTROL APP INDEX: 1. Introduction 2. Environments 2.1. Simulated Envi

Rafanton 4 Aug 03, 2022
The program calculates the BMI of people

Programmieren Einleitung: Das Programm berechnet den BMI von Menschen. Es ist sehr einfach zu handhaben, so können alle Menschen ihren BMI berechnen.

2 Dec 16, 2021
京东自动入会获取京豆

京东入会领京豆 要求 有一定的电脑知识 or 有耐心爱折腾 需要Chrome(推荐)、Edge(Chromium)、Firefox 操作系统需是Mac(本人没在m1上测试)、Linux(在deepin上测试过)、Windows 安装方法 脚本采用Selenium遍历京东入会有礼界面,由于遍历了200

Vanke Anton 500 Dec 22, 2022
The functions we created are included in a script. The necessary parts for pre-processing were taken. Analysis complete.

Feature-Engineering The functions we created are included in a script. The necessary parts for pre-processing were taken. Analysis complete. Business

Ayşe Nur Türkaslan 4 Oct 17, 2021
Exploiting Linksys WRT54G using a vulnerability I found.

Exploiting Linksys WRT54G Exploit # Install the requirements. pip install -r requirements.txt ROUTER_HOST=192.169.1.1 ROUTER_USERNAME=admin ROUTER_P

Elon Gliksberg 31 May 29, 2022
Delayed iteration for polling and retries.

Does Python need yet another retry / poll library? It needs at least one that isn't coupled to decorators and functions. Decorators prevent the caller

A. Coady 22 Dec 29, 2022
Unified Distributed Execution

Unified Distributed Execution The framework supports multiple execution backends: Ray, Dask, MPI and MultiProcessing. To run tests you need to install

17 Dec 25, 2022
Simple Wayland HotKey Daemon

swhkd Simple Wayland HotKey Daemon This project is still very new and I'm making new decisions everyday as to where I should drive this project. I'm u

Aakash Sen Sharma 407 Dec 30, 2022
Welcome to my pod transcript search webb app!

pod_transcript_search Welcome to the pod transcript search webb app! Tech stack used: Languages used: Python (for the back-end), JavaScript (for the f

3 Feb 04, 2022