Improving Representations via Similarities

Last update: Jan 08, 2023

Related tags

Miscellaneous embetter

Overview

embetter

warning

I like to build in public, but please don't expect anything yet. This is alpha stuff!

notes

Improving Representations via Similarities

The object to implement:

Embetter(multi_output=True, epochs=50, sampling_kwargs)
  .fit(X, y)
  .fit_sim(X1, X2, y_sim, weights)
  .partial_fit(X, y, classes, weights)
  .partial_fit_sim(X1, X2, y_sim, weights)
  .predict(X)
  .predict_proba(X)
  .predict_sim(X1, X2)
  .transform(X)
  .translate_X_y(X, y, classes=none)

Observation: especially when multi_output=True there's an opportunity with regards to NaN y-values. We can simply choose with values to translate and which to ignore.

Comments

[WIP] Feature/progress bar
Fixes issue #20

[x] Adds progress bar to all text and image embedders.

[x] Tests for SentenceEncoder.

[ ] Use perfplot for progress bar?

[ ] Can we ensure fast NumPy vectorization while using a progress bar?
opened by CarloLepelaars 5
[BUG] `device` should be attribute on `SentenceEncoder`
The device argument in SentenceEncoder is not defined as an attribute. This leads to bugs when using it with sklearn. I encountered attribute errors when trying to print out a Pipeline representation that has SentenceEncoder as a component.

Should be easy to fix by just adding self.device in SentenceEncoder.__init__. We can consider adding tests for text encoders so we can catch these errors beforehand.

The scikit-learn development docs make it clear every argument should be defined as an attribute:

every keyword argument accepted by init should correspond to an attribute on the instance. Scikit-learn relies on this to find the relevant attributes to set on an estimator when doing model selection.

Error message: AttributeError: 'SentenceEncoder' object has no attribute 'device'.

Reproduction: Python 3.8 with embetter = "^0.2.2"

se = SentenceEncoder() repr(se)

Fix:

Add self.device on SentenceEncoder

class SentenceEncoder(EmbetterBase): . . def __init__(self, name="all-MiniLM-L6-v2", device=None): if not device: device = torch.device("cuda" if torch.cuda.is_available() else "cpu") self.device = device self.name = name self.tfm = SBERT(name, device=self.device)
opened by CarloLepelaars 4
Color Histograms - Additional Tricks

This approach could work pretty well as an implementation: https://danielmuellerkomorowska.com/2020/06/17/analyzing-image-histograms-with-scikit-image/

To do something similar to what is explained here: https://www.pinecone.io/learn/color-histograms/

opened by koaning 4
Support for word embeddings
Hi,

Do you think it would be a good idea to add support for static word embeddings (word2vec, glove, etc.)? The embedder would need:

A filename to a local embedding file (e.g., glove.6b.100d.txt)

Either a callable tokenizer or regex string (i.e., the way sci-kit learn's TfIdfVectorizer splits words).

A (name of a) pooling function (e.g., "mean", "max", "sum").

The second and third parameters could easily have sensible defaults, of course. If you think it's a good idea, I can do the PR somewhere next week.

Stéphan
opened by stephantul 3
[FEATURE] SpaCyEmbedder
I think it would be a nice addition to add an embedder that can easily vectorize text through SpaCy. I already have an implementation class for this and would be happy to contribute it here.

SpaCy Docs on vector: https://spacy.io/api/doc#vector

Example code for single string:

import spacy nlp = spacy.load("en_core_web_sm") doc = nlp("This here text") doc.vector
opened by CarloLepelaars 2
`get_feature_names_out` for encoders

I would be happy to implement get_feature_names_out for all the Embetter objects. I will implement them by just adding a new method (without a Mixin).

opened by CarloLepelaars 1
Remove the classification layer in timm models

I was playing a bit with the library and found out that the TimmEncoder returns 1000-dimensional vectors for all the models I selected. That is caused by returning the state of the last FC classification layer and the fact all of the models were trained on ImageNet with 1000 classes. In practice, it's typically replaced with identity.

Are there any reasons for returning the state of that last layer as an embedding? I'd be happy to submit a PR fixing that.

opened by kacperlukawski 1
xception mobilenet

https://keras.io/api/applications/

https://www.tensorflow.org/api_docs/python/tf/keras/applications/mobilenet_v2/MobileNetV2 https://www.tensorflow.org/api_docs/python/tf/keras/applications/xception/Xception

opened by koaning 0

'SentenceEncoder' object has no attribute 'device'

text_emb_pipeline = make_pipeline(
  ColumnGrabber("text"),
  SentenceEncoder('all-MiniLM-L6-v2')
)

# This pipeline can also be trained to make predictions, using
# the embedded features. 
text_clf_pipeline = make_pipeline(
  text_emb_pipeline,
  LogisticRegression()
)

dataf = pd.DataFrame({
  "text": ["positive sentiment", "super negative"],
  "label_col": ["pos", "neg"]
})

X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
text_clf_pipeline.fit(dataf, dataf['label_col'])

This code gives this error: 'SentenceEncoder' object has no attribute 'device'

opened by nicholas-dinicola 6

Releases(0.2.2)

0.2.2(Dec 20, 2022)

Adds GPU support for Sentence Encoders.
Source code(tar.gz)
Source code(zip)
0.2.1(Dec 5, 2022)

Fixed some error messages related to installing extra dependencies.
Source code(tar.gz)
Source code(zip)
0.2.0(Oct 10, 2022)

Fixes a bug related to the Timm vision models.
Source code(tar.gz)
Source code(zip)
0.1.0(Sep 19, 2022)

The first original release. Should have enough components to be interesting.
Source code(tar.gz)
Source code(zip)

Owner

vincent d warmerdam

Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].

GitHub Repository

Impf Bot.py 🐍⚡ automation for the German

Impf Bot.py 🐍⚡ automation for the German "ImpfterminService - 116117"

251 Dec 13, 2022

AlexaUsingPython - Alexa will pay attention to your order, as: Hello Alexa, play music, Hello Alexa

AlexaUsingPython - Alexa will pay attention to your order, as: Hello Alexa, play music, Hello Alexa, what's the time? Alexa will pay attention to your order, get it, and afterward do some activity as

10 Aug 18, 2022

Demo of using DataLoader to prevent out of memory

3 Jun 25, 2022

poetry2nix turns Poetry projects into Nix derivations without the need to actually write Nix expressions

poetry2nix poetry2nix turns Poetry projects into Nix derivations without the need to actually write Nix expressions. It does so by parsing pyproject.t

405 Dec 29, 2022

Huggingface package for the discrete VAE used for DALL-E.

DALL-E-Tokenizer Huggingface package for the discrete VAE used for DALL-E.

5 Sep 01, 2021

create cohort visualizations for a subscription business

pycohort The main revenue generator for subscription businesses is recurring payments. There might be additional one-time offerings but the number of

4 Sep 09, 2022

A tool to nowcast quarterly data with monthly indicators: US consumption example

MIDAS_Nowcaster A tool to nowcast quarterly data with monthly indicators: US consumption example Pulls data directly from FRED from a list of codes -

3 Oct 06, 2022

Advanced Variable Manager {AVM} [0.8.0]

Advanced Variable Manager {AVM} [0.8.0] By Grosse pastèque#6705 WARNING : This modules need some typing modifications ! If you try to run it without t

1 Dec 11, 2021

Library for Memory Trace Statistics in Python

Memory Search Library for Memory Trace Statistics in Python The library uses tracemalloc as a core module, which is why it is only available for Pytho

1 Dec 20, 2021

LibreMind is a free meditation app made in under 24 hours. It has various meditation, breathwork, and visualization exercises.

libreMind Meditation exercises What is it? LibreMind is a free meditation app made in under 24 hours. It has various meditation, breathwork, and visua

1 May 24, 2022

Improving Representations via Similarities

Related tags

Overview

embetter

warning

notes

Comments

Releases(0.2.2)

0.2.2(Dec 20, 2022)

0.2.1(Dec 5, 2022)

0.2.0(Oct 10, 2022)

0.1.0(Sep 19, 2022)

Owner

vincent d warmerdam

Impf Bot.py 🐍⚡ automation for the German

AlexaUsingPython - Alexa will pay attention to your order, as: Hello Alexa, play music, Hello Alexa

Demo of using DataLoader to prevent out of memory

poetry2nix turns Poetry projects into Nix derivations without the need to actually write Nix expressions

Huggingface package for the discrete VAE used for DALL-E.

create cohort visualizations for a subscription business

A tool to nowcast quarterly data with monthly indicators: US consumption example

Advanced Variable Manager {AVM} [0.8.0]

Library for Memory Trace Statistics in Python

LibreMind is a free meditation app made in under 24 hours. It has various meditation, breathwork, and visualization exercises.

Script to automate the scanning of "old printed photos"

TurtleBot Control App - TurtleBot Control App With Python

The program calculates the BMI of people

京东自动入会获取京豆

The functions we created are included in a script. The necessary parts for pre-processing were taken. Analysis complete.

Exploiting Linksys WRT54G using a vulnerability I found.

Delayed iteration for polling and retries.

Unified Distributed Execution

Simple Wayland HotKey Daemon

Welcome to my pod transcript search webb app!