Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Last update: Aug 19, 2021

Overview

Yase

Yet Another Sequence Encoder - encode sequences to vector of vectors in python !

Why Yase ?

Yase enable you to encode any sequence which can be represented by string to be encoded into a list of word-vector representation.

When searching over a tool to encode a sentence as a list of word-vector, it was clear that there was no simple tool to use. And so, i decided to create Yase.

Note : If you only want to get the word-vector of a word, or average of word-vector in sentence, you should probably better check Spacy.

Requirements

Yase requirements are :

numpy
tqdm

Mapping file

The mapping should be a columnar file like :

<token> <vector value>
token1 0.1 0.6 -1.2
token2 0.6 -2.3 3.4

All data should be separated by space, thus no space is allowed in token. You should be able to directly use Facebook Fast Text pretrained word vector as mapping.

Input file

Input file should be a list of text, with one sample per line.

hello world
Yase is awesome !

The default separator is a space " " but any regular expression can be provided.

Note that Yase is case insensitive

How to use

yase is command line tool. You can install by with :pip install git+https://github.com/PPACI/yase.git

>> yase
usage: yase [-h] --input input.txt [--input-encoding UTF8] --output
               output.txt --mapping mapping.vec [--mapping-encoding UTF8]
               [--separator \ |\.|\,] [--no-replace]
               [--cleaning-json cleaning.json]

Yet Another Sequence Encoder

optional arguments:
  -h, --help            show this help message and exit
  --input input.txt     Path to file to encode
  --input-encoding UTF8
                        encoding of input file. UTF8 by default
  --output output.txt   Path to output file
  --mapping mapping.vec
                        Path to mapping file
  --mapping-encoding UTF8
                        encoding of mapping file. UTF8 by default
  --separator \ |\.|\,  regular expression used to split the input sequence
  --no-replace          don't clean input data
  --cleaning-json cleaning.json
                        Path to your own json replacement file for cleaning.
                        Will use the included replacement file otherwise.

If you wanted to use the english word vector for an input file like previously described :

yase --input "input.txt" --output "output.csv" --mapping "wiki.en.vec"

Output format

The idea behind yase is to be as easy as possible to integrate it in all data science processing.

Yase output it's your data as CSV.

The only problem with CSV is that it's difficult to integrate multi-dimensional array. So we had to find a compromise..

Yase encode the vector columns in JSON format, which is easily readable and is very similar to python array representation.

The output file will be similar to :

inputs	vectors
hello world	[[1,1,1],[2,2,2]]
yase is awesome !	[[3,3,3],[4,4,4]]

Cleaning

Yase will automatically try to clean your input file by applying regex in the right order.

For example : Hello I'm yase.Nice to meet you will magically become Hello I m yase . Nice to meet you.

Remember that yase is case insensitive. So yase will understand as hello i m yase . nice to meet you.

Lastly, if your mapping doesn't include a mapping for ".", you will obtain vectors for hello i m yase nice to meet you

Of course, you can disable this behaviour by providing --no-replace argument.

Providing your own replacement file

You can do this by providing a path to your file with --cleaning-json.

The replacement file is a json like :

{
  "\"": "",
  "'": "",
  ",": " , ",
  "\\.": " . ",
  "  ": " "
}

Input are regex, so remind to escape . or *.

Note that replacement are made in the same order as in the json. So here, the first replacement will be to remove "

How to load a yase output ?

As said previously, the choice made with Yase make it possible to use it as simply as :

import pandas, json

csv = pandas.read_csv("output.csv")
csv.vectors = csv.vectors.apply(json.loads)

csv.head()

Note that Pandas is not mandatory but very recommended for data science.

TODO

Optimize Mapping loading time
Optional argument to output fixed size vectors for all input sequences
Surely lot of thing !

Can i contribute ?

Off course ! If you want to improve Yase, your idea / pull requests / issues are welcomed !

Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Related tags

Overview

Yase

Why Yase ?

Requirements

Mapping file

Input file

How to use

Output format

Cleaning

Providing your own replacement file

How to load a yase output ?

TODO

Can i contribute ?

Owner

Pierre PACI

BERT-based Financial Question Answering System

scikit-learn wrappers for Python fastText.

Practical Machine Learning with Python

Global Rhythm Style Transfer Without Text Transcriptions

Codename generator using WordNet parts of speech database

MRC approach for Aspect-based Sentiment Analysis (ABSA)

Sample data associated with the Aurora-BP study

📔️ Generate a text-based journal from a template file.

中文空间语义理解评测

We have built a Voice based Personal Assistant for people to access files hands free in their device using natural language processing.

open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

Bu Chatbot, Konya Bilim Merkezi Yen için tasarlanmış olan bir projedir.

Rhyme with AI

NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels

Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Implementation of the Hybrid Perception Block and Dual-Pruned Self-Attention block from the ITTR paper for Image to Image Translation using Transformers

MiCECo - Misskey Custom Emoji Counter

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

Python api wrapper for JellyFish Lights