Semi-automated vocabulary generation from semantic vector models

Overview

vec2word

Semi-automated vocabulary generation from semantic vector models

This script generates a list of potential conlang word forms along with associated possible glosses based on a word-shape template and a word2vec-style semantic vector model. The process works something like this:

  1. Acquire a word2vec-style semantic vector model (either word2vec binary format or text format).
  2. Define a word-shape template.
  3. Use Principle Component Analysis to project the vector model down to the same number of dimensions as you have slots in your template.
  4. Match the new model dimensions to slots based on how many phonemes can go in a slot vs. the variance in a given dimension (large phoneme range pairs with large variance), and then discretize those dimensions into the appropriate number of buckets.
  5. Use the buckets each vector ends up getting put in to select phonemes for each template slot and generate new conlang words, along with a list of all of the model words whose vectors ended up in that same set of buckets.

This results in word forms in which each phoneme represents a category in some semantic classification scheme, rather like a traditional philosophical language--except, the categories are not obviously-sensible, human-defined categories such as you might find in a thesaurus, but weird collections of whatever happens to project into similar places in low-dimensional space. Getting reasonable definitions for your new words will still require work at selecting among the various options provided to you, or making up a new one in a similar semantic space--whatever you decide that means. Ideally, this should result in a lexicon with lots of discoverable sound-symbolism, but very little obvious regular morphology.

You could also decide that, rather than generating complete words, you just want to generate, e.g., individual syllables, which could then be compounded together to produce words with more specific meanings--essentially, simulating the process by which Chinese produced lots of homophones (single phonetic forms with wildly varying ambiguous meanings) and then used compounding to re-disambiguate the lexicon.

Or generate triliteral consonant roots, whose semantics will be narrowed down by intercalated vowel patterns.

Or something else entirely! Play around, experiment, have fun!

Example use

python vec2word.py model.bin "t,d,n,k,g,q,p,b,m" "i,u,e" "t,n,k,q,p,m" > syllables.txt

This uses the model.bin model to produce "words" on a CVC template and save the results in syllables.txt. For longer templates, just add more command-line arguments, each consisting of a comma-separated list of phonemes/graphemes that are allowed in the slot.

Many pre-built word2vec models suitable for use with this script can be downloaded from the NLPL Word Vectors Repository.

Topic Inference with Zeroshot models

zeroshot_topics Table of Contents Installation Usage License Installation zeroshot_topics is distributed on PyPI as a universal wheel and is available

Rita Anjana 55 Nov 28, 2022
Code for Emergent Translation in Multi-Agent Communication

Emergent Translation in Multi-Agent Communication PyTorch implementation of the models described in the paper Emergent Translation in Multi-Agent Comm

Facebook Research 75 Jul 15, 2022
A high-level Python library for Quantum Natural Language Processing

lambeq About lambeq is a toolkit for quantum natural language processing (QNLP). Documentation: https://cqcl.github.io/lambeq/ Getting started Prerequ

Cambridge Quantum 315 Jan 01, 2023
Code for "Finetuning Pretrained Transformers into Variational Autoencoders"

transformers-into-vaes Code for Finetuning Pretrained Transformers into Variational Autoencoders (our submission to NLP Insights Workshop 2021). Gathe

Seongmin Park 22 Nov 26, 2022
Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

Yan Yuanmeng 478 Dec 25, 2022
This program do translate english words to portuguese

Python-Dictionary This program is used to translate english words to portuguese. Web-Scraping This program use BeautifulSoap to make web scraping, so

João Assalim 1 Oct 10, 2022
English loanwords in the world's languages

Wiktionary as CLDF Content cldf1 and cldf2 contain cldf-conform data sets with a total of 2 377 756 entries about the vocabulary of all 1403 languages

Viktor Martinović 3 Jan 14, 2022
Switch spaces for knowledge graph embeddings

SwisE Switch spaces for knowledge graph embeddings. Requirements: python3 pytorch numpy tqdm Reproduce the results To reproduce the reported results,

Shuai Zhang 4 Dec 01, 2021
Voice Assistant inspired by Google Assistant, Cortana, Alexa, Siri, ...

author: @shival_gupta VoiceAI This program is an example of a simple virtual assitant It will listen to you and do accordingly It will begin with wish

Shival Gupta 1 Jan 06, 2022
Shellcode antivirus evasion framework

Schrodinger's Cat Schrodinger'sCat is a Shellcode antivirus evasion framework Technical principle Please visit my blog https://idiotc4t.com/ How to us

idiotc4t 27 Jul 09, 2022
Create a machine learning model which will predict if the mortgage will be approved or not based on 5 variables

Mortgage-Application-Analysis Create a machine learning model which will predict if the mortgage will be approved or not based on 5 variables: age, in

1 Jan 29, 2022
The Classical Language Toolkit

Notice: This Git branch (dev) contains the CLTK's upcoming major release (v. 1.0.0). See https://github.com/cltk/cltk/tree/master and https://docs.clt

Classical Language Toolkit 754 Jan 09, 2023
Code for hyperboloid embeddings for knowledge graph entities

Implementation for the papers: Self-Supervised Hyperboloid Representations from Logical Queries over Knowledge Graphs, Nurendra Choudhary, Nikhil Rao,

30 Dec 10, 2022
Codes for processing meeting summarization datasets AMI and ICSI.

Meeting Summarization Dataset Meeting plays an essential part in our daily life, which allows us to share information and collaborate with others. Wit

xcfeng 39 Dec 14, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

🤗 Contributing to OpenSpeech 🤗 OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform ta

Openspeech TEAM 513 Jan 03, 2023
Proquabet - Convert your prose into proquints and then you essentially have Vogon poetry

Proquabet Turn your prose into a constant stream of encrypted and meaningless-so

Milo Fultz 2 Oct 10, 2022
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
Code examples for my Write Better Python Code series on YouTube.

Write Better Python Code This repository contains the code examples used in my Write Better Python Code series published on YouTube: https:/

858 Dec 29, 2022
Speech Recognition for Uyghur using Speech transformer

Speech Recognition for Uyghur using Speech transformer Training: this model using CTC loss and Cross Entropy loss for training. Download pretrained mo

Uyghur 11 Nov 17, 2022
Extract rooms type, door, neibour rooms, rooms corners nad bounding boxes, and generate graph from rplan dataset

Housegan-data-reader House-GAN++ (data-reader) Code and instructions for converting rplan dataset (raster images) to housegan++ data format. House-GAN

Sepid Hosseini 13 Nov 24, 2022