Korean Sentence Embedding Repository

Overview

Korean-Sentence-Embedding

๐Ÿญ Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides environments where individuals can train models.

Baseline Models

Baseline models used for korean sentence embedding - KLUE-PLMs

Model Embedding size Hidden size # Layers # Heads
KLUE-BERT-base 768 768 12 12
KLUE-RoBERTa-base 768 768 12 12

NOTE: All the pretrained models are uploaded in Huggingface Model Hub. Check https://huggingface.co/klue.

How to start

  • Get datasets to train or test.
bash get_model_dataset.sh
  • If you want to do inference quickly, download the pre-trained models and then you can start some downstream tasks.
bash get_model_checkpoint.sh
cd KoSBERT/
python SemanticSearch.py

Available Models

  1. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks [SBERT]-[EMNLP 2019]
  2. SimCSE: Simple Contrastive Learning of Sentence Embeddings [SimCSE]-[EMNLP 2021]

KoSentenceBERT

  • ๐Ÿค— Model Training
  • Dataset
    • Train: snli_1.0_train.ko.tsv (First phase, training NLI), sts-train.tsv (Second phase, continued training STS)
    • Valid: sts-dev.tsv
    • Test: sts-test.tsv

KoSimCSE

  • ๐Ÿค— Model Training
  • Dataset
    • Train: snli_1.0_train.ko.tsv + multinli.train.ko.tsv
    • Valid: sts-dev.tsv
    • Test: sts-test.tsv

Performance

  • Semantic Textual Similarity test set results
Model Cosine Pearson Cosine Spearman Euclidean Pearson Euclidean Spearman Manhattan Pearson Manhattan Spearman Dot Pearson Dot Spearman
KoSBERTโ€ SKT 78.81 78.47 77.68 77.78 77.71 77.83 75.75 75.22
KoSBERTbase 82.13 82.25 80.67 80.75 80.69 80.78 77.96 77.90
KoSRoBERTabase 80.70 81.03 80.97 81.06 80.84 80.97 79.20 78.93
KoSimCSE-BERTโ€ SKT 82.12 82.56 81.84 81.63 81.99 81.74 79.55 79.19
KoSimCSE-BERTbase 82.73 83.51 82.32 82.78 82.43 82.88 77.86 76.70
KoSimCSE-RoBERTabase 83.64 84.05 83.32 83.84 83.33 83.79 80.92 79.84

Downstream Tasks

  • KoSBERT: Semantic Search, Clustering
python SemanticSearch.py
python Clustering.py
  • KoSimCSE: Semantic Search
python SemanticSearch.py

Semantic Search (KoSBERT)

from sentence_transformers import SentenceTransformer, util
import numpy as np

model_path = '../Checkpoint/KoSBERT/kosbert-klue-bert-base'

embedder = SentenceTransformer(model_path)

# Corpus with example sentences
corpus = ['ํ•œ ๋‚จ์ž๊ฐ€ ์Œ์‹์„ ๋จน๋Š”๋‹ค.',
          'ํ•œ ๋‚จ์ž๊ฐ€ ๋นต ํ•œ ์กฐ๊ฐ์„ ๋จน๋Š”๋‹ค.',
          '๊ทธ ์—ฌ์ž๊ฐ€ ์•„์ด๋ฅผ ๋Œ๋ณธ๋‹ค.',
          'ํ•œ ๋‚จ์ž๊ฐ€ ๋ง์„ ํƒ„๋‹ค.',
          'ํ•œ ์—ฌ์ž๊ฐ€ ๋ฐ”์ด์˜ฌ๋ฆฐ์„ ์—ฐ์ฃผํ•œ๋‹ค.',
          '๋‘ ๋‚จ์ž๊ฐ€ ์ˆ˜๋ ˆ๋ฅผ ์ˆฒ ์†ฆ์œผ๋กœ ๋ฐ€์—ˆ๋‹ค.',
          'ํ•œ ๋‚จ์ž๊ฐ€ ๋‹ด์œผ๋กœ ์‹ธ์ธ ๋•…์—์„œ ๋ฐฑ๋งˆ๋ฅผ ํƒ€๊ณ  ์žˆ๋‹ค.',
          '์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค.',
          '์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค.']

corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['ํ•œ ๋‚จ์ž๊ฐ€ ํŒŒ์Šคํƒ€๋ฅผ ๋จน๋Š”๋‹ค.',
           '๊ณ ๋ฆด๋ผ ์˜์ƒ์„ ์ž…์€ ๋ˆ„๊ตฐ๊ฐ€๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•˜๊ณ  ์žˆ๋‹ค.',
           '์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.']

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = 5
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    cos_scores = cos_scores.cpu()

    #We use np.argpartition, to only partially sort the top_k results
    top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx in top_results[0:top_k]:
        print(corpus[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))
  • Results are as follows :

Query: ํ•œ ๋‚จ์ž๊ฐ€ ํŒŒ์Šคํƒ€๋ฅผ ๋จน๋Š”๋‹ค.

Top 5 most similar sentences in corpus:
ํ•œ ๋‚จ์ž๊ฐ€ ์Œ์‹์„ ๋จน๋Š”๋‹ค. (Score: 0.6141)
ํ•œ ๋‚จ์ž๊ฐ€ ๋นต ํ•œ ์กฐ๊ฐ์„ ๋จน๋Š”๋‹ค. (Score: 0.5952)
ํ•œ ๋‚จ์ž๊ฐ€ ๋ง์„ ํƒ„๋‹ค. (Score: 0.1231)
ํ•œ ๋‚จ์ž๊ฐ€ ๋‹ด์œผ๋กœ ์‹ธ์ธ ๋•…์—์„œ ๋ฐฑ๋งˆ๋ฅผ ํƒ€๊ณ  ์žˆ๋‹ค. (Score: 0.0752)
๋‘ ๋‚จ์ž๊ฐ€ ์ˆ˜๋ ˆ๋ฅผ ์ˆฒ ์†ฆ์œผ๋กœ ๋ฐ€์—ˆ๋‹ค. (Score: 0.0486)


======================


Query: ๊ณ ๋ฆด๋ผ ์˜์ƒ์„ ์ž…์€ ๋ˆ„๊ตฐ๊ฐ€๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•˜๊ณ  ์žˆ๋‹ค.

Top 5 most similar sentences in corpus:
์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค. (Score: 0.6656)
์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค. (Score: 0.2988)
ํ•œ ์—ฌ์ž๊ฐ€ ๋ฐ”์ด์˜ฌ๋ฆฐ์„ ์—ฐ์ฃผํ•œ๋‹ค. (Score: 0.1566)
ํ•œ ๋‚จ์ž๊ฐ€ ๋ง์„ ํƒ„๋‹ค. (Score: 0.1112)
ํ•œ ๋‚จ์ž๊ฐ€ ๋‹ด์œผ๋กœ ์‹ธ์ธ ๋•…์—์„œ ๋ฐฑ๋งˆ๋ฅผ ํƒ€๊ณ  ์žˆ๋‹ค. (Score: 0.0262)


======================


Query: ์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.

Top 5 most similar sentences in corpus:
์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค. (Score: 0.7570)
๋‘ ๋‚จ์ž๊ฐ€ ์ˆ˜๋ ˆ๋ฅผ ์ˆฒ ์†ฆ์œผ๋กœ ๋ฐ€์—ˆ๋‹ค. (Score: 0.3658)
์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค. (Score: 0.3583)
ํ•œ ๋‚จ์ž๊ฐ€ ๋ง์„ ํƒ„๋‹ค. (Score: 0.0505)
๊ทธ ์—ฌ์ž๊ฐ€ ์•„์ด๋ฅผ ๋Œ๋ณธ๋‹ค. (Score: -0.0087)

Clustering (KoSBERT)

from sentence_transformers import SentenceTransformer, util
import numpy as np

model_path = '../Checkpoint/KoSBERT/kosbert-klue-bert-base'

embedder = SentenceTransformer(model_path)

# Corpus with example sentences
corpus = ['ํ•œ ๋‚จ์ž๊ฐ€ ์Œ์‹์„ ๋จน๋Š”๋‹ค.',
          'ํ•œ ๋‚จ์ž๊ฐ€ ๋นต ํ•œ ์กฐ๊ฐ์„ ๋จน๋Š”๋‹ค.',
          '๊ทธ ์—ฌ์ž๊ฐ€ ์•„์ด๋ฅผ ๋Œ๋ณธ๋‹ค.',
          'ํ•œ ๋‚จ์ž๊ฐ€ ๋ง์„ ํƒ„๋‹ค.',
          'ํ•œ ์—ฌ์ž๊ฐ€ ๋ฐ”์ด์˜ฌ๋ฆฐ์„ ์—ฐ์ฃผํ•œ๋‹ค.',
          '๋‘ ๋‚จ์ž๊ฐ€ ์ˆ˜๋ ˆ๋ฅผ ์ˆฒ ์†ฆ์œผ๋กœ ๋ฐ€์—ˆ๋‹ค.',
          'ํ•œ ๋‚จ์ž๊ฐ€ ๋‹ด์œผ๋กœ ์‹ธ์ธ ๋•…์—์„œ ๋ฐฑ๋งˆ๋ฅผ ํƒ€๊ณ  ์žˆ๋‹ค.',
          '์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค.',
          '์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค.',
          'ํ•œ ๋‚จ์ž๊ฐ€ ํŒŒ์Šคํƒ€๋ฅผ ๋จน๋Š”๋‹ค.',
          '๊ณ ๋ฆด๋ผ ์˜์ƒ์„ ์ž…์€ ๋ˆ„๊ตฐ๊ฐ€๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•˜๊ณ  ์žˆ๋‹ค.',
          '์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.']

corpus_embeddings = embedder.encode(corpus)

# Then, we perform k-means clustering using sklearn:
from sklearn.cluster import KMeans

num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_

clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
    clustered_sentences[cluster_id].append(corpus[sentence_id])

for i, cluster in enumerate(clustered_sentences):
    print("Cluster ", i+1)
    print(cluster)
    print("")
  • Results are as follows:
Cluster  1
['ํ•œ ๋‚จ์ž๊ฐ€ ์Œ์‹์„ ๋จน๋Š”๋‹ค.', 'ํ•œ ๋‚จ์ž๊ฐ€ ๋นต ํ•œ ์กฐ๊ฐ์„ ๋จน๋Š”๋‹ค.', 'ํ•œ ๋‚จ์ž๊ฐ€ ํŒŒ์Šคํƒ€๋ฅผ ๋จน๋Š”๋‹ค.']

Cluster  2
['์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค.', '๊ณ ๋ฆด๋ผ ์˜์ƒ์„ ์ž…์€ ๋ˆ„๊ตฐ๊ฐ€๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•˜๊ณ  ์žˆ๋‹ค.']

Cluster  3
['ํ•œ ๋‚จ์ž๊ฐ€ ๋ง์„ ํƒ„๋‹ค.', '๋‘ ๋‚จ์ž๊ฐ€ ์ˆ˜๋ ˆ๋ฅผ ์ˆฒ ์†ฆ์œผ๋กœ ๋ฐ€์—ˆ๋‹ค.', 'ํ•œ ๋‚จ์ž๊ฐ€ ๋‹ด์œผ๋กœ ์‹ธ์ธ ๋•…์—์„œ ๋ฐฑ๋งˆ๋ฅผ ํƒ€๊ณ  ์žˆ๋‹ค.']

Cluster  4
['์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค.', '์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.']

Cluster  5
['๊ทธ ์—ฌ์ž๊ฐ€ ์•„์ด๋ฅผ ๋Œ๋ณธ๋‹ค.', 'ํ•œ ์—ฌ์ž๊ฐ€ ๋ฐ”์ด์˜ฌ๋ฆฐ์„ ์—ฐ์ฃผํ•œ๋‹ค.']

References

@misc{park2021klue,
    title={KLUE: Korean Language Understanding Evaluation},
    author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
    year={2021},
    eprint={2105.09680},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
@inproceedings{gao2021simcse,
   title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
   author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
   booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
   year={2021}
}
@article{ham2020kornli,
  title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
  author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
  journal={arXiv preprint arXiv:2004.03289},
  year={2020}
}
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "http://arxiv.org/abs/1908.10084",
}
Owner
Self-softmax
A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

Rebiber: A tool for normalizing bibtex with official info. We often cite papers using their arXiv versions without noting that they are already PUBLIS

(Bill) Yuchen Lin 2k Jan 01, 2023
Skipgram Negative Sampling in PyTorch

PyTorch SGNS Word2Vec's SkipGramNegativeSampling in Python. Yet another but quite general negative sampling loss implemented in PyTorch. It can be use

Jamie J. Seol 287 Dec 14, 2022
nlpcommon is a python Open Source Toolkit for text classification.

nlpcommon nlpcommon, Python Text Tool. Guide Feature Install Usage Dataset Contact Cite Reference Feature nlpcommon is a python Open Source

xuming 3 May 29, 2022
Modified GPT using average pooling to reduce the softmax attention memory constraints.

NLP-GPT-Upsampling This repository contains an implementation of Open AI's GPT Model. In particular, this implementation takes inspiration from the Ny

WD 1 Dec 03, 2021
Simple Text-To-Speech Bot For Discord

Simple Text-To-Speech Bot For Discord This is a very simple TTS bot for discord made with python. For this bot you need FFMPEG, see installation to se

1 Sep 26, 2022
Translate U is capable of translating the text present in an image from one language to the other.

Translate U is capable of translating the text present in an image from one language to the other. The app uses OCR and Google translate to identify and translate across 80+ languages.

Neelanjan Manna 1 Dec 22, 2021
PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Microsoft 105 Jan 08, 2022
Simple telegram bot to convert files into direct download link.you can use telegram as a file server ๐Ÿช

TGCLOUD ๐Ÿช Simple telegram bot to convert files into direct download link.you can use telegram as a file server ๐Ÿช Features Easy to Deploy Heroku Supp

Mr.Acid dev 6 Oct 18, 2022
Uses Google's gTTS module to easily create robo text readin' on command.

Tool to convert text to speech, creating files for later use. TTRS uses Google's gTTS module to easily create robo text readin' on command.

0 Jun 20, 2021
Graph Coloring - Weighted Vertex Coloring Problem

Graph Coloring - Weighted Vertex Coloring Problem This project proposes several local searches and an MCTS algorithm for the weighted vertex coloring

Cyril 1 Jul 08, 2022
Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)

Universal Adversarial Triggers for Attacking and Analyzing NLP This is the official code for the EMNLP 2019 paper, Universal Adversarial Triggers for

Eric Wallace 248 Dec 17, 2022
Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

A brief explanation This script provides a quick way to setup a Time-of-day (Tod

2 Feb 03, 2022
A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

wav2vec-toolkit A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models This repository accompanies the

Anton Lozhkov 29 Oct 23, 2022
Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and summarisation.

PART 2: CHAIN LINKING AUDIO-TO-TEXT NLP TASKS 2A: TRANSCRIBE-TRANSLATE-SENTIMENT-ANALYSIS In notebook3.0, I demo a simple workflow to: transcribe a lo

Chua Chin Hon 30 Jul 13, 2022
A high-level Python library for Quantum Natural Language Processing

lambeq About lambeq is a toolkit for quantum natural language processing (QNLP). Documentation: https://cqcl.github.io/lambeq/ Getting started Prerequ

Cambridge Quantum 315 Jan 01, 2023
A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You can find two approaches for achieving this in this repo.

multitask-learning-transformers A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You

Shahrukh Khan 48 Jan 02, 2023
Chinese NewsTitle Generation Project by GPT2.ๅธฆๆœ‰่ถ…็บง่ฏฆ็ป†ๆณจ้‡Š็š„ไธญๆ–‡GPT2ๆ–ฐ้—ปๆ ‡้ข˜็”Ÿๆˆ้กน็›ฎใ€‚

GPT2-NewsTitle ๅธฆๆœ‰่ถ…่ฏฆ็ป†ๆณจ้‡Š็š„GPT2ๆ–ฐ้—ปๆ ‡้ข˜็”Ÿๆˆ้กน็›ฎ UpDate 01.02.2021 ไปŽ็ฝ‘ไธŠๆ”ถ้›†ๆ•ฐๆฎ๏ผŒๅฐ†ๆธ…ๅŽๆ–ฐ้—ปๆ•ฐๆฎใ€ๆœ็‹—ๆ–ฐ้—ปๆ•ฐๆฎ็ญ‰ๆ–ฐ้—ปๆ•ฐๆฎ้›†๏ผŒไปฅๅŠๅผ€ๆบ็š„ไธ€ไบ›ๆ‘˜่ฆๆ•ฐๆฎ่ฟ›่กŒๆ•ด็†ๆธ…ๆด—๏ผŒๆž„ๅปบไธ€ไธช่พƒๅฎŒๅ–„็š„ไธญๆ–‡ๆ‘˜่ฆๆ•ฐๆฎ้›†ใ€‚ ๆ•ฐๆฎ้›†ๆธ…ๆด—ๆ—ถ๏ผŒไป…่ฟ›่กŒไบ†็ฎ€ๅ•ๅœฐ่ง„ๅˆ™ๆธ…ๆด—ใ€‚

logCong 785 Dec 29, 2022
Almost State-of-the-art Text Generation library

Ps: we are adding transformer model soon Text Gen ๐Ÿ Almost State-of-the-art Text Generation library Text gen is a python library that allow you build

Emeka boris ama 63 Jun 24, 2022
This is a GUI program that will generate a word search puzzle image

Word Search Puzzle Generator Table of Contents About The Project Built With Getting Started Prerequisites Installation Usage Roadmap Contributing Cont

11 Feb 22, 2022
A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

Machinalis 1.2k Dec 18, 2022