Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

Last update: Nov 24, 2022

Overview

KoSimCSE

Korean Simple Contrastive Learning of Sentence Embeddings implementation using pytorch
- SimCSE

Installation

git clone https://github.com/BM-K/KoSimCSE.git
cd KoSimCSE
git clone https://github.com/SKTBrain/KoBERT.git
cd KoBERT
pip install -r requirements.txt
pip install .
cd ..
pip install -r requirements.txt

Training - only supervised

Model
- SKT KoBERT
Dataset
- kakaobrain NLU dataset
  - train: KorNLI
  - dev & test: KorSTS
Setting
- epochs: 3
- dropout: 0.1
- batch size: 256
- temperature: 0.05
- learning rate: 5e-5
- warm-up ratio: 0.05
- max sequence length: 50
- evaluation steps during training: 250
Run train -> test -> semantic_search

bash run_example.sh

Pre-Trained Models

Using BERT [CLS] token representation
Pre-Trained model check point
- Google Drive Sharing
- ./output/nli_checkpoint.pt

Performance

Model	Cosine Pearson	Cosine Spearman	Euclidean Pearson	Euclidean Spearman	Manhattan Pearson	Manhattan Spearman	Dot Pearson	Dot Spearman
KoSBERT_SKT*	78.81	78.47	77.68	77.78	77.71	77.83	75.75	75.22
KoSimCSE_SKT	81.55	82.11	81.70	81.69	81.65	81.60	78.19	77.18

*: KoSBERT_SKT

Example Downstream Task

Semantic Search

python SemanticSearch.py

import numpy as np
from model.utils import pytorch_cos_sim
from data.dataloader import convert_to_tensor, example_model_setting


def main():
    model_ckpt = './output/nli_checkpoint.pt'
    model, transform, device = example_model_setting(model_ckpt)

    # Corpus with example sentences
    corpus = ['한 남자가 음식을 먹는다.',
              '한 남자가 빵 한 조각을 먹는다.',
              '그 여자가 아이를 돌본다.',
              '한 남자가 말을 탄다.',
              '한 여자가 바이올린을 연주한다.',
              '두 남자가 수레를 숲 속으로 밀었다.',
              '한 남자가 담으로 싸인 땅에서 백마를 타고 있다.',
              '원숭이 한 마리가 드럼을 연주한다.',
              '치타 한 마리가 먹이 뒤에서 달리고 있다.']

    inputs_corpus = convert_to_tensor(corpus, transform)

    corpus_embeddings = model.encode(inputs_corpus, device)

    # Query sentences:
    queries = ['한 남자가 파스타를 먹는다.',
               '고릴라 의상을 입은 누군가가 드럼을 연주하고 있다.',
               '치타가 들판을 가로 질러 먹이를 쫓는다.']

    # Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
    top_k = 5
    for query in queries:
        query_embedding = model.encode(convert_to_tensor([query], transform), device)
        cos_scores = pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
        cos_scores = cos_scores.cpu().detach().numpy()

        top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]

        print("\n\n======================\n\n")
        print("Query:", query)
        print("\nTop 5 most similar sentences in corpus:")

        for idx in top_results[0:top_k]:
            print(corpus[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))

Result

Query: 한 남자가 파스타를 먹는다.

Top 5 most similar sentences in corpus:
한 남자가 음식을 먹는다. (Score: 0.6002)
한 남자가 빵 한 조각을 먹는다. (Score: 0.5938)
치타 한 마리가 먹이 뒤에서 달리고 있다. (Score: 0.0696)
한 남자가 말을 탄다. (Score: 0.0328)
원숭이 한 마리가 드럼을 연주한다. (Score: -0.0048)


======================


Query: 고릴라 의상을 입은 누군가가 드럼을 연주하고 있다.

Top 5 most similar sentences in corpus:
원숭이 한 마리가 드럼을 연주한다. (Score: 0.6489)
한 여자가 바이올린을 연주한다. (Score: 0.3670)
한 남자가 말을 탄다. (Score: 0.2322)
그 여자가 아이를 돌본다. (Score: 0.1980)
한 남자가 담으로 싸인 땅에서 백마를 타고 있다. (Score: 0.1628)


======================


Query: 치타가 들판을 가로 질러 먹이를 쫓는다.

Top 5 most similar sentences in corpus:
치타 한 마리가 먹이 뒤에서 달리고 있다. (Score: 0.7756)
두 남자가 수레를 숲 속으로 밀었다. (Score: 0.1814)
한 남자가 말을 탄다. (Score: 0.1666)
원숭이 한 마리가 드럼을 연주한다. (Score: 0.1530)
한 남자가 담으로 싸인 땅에서 백마를 타고 있다. (Score: 0.1270)

Citing

SimCSE

@article{gao2021simcse,
   title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
   author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
   journal={arXiv preprint arXiv:2104.08821},
   year={2021}
}

KorNLU Datasets

@article{ham2020kornli,
  title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
  author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
  journal={arXiv preprint arXiv:2004.03289},
  year={2020}
}

Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

Related tags

Overview

KoSimCSE

Installation

Training - only supervised

Pre-Trained Models

Performance

Example Downstream Task

Semantic Search

Result

Citing

SimCSE

KorNLU Datasets

Owner

A CSRankings-like index for speech researchers

A python package to fine-tune transformer-based models for named entity recognition (NER).

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

TruthfulQA: Measuring How Models Imitate Human Falsehoods

Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph",

Code for the paper PermuteFormer

MiCECo - Misskey Custom Emoji Counter

A Fast Command Analyser based on Dict and Pydantic

NLP command-line assistant powered by OpenAI

Repository for the paper: VoiceMe: Personalized voice generation in TTS

NLP-SentimentAnalysis - Coursera Course ( Duration : 5 weeks ) offered by DeepLearning.AI

Header-only C++ HNSW implementation with python bindings

KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정한 코드입니다.

Open-source offline translation library written in Python. Uses OpenNMT for translations

A practical and feature-rich paraphrasing framework to augment human intents in text form to build robust NLU models for conversational engines. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Generate a cool README/About me page for your Github Profile

A versatile token stream for handwritten parsers.

Repository of the Code to Chatbots, developed in Python

SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

Honor's thesis project analyzing whether the GPT-2 model can more effectively generate free-verse or structured poetry.