Question and answer retrieval in Turkish with BERT

Overview

trfaq

Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! 🎉

What is this?

At this repo, I'm releasing the training script and a full working inference example for my model mys/bert-base-turkish-cased-nli-mean-faq-mnr published on HuggingFace. Please note that the training code at finetune_tf.py is a simplified version of the original, which is intended for educational purposes and not optimized for anything. However, it contains an implementation of the Multiple Negatives Symmetric Ranking loss, and you can use it in your own work. Additionally, I cleaned and filtered the Turkish subset of the clips/mqa dataset, as it contains lots of mis-encoded texts. You can download this cleaned dataset here.

Model

This is a finetuned version of mys/bert-base-turkish-cased-nli-mean for FAQ retrieval, which is itself a finetuned version of dbmdz/bert-base-turkish-cased for NLI. It maps questions & answers to 768 dimensional vectors to be used for FAQ-style chatbots and answer retrieval in question-answering pipelines. It was trained on the Turkish subset of clips/mqa dataset after some cleaning/ filtering and with a Multiple Negatives Symmetric Ranking loss. Before finetuning, I added two special tokens to the tokenizer (i.e., for questions and for answers) and resized the model embeddings, so you need to prepend the relevant tokens to the sequences before feeding them into the model. Please have a look at my accompanying repo to see how it was finetuned and how it can be used in inference. The following code snippet is an excerpt from the inference at the repo.

Usage

see inference.py for a full working example.

" + q for q in questions] answers = ["" + a for a in answers] def answer_faq(model, tokenizer, questions, answers, return_similarities=False): q_len = len(questions) tokens = tokenizer(questions + answers, padding=True, return_tensors='tf') embs = model(**tokens)[0] attention_masks = tf.cast(tokens['attention_mask'], tf.float32) sample_length = tf.reduce_sum(attention_masks, axis=-1, keepdims=True) masked_embs = embs * tf.expand_dims(attention_masks, axis=-1) masked_embs = tf.reduce_sum(masked_embs, axis=1) / tf.cast(sample_length, tf.float32) a = tf.math.l2_normalize(masked_embs[:q_len, :], axis=1) b = tf.math.l2_normalize(masked_embs[q_len:, :], axis=1) similarities = tf.matmul(a, b, transpose_b=True) scores = tf.nn.softmax(similarities) results = list(zip(answers, scores.numpy().squeeze().tolist())) sorted_results = sorted(results, key=lambda x: x[1], reverse=True) sorted_results = [{"answer": answer.replace("", ""), "score": f"{score:.4f}"} for answer, score in sorted_results] return sorted_results for question in questions: results = answer_faq(model, tokenizer, [question], answers) print(question.replace("", "")) print(results) print("---------------------") ">
questions = [
    "Merhaba",
    "Nasılsın?",
    "Bireysel araç kiralama yapıyor musunuz?",
    "Kurumsal araç kiralama yapıyor musunuz?"
]

answers = [
    "Merhaba, size nasıl yardımcı olabilirim?",
    "İyiyim, teşekkür ederim. Size nasıl yardımcı olabilirim?",
    "Hayır, sadece Kurumsal Araç Kiralama operasyonları gerçekleştiriyoruz. Size başka nasıl yardımcı olabilirim?",
    "Evet, kurumsal araç kiralama hizmetleri sağlıyoruz. Size nasıl yardımcı olabilirim?"
]


questions = ["" + q for q in questions]
answers = ["" + a for a in answers]


def answer_faq(model, tokenizer, questions, answers, return_similarities=False):
    q_len = len(questions)
    tokens = tokenizer(questions + answers, padding=True, return_tensors='tf')
    embs = model(**tokens)[0]

    attention_masks = tf.cast(tokens['attention_mask'], tf.float32)
    sample_length = tf.reduce_sum(attention_masks, axis=-1, keepdims=True)
    masked_embs = embs * tf.expand_dims(attention_masks, axis=-1)
    masked_embs = tf.reduce_sum(masked_embs, axis=1) / tf.cast(sample_length, tf.float32)
    a = tf.math.l2_normalize(masked_embs[:q_len, :], axis=1)
    b = tf.math.l2_normalize(masked_embs[q_len:, :], axis=1)

    similarities = tf.matmul(a, b, transpose_b=True)
        
    scores = tf.nn.softmax(similarities)
    results = list(zip(answers, scores.numpy().squeeze().tolist()))
    sorted_results = sorted(results, key=lambda x: x[1], reverse=True)
    sorted_results = [{"answer": answer.replace("", ""), "score": f"{score:.4f}"} for answer, score in sorted_results]
    return sorted_results


for question in questions:
    results = answer_faq(model, tokenizer, [question], answers)
    print(question.replace("", ""))
    print(results)
    print("---------------------")

And the output is:

Merhaba
[{'answer': 'Merhaba, size nasıl yardımcı olabilirim?', 'score': '0.2931'}, {'answer': 'İyiyim, teşekkür ederim. Size nasıl yardımcı olabilirim?', 'score': '0.2751'}, {'answer': 'Hayır, sadece Kurumsal Araç Kiralama operasyonları gerçekleştiriyoruz. Size başka nasıl yardımcı olabilirim?', 'score': '0.2200'}, {'answer': 'Evet, kurumsal araç kiralama hizmetleri sağlıyoruz. Size nasıl yardımcı olabilirim?', 'score': '0.2118'}]
---------------------
Nasılsın?
[{'answer': 'İyiyim, teşekkür ederim. Size nasıl yardımcı olabilirim?', 'score': '0.2808'}, {'answer': 'Merhaba, size nasıl yardımcı olabilirim?', 'score': '0.2623'}, {'answer': 'Hayır, sadece Kurumsal Araç Kiralama operasyonları gerçekleştiriyoruz. Size başka nasıl yardımcı olabilirim?', 'score': '0.2320'}, {'answer': 'Evet, kurumsal araç kiralama hizmetleri sağlıyoruz. Size nasıl yardımcı olabilirim?', 'score': '0.2249'}]
---------------------
Bireysel araç kiralama yapıyor musunuz?
[{'answer': 'Hayır, sadece Kurumsal Araç Kiralama operasyonları gerçekleştiriyoruz. Size başka nasıl yardımcı olabilirim?', 'score': '0.2861'}, {'answer': 'Evet, kurumsal araç kiralama hizmetleri sağlıyoruz. Size nasıl yardımcı olabilirim?', 'score': '0.2768'}, {'answer': 'İyiyim, teşekkür ederim. Size nasıl yardımcı olabilirim?', 'score': '0.2215'}, {'answer': 'Merhaba, size nasıl yardımcı olabilirim?', 'score': '0.2156'}]
---------------------
Kurumsal araç kiralama yapıyor musunuz?
[{'answer': 'Evet, kurumsal araç kiralama hizmetleri sağlıyoruz. Size nasıl yardımcı olabilirim?', 'score': '0.3060'}, {'answer': 'Hayır, sadece Kurumsal Araç Kiralama operasyonları gerçekleştiriyoruz. Size başka nasıl yardımcı olabilirim?', 'score': '0.2929'}, {'answer': 'İyiyim, teşekkür ederim. Size nasıl yardımcı olabilirim?', 'score': '0.2066'}, {'answer': 'Merhaba, size nasıl yardımcı olabilirim?', 'score': '0.1945'}]
---------------------
Owner
M. Yusuf Sarıgöz
AI research engineer and Google Developer Expert on Machine Learning. Open to new opportunities.
M. Yusuf Sarıgöz
CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

- 基于标题的大规模商品实体检索top1 一、任务介绍 CCKS 2020:基于标题的大规模商品实体检索,任务为对于给定的一个商品标题,参赛系统需要匹配到该标题在给定商品库中的对应商品实体。 输入:输入文件包括若干行商品标题。 输出:输出文本每一行包括此标题对应的商品实体,即给定知识库中商品 ID,

43 Nov 11, 2022
Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra. What is Lightning Tran

Pytorch Lightning 581 Dec 21, 2022
AI and Machine Learning workflows on Anthos Bare Metal.

Hybrid and Sovereign AI on Anthos Bare Metal Table of Contents Overview Terraform as IaC Substrate ABM Cluster on GCE using Terraform TensorFlow ResNe

Google Cloud Platform 8 Nov 26, 2022
This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

POS-Tagger This repository details the creation of a Part-of-Speech tagger using Trigram Hidden Markov Models to predict word tags in a word sequence.

Raihan Ahmed 1 Dec 09, 2021
A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to ach

Keon Lee 237 Jan 02, 2023
This is a really simple text-to-speech app made with python and tkinter.

Tkinter Text-to-Speech App by Souvik Roy This is a really simple tkinter app which converts the text you have entered into a speech. It is created wit

Souvik Roy 1 Dec 21, 2021
This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 04, 2022
UniSpeech - Large Scale Self-Supervised Learning for Speech

UniSpeech The family of UniSpeech: WavLM (arXiv): WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing UniSpeech (ICML 202

Microsoft 281 Dec 15, 2022
(ACL-IJCNLP 2021) Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models.

BERT Convolutions Code for the paper Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models. Contains expe

mlpc-ucsd 21 Jul 18, 2022
SAINT PyTorch implementation

SAINT-pytorch A Simple pyTorch implementation of "Towards an Appropriate Query, Key, and Value Computation for Knowledge Tracing" based on https://arx

Arshad Shaikh 63 Dec 25, 2022
Production First and Production Ready End-to-End Keyword Spotting Toolkit

Production First and Production Ready End-to-End Keyword Spotting Toolkit

223 Jan 02, 2023
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

Moment-DETR QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries Jie Lei, Tamara L. Berg, Mohit Bansal For dataset de

Jie Lei 雷杰 133 Dec 22, 2022
Pretty-doc - Composable text objects with python

pretty-doc from __future__ import annotations from dataclasses import dataclass

Taine Zhao 2 Jan 17, 2022
Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

Maluuba Inc. 309 Oct 19, 2022
source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

49 Dec 17, 2022
Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

TextCortex - HemingwAI Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingw

TextCortex AI 27 Nov 28, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Dec 30, 2022
Utility for Google Text-To-Speech batch audio files generator. Ideal for prompt files creation with Google voices for application in offline IVRs

Google Text-To-Speech Batch Prompt File Maker Are you in the need of IVR prompts, but you have no voice actors? Let Google talk your prompts like a pr

Ponchotitlán 1 Aug 19, 2021
Score-Based Point Cloud Denoising (ICCV'21)

Score-Based Point Cloud Denoising (ICCV'21) [Paper] https://arxiv.org/abs/2107.10981 Installation Recommended Environment The code has been tested in

Shitong Luo 79 Dec 26, 2022