BMS-Molecular-Translation

Introduction

This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got bronze medals in this competition. Significant part of code was originated from Y.Nakama's notebook

This competition was about image-to-text translation of images with molecular skeletal strucutures to InChI chemical formula identifiers.

InChI=1S/C16H13Cl2NO3/c1-10-2-4-11(5-3-10)16(21)22-9-15(20)19-14-8-12(17)6-7-13(14)18/h2-8H,9H2,1H3,(H,19,20)

Solution

General Encoder-Decoder concept

Most participants used CNN encoder to acquire features with decoder (LSTM/GRU/Transformer) to get text sequences. That's a casual approach to image captioning problem.

Pseudo-labelling with InChI validation using RDKit

RDKit is an open source toolkit for cheminformatics and it was quite useful while solving the problem. When we trained our first model, it scored around 7-8 on public leaderboard and we decided to make pseudo-labelling on test data. However, in common scenario you get a significant amount of wrong predictions in your extended training set from pseudo-labelling. With RDKit we validated all of our predicted formulas and select around 800k correct samples. Lack of wrong labels in pseudo labels improved the score.

Predictions normalization

This notebook tells about InChI normalization

Blending

Finally, we blended ~20 predictions from 2 models (mostly from different epochs) using RDKit validation to choose only formulas which have possible InChI structure.

Pipeline for chemical image-to-text competition

Related tags

Overview

BMS-Molecular-Translation

Introduction

Solution

General Encoder-Decoder concept

Pseudo-labelling with InChI validation using RDKit

Predictions normalization

Blending

Final private LB score 1.79

Owner

Maksim Zhdanov

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

FedNLP: A Benchmarking Framework for Federated Learning in Natural Language Processing

华为商城抢购手机的Python脚本 Python script of Huawei Store snapping up mobile phones

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Ceaser-Cipher - The Caesar Cipher technique is one of the earliest and simplest method of encryption technique

GVT is a generic translation tool for parts of text on the PC screen with Text to Speak functionality.

Python SDK for working with Voicegain Speech-to-Text

RuCLIP-SB (Russian Contrastive Language–Image Pretraining SWIN-BERT) is a multimodal model for obtaining images and text similarities and rearranging captions and pictures. Unlike other versions of the model we use BERT for text encoder and SWIN transformer for image encoder.

A website which allows you to play with the GPT-2 transformer

IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models

Blender addon - Scrub timeline from viewport with a shortcut

Basic Utilities for PyTorch Natural Language Processing (NLP)

NLP codes implemented with Pytorch (w/o library such as huggingface)

Code for PED: DETR For (Crowd) Pedestrian Detection

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

💫 Industrial-strength Natural Language Processing (NLP) in Python

NLP project that works with news (NER, context generation, news trend analytics)

A text augmentation tool for named entity recognition.

Espial is an engine for automated organization and discovery of personal knowledge