Repository for Project Insight: NLP as a Service

Overview

Project Insight

NLP as a Service

Project Insight

GitHub issues GitHub forks Github Stars GitHub license Code style: black

Contents

  1. Introduction
  2. Installation
  3. Project Details
  4. License

Introduction

Project Insight is designed to create NLP as a service with code base for both front end GUI (streamlit) and backend server (FastApi) the usage of transformers models on various downstream NLP task.

The downstream NLP tasks covered:

  • News Classification

  • Entity Recognition

  • Sentiment Analysis

  • Summarization

  • Information Extraction To Do

The user can select different models from the drop down to run the inference.

The users can also directly use the backend fastapi server to have a command line inference.

Features of the solution

  • Python Code Base: Built using Fastapi and Streamlit making the complete code base in Python.
  • Expandable: The backend is desinged in a way that it can be expanded with more Transformer based models and it will be available in the front end app automatically.
  • Micro-Services: The backend is designed with a microservices architecture, with dockerfile for each service and leveraging on Nginx as a reverse proxy to each independently running service.
    • This makes it easy to update, manitain, start, stop individual NLP services.

Installation

  • Clone the Repo.
  • Run the Docker Compose to spin up the Fastapi based backend service.
  • Run the Streamlit app with the streamlit run command.

Setup and Documentation

  1. Download the models

    • Download the models from here
    • Save them in the specific model folders inside the src_fastapi folder.
  2. Running the backend service.

    • Go to the src_fastapi folder
    • Run the Docker Compose comnand
    $ cd src_fastapi
    src_fastapi:~$ sudo docker-compose up -d
  3. Running the frontend app.

    • Go to the src_streamlit folder
    • Run the app with the streamlit run command
    $ cd src_streamlit
    src_streamlit:~$ streamlit run NLPfily.py
  4. Access to Fastapi Documentation: Since this is a microservice based design, every NLP task has its own seperate documentation

Project Details

Demonstration

Project Insight Demo

Directory Details

  • Front End: Front end code is in the src_streamlit folder. Along with the Dockerfile and requirements.txt

  • Back End: Back End code is in the src_fastapi folder.

    • This folder contains directory for each task: Classification, ner, summary...etc
    • Each NLP task has been implemented as a microservice, with its own fastapi server and requirements and Dockerfile so that they can be independently mantained and managed.
    • Each NLP task has its own folder and within each folder each trained model has 1 folder each. For example:
    - sentiment
        > app
            > api
                > distilbert
                    - model.bin
                    - network.py
                    - tokeniser files
                >roberta
                    - model.bin
                    - network.py
                    - tokeniser files
    
    • For each new model under each service a new folder will have to be added.

    • Each folder model will need the following files:

      • Model bin file.
      • Tokenizer files
      • network.py Defining the class of the model if customised model used.
    • config.json: This file contains the details of the models in the backend and the dataset they are trained on.

How to Add a new Model

  1. Fine Tune a transformer model for specific task. You can leverage the transformers-tutorials

  2. Save the model files, tokenizer files and also create a network.py script if using a customized training network.

  3. Create a directory within the NLP task with directory_name as the model name and save all the files in this directory.

  4. Update the config.json with the model details and dataset details.

  5. Update the <service>pro.py with the correct imports and conditions where the model is imported. For example for a new Bert model in Classification Task, do the following:

    • Create a new directory in classification/app/api/. Directory name bert.

    • Update config.json with following:

      "classification": {
      "model-1": {
          "name": "DistilBERT",
          "info": "This model is trained on News Aggregator Dataset from UC Irvin Machine Learning Repository. The news headlines are classified into 4 categories: **Business**, **Science and Technology**, **Entertainment**, **Health**. [New Dataset](https://archive.ics.uci.edu/ml/datasets/News+Aggregator)"
      },
      "model-2": {
          "name": "BERT",
          "info": "Model Info"
      }
      }
    • Update classificationpro.py with the following snippets:

      Only if customized class used

      from classification.bert import BertClass

      Section where the model is selected

      if model == "bert":
          self.model = BertClass()
          self.tokenizer = BertTokenizerFast.from_pretrained(self.path)

License

This project is licensed under the GPL-3.0 License - see the LICENSE.md file for details

Owner
Abhishek Kumar Mishra
Eat, Sleep, Pray, and Code * An Operations Innovation Lead at IHS Markit during working hours. * Love to read manga and cook new cuisines.
Abhishek Kumar Mishra
A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

Machinalis 1.2k Dec 18, 2022
Python powered crossword generator with database with 20k+ polish words

crossword_generator Generate simple crossword puzzle from words and definitions fetched from krzyżowki.edu.pl endpoints -/ string:word - returns js

0 Jan 04, 2022
Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Francis R. Willett 305 Dec 22, 2022
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 6k Dec 30, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
A workshop with several modules to help learn Feast, an open-source feature store

Workshop: Learning Feast This workshop aims to teach users about Feast, an open-source feature store. We explain concepts & best practices by example,

Feast 52 Jan 05, 2023
中文医疗信息处理基准CBLUE: A Chinese Biomedical LanguageUnderstanding Evaluation Benchmark

English | 中文说明 CBLUE AI (Artificial Intelligence) is playing an indispensabe role in the biomedical field, helping improve medical technology. For fur

452 Dec 30, 2022
glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Glow-Speak glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end. Installation git clone https://g

Rhasspy 8 Dec 25, 2022
Generate a cool README/About me page for your Github Profile

Github Profile README/ About Me Generator 💯 This webapp lets you build a cool README for your profile. A few inputs + ~15 mins = Your Github Profile

Rahul Banerjee 179 Jan 07, 2023
🏖 Easy training and deployment of seq2seq models.

Headliner Headliner is a sequence modeling library that eases the training and in particular, the deployment of custom sequence models for both resear

Axel Springer Ideas Engineering GmbH 231 Nov 18, 2022
Fixes mojibake and other glitches in Unicode text, after the fact.

ftfy: fixes text for you print(fix_encoding("(ง'⌣')ง")) (ง'⌣')ง Full documentation: https://ftfy.readthedocs.org Testimonials “My life is li

Luminoso Technologies, Inc. 3.4k Dec 29, 2022
All the code I wrote for Overwatch-related projects that I still own the rights to.

overwatch_shit.zip This is (eventually) going to contain all the software I wrote during my five-year imprisonment stay playing Overwatch. I'll be add

zkxjzmswkwl 2 Dec 31, 2021
GVT is a generic translation tool for parts of text on the PC screen with Text to Speak functionality.

GVT is a generic translation tool for parts of text on the PC screen with Text to Speech functionality. I wanted to create it because the existing tools that I experimented with did not satisfy me in

Nuked 1 Aug 21, 2022
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

The implementation of paper CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. CLIP4Clip is a video-text retrieval model based

ArrowLuo 456 Jan 06, 2023
HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools

HuggingSound HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools. I have no intention of building a very complex tool here.

Jonatas Grosman 247 Dec 26, 2022
End-to-end MLOps pipeline of a BERT model for emotion classification.

image source EmoBERT-MLOps The goal of this repository is to build an end-to-end MLOps pipeline based on the MLOps course from Made with ML, but this

Dimitre Oliveira 4 Nov 06, 2022
This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

Twitter COVID-19 Sentiment Analysis Members: Christopher Bach | Khalid Hamid Fallous | Jay Hirpara | Jing Tang | Graham Thomas | David Wetherhold Pro

4 Oct 15, 2022
🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutt

475 Jan 04, 2023
CredData is a set of files including credentials in open source projects

CredData is a set of files including credentials in open source projects. CredData includes suspicious lines with manual review results and more information such as credential types for each suspicio

Samsung 19 Sep 07, 2022
CCF BDCI 2020 房产行业聊天问答匹配赛道 A榜47/2985

CCF BDCI 2020 房产行业聊天问答匹配 A榜47/2985 赛题描述详见:https://www.datafountain.cn/competitions/474 文件说明 data: 存放训练数据和测试数据以及预处理代码 model_bert.py: 网络模型结构定义 adv_train

shuo 40 Sep 28, 2022