NLP Overview

Overview

NLP-Overview

  1. Introduction
  • The field of NPL encompasses a variety of topics which involve the computational processing and understanding of human languages. Since the 1980s, the field has increasingly relied on data-driven computation involving statistics, probability, and machine learning.

  • Recent increases in computational power and parallelization, harnessed by Graphical Processing Units (GPUs), now allow for “deep learning”, which utilizes artificial neural networks (ANNs), sometimes with billions of trainable parameters. Additionally, the contemporary availability of large datasets, facilitated by sophisticated data collection processes, enables the training of such deep architectures.

  1. Natural Language Processing
  • The field of natural language processing, also known as computational linguistics, involves the engineering of computational models and processes to solve practical problems in understanding human languages. Although it is sometimes difficult to distinguish clearly to which areas issues belong.

  • Work in NLP can be divided into two broad sub-areas: core areas and applications.

  • Often one needs to handle one or more of the core issues successfully and apply those ideas and procedures to solve practical problems.

  • Currently, NLP is primarily a data-driven field using statistical and probabilistic computations along with machine learning.

2.1. Core area of NLP

  • The core issues are those that are inherently present in any computational linguistic system. To perform translation, text summarization, image captioning, or any other linguistic task, there must be some understanding of the underlying language. This understanding can be broken down into at least four main areas: language modeling, morphology, parsing, and semantics. The number of scholarly works in each area over the last decade is shown in Figure 3.

image

Fig. 3: Publication Volume for Core Areas of NLP. The number of publications, indexed by Google Scholar, relating to each topic over the last decade is shown. While all areas have experienced growth, language modeling has grown the most.

2.1.1 Language Modeling and Word Embeddings

  • Language modeling can be viewed in two ways. First, it determines which words follow which. By extension, however, this can be viewed as determining what words mean, as individual words are only weakly meaningful, deriving their full value only from their interactions with other words.

  • The core areas address fundamental problems such as language modeling, which underscores quantifying associations among naturally occurring words.

  • Arguably, the most important task in NLP is that of language modeling. Language modeling (LM) is an essential piece of almost any application of NLP. Language modeling is the process of creating a model to predict words or simple linguistic components given previous words or components. This is useful for applications in which a user types input, to provide predictive ability for fast text entry. However, its power and versatility emanate from the fact that it can implicitly capture syntactic and semantic relationships among words or components in a linear neighborhood, making it useful for tasks such as machine translation or text summarization. Using prediction, such programs are able to generate more relevant, human-sounding sentences

2.1.2. Morphology

  • Morphology is the study of how words themselves are formed. It considers the roots of words and the use of prefixes and suffixes, compounds, and other intraword devices, to display tense, gender, plurality, and a other linguistic constructs. (Understanding later)

  • Morphological processing, dealing with segmentation of meaningful components of words and identifying the true parts of speech of words as used.

  • Morphology is concerned with finding segments within single words, including roots and stems, prefixes, suffixes,and—in some languages—infixes. Affixes (prefixes, suffixes, or infixes) are used to overtly modify stems for gender, number, person, et cetera.

2.1.3. Parsing

  • Parsing considers which words modify others, forming constituents, leading to a sentential structure. The area of semantics is the study of what words mean. It takes into account the meanings of the individual words and how they relate to and modify others, as well as the context these words appear in and some degree of world knowledge, i.e., “common sense”. There is a significant amount of overlap between each of these areas.

  • Syntactic processing, or parsing, which builds sentence diagrams as possible precursors to semantic processing.

  • Parsing examines how different words and phrases relate to each other within a sentence. There are at least two distinct forms of parsing: constituency parsing and dependency parsing. In constituency parsing, phrasal constituents are extracted from a sentence in a hierarchical fashion. Dependency parsing looks at the relationships between pairs of individual words.

  • Remaining Challenges: Outside of universal parsing, a parsing challenge that needs to be further investigated is the building of syntactic structures without the use of treebanks for training. Attempts have been made using attention scores and Tree-LSTMs, as well as outside-inside auto-encoders. If such approaches are successful, they have potential use in many environments, including in the context of low-resource languages and out-of-domain scenarios. While a number of other challenges remain, these are the largest and are expected to receive the most focus.

D. Semantics

  • Semantic processing, which attempts to distill meaning of words, phrases, and higher level components in text.

  • Semantic processing involves understanding the meaning of words, phrases, sentences, or documents at some level. Word embeddings, such as Word2Vec and GloVe, claim to capture meanings of words, following the Distributional Hypothesis of Meaning. As a corollary, when vectors corresponding to phrases, sentences, or other components of text are processed using a neural network, a representation that can be loosely thought to be semantically representative is computed compositionally.

  • In this section, neural semantic processing research is separated into two distinct areas: Work on comparing the semantic similarity of two portions of text, and work on capturing and transferring meaning in high level constituents, particularly sentences.

  • Semantic Challenges: In addition to the challenges already mentioned, researchers believe that being able to solve tasks well does not indicate actual understanding. Integrating deep networks with general word-graphs (e.g. WordNet) or knowledge-graphs (e.g. DBPedia) may be able to endow a sense of understanding. Graph-embedding is an active area of research, and work on integrating language-based models and graph models has only recently begun to take off, giving hope for better machine understanding.

  1. Summary of Core Issues
  • Deep learning has generally performed very well, surpassing existing states of the art in many individual core NLP tasks, and has thus created the foundation on which useful natural language applications can and are being built. However, it is clear from examining the research reviewed here that natural language is an enigmatically complex topic, with myriad core or basic tasks, of which deep learning has only grazed the surface. It is also not clear how architectures for ably executing individual core tasks can be synthesized to build a common edifice, possibly a much more complex distributed neural architecture, to show competence in multiple or “all” core tasks. More fundamentally, it is also not clear, how mastering of basic tasks, may lead to superior performance in applied tasks, which are the ultimate engineering goals, especially in the context of building effective and efficient deep learning models. Many, if not most, successful deep learning architectures for applied tasks, discussed in the next section, seem to forgo explicit architectural components for core tasks, and learn such tasks implicitly. Thus, some researchers argue that the relevance of the large amount of work on core issues is not fully justified, while others argue that further extensive research in such areas is necessary to better understand and develop systems which more perfectly perform these tasks, whether explicitly or implicitly.
  1. Application of NLP Using Deep Learning
  • While the study of core areas of NLP is important to understanding how neural models work, it is meaningless in and of itself from an engineering perspective, which values applications that benefit humanity, not pure philosophical and scientific inquiry. Current approaches to solving several immediately useful NLP tasks are summarized here. Note that the issues included here are only those involving the processing of text, not the processing of verbal speech. Because speech processing [162], [163] requires expertise on several other topics including acoustic processing, it is generally considered another field of its own, sharing many commonalities with the field of NLP. The number of studies in each discussed area over the last decade is shown in Figure 4.

image

Fig. 4: Publication Volume for Applied Areas of NLP. All areas of applied natural language processing discussed have witnessed growth in recent years, with the largest growth occurring in the last two to three years.

  • The application areas involve topics

    • Extraction of useful information (e.g. named entities and relations), translation of text between and among languages, summarization of written works, automatic answering of questions by inferring answers, and classification and clustering of documents.

4.1. Information Retrieval

  • The purpose of Information Retrieval (IR) systems is to help people find the right (most useful) information in the right (most convenient) format at the right time (when they need it) [164]. Among many issues in IR, a primary problem that needs addressing pertains to ranking documents with respect to a query string in terms of relevance scores for ad-hoc retrieval tasks, similar to what happens in a search engine.

4.2. Information Extraction

  • Information extraction extracts explicit or implicit information from text. The outputs of systems vary, but often the extracted data and the relationships within it are saved in relational databases [172]. Commonly extracted information includes named entities and relations, events and their participants, temporal information, and tuples of facts.

4.3. Text Generation

  • Many NLP tasks require the generation of human-like language. Summarization and machine translation convert one text to another in a sequence-to-sequence (seq2seq) fashion. Other tasks, such as image and video captioning and automatic weather and sports reporting, convert non-textual data to text. Some tasks, however, produce text without any input data to convert (or with only small amounts used as a topic or guide). These tasks include poetry generation, joke generation, and story generation.

4.4. Summarization

  • Summarization finds elements of interest in documents in order to produce an encapsulation of the most important content. There are two primary types of summarization: extractive and abstractive. The first focuses on sentence extraction, simplification, reordering, and concatenation to relay the important information in documents using text taken directly from the documents. Abstractive summaries rely on expressing documents’ contents through generation-style abstraction, possibly using words never seen in the documents [48].

4.5. Question Answering

  • Similar to summarization and information extraction, question answering (QA) gathers relevant words, phrases, or sentences from a document. QA returns this information in a coherent fashion in response to a request. Current methods resemble those of summarization.

4.6. Machine Translation

  • Machine translation (MT) is the quintessential application of NLP. It involves the use of mathematical and algorithmic techniques to translate documents in one language to another. Performing effective translation is intrinsically onerous even for humans, requiring proficiency in areas such as morphology, syntax, and semantics, as well as an adept understanding and discernment of cultural sensitivities, for both of the languages (and associated societies) under consideration [48].
  1. Summary of Deep Learning NLP Applications
  • Numerous other applications of natural language processing exist including grammar correction, as seen in word processors, and author mimicking, which, given sufficient data, generates text replicating the style of a particular writer. Many of these applications are infrequently used, understudied, or not yet exposed to deep learning. However, the area of sentiment analysis should be noted, as it is becoming increasingly popular and utilizing deep learning. In large part a semantic task, it is the extraction of a writer’s sentiment—their positive, negative, or neutral inclination towards some subject or idea [268]. Applications are varied, including product research, futures prediction, social media analysis, and classification of spam [269], [270]. The current state of the art uses an ensemble including both LSTMs and CNNs [271].

  • This section has provided a number of select examples of the applied usages of deep learning in natural language processing. Countless studies have been conducted in these and similar areas, chronicling the ways in which deep learning has facilitated the successful use of natural language in a wide variety of applications. Only a minuscule fraction of such work has been referred to in this survey.

  • While more specific recommendations for practitioners have been discussed in some individual subsections, the current trend in state-of-the-art models in all application areas is to use pre-trained stacks of Transformer units in some configuration, whether in encoder-decoder configurations or just as encoders. Thus, self-attention which is the mainstay of Transformer has become the norm, along with cross-attention between encoder and decoder units, if decoders are present. In fact, in many recent papers, if not most, Transformers have begun to replace LSTM units that were preponderant just a few months ago. Pre-training of these large Transformer models has also become the accepted way to endow a model with generalized knowledge of language. Models such as BERT, which have been trained on corpora of billions of words, are available for download, thus providing a practitioner with a model that possesses a great amount of general knowledge of language already. A practitioner can further train it with one’s own general corpora, if desired, but such training is not always necessary, considering the enormous sizes of the pre-training that downloaded models have received. To train a model to perform a certain task well, the last step a practitioner must go through is to use available downloadable task-specific corpora, or build one’s own task-specific corpus. This last training step is usually supervised. It is also recommended that if several tasks are to be performed, multi-task training be used wherever possible.

Reference:

[1] https://arxiv.org/pdf/1807.10854.pdf

[2] https://arxiv.org/pdf/1708.02709.pdf

[3] https://arxiv.org/pdf/2108.05542.pdf

[4] Recent Advances in Natural Language Processing via Large Pre-TrainedLanguage Models: A Survey https://arxiv.org/pdf/2111.01243.pdf

[5] AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing https://arxiv.org/pdf/2108.05542.pdf Appendix:

Encompasses (v) to include different types of things

Underscores (v) to emphasize something, or to show that it is important

Data-driven science is an interdisciplinary field of scientific methods to extract knowledge from data

Data-driven learning a learning approach driven by research-like access to data

Inherently (v) is a basic or essential feature that gives something its character

Arguably (adv) used for stating your opinion or belief, especially when you think other people may disagree (Được cho là)

Versatility (adj) able to change easily or to be used for different purposes (Tính linh hoạt)

Emanate (v) to come from a particular place (bắt nguồn).

Implicitly (adj) not stated directly, but expressed in the way that someone behaves, or understood from what they are saying (ngầm hiểu).

Corollary (n) something that will also be true if a particular idea or statement is true, or something that will also exist if a particular situation exists (kết quả tất yếu)1

Owner
PeterPham
PhD Student at National Chung Cheng University
PeterPham
The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

THUHCSI 138 Oct 28, 2022
使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征,提升下游任务的表现。

Pretrain_Bert_with_MaskLM Info 使用Mask LM预训练任务来预训练Bert模型。 基于pytorch框架,训练关于垂直领域语料的预训练语言模型,目的是提升下游任务的表现。 Pretraining Task Mask Language Model,简称Mask LM,即

Desmond Ng 24 Dec 10, 2022
This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular intervals.It sends out the most recent news at random!

Nepali-news-notifier This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular in

Sachit Yadav 1 Feb 11, 2022
Mednlp - Medical natural language parsing and utility library

Medical natural language parsing and utility library A natural language medical

Paul Landes 3 Aug 24, 2022
Transformers Wav2Vec2 + Parlance's CTCDecodeTransformers Wav2Vec2 + Parlance's CTCDecode

🤗 Transformers Wav2Vec2 + Parlance's CTCDecode Introduction This repo shows how 🤗 Transformers can be used in combination with Parlance's ctcdecode

Patrick von Platen 9 Jul 21, 2022
Source code of the "Graph-Bert: Only Attention is Needed for Learning Graph Representations" paper

Graph-Bert Source code of "Graph-Bert: Only Attention is Needed for Learning Graph Representations". Please check the script.py as the entry point. We

14 Mar 25, 2022
VoiceFixer VoiceFixer is a framework for general speech restoration.

VoiceFixer VoiceFixer is a framework for general speech restoration. We aim at the restoration of severly degraded speech and historical speech. Paper

Leo 174 Jan 06, 2023
Precision Medicine Knowledge Graph (PrimeKG)

PrimeKG Website | bioRxiv Paper | Harvard Dataverse Precision Medicine Knowledge Graph (PrimeKG) presents a holistic view of diseases. PrimeKG integra

Machine Learning for Medicine and Science @ Harvard 103 Dec 10, 2022
A library for end-to-end learning of embedding index and retrieval model

Poeem Poeem is a library for efficient approximate nearest neighbor (ANN) search, which has been widely adopted in industrial recommendation, advertis

54 Dec 21, 2022
ETM - R package for Topic Modelling in Embedding Spaces

ETM - R package for Topic Modelling in Embedding Spaces This repository contains an R package called topicmodels.etm which is an implementation of ETM

bnosac 37 Nov 06, 2022
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

Antlr Project 13.6k Jan 05, 2023
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 4.6k Jan 01, 2023
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

Facebook Research 409 Oct 28, 2022
BERN2: an advanced neural biomedical namedentity recognition and normalization tool

BERN2 We present BERN2 (Advanced Biomedical Entity Recognition and Normalization), a tool that improves the previous neural network-based NER tool by

DMIS Laboratory - Korea University 99 Jan 06, 2023
PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation

SITT The repo contains official PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation. Authors: Boyi Li Yin Cui T

Boyi Li 52 Jan 05, 2023
KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

Kakao Brain 797 Dec 26, 2022
Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Yase Yet Another Sequence Encoder - encode sequences to vector of vectors in python ! Why Yase ? Yase enable you to encode any sequence which can be r

Pierre PACI 12 Aug 19, 2021
Stand-alone language identification system

langid.py readme Introduction langid.py is a standalone Language Identification (LangID) tool. The design principles are as follows: Fast Pre-trained

2k Jan 04, 2023
Pipeline for fast building text classification TF-IDF + LogReg baselines.

Text Classification Baseline Pipeline for fast building text classification TF-IDF + LogReg baselines. Usage Instead of writing custom code for specif

Dani El-Ayyass 57 Dec 07, 2022
PyTorch implementation of Tacotron speech synthesis model.

tacotron_pytorch PyTorch implementation of Tacotron speech synthesis model. Inspired from keithito/tacotron. Currently not as much good speech quality

Ryuichi Yamamoto 279 Dec 09, 2022