This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Overview

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning

This repository contains all the source code that is needed for the Project : An Efficient Pipeline For Bloom’s Taxonomy with Question Generation Using Natural Language Processing and Deep Learning.

Outline :

An examination assessment undertaken by educational institutions is an essential process, since it is one of the fundamental steps to determine a student’s progress and achievements for a distinct subject or course. To meet learning objectives, the questions must be presented by the topics, that are mastered by the students. Generation of examination questions from an extensive amount of available text material presents some complications. The current availability of huge lengths of textbooks makes it a slow and time-consuming task for a faculty when it comes to manually annotate good quality of questions keeping in mind, they are well balanced as well. As a result, faculties rely on Bloom’s taxonomy's cognitive domain, which is a popular framework, for assessing students’ intellectual abilities. Therefore, the primary goal of this research paper is to demonstrate an effective pipeline for the generation of questions using deep learning from a given text corpus. We also employ various neural network architectures to classify questions into the cognitive domain of different levels of Bloom’s taxonomy using deep learning, to derive questions and judge the complexity and specificity of those questions. The findings from this study showed that the proposed pipeline is significant in generating the questions, which were equally similar concerning manually annotated questions and classifying questions from multiple domains based on Bloom’s taxonomy.

Main Proposed Pipeline Layout :

Used Datasets

  • Squad Dataset 2.0 - Used In Question Generation Module. Released in 2018, has over 150,000 question-answer pairs.

  • "Yahya et al, (2012)" Introduced Dataset - Dataset Used in Question Classification Module.Consists of around 600 open-ended questions, covering a wide variety of questions belonging to the different levels of the cognitive domain. Original Dataset required some basic pre-processing and then manually converted into dataframe. Check out main paper cited here.

  • Quora Question Pairs Dataset- Dataset Used in Case study of computing semantic similarity between generated questions from T5 Transformer and manually annotated questions from survey form.

Question Generation Module:

The dataset being used for the question generation is Squad (The Stanford Question Answering Dataset) 2.0 Dataset. Squad 2.0 is an extension of the original Squad V1.1 that was published in 2016 by Stanford University.

In this paper, we have implemented T5 Transformer, which is then fine-tuned using PyTorch lightning and training it on the Squad 2.0 dataset. T5 is essentially an encoder-decoder model which takes in all NLP problems and has them converted to a text-to-text format.

Table 1

Passage Answer Context
The term health is very frequently used by everybody. How do we define it? Health does not simply mean "absence of disease" or "physical fitness". It could be defined as a state of complete physical, mental and social well-being. When people are healthy, they are more efficient at work. This increases productivity and brings economic prosperity. Health also increases longevity of people and reduces infant and maternal mortality. When the functioning of one or more organs or systems of the body is adversely affected, characterized by appearance of various signs and symptoms,we say that we are not healthy, i.e., we have a disease. Diseases can be broadly grouped into infectious and non-infectious. Diseases which are easily transmitted from one person to another, are called infectious diseases.' Easily transmitted from one person to another
Proteins are the most abundant biomolecules of the living system. Chief sources of proteins are milk, cheese, pulses, peanuts, fish, meat, etc. They occur in every part of the body and form the fundamental basis of structure and functions of life. They are also required for growth and maintenance of the body. The word protein is derived from Greek word, “proteios” which means primary or of prime importance. Greek Word

Table 1 shows the passages that we have input it into the model and the answers that we want the questions to be generated. We have taken these passages from various high school level books.

Table 2

Answer Context Easily transmitted from one person to another Greek Word
Questions Generated How are infectious diseases defined? What does the word protein come from?
Questions Received What do you mean by infectious disease? What is "proteios"? From which language was it derived from?

As you can see in table 2, the questions generated row are the questions generated as per the answer context by our model. Correspondingly, the Questions Received are the ones that we obtained from circulating a survey that contained the same passage and context.

Results

After training, we observed a steady decrease in training loss Fig. 3. The validation loss fluctuated and has been observed in Fig. 4. Note that due to fewer computation resources, we could train for only a limited amount of time, and hence the fluctuations in validation loss.

  • Training Loss = 0.070
  • Validation Loss = 2.39

Question Classification Module :

A deep learning-based model for multi class classification which takes in a text as input and tries to classify a certain category out of multiple categories in coginitive domain of bloom's taxonomy.

Dataset Used : Yahaa et all (2012)

Model Pipeline :

Model Architecture :

Results :

Summarised Evaluation :

S.No Model Optimizer Accuracy Loss Dropout
1 ConvNet 1D+ 2 Bidirectional LSTMs Layers Adam 80.83 0.6842
2 ConvNet 1D+ 2 Bidirectional LSTMs Layers RMSProp 80.00 1.50
3 ConvNet 1D+ 2 Bidirectional LSTMs Layers Adam with ClipNorm=1.25 83.33 0.86
4 ConvNet 1D+ 2 Bidirectional LSTMs Layers RMSProp with ClipNorm=1.25 79.17 2.10
5 ConvNet 1D+ 2 Bidirectional LSTMs Layers Adam 86.67 0.59 Recurrent Dropout=0.1
6 ConvNet 1D+ 2 Bidirectional LSTMs Layers RMSprop 78.83 2.54 Recurrent Dropout=0.1
7 ConvNet 1D+ 2 Bidirectional LSTMs Layers Adam with ClipNorm=1.25 85.83 0.56 Recurrent Dropout=0.1
8 ConvNet 1D+ 2 Bidirectional LSTMs Layers RMSprop with ClipNorm=1.25 75.83 0.76 Recurrent Dropout=0.1
9 ConvNet 1D+ 2 Bidirectional LSTMs Layers + GloVe 100-D Adam With ClipNorm=1.25 73.33 1.28
10 ConvNet 1D+ 2 Bidirectional LSTMs Layers + GloVe 300-D Adam With ClipNorm=1.25 75.83 0.88
11 ConvNet 1D+ 2 Bidirectional LSTMs Layers + GloVe 100-D RMSprop With ClipNorm=1.25 73.33 2.31
12 ConvNet 1D+ 2 Bidirectional LSTMs Layers + GloVe 300-D RMSprop With ClipNorm=1.25 80.00 1.12

The Best Performance was exhibited by the following dense neural network : ConvNet 1D with 2 Bidirectional LSTMs Layers ,along with Adam optimizer and recurrent dropout =0.1 as regulariser.

Following Results were obtained :

  • Accuracy : 86.67 %
  • Loss : 0.59

Accuracy vs Loss Plot :

Siamese Neural Network for Computing Sentence Similarity – A Case Study :

With a thorough analysis of the outputs, i.e., questions, generated from the proposed model,a case study was done to evaluate how much the generated questions are semantically similar to the questions if annotated manually. For this evaluation, we considered an effective pipeline of Siamese neural networks. This study was done in order to explore insights about the effectiveness of our proposed pipeline – how much our model is efficient to generate questions when compared to the manual annotation of the questions which requires comparatively more hard work and time.

Model Architecture :

Generated Questions Manually Annotated Questions Context Similarity Score
Why is health more efficient at work? How does health affect efficiency at work? Increases Productivity And Brings Economic Prosperity 0.4464
What is the health of people more efficient at work? What are the outcomes of being more efficient at work as a result of good health? Increases Productivity And Brings Economic Prosperity 0.4811
What is the term infectious disease? What do you mean by infectious disease? Easily Transmitted From One Person To Another 0.3505
How are infectious diseases defined? Define infectious disease. Easily Transmitted From One Person To Another 0.2489
According to classical electromagnetic theory, an accelerating charged particle does what ? According to electromagnetic theory what happens when a charged particle accelerates ? Emits Radiation In The Form Of Electromagnetic Waves 0.2074
What does the theory of an accelerating charged particle imply ? What does the classical electromagnetic theory state ? Emits Radiation In The Form Of Electromagnetic Waves 0.0474
What was the Harappans's strategy of sending expeditions to ? What was the primary reason for settlements and expeditions as seen from Harappans's ? Strategy For Procuring Raw Materials 0.4222
What was the idea behind sending expeditions to Rajasthan ? Why did the Harappans's send expeditions to areas in Rajasthan ? Strategy For Procuring Raw Materials 0.6870
What was a feature of the Ganeshwar culture ? What was the distinctive feature of the Ganeshwar culture ? Non-Harappan Pottery 0.6439
What type of artefacts are from the Ganeshwar culture ? What kind of artefacts are from Ganeshwar culture ? Non-Harappan Pottery 0.4309
Proteins form the basis of what? What is the significance of proteins ? Function Of Life 0.1907
What are proteins the fundamental basis of ? What does protein form along with fundamental basis of structure ? Function Of Life 0.1775

The above analysis is a sample from a set of recorded observations evaluated by our network. This clearly indicates the depth of similarity score between generated questions from the transformer and manually annotated questions from the survey.

Accuracy vs Loss Plot :

Owner
Rohan Mathur
3rd Year Undergrad | Data Science Enthusiast
Rohan Mathur
Sequence model architectures from scratch in PyTorch

This repository implements a variety of sequence model architectures from scratch in PyTorch. Effort has been put to make the code well structured so that it can serve as learning material. The train

Brando Koch 11 Mar 28, 2022
Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

Texar-PyTorch is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar

ASYML 726 Dec 30, 2022
A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

A2T: Towards Improving Adversarial Training of NLP Models This is the source code for the EMNLP 2021 (Findings) paper "Towards Improving Adversarial T

QData 17 Oct 15, 2022
뉴스 도메인 질의응답 시스템 (21-1학기 졸업 프로젝트)

뉴스 도메인 질의응답 시스템 본 프로젝트는 뉴스기사에 대한 질의응답 서비스 를 제공하기 위해서 진행한 프로젝트입니다. 약 3개월간 ( 21. 03 ~ 21. 05 ) 진행하였으며 Transformer 아키텍쳐 기반의 Encoder를 사용하여 한국어 질의응답 데이터셋으로

TaegyeongEo 4 Jul 08, 2022
A cross platform OCR Library based on PaddleOCR & OnnxRuntime

A cross platform OCR Library based on PaddleOCR & OnnxRuntime

RapidOCR Team 767 Jan 09, 2023
【原神】自动演奏风物之诗琴的程序

疯物之诗琴 读取midi并自动演奏原神风物之诗琴。 可以自定义配置文件自动调整音符来适配风物之诗琴。 (原神1.4直播那天就开始做了!到现在才能放出来。。) 如何使用 在Release页面中下载打包好的程序和midi压缩包并解压。 双击运行“疯物之诗琴.exe”。 在原神中打开风物之诗琴,软件内输入

435 Jan 04, 2023
Spam filtering made easy for you

spammy Author: Tasdik Rahman Latest version: 1.0.3 Contents 1 Overview 2 Features 3 Example 3.1 Accuracy of the classifier 4 Installation 4.1 Upgradin

Tasdik Rahman 137 Dec 18, 2022
A framework for implementing federated learning

This is partly the reproduction of the paper of [Privacy-Preserving Federated Learning in Fog Computing](DOI: 10.1109/JIOT.2020.2987958. 2020)

DavidChen 46 Sep 23, 2022
Text editor on python to convert english text to malayalam(Romanization/Transiteration).

Manglish Text Editor This is a simple transiteration (romanization ) program which is used to convert manglish to malayalam (converts njaan to ഞാൻ ).

Merin Rose Tom 1 May 11, 2022
ADCS - Automatic Defect Classification System (ADCS) for SSMC

Table of Contents Table of Contents ADCS Overview Summary Operator's Guide Demo System Design System Logic Training Mode Production System Flow Folder

Tam Zher Min 2 Jun 24, 2022
A minimal code for fairseq vq-wav2vec model inference.

vq-wav2vec inference A minimal code for fairseq vq-wav2vec model inference. Runs without installing the fairseq toolkit and its dependencies. Usage ex

Vladimir Larin 7 Nov 15, 2022
Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

New State-of-the-Art in Preposition Sense Disambiguation Supervisor: Prof. Dr. Alexander Mehler Alexander Henlein Institutions: Goethe University TTLa

Dirk Neuhäuser 4 Apr 06, 2022
無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXの音声合成エンジン

VOICEVOX ENGINE VOICEVOXの音声合成エンジン。 実態は HTTP サーバーなので、リクエストを送信すればテキスト音声合成できます。 API ドキュメント VOICEVOX ソフトウェアを起動した状態で、ブラウザから

Hiroshiba 3 Jul 05, 2022
SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning We propose a SASE mode

Tower 1 Nov 20, 2021
숭실대학교 컴퓨터학부 전공종합설계프로젝트

✨ 시각장애인을 위한 버스도착 알림 장치 ✨ 👀 개요 현대 사회에서 대중교통 위치 정보를 이용하여 사람들이 간단하게 이용할 대중교통의 정보를 얻고 쉽게 대중교통을 이용할 수 있다. 해당 정보는 각종 어플리케이션과 대중교통 이용시설에서 위치 정보를 제공하고 있지만 시각

taegyun 3 Jan 25, 2022
Translate U is capable of translating the text present in an image from one language to the other.

Translate U is capable of translating the text present in an image from one language to the other. The app uses OCR and Google translate to identify and translate across 80+ languages.

Neelanjan Manna 1 Dec 22, 2021
Search for documents in a domain through Google. The objective is to extract metadata

MetaFinder - Metadata search through Google _____ __ ___________ .__ .___ / \

Josué Encinar 85 Dec 16, 2022
Download videos from YouTube/Twitch/Twitter right in the Windows Explorer, without installing any shady shareware apps

youtube-dl and ffmpeg Windows Explorer Integration Download videos from YouTube/Twitch/Twitter and more (any platform that is supported by youtube-dl)

Wolfgang 226 Dec 30, 2022
hashily is a Python module that provides a variety of text decoding and encoding operations.

hashily is a python module that performs a variety of text decoding and encoding functions. It also various functions for encrypting and decrypting text using various ciphers.

DevMysT 5 Jul 17, 2022
This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.

Speech-Backbones This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab. Grad-TTS Official implementation of the Grad-

HUAWEI Noah's Ark Lab 295 Jan 07, 2023