端到端的长本文摘要模型（法研杯2020司法摘要赛道）

Last update: Jan 08, 2023

Related tags

Text Data & NLP SPACES

Overview

SPACES

端到端的长文本摘要模型（法研杯2020司法摘要赛道）。

博客介绍：https://kexue.fm/archives/8046

含义

我们将我们的模型称为SPACES，它正好是科学空间的域名之一（https://spaces.ac.cn），具体含义如下：

S：Sparse Softmax；
P：Pretrained Language Model；
A：Abstractive；
C：Copy Mechanism；
E：Extractive；
S：Special Words。

顾名思义，这是一个以词为单位的、包含预训练和Copy机制的“抽取-生成”式摘要模型，里边包含了一些我们对文本生成技术的最新研究成果。

运行

实验环境：tensorflow 1.14 + keras 2.3.1 + bert4keras 0.9.7

(如果是Windows，请用bert4keras>=0.9.8)

首先请在snippets.py中修改相关路径配置，然后再执行下述代码。

训练代码：

#! /bin/bash

python extract_convert.py
python extract_vectorize.py

for ((i=0; i<15; i++));
    do
        python extract_model.py $i
    done

python seq2seq_convert.py
python seq2seq_model.py

预测代码

from final import *
summary = predict(text, topk=3)
print(summary)

交流

QQ交流群：808623966，微信群请加机器人微信号spaces_ac_cn

链接

博客：https://kexue.fm
追一：https://zhuiyi.ai/
预训练模型：https://github.com/ZhuiyiTechnology/pretrained-models
WoBERT：https://github.com/ZhuiyiTechnology/WoBERT

端到端的长本文摘要模型（法研杯2020司法摘要赛道）

Related tags

Overview

SPACES

含义

运行

交流

链接

Owner

苏剑林(Jianlin Su)

The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.

Lyrics generation with GPT2-based Transformer

Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form.

The code for the Subformer, from the EMNLP 2021 Findings paper: "Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers", by Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo

基于pytorch_rnn的古诗词生成

中文空间语义理解评测

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

The simple project to separate mixed voice (2 clean voices) to 2 separate voices.

HuggingTweets - Train a model to generate tweets

Creating an LSTM model to generate music

Transformers Wav2Vec2 + Parlance's CTCDecodeTransformers Wav2Vec2 + Parlance's CTCDecode

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

[ICLR'19] Trellis Networks for Sequence Modeling

CredData is a set of files including credentials in open source projects

Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

用Resnet101+GPT搭建一个玩王者荣耀的AI

Syntax-aware Multi-spans Generation for Reading Comprehension (TASLP 2022)

Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.

문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.