CCF BDCI 2020 房产行业聊天问答匹配赛道 A榜47/2985

Overview

CCF BDCI 2020 房产行业聊天问答匹配 A榜47/2985

赛题描述详见:https://www.datafountain.cn/competitions/474

文件说明

data: 存放训练数据和测试数据以及预处理代码

model_bert.py: 网络模型结构定义

adv_train.py: 对抗训练代码

run_bert_pse_adv.py: 运行bert-wwm + 对抗训练 + 伪标签模型

run_roberta_cls_pse_reinit_adv.py: 运行roberta-large last2embedding_cls + reinit + 对抗训练 + 伪标签模型

个人方案

我的baseline是将query和answer拼接后传入预训练好的bert进行特征提取,之后将提取的特征传入一个全连接层,最后接一个softmax进行分类。

其中尝试的预训练模型有bert(谷歌),bert_wwm(哈工大版本),roberta_large(哈工大版本),xlneternie等,其中效果较好的有bert-wwm和roberta-large。之后在baseline的基础上进行了各种尝试,主要尝试有以下:

模型 线上F1
bert-wwm 0.78
bert-wwm + 对抗训练 0.783
bert-wwm + 对抗训练 + 伪标签 0.7879
roberta-large 0.774
roberta-large + reinit + 对抗训练 0.786
roberta-large + reinit+对抗训练 + 伪标签 0.7871
roberta-large last2embedding_cls + reinit + 对抗训练 + 伪标签 0.7879

对抗训练

其基本的原理呢,就是通过添加扰动构造一些对抗样本,放给模型去训练,以攻为守,提高模型在遇到对抗样本时的鲁棒性,同时一定程度也能提高模型的表现和泛化能力。

参考链接:https://zhuanlan.zhihu.com/p/91269728

伪标签

将测试数据和预测结果进行拼接,之后当成训练数据传入到模型中重新进行训练。为了减少对训练数据的原始分布的影响并增加伪标签的置信度,我只在五个采用不同预训练模型的baseline预测一致的数据中采样了6000条测试数据加入到训练集进行训练。

重新初始化

参考链接:如何让Bert在finetune小数据集时更“稳”一点 https://zhuanlan.zhihu.com/p/148720604

大致思想是靠近底部的层(靠近input)学到的是比较通用的语义方面的信息,比如词性、词法等语言学知识,而靠近顶部的层会倾向于学习到接近下游任务的知识,对于预训练来说就是类似masked word prediction、next sentence prediction任务的相关知识。当使用bert预训练模型finetune其他下游任务(比如序列标注)时,如果下游任务与预训练任务差异较大,那么bert顶层的权重所拥有的知识反而会拖累整体的finetune进程,使得模型在finetune初期产生训练不稳定的问题。

因此,我们可以在finetune时,只保留接近底部的bert权重,对于靠近顶部的层的权重,可以重新随机初始化,从头开始学习。

在本次比赛中,我只对最后roberta-large的最后五层进行重新初始化。在实验中,我发现对于bert,重新初始化会降低效果,而roberta-large则有提升。

bert 不同embedding和cls组合

思路主要是参考 CCF BDCI 2019 互联网新闻情感分析 复赛top1解决方案

参考链接:https://github.com/cxy229/BDCI2019-SENTIMENT-CLASSIFICATION

即对bert不同embedding进行组合后传入全连接层进行分类。该方案尝试时间较晚,只实验last2embedding_cls这种组合,结果也确实有提升。

模型融合

对于单模,我采用五折交叉验证,对每一个单模的五个模型结果,我尝试了相加融合和投票的方式,结果是融合相加的线上f1较高

对于不同模型,我也只是采用的相加融合的方式(由于时间问题没有尝试投票和stacking的方式)。最后a榜效果最好的是bert-wwm + 对抗训练 + 伪标签、roberta-large + reinit+对抗训练 + 伪标签、roberta-large last2embedding_cls + reinit + 对抗训练 + 伪标签 三个模型的融合,线上F1有 0.7908 , 排名47;B榜我尝试只对两个效果最好的模型进行融合,即 bert-wwm + 对抗训练 + 伪标签last2embedding_cls + reinit + 对抗训练 + 伪标签,最终F1为0.80,排名72。

总结

本次参加比赛完全是数据挖掘课程要求,也是我第一次参加大数据比赛。因为我的研究方向是图像,所以基本可以说是从零开始,写这个github只是想记录一下这一个月自己从零开始的参赛经历,也希望对同样参加类似比赛的新人有帮助。最后,希望看到了顺手给star,万分感谢。

Owner
shuo
shuo
This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection"

Splinter This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection", to

Ori Ram 88 Dec 31, 2022
Unsupervised intent recognition

INTENT author: steeve LAQUITAINE description: deployment pattern: currently batch only Setup & run git clone https://github.com/slq0/intent.git bash

sl 1 Apr 08, 2022
Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields

Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields [project page][paper][cite] Geometry-Consistent Neural Shape Represe

Yifan Wang 100 Dec 19, 2022
Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

Bethge Lab 61 Dec 21, 2022
TruthfulQA: Measuring How Models Imitate Human Falsehoods

TruthfulQA: Measuring How Models Imitate Human Falsehoods

69 Dec 25, 2022
Chinese Named Entity Recognization (BiLSTM with PyTorch)

BiLSTM-CRF for Name Entity Recognition PyTorch version A PyTorch implemention of Bi-LSTM-CRF model for Chinese Named Entity Recognition. 使用 PyTorch 实现

5 Jun 01, 2022
Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention

Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention April 6, 2021 We extended segment-means to compute landmarks without requiri

Zhanpeng Zeng 322 Jan 01, 2023
Ecco is a python library for exploring and explaining Natural Language Processing models using interactive visualizations.

Visualize, analyze, and explore NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2, BER

Jay Alammar 1.6k Dec 25, 2022
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 3.2k Dec 31, 2022
A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

ParlAI (pronounced “par-lay”) is a python framework for sharing, training and testing dialogue models, from open-domain chitchat, to task-oriented dia

Facebook Research 9.7k Jan 09, 2023
A high-level yet extensible library for fast language model tuning via automatic prompt search

ruPrompts ruPrompts is a high-level yet extensible library for fast language model tuning via automatic prompt search, featuring integration with Hugg

Sber AI 37 Dec 07, 2022
2021 2학기 데이터크롤링 기말프로젝트

공지 주제 웹 크롤링을 이용한 취업 공고 스케줄러 스케줄 주제 정하기 코딩하기 핵심 코드 설명 + 피피티 구조 구상 // 12/4 토 피피티 + 스크립트(대본) 제작 + 녹화 // ~ 12/10 ~ 12/11 금~토 영상 편집 // ~12/11 토 웹크롤러 사람인_평균

Choi Eun Jeong 2 Aug 16, 2022
justCTF [*] 2020 challenges sources

justCTF [*] 2020 This repo contains sources for justCTF [*] 2020 challenges hosted by justCatTheFish. TLDR: Run a challenge with ./run.sh (requires Do

justCatTheFish 25 Dec 27, 2022
Visual Automata is a Python 3 library built as a wrapper for Caleb Evans' Automata library to add more visualization features.

Visual Automata Copyright 2021 Lewi Lie Uberg Released under the MIT license Visual Automata is a Python 3 library built as a wrapper for Caleb Evans'

Lewi Uberg 55 Nov 17, 2022
Adversarial Examples for Extreme Multilabel Text Classification

Adversarial Examples for Extreme Multilabel Text Classification The code is adapted from the source codes of BERT-ATTACK [1], APLC_XLNet [2], and Atte

1 May 14, 2022
Chinese Grammatical Error Diagnosis

nlp-CGED Chinese Grammatical Error Diagnosis 中文语法纠错研究 基于序列标注的方法 所需环境 Python==3.6 tensorflow==1.14.0 keras==2.3.1 bert4keras==0.10.6 笔者使用了开源的bert4keras

12 Nov 25, 2022
In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a model using HugginFace transformers framework.

Transformers are all you need In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a

Aymen Berriche 8 Apr 13, 2022
Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP

Stat4ML Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP This is the first course from our trio courses: Statistics Foundatio

Omid Safarzadeh 83 Dec 29, 2022
This is a really simple text-to-speech app made with python and tkinter.

Tkinter Text-to-Speech App by Souvik Roy This is a really simple tkinter app which converts the text you have entered into a speech. It is created wit

Souvik Roy 1 Dec 21, 2021
Fine-tuning scripts for evaluating transformer-based models on KLEJ benchmark.

The KLEJ Benchmark Baselines The KLEJ benchmark (Kompleksowa Lista Ewaluacji Językowych) is a set of nine evaluation tasks for the Polish language und

Allegro Tech 17 Oct 18, 2022