当前位置:网站首页>Simple to use NLP annotation tool the Brat
Simple to use NLP annotation tool the Brat
2022-08-11 11:58:00 【sweet and spicy uu】
目录
4. Annotation configuration and labeling
(1) The raw data file ready to
(4) Chinese label configuration
5. Marked with fixed sample video
写在前面
今天跟大家分享的是 NLP标注工具 Brat 的简单使用.
1. 背景
Brat 所支持的标注任务有:实体识别、实体关系、事件抽取等;Of course can also be adapted to Aspect-Based Sentiment Analysis The emotional level analysis task data indicate,Refer to the emotional analysis of series《利用BRAT进行中文情感分析语料标注》[1];
另外,BratTool can be adapted toChinese annotation scene;It is important to note the安装环境需要为osx或linux系统或linux虚拟环境.
2. Brat的安装与启动
首先从[Brat rapid annotation tool][2]下载安装包「brat-v1.3_Crunchy_Frog.tar.gz」,To extract and install.
需要注意的是:Unpack the need to put in a directory does not contain Chinese characters,如「dataLabeling」.
解压
解压命令:tar -xf brat-v1.3_Crunchy_Frog.tar.gz
安装
安装命令:./install.sh -u
Installation process according to the prompt for login name、The password and email, etc(Mainly used for the follow-up in tagging login page).
-> dataLabeling ./install.sh -u
please the user name that you want to use when logging into brat
xxx
please enter a brat password (this shows on screen)
xxx
please enter the administrator contact email
xxx
The installation process you can seeInstallation - brat rapid annotation tool[3],写得非常清楚.
3. Brat的运行
运行命令:python standalone.py
注意这里 python版本为python2 .
如下图所示,After the success of the operation click url(默认为 http://127.0.0.1:8001),To enter the online annotation page:
点击ok,然后可以看到Examples 和 tutorials ,The two is the official to mark sample.
If you want to annotate,Click on the page the top right cornerbrat进行登录,Input user name and password in installation process just now:
4. Annotation configuration and labeling
If you want to fit into their own tagging task,需要进行一些配置,具体步骤概括如下:
Prepare the raw data file
准备配置文件 annotation.conf
Mark interface display Chinese label configuration visual.conf
(1) The raw data file ready to
According to the labeling requirements,按照句子/段落/Chapter into file,Each file is a sample,All the samples into a folder,Then put the folder Brat 安装路径下的 data 目录下.
注意:文本编码格式为utf-8,文件名称为xxx.txt,其中xxxOnly for digital or English.
这里以事件抽取Tasks marked as an example:The data file unification in Brat 安装路径下的 data目录中的「event_demo」文件夹下;其中,Each sample file contains a sentence.The figure below shows the called2.txtFiles contained in the text content:
此外,Each sample file must have a corresponding with emptyann文件,It is mainly used for storage annotation automatically generated after the annotation results.若没有ann文件,So when you are in the page, click on the corresponding file cannot be open.
生成 ann 文件The command is simple.只需在data目录下,执行命令:
find 目标文件夹名称 -name '*.txt'|sed -e 's|\.txt|.ann|g'|xargs touch
(2) 配置文件配置
仍然以事件抽取任务Mark, for example.首先,我们需要明确:
What are the events to mark,The clear event type;
The structure of each event how to,Each event type is made under the event element/论元(角色);
Each event element can belong to which entity type;
Each event element is a must have,Still can have no,或者有几个.
After the above content clear,To configure file configuration.
这里需要解释一点,Is in accordance with the standard event extraction task definition,Event elements are entities,So we have to clear each event type of every element can belong to which entity type.
annotation.conf Configuration files in the corresponding data folder,比如这里的 event_demo.Below is a configuration example,其中共有7类实体,即:时间(Time)、地点(Loc)、组织机构(Organization)、人物(Person)、职务(Job)、数字(Number)、Sports activity title(Sport-Name).
此外,Common configuration inside3Event that political meeting(Political-meeting)、地震(Earthquake)、获胜(Win).其中:
Political events of elements have a meeting:时间(Time)、地点(Place)、参与者(Participants);
The events of seismic events element has:时间、地点、震级(Layer)、震源深度(Distance)、死亡人数(Die)、受伤人数(Injure);
Win the events element has:时间、胜者(Winner)、败者(Loser)、赛事名称(Name).
Each element has also conducted under each type of eventEntity type constraints,Such as the element entity type of political meeting participants as<POJ>,Combining entity or configuration,We can know the element entity type of political meeting participants for job、组织机构、The three types characters.其他同理,不再赘述.
In addition to event element,?、 *、+Is limited if it must be,至少有几个,The interpretation of the specific can see entity referred to as.
(3) 标注
In the annotation on the page,Selected to annotation of words,Playing box checked their corresponding labels can be.
实体标注
Assuming that the selected text is real,Then we will in the column for the entity choice belongs to the entity type,如「10月20日」Entity type for time,「The united Arab emirates delegation」Entity type for organization .
Trigger words and event type annotation
Assuming that the selected text for the event trigger word,Then we will in the column for the event type selection event type,如「会晤」The term event type for political meeting.
Event elements indicate
After finish the entities and events trigger word,我们需要做的就是The entity associated with event trigger word起来.操作简单:Directly from the trigger pull out a word箭头Point to the corresponding entity,In the box to select the entity in the event the role of(事件元素).如「The united Arab emirates delegation」为「会晤」Elements of trigger political meeting participants.
(4) Chinese label configuration
可以看到,在(3)In the annotation page displays are in English.如果Want to mark personnel more friendly,Preferably in annotation page shows Chinese.That how to make annotation page shows Chinese?
经过实践,Found that want to mark the page display Chinese is not directly change the configuration file,Because out of all kinds ofbug.但是,我们可以Another file configuration visual.conf.
如下图所示,We just have respectively configuration entity types、event types、role types Can let related entity type、Event types and the argument roles in the tags inside the page displayed as Chinese.
需注意的是,BratItself does not support Chinese,So you also need to changeserver/src/projectconfig.py 文件中第162行代码为:
n = re.sub(u'[^a-zA-Z\u4e00-\u9fa5<>,0-9_-]','_', n)
(5) 标注
在配置 visual.conf 文件后,最终,Mark the page as follows:
实体标注
Trigger words and event type annotation
Event elements indicate
(6) 标注结果
The following two charts respectively for the earthquake、Win event marked complete figure and the corresponding ann In the file system according to the result of our annotation automatically generated annotation:
在9.txt文件中,共有4个实体:
T1:10月29日4时52分-时间
T2:Guizhou bijie city xianning county-地点
T3:3.2级-数字
T4:18千米-数字
共有1A seismic event E1:
事件触发词:地震(T5)
事件元素:
时间元素:10月29日4时52分
Location element:Guizhou bijie city xianning county
Magnitude of the element:3.2级
A depth elements:18千米
在16.txt文件中,共有7个实体:
T1:北京时间11月4日凌晨4点-时间
T2:欧冠BGroup of the third round-赛事名称
T3:皇马-组织机构
T4:国际米兰-组织机构
T5:4分-数字
T6:1分-数字
T7:国米-组织机构
共有1A winning events E1:
事件触发词:击败(T8)
事件元素:
时间元素:北京时间11月4日凌晨4点-时间
Event name element:欧冠BGroup of the third round-赛事名称
The winner elements:皇马
Loser elements:国际米兰
5. Marked with fixed sample video
BratTools used in general is very simple,The video below shows a complete annotation process:
,时长01:06
Of course the mistake also modify, and delete operations are simple,See the following video:
,时长00:33
It's important to note that if you want to delete the entity has been related to the event,At this moment need to remove the link between the entities and events or delete the corresponding role arrow with events,Remove the entity again.
总结
最后对 BratAnnotation tool summarized below:
Build low cost:Only need to meet the requirements of the operating system the computer,Can be installed and run;
操作简单:成功运行后,明确业务需求/Labeling requirements can be labeled with per capita.操作简单,The relevant text,In the box for the selected corresponding labels can be;
Multitasking mark:At the same time for physical、The labeling of entity relationship and event;
With quality assurance:To ensure the quality of,First you need to be carried out in accordance with the requirements configuration,Can set each event must be some elements,How many, and each element,This annotation can avoid some not fully data appear;Otherwise noted personnel must the annotation taskSchemaSuch as event structure very understanding.
多人标注:Brat Also support people mark.
边栏推荐
猜你喜欢
随机推荐
【深度学习】小结1-入门两周学习感受
亏了3000亿,巴菲特:这也叫亏?
Network Security - nmap
PL4807-ADJ线性锂电池可调充电芯片
Flutter 教程之 Flutter 中的 HMS 定位工具包
音频分享系统(类听书系统)
谷歌搜索,全球宕机??
学习笔记【nlp中的sample和beam_search】
云原生(三十四) | Kubernetes篇之平台存储系统实战
Web3 Entrepreneur's Guide: How to Build a Decentralized Community for Your Product?
莫队学习总结
【深度学习】笔记2-模型在测试集的准确率大于训练集
从 IP 开始,学习数字逻辑:DataMover 基础篇
【毕业设计】老人心率脉搏血压体征监测手表 - stm32 单片机 嵌入式 物联网
资本市场做好为工业互联网“买单”的准备了吗?
HM升压IC芯片代理商
同城是美团电商的解法吗?
Go编译原理系列10(逃逸分析)
CSDN文章抓取
【黑马早报】抖音否认与头部主播签对赌协议;阿迪达斯CEO承认在中国犯了错;网易云社交App心遇被指涉黄;联通董事长称5G资费比点外卖还便宜