当前位置:网站首页>Simple to use NLP annotation tool the Brat

Simple to use NLP annotation tool the Brat

2022-08-11 11:58:00 sweet and spicy uu

目录

写在前面

1. 背景

2. Brat的安装与启动

3. Brat的运行

4. Annotation configuration and labeling

(1) The raw data file ready to

(2) 配置文件配置

(3) 标注

(4) Chinese label configuration

(5) 标注

(6) 标注结果

5. Marked with fixed sample video

总结

写在前面

今天跟大家分享的是 NLP标注工具 Brat 的简单使用.

1. 背景

Brat 所支持的标注任务有:实体识别、实体关系、事件抽取等;Of course can also be adapted to Aspect-Based Sentiment Analysis The emotional level analysis task data indicate,Refer to the emotional analysis of series《利用BRAT进行中文情感分析语料标注》[1];

另外,BratTool can be adapted toChinese annotation scene;It is important to note the安装环境需要为osx或linux系统或linux虚拟环境.

2. Brat的安装与启动

首先从[Brat rapid annotation tool][2]下载安装包「brat-v1.3_Crunchy_Frog.tar.gz」,To extract and install.

需要注意的是:Unpack the need to put in a directory does not contain Chinese characters,如「dataLabeling」.

  1. 解压

解压命令:tar -xf brat-v1.3_Crunchy_Frog.tar.gz

  1. 安装

安装命令:./install.sh -u

Installation process according to the prompt for login name、The password and email, etc(Mainly used for the follow-up in tagging login page).

-> dataLabeling  ./install.sh -u
please the user name that you want to use when logging into brat
xxx
please enter a brat password (this shows on screen)
xxx
please enter the administrator contact email
xxx

The installation process you can seeInstallation - brat rapid annotation tool[3],写得非常清楚.

3. Brat的运行

运行命令:python standalone.py

注意这里 python版本为python2 .

如下图所示,After the success of the operation click url(默认为 http://127.0.0.1:8001),To enter the online annotation page:

点击ok,然后可以看到Examples 和 tutorials ,The two is the official to mark sample.

If you want to annotate,Click on the page the top right cornerbrat进行登录,Input user name and password in installation process just now:

4. Annotation configuration and labeling

If you want to fit into their own tagging task,需要进行一些配置,具体步骤概括如下:

  • Prepare the raw data file

  • 准备配置文件 annotation.conf

  • Mark interface display Chinese label configuration visual.conf

(1) The raw data file ready to

According to the labeling requirements,按照句子/段落/Chapter into file,Each file is a sample,All the samples into a folder,Then put the folder Brat 安装路径下的 data 目录下.

注意:文本编码格式为utf-8,文件名称为xxx.txt,其中xxxOnly for digital or English.

这里以事件抽取Tasks marked as an example:The data file unification in Brat 安装路径下的 data目录中的「event_demo」文件夹下;其中,Each sample file contains a sentence.The figure below shows the called2.txtFiles contained in the text content:

此外,Each sample file must have a corresponding with emptyann文件,It is mainly used for storage annotation automatically generated after the annotation results.若没有ann文件,So when you are in the page, click on the corresponding file cannot be open.

生成 ann 文件The command is simple.只需在data目录下,执行命令:

find 目标文件夹名称 -name '*.txt'|sed -e 's|\.txt|.ann|g'|xargs touch

(2) 配置文件配置

仍然以事件抽取任务Mark, for example.首先,我们需要明确:

  • What are the events to mark,The clear event type;

  • The structure of each event how to,Each event type is made under the event element/论元(角色);

  • Each event element can belong to which entity type;

  • Each event element is a must have,Still can have no,或者有几个.

After the above content clear,To configure file configuration.

这里需要解释一点,Is in accordance with the standard event extraction task definition,Event elements are entities,So we have to clear each event type of every element can belong to which entity type.

annotation.conf Configuration files in the corresponding data folder,比如这里的 event_demo.Below is a configuration example,其中共有7类实体,即:时间(Time)、地点(Loc)、组织机构(Organization)、人物(Person)、职务(Job)、数字(Number)、Sports activity title(Sport-Name).

此外,Common configuration inside3Event that political meeting(Political-meeting)、地震(Earthquake)、获胜(Win).其中:

  • Political events of elements have a meeting:时间(Time)、地点(Place)、参与者(Participants);

  • The events of seismic events element has:时间、地点、震级(Layer)、震源深度(Distance)、死亡人数(Die)、受伤人数(Injure);

  • Win the events element has:时间、胜者(Winner)、败者(Loser)、赛事名称(Name).

Each element has also conducted under each type of eventEntity type constraints,Such as the element entity type of political meeting participants as<POJ>,Combining entity or configuration,We can know the element entity type of political meeting participants for job、组织机构、The three types characters.其他同理,不再赘述.

In addition to event element,?、 *、+Is limited if it must be,至少有几个,The interpretation of the specific can see entity referred to as.

(3) 标注

In the annotation on the page,Selected to annotation of words,Playing box checked their corresponding labels can be.

  • 实体标注

Assuming that the selected text is real,Then we will in the column for the entity choice belongs to the entity type,如「10月20日」Entity type for time,「The united Arab emirates delegation」Entity type for organization .

  • Trigger words and event type annotation

Assuming that the selected text for the event trigger word,Then we will in the column for the event type selection event type,如「会晤」The term event type for political meeting.

  • Event elements indicate

After finish the entities and events trigger word,我们需要做的就是The entity associated with event trigger word起来.操作简单:Directly from the trigger pull out a word箭头Point to the corresponding entity,In the box to select the entity in the event the role of(事件元素).如「The united Arab emirates delegation」为「会晤」Elements of trigger political meeting participants.

(4) Chinese label configuration

可以看到,在(3)In the annotation page displays are in English.如果Want to mark personnel more friendly,Preferably in annotation page shows Chinese.That how to make annotation page shows Chinese?

经过实践,Found that want to mark the page display Chinese is not directly change the configuration file,Because out of all kinds ofbug.但是,我们可以Another file configuration visual.conf.

如下图所示,We just have respectively configuration entity types、event types、role types Can let related entity type、Event types and the argument roles in the tags inside the page displayed as Chinese.

需注意的是,BratItself does not support Chinese,So you also need to changeserver/src/projectconfig.py 文件中第162行代码为:

n = re.sub(u'[^a-zA-Z\u4e00-\u9fa5<>,0-9_-]','_', n)

(5) 标注

在配置 visual.conf 文件后,最终,Mark the page as follows:

  • 实体标注

  • Trigger words and event type annotation

  • Event elements indicate

(6) 标注结果

The following two charts respectively for the earthquake、Win event marked complete figure and the corresponding ann In the file system according to the result of our annotation automatically generated annotation:

在9.txt文件中,共有4个实体:

  • T1:10月29日4时52分-时间

  • T2:Guizhou bijie city xianning county-地点

  • T3:3.2级-数字

  • T4:18千米-数字

共有1A seismic event E1:

  • 事件触发词:地震(T5)

  • 事件元素:

    • 时间元素:10月29日4时52分

    • Location element:Guizhou bijie city xianning county

    • Magnitude of the element:3.2级

    • A depth elements:18千米

在16.txt文件中,共有7个实体:

  • T1:北京时间11月4日凌晨4点-时间

  • T2:欧冠BGroup of the third round-赛事名称

  • T3:皇马-组织机构

  • T4:国际米兰-组织机构

  • T5:4分-数字

  • T6:1分-数字

  • T7:国米-组织机构

共有1A winning events E1:

  • 事件触发词:击败(T8)

  • 事件元素:

    • 时间元素:北京时间11月4日凌晨4点-时间

    • Event name element:欧冠BGroup of the third round-赛事名称

    • The winner elements:皇马

    • Loser elements:国际米兰

5. Marked with fixed sample video

BratTools used in general is very simple,The video below shows a complete annotation process:

,时长01:06

Of course the mistake also modify, and delete operations are simple,See the following video:

,时长00:33

It's important to note that if you want to delete the entity has been related to the event,At this moment need to remove the link between the entities and events or delete the corresponding role arrow with events,Remove the entity again.

总结

最后对 BratAnnotation tool summarized below:

  • Build low cost:Only need to meet the requirements of the operating system the computer,Can be installed and run;

  • 操作简单:成功运行后,明确业务需求/Labeling requirements can be labeled with per capita.操作简单,The relevant text,In the box for the selected corresponding labels can be;

  • Multitasking mark:At the same time for physical、The labeling of entity relationship and event;

  • With quality assurance:To ensure the quality of,First you need to be carried out in accordance with the requirements configuration,Can set each event must be some elements,How many, and each element,This annotation can avoid some not fully data appear;Otherwise noted personnel must the annotation taskSchemaSuch as event structure very understanding.

  • 多人标注Brat Also support people mark.

原网站

版权声明
本文为[sweet and spicy uu]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/223/202208111148198562.html