当前位置:网站首页>SCIERC语料格式解读
SCIERC语料格式解读
2022-08-08 06:24:00 【hithithithithit】
一、观察语料
{"clusters": [[[17, 20], [23, 23]]], "sentences": [["English", "is", "shown", "to", "be", "trans-context-free", "on", "the", "basis", "of", "coordinations", "of", "the", "respectively", "type", "that", "involve", "strictly", "syntactic", "cross-serial", "agreement", "."], ["The", "agreement", "in", "question", "involves", "number", "in", "nouns", "and", "reflexive", "pronouns", "and", "is", "syntactic", "rather", "than", "semantic", "in", "nature", "because", "grammatical", "number", "in", "English", ",", "like", "grammatical", "gender", "in", "languages", "such", "as", "French", ",", "is", "partly", "arbitrary", "."], ["The", "formal", "proof", ",", "which", "makes", "crucial", "use", "of", "the", "Interchange", "Lemma", "of", "Ogden", "et", "al.", ",", "is", "so", "constructed", "as", "to", "be", "valid", "even", "if", "English", "is", "presumed", "to", "contain", "grammatical", "sentences", "in", "which", "respectively", "operates", "across", "a", "pair", "of", "coordinate", "phrases", "one", "of", "whose", "members", "has", "fewer", "conjuncts", "than", "the", "other", ";", "it", "thus", "goes", "through", "whatever", "the", "facts", "may", "be", "regarding", "constructions", "with", "unequal", "numbers", "of", "conjuncts", "in", "the", "scope", "of", "respectively", ",", "whereas", "other", "arguments", "have", "foundered", "on", "this", "problem", "."]], "ner": [[[0, 0, "Material"], [10, 10, "OtherScientificTerm"], [17, 20, "OtherScientificTerm"]], [[23, 23, "Generic"], [29, 29, "OtherScientificTerm"], [31, 32, "OtherScientificTerm"], [42, 43, "OtherScientificTerm"], [45, 45, "Material"], [48, 49, "OtherScientificTerm"], [51, 51, "Material"], [54, 54, "Material"]], [[70, 71, "Method"], [86, 86, "Material"]]], "relations": [[], [[29, 29, 31, 32, "CONJUNCTION"], [48, 49, 51, 51, "FEATURE-OF"], [54, 54, 51, 51, "HYPONYM-OF"]], []], "doc_key": "J87-1003"}
二、利用下面的代码将语料打印出来
import json
gold_docs = [json.loads(line) for line in open('scierc_data/processed_data/json/train.json')]
# print(gold_docs)
for i in gold_docs:
print('集群:', i['clusters'])
print("句子:", i['sentences'])
print("实体信息:", i['ner'])
print("关系对:", i["relations"])
print("文章编号:", i["doc_key"])
all_sentences = []
for j in i["sentences"]:
all_sentences += j
break集群: [[[17, 20], [23, 23]]]
句子: [['English', 'is', 'shown', 'to', 'be', 'trans-context-free', 'on', 'the', 'basis', 'of', 'coordinations', 'of', 'the', 'respectively', 'type', 'that', 'involve', 'strictly', 'syntactic', 'cross-serial', 'agreement', '.'], ['The', 'agreement', 'in', 'question', 'involves', 'number', 'in', 'nouns', 'and', 'reflexive', 'pronouns', 'and', 'is', 'syntactic', 'rather', 'than', 'semantic', 'in', 'nature', 'because', 'grammatical', 'number', 'in', 'English', ',', 'like', 'grammatical', 'gender', 'in', 'languages', 'such', 'as', 'French', ',', 'is', 'partly', 'arbitrary', '.'], ['The', 'formal', 'proof', ',', 'which', 'makes', 'crucial', 'use', 'of', 'the', 'Interchange', 'Lemma', 'of', 'Ogden', 'et', 'al.', ',', 'is', 'so', 'constructed', 'as', 'to', 'be', 'valid', 'even', 'if', 'English', 'is', 'presumed', 'to', 'contain', 'grammatical', 'sentences', 'in', 'which', 'respectively', 'operates', 'across', 'a', 'pair', 'of', 'coordinate', 'phrases', 'one', 'of', 'whose', 'members', 'has', 'fewer', 'conjuncts', 'than', 'the', 'other', ';', 'it', 'thus', 'goes', 'through', 'whatever', 'the', 'facts', 'may', 'be', 'regarding', 'constructions', 'with', 'unequal', 'numbers', 'of', 'conjuncts', 'in', 'the', 'scope', 'of', 'respectively', ',', 'whereas', 'other', 'arguments', 'have', 'foundered', 'on', 'this', 'problem', '.']]
实体信息: [[[0, 0, 'Material'], [10, 10, 'OtherScientificTerm'], [17, 20, 'OtherScientificTerm']], [[23, 23, 'Generic'], [29, 29, 'OtherScientificTerm'], [31, 32, 'OtherScientificTerm'], [42, 43, 'OtherScientificTerm'], [45, 45, 'Material'], [48, 49, 'OtherScientificTerm'], [51, 51, 'Material'], [54, 54, 'Material']], [[70, 71, 'Method'], [86, 86, 'Material']]]
关系对: [[], [[29, 29, 31, 32, 'CONJUNCTION'], [48, 49, 51, 51, 'FEATURE-OF'], [54, 54, 51, 51, 'HYPONYM-OF']], []]
文章编号: J87-1003
三、解读
由上面打印的信息可知,句子对就是文档中的句子,这里是以列表的形式给出来的。实体信息就是实体在前面文档的中(起始位置,终止位置,实体类型)三部分构成。关系对以(主体起始位置,主体终止位置,客体起始位置,客体终止位置,关系类型构成)五个部分构成。
最后的集群指的是同一个指代的实体在文章中出现的不同位置,已用(起始位置,终止位置)的形式给出。
边栏推荐
- Lamp analysis: LED lamps are expected to reach $45.9 billion in 2028
- In 2022 China children's food market scale and development trend
- 3. MATPLOTLIB data visualization analysis tool
- Yii2使用composer安装MongoDB扩展
- 类图是什么?
- 消费品行业报告:椰子油市场现状研究分析与发展前景预测
- Learning How to Ask: Querying LMs with Mixtures of Soft Prompts
- 玫瑰精油市场研究:目前市场产值超过23亿元,市场需求缺口约10%
- Map和Set
- 市场调研:2022年金属家具行业深度分析与发展前景报告
猜你喜欢

redis笔记

市场调研报告-食品添加剂行业产量为974万吨

Google Colab 快速上手

Distributed voltage regulation using permissioned blockchains and extended contract net protocol优化效率

CUDA10 installs a version of tensorflow that supports gpu
![[BSidesCF 2020] Had a bad day1](/img/18/872d1c4a87608c618d2add0a65c4ba.png)
[BSidesCF 2020] Had a bad day1

消费品行业报告:化妆品容器市场现状研究分析与发展前景预测

爬取实习吧前四页的招聘信息

Refrigerator compressor market status research analysis and development prospect forecast

Scrapy_Redis 分布式处理
随机推荐
MySQL表的增删改查
MongoDB的备份与恢复
课堂作业--物不知数
2022年天然橡胶市场供需与价格走势
Consumer Goods Industry Report: Coconut Oil Market Status Research Analysis and Development Prospect Forecast
四 、TF2.0中张量的数学运算
Refrigerator compressor market status research analysis and development prospect forecast
ACM latex
Variational Inference with Normalizing Flows变分推断
yii2使用多个数据库的使用方法
Four, TF2.0 tensor in mathematics
[BSidesCF 2020] Had a bad day1
四. Redis 事务、锁机制秒杀
Redis实战篇
[GWCTF 2019]我有一个数据库1
Flask study notes
Equipment industry research report: laser printer market present situation and development trend in the future
市场调研:2022年金属家具行业深度分析与发展前景报告
MongoDB的介绍与特点
一、TF2 常用命令