๐Ÿงช Cutting-edge experimental spaCy components and features

Overview

spacy-experimental: Cutting-edge experimental spaCy components and features

This package includes experimental components and features for spaCy v3.x, for example model architectures, pipeline components and utilities.

Azure Pipelines pypi Version

Installation

Install with pip:

python -m pip install -U pip setuptools wheel
python -m pip install spacy-experimental

Using spacy-experimental

Components and features may be modified or removed in any release, so always specify the exact version as a package requirement if you're experimenting with a particular component, e.g.:

spacy-experimental==0.147.0

Then you can add the experimental components to your config or import from spacy_experimental:

[components.experimental_edit_tree_lemmatizer]
factory = "experimental_edit_tree_lemmatizer"

Components

Edit tree lemmatizer

[components.experimental_edit_tree_lemmatizer]
factory = "experimental_edit_tree_lemmatizer"
# token attr to use as backoff with the predicted trees are not applicable; null to leave unset
backoff = "orth"
# prune trees that are applied less than this frequency in the training data
min_tree_freq = 2
# whether to overwrite existing lemma annotation
overwrite = false
scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}
# try to apply at most the k most probable edit trees
top_k = 1

Trainable character-based tokenizers

Two trainable tokenizers represent tokenization as a sequence tagging problem over individual characters and use the existing spaCy tagger and NER architectures to perform the tagging.

In the spaCy pipeline, a simple "pretokenizer" is applied as the pipeline tokenizer to split each doc into individual characters and the trainable tokenizer is a pipeline component that retokenizes the doc. The pretokenizer needs to be configured manually in the config or with spacy.blank():

nlp = spacy.blank(
    "en",
    config={
        "nlp": {
            "tokenizer": {"@tokenizers": "spacy-experimental.char_pretokenizer.v1"}
        }
    },
)

The two tokenizers currently reset any existing tag or entity annotation respectively in the process of retokenizing.

Character-based tagger tokenizer

In the tagger version experimental_char_tagger_tokenizer, the tagging problem is represented internally with character-level tags for token start (T), token internal (I), and outside a token (O). This representation comes from Elephant: Sequence Labeling for Word and Sentence Segmentation (Evang et al., 2013).

This is a sentence.
TIIIOTIOTOTIIIIIIIT

With the option annotate_sents, S replaces T for the first token in each sentence and the component predicts both token and sentence boundaries.

This is a sentence.
SIIIOTIOTOTIIIIIIIT

A config excerpt for experimental_char_tagger_tokenizer:

[nlp]
pipeline = ["experimental_char_tagger_tokenizer"]
tokenizer = {"@tokenizers":"spacy-experimental.char_pretokenizer.v1"}

[components]

[components.experimental_char_tagger_tokenizer]
factory = "experimental_char_tagger_tokenizer"
annotate_sents = true
scorer = {"@scorers":"spacy-experimental.tokenizer_senter_scorer.v1"}

[components.experimental_char_tagger_tokenizer.model]
@architectures = "spacy.Tagger.v1"
nO = null

[components.experimental_char_tagger_tokenizer.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.experimental_char_tagger_tokenizer.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 128
attrs = ["ORTH","LOWER","IS_DIGIT","IS_ALPHA","IS_SPACE","IS_PUNCT"]
rows = [1000,500,50,50,50,50]
include_static_vectors = false

[components.experimental_char_tagger_tokenizer.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 128
depth = 4
window_size = 4
maxout_pieces = 2

Character-based NER tokenizer

In the NER version, each character in a token is part of an entity:

T	B-TOKEN
h	I-TOKEN
i	I-TOKEN
s	I-TOKEN
 	O
i	B-TOKEN
s	I-TOKEN
	O
a	B-TOKEN
 	O
s	B-TOKEN
e	I-TOKEN
n	I-TOKEN
t	I-TOKEN
e	I-TOKEN
n	I-TOKEN
c	I-TOKEN
e	I-TOKEN
.	B-TOKEN

A config excerpt for experimental_char_ner_tokenizer:

[nlp]
pipeline = ["experimental_char_ner_tokenizer"]
tokenizer = {"@tokenizers":"spacy-experimental.char_pretokenizer.v1"}

[components]

[components.experimental_char_ner_tokenizer]
factory = "experimental_char_ner_tokenizer"
scorer = {"@scorers":"spacy-experimental.tokenizer_scorer.v1"}

[components.experimental_char_ner_tokenizer.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.experimental_char_ner_tokenizer.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.experimental_char_ner_tokenizer.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 128
attrs = ["ORTH","LOWER","IS_DIGIT","IS_ALPHA","IS_SPACE","IS_PUNCT"]
rows = [1000,500,50,50,50,50]
include_static_vectors = false

[components.experimental_char_ner_tokenizer.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 128
depth = 4
window_size = 4
maxout_pieces = 2

The NER version does not currently support sentence boundaries, but it would be easy to extend using a B-SENT entity type.

Biaffine parser

A biaffine dependency parser, similar to that proposed in [Deep Biaffine Attention for Neural Dependency Parsing](Deep Biaffine Attention for Neural Dependency Parsing) (Dozat & Manning, 2016). The parser consists of two parts: an edge predicter and an edge labeler. For example:

[components.experimental_arc_predicter]
factory = "experimental_arc_predicter"

[components.experimental_arc_labeler]
factory = "experimental_arc_labeler"

The arc predicter requires that a previous component (such as senter) sets sentence boundaries during training. Therefore, such a component must be added to annotating_components:

[training]
annotating_components = ["senter"]

The biaffine parser sample project provides an example biaffine parser pipeline.

Architectures

None currently.

Other

Tokenizers

  • spacy-experimental.char_pretokenizer.v1: Tokenize a text into individual characters.

Scorers

  • spacy-experimental.tokenizer_scorer.v1: Score tokenization.
  • spacy-experimental.tokenizer_senter_scorer.v1: Score tokenization and sentence segmentation.

Bug reports and issues

Please report bugs in the spaCy issue tracker or open a new thread on the discussion board for other issues.

Older documentation

See the READMEs in earlier tagged versions for details about components in earlier releases.

Comments
  • Coref Components

    Coref Components

    This is a continuation of https://github.com/explosion/spaCy/pull/7264, since we decided to add the coref components here first. It's still a work in progress.

    enhancement 
    opened by polm 15
  • Add experimental Span Suggesters

    Add experimental Span Suggesters

    This PR adds three new experimental suggester functions for the spancat component and a spaCy project showcasing how to use them in a config.cfg file.

    Subtree Suggester:

    • Uses annotations from the Tagger and Parser to suggests subtrees of individual tokens

    Chunk Suggester:

    • Uses annotations from the Tagger and Parser to suggest noun_chunks

    Sentence Suggester:

    • Uses sentence boundaries to suggest sentences

    These suggesters also come with the ngram functionality which allows users to set a list of sizes for suggesting individual ngrams

    The spaCy project covers:

    • How to source components from existing models
    • How to use frozen_components & annotating_components
    • How to use custom suggester functions registered in the registry
    enhancement 
    opened by thomashacker 12
  • Span Finder Suggester

    Span Finder Suggester

    This PR adds a new experimental component for learning span boundaries and a custom suggester function for spancat. It further adds a spaCy project showcasing how to use the SpanFinder component on 3 different datasets (Healthsea, ToxicSpans, Genia) with 2 configurations (tok2vec & transformer). The project also provides the possibility to train spancat with ngram and compare it to SpanFinder with a custom evaluation script that calculates the performance and overall coverage of the suggester functions.

    Features

    • spaCy project for comparing SpanFinder vs Ngram
    • SpanFinder model
    • SpanFinder component
    • SpanFinder suggester
    • Unit tests for component, model and suggester
    enhancement 
    opened by thomashacker 10
  • Fix handling of small docs in coref

    Fix handling of small docs in coref

    Docs with one or zero tokens fail in the coref component. This doesn't have a fix yet, just a failing test. (There is also a test for the span resolver, which does not fail.)

    bug 
    opened by polm 2
  • Fix issue with resolving final token in SpanResolver

    Fix issue with resolving final token in SpanResolver

    The SpanResolver seems unable to include the final token in a Doc in output spans. It will even produce empty spans instead of doing so.

    This makes changes so that within the model span end indices are treated as inclusive, and converts them back to exclusive when annotating docs. This has been tested to work, though an automated test should be added.

    bug 
    opened by polm 2
  • Make coref entry points work without PyTorch

    Make coref entry points work without PyTorch

    Before this PR, in environments without PyTorch, using spacy experimental can fail due to attempts to load entry points. This change makes it so the types required for class definitions (torch.nn.Module and torch.Tensor) are stubbed to object when torch is not available.

    opened by polm 2
  • Fix device issue with indices in coref

    Fix device issue with indices in coref

    It looks like Torch 1.13.0 has some changes in the way devices are handled and can result in subtle errors in code that worked previously. This explicitly specifies a device in one place, and may resolve https://github.com/explosion/spaCy/issues/11734. For another example of this issue, see https://github.com/pytorch/pytorch/issues/85450.

    The core problem is that a CPU tensor is being indexed using a non-CPU tensor.

    Leaving as a draft in case this doesn't resolve this issue, which might happen if we're doing the same thing somewhere else.

    bug 
    opened by polm 1
  • Add test step before PyTorch is installed

    Add test step before PyTorch is installed

    spacy-experimental is supposed to be safe to load without PyTorch, even if large parts of it aren't functional, but that wasn't checked in tests. This adds a check for that by simply running the tests before installing PyTorch and again afterwards.

    Given the current state of master, which doesn't have #23, this should fail.

    opened by polm 1
  • `Coref`: Optimize `SpanResolver.set_annotations`

    `Coref`: Optimize `SpanResolver.set_annotations`

    Coerce scalar tensors to native Python integers to avoid comparison overhead.

    With the above change (and another optimization to SpanGroups.copy; PR), we see a 90% reduction in execution time of the set_annotation pipeline phase.

    Before

    Screenshot_20220825_154448

    After

    Screenshot_20220825_154626

    opened by shadeMe 0
  • `Coref`: Optimize `create_gold_scores`

    `Coref`: Optimize `create_gold_scores`

    Copy the entire mentions array to the CPU, create tuple keys on-demand, pre-allocate output matrix as Float2d. Also remove unnecessary casts in CoreferenceResolver.update.

    Before

    Screenshot_20220824_153937

    After

    Screenshot_20220824_154002

    opened by shadeMe 0
  • MultiEmbed

    MultiEmbed

    Embedding component that is the deterministic version of MultiHashEmbed i.e.: each token gets mapped to an index unless they are not in the vocabulary in which case they get mapped to a learned unknown vector.

    The mechanism to initialize MultiEmbed is a bit strange. The Model gets created first with dummy Embed layers. Then when init gets called MultiEmbed expects the model.attrs["tables"] to be already set, which provides the mapping from token attributes to indices. During initialization the dummy Embed layers get replaced by ones that adjust their sizes to the number of symbols in the tables.

    A helper callback is provided in set_attr.py that should be placed in the initialize.before_init section in the config. It can be used to set the tables for MultiEmbed.

    Currently the token_map.py is a script that has the structure of the usual spacy init scrips.

    opened by kadarakos 0
  • Support lazy, recursive sentence splitting

    Support lazy, recursive sentence splitting

    We use sentence splitting in the biaffine parser to keep the O(n^2) biaffine attention model tractable. However, since the sentence splitter makes errors, the parser may not have the correct head available.

    This change adds another splitting strategy as the preferred splitting. The goal of this strategy is to split up a Doc into pieces that are as large as possible given a maximum n_max. This reduces the number of attachment errors as a result of incorrect sentence splits, while providing an upper bound on complexity (O(n_max^2)).

    The algorithm works as follows:

    • If the length |d| > max_length:
      • Find the highest-probability split in d according to senter.
      • Split d into d_1 and d_2 using the highest probability split.
      • Recursively apply this algorithm to d_1 and d_2.
    • Otherwise: do nothing

    Note: draft, requires functionality from PR https://github.com/explosion/spaCy/pull/11002, which targets spaCy v4.

    opened by danieldk 0
Releases(v0.6.1)
  • v0.6.1(Nov 4, 2022)

    • Coref: Docs of one or fewer tokens resulted in an error (#28).
    • Coref: resolved spans could not include the final token in a Doc (#27).
    • Biaffine parser: place tensors on the right device (#25).

    This release includes an updated trained pipeline for demonstration purposes. You can install it like this:

    pip install https://github.com/explosion/spacy-experimental/releases/download/v0.6.1/en_coreference_web_trf-3.4.0a2-py3-none-any.whl
    

    Downloads (wheel)

    Source code(tar.gz)
    Source code(zip)
    en_coreference_web_trf-3.4.0a2-py3-none-any.whl(467.55 MB)
  • v0.6.0(Sep 28, 2022)

    • new coreference components (#17)

    ~~This release includes an experimental English coref pipeline. You can install the pipeline by downloading it from the assets in this release page, or install it directly with the following command:~~

    Update 2022-11-07: Some issues in the coref implementation have been fixed in the v0.6.1 release of this package. While the pipeline below can still be installed and will be left up for posterity, note that it should only be used with 0.6.0, but by default will pull in the newer version. If you want to use the below package (which is not recommended), be sure to pip install spacy-experimental==0.6.0.

    pip install https://github.com/explosion/spacy-experimental/releases/download/v0.6.0/en_coreference_web_trf-3.4.0a0-py3-none-any.whl
    

    For further information about the coref components, see the example project or the API documentation. We'll also be providing more detailed explanations in an upcoming blog post, video, and elsewhere.

    Downloads (wheel)

    Source code(tar.gz)
    Source code(zip)
    en_coreference_web_trf-3.4.0a0-py3-none-any.whl(467.57 MB)
  • v0.5.0(Jun 10, 2022)

    • removed edit tree lemmatizer (#12, it's in core now)
    • biaffine parser updates (#9, #13)
    • add experimental Span Suggesters exploiting parser/tagger/sentence information (#11)
    • add SpanFinder: a new experimental component for learning span boundaries (#10)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Mar 18, 2022)

Owner
Explosion
A software company specializing in developer tools for Artificial Intelligence and Natural Language Processing
Explosion
Shellcode antivirus evasion framework

Schrodinger's Cat Schrodinger'sCat is a Shellcode antivirus evasion framework Technical principle Please visit my blog https://idiotc4t.com/ How to us

idiotc4t 27 Jul 09, 2022
Sapiens is a human antibody language model based on BERT.

Sapiens: Human antibody language model ____ _ / ___| __ _ _ __ (_) ___ _ __ ___ \___ \ / _` | '_ \| |/ _ \ '

Merck Sharp & Dohme Corp. a subsidiary of Merck & Co., Inc. 13 Nov 20, 2022
The Easy-to-use Dialogue Response Selection Toolkit for Researchers

The Easy-to-use Dialogue Response Selection Toolkit for Researchers

GMFTBY 32 Nov 13, 2022
A simple version of DeTR

DeTR-Lite A simple version of DeTR Before you enjoy this DeTR-Lite The purpose of this project is to allow you to learn the basic knowledge of DeTR. P

Jianhua Yang 11 Jun 13, 2022
Toward a Visual Concept Vocabulary for GAN Latent Space, ICCV 2021

Toward a Visual Concept Vocabulary for GAN Latent Space Code and data from the ICCV 2021 paper Sarah Schwettmann, Evan Hernandez, David Bau, Samuel Kl

Sarah Schwettmann 13 Dec 23, 2022
kochat

Kochat ์ฑ—๋ด‡ ๋นŒ๋”๋Š” ์„ฑ์— ์•ˆ์ฐจ๊ณ , ์ž์‹ ๋งŒ์˜ ๋”ฅ๋Ÿฌ๋‹ ์ฑ—๋ด‡ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ๋งŒ๋“œ์‹œ๊ณ  ์‹ถ์œผ์‹ ๊ฐ€์š”? Kochat์„ ์ด์šฉํ•˜๋ฉด ์†์‰ฝ๊ฒŒ ์ž์‹ ๋งŒ์˜ ๋”ฅ๋Ÿฌ๋‹ ์ฑ—๋ด‡ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ๋นŒ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. # 1. ๋ฐ์ดํ„ฐ์…‹ ๊ฐ์ฒด ์ƒ์„ฑ dataset = Dataset(ood=True) #

1 Oct 25, 2021
Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

TextCortex - HemingwAI Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingw

TextCortex AI 27 Nov 28, 2022
ConvBERT-Prod

ConvBERT ็›ฎๅฝ• 0. ไป“ๅบ“็ป“ๆž„ 1. ็ฎ€ไป‹ 2. ๆ•ฐๆฎ้›†ๅ’Œๅค็Žฐ็ฒพๅบฆ 3. ๅ‡†ๅค‡ๆ•ฐๆฎไธŽ็Žฏๅขƒ 3.1 ๅ‡†ๅค‡็Žฏๅขƒ 3.2 ๅ‡†ๅค‡ๆ•ฐๆฎ 3.3 ๅ‡†ๅค‡ๆจกๅž‹ 4. ๅผ€ๅง‹ไฝฟ็”จ 4.1 ๆจกๅž‹่ฎญ็ปƒ 4.2 ๆจกๅž‹่ฏ„ไผฐ 4.3 ๆจกๅž‹้ข„ๆต‹ 5. ๆจกๅž‹ๆŽจ็†้ƒจ็ฝฒ 5.1 ๅŸบไบŽInference็š„ๆŽจ็† 5.2 ๅŸบไบŽServ

yujun 7 Apr 08, 2022
This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

The baseline code is for EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks

Akbar Karimi 81 Dec 09, 2022
NLTK Source

Natural Language Toolkit (NLTK) NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets, and tutorials supporting

Natural Language Toolkit 11.4k Jan 04, 2023
Search for documents in a domain through Google. The objective is to extract metadata

MetaFinder - Metadata search through Google _____ __ ___________ .__ .___ / \

Josuรฉ Encinar 85 Dec 16, 2022
Pretrained Japanese BERT models

Pretrained Japanese BERT models This is a repository of pretrained Japanese BERT models. The models are available in Transformers by Hugging Face. Mod

Inui Laboratory 387 Dec 30, 2022
MRC approach for Aspect-based Sentiment Analysis (ABSA)

B-MRC MRC approach for Aspect-based Sentiment Analysis (ABSA) Paper: Bidirectional Machine Reading Comprehension for Aspect Sentiment Triplet Extracti

Phuc Phan 1 Apr 05, 2022
Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision Training Efficiency We show the training efficiency of our DSLP model b

Chenyang Huang 37 Jan 04, 2023
EdiTTS: Score-based Editing for Controllable Text-to-Speech

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Neosapience 99 Jan 02, 2023
A script that automatically creates a branch name using google translation api and jira api

About google translation api์™€ jira api์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž๋™์œผ๋กœ ๋ธŒ๋žœ์น˜ ์ด๋ฆ„์„ ๋งŒ๋“ค์–ด์ฃผ๋Š” ์Šคํฌ๋ฆฝํŠธ Setup ํ™˜๊ฒฝ๋ณ€์ˆ˜์— ๋‹ค์Œ 3๊ฐ€์ง€๋ฅผ ๋“ฑ๋กํ•ด์•ผ ํ•œ๋‹ค. JIRA_USER : JIRA email (ex: hyunwook.kim 2 Dec 20, 2021

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

Anchored CorEx: Hierarchical Topic Modeling with Minimal Domain Knowledge Correlation Explanation (CorEx) is a topic model that yields rich topics tha

Greg Ver Steeg 592 Dec 18, 2022
A raytrace framework using taichi language

ti-raytrace The code use Taichi programming language Current implement acceleration lvbh disney brdf How to run First config your anaconda workspace,

่•‰ๅคช็‹ผ 73 Dec 11, 2022
A BERT-based reverse-dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. ๊น€์œ ๋นˆ : ๋ชจ๋ธ๋ง / ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ / ํ”„๋กœ์ ํŠธ ์„ค๊ณ„ / back-end ๊น€์ข…์œค : ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ / ํ”„๋กœ์ ํŠธ ์„ค๊ณ„ / front-end Quick Start C

Eu-Bin KIM 94 Dec 08, 2022
Implementation of TTS with combination of Tacotron2 and HiFi-GAN

Tacotron2-HiFiGAN-master Implementation of TTS with combination of Tacotron2 and HiFi-GAN for Mandarin TTS. Inference In order to inference, we need t

SunLu Z 7 Nov 11, 2022