A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

Overview

Rita Logo

RITA DSL

Documentation Status codecov made-with-python Maintenance PyPI version fury.io PyPI download month GitHub license

This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy compatible patterns, or pure regex. These patterns can be used for doing manual NER as well as used in other processes, like retokenizing and pure matching

An Introduction Video

Intro

Links

Support

reddit Gitter

Install

pip install rita-dsl

Simple Rules example

rules = """
cuts = {"fitted", "wide-cut"}
lengths = {"short", "long", "calf-length", "knee-length"}
fabric_types = {"soft", "airy", "crinkled"}
fabrics = {"velour", "chiffon", "knit", "woven", "stretch"}

{IN_LIST(cuts)?, IN_LIST(lengths), WORD("dress")}->MARK("DRESS_TYPE")
{IN_LIST(lengths), IN_LIST(cuts), WORD("dress")}->MARK("DRESS_TYPE")
{IN_LIST(fabric_types)?, IN_LIST(fabrics)}->MARK("DRESS_FABRIC")
"""

Loading in spaCy

import spacy
from rita.shortcuts import setup_spacy


nlp = spacy.load("en")
setup_spacy(nlp, rules_string=rules)

And using it:

>>> r = nlp("She was wearing a short wide-cut dress")
>>> [{"label": e.label_, "text": e.text} for e in r.ents]
[{'label': 'DRESS_TYPE', 'text': 'short wide-cut dress'}]

Loading using Regex (standalone)

import rita

patterns = rita.compile_string(rules, use_engine="standalone")

And using it:

>>> list(patterns.execute("She was wearing a short wide-cut dress"))
[{'end': 38, 'label': 'DRESS_TYPE', 'start': 18, 'text': 'short wide-cut dress'}]
Comments
  • Jetbrains RITA Plugin not compatible with PyCharm 2020.2.1

    Jetbrains RITA Plugin not compatible with PyCharm 2020.2.1

    Plugin Version: 1.2 https://plugins.jetbrains.com/plugin/15011-rita-language/versions/

    Tested Version: PyCharm 2020.2.1 (Professional Edition)

    Error when trying to install from disk: grafik

    On the plugin site https://plugins.jetbrains.com/plugin/15011-rita-language/versions/ it says, that this should be uncompitable for all IntellJ-based IDEs in the 2020.2 version:

    The list of supported products was determined by dependencies defined in the plugin.xml: Android Studio — build 201.7223 — 201.* DataGrip — 2020.1.3 — 2020.1.5 IntelliJ IDEA Ultimate — 2020.1.1 — 2020.1.4 Rider — 2020.1.3 PyCharm Professional — 2020.1.1 — 2020.1.4 PyCharm Community — 2020.1.1 — 2020.1.4 PhpStorm — 2020.1.1 — 2020.1.4 IntelliJ IDEA Educational — 2020.1.1 — 2020.1.2 CLion — 2020.1.1 — 2020.1.3 PyCharm Educational — 2020.1.1 — 2020.1.2 GoLand — 2020.1.1 — 2020.1.4 AppCode — 2020.1.2 — 2020.1.6 RubyMine — 2020.1.1 — 2020.1.4 MPS — 2020.1.1 — 2020.1.4 IntelliJ IDEA Community — 2020.1.1 — 2020.1.4 WebStorm — 2020.1.1 — 2020.1.4

    opened by rolandmueller 3
  • IN_LIST ignores OP quantifier

    IN_LIST ignores OP quantifier

    Somehow I get this unexpected behaviour when using OP quantifiers (?, *, +, etc) with the IN_LIST element:

    rules = """
    list_elements = {"one", "two"}
    {IN_LIST(list_elements)?}->MARK("LABEL")
    """
    rules = rita.compile_string(rules)
    expected_result = "[{'label': 'LABEL', 'pattern': [{'LOWER': {'REGEX': '^(one|two)$'}, 'OP': '?'}]}]"
    print("expected_result:", expected_result)
    print("result:", rules)
    assert str(rules) == expected_result
    

    Version: 0.5.0

    bug 
    opened by rolandmueller 3
  • Add module regex

    Add module regex

    This feature would introduce the REGEX element as a module.

    Matches words based on a Regex pattern e.g. all words that start with an 'a' would be REGEX("^a")

    !IMPORT("rita.modules.regex")
    
    {REGEX("^a")}->MARK("TAGGED_MATCH")
    
    opened by rolandmueller 2
  • Feature/pluralize

    Feature/pluralize

    Add a new module for a PLURALIZE tag For a noun or a list of nouns, it will match any singular or plural word. Usage for a single word, e.g.:

    PLURALIZE("car")
    

    Usage for lists, e.g.:

    vehicles = {"car", "bicycle", "ship"}
    PLURALIZE(vehicles)
    

    Will work even for regex or if the lemmatizer of spaCy is making an error. Has dependency to the Python inflect package https://pypi.org/project/inflect/

    opened by rolandmueller 2
  • Feature/regex tag

    Feature/regex tag

    This feature would introduce the TAG element as a module. Needs a new parser for the SpaCy translate. Would allow more flexible matching of detailed part-of-speech tag, like all adjectives or nouns: TAG("^NN|^JJ").

    opened by rolandmueller 2
  • Feature/improve robustness

    Feature/improve robustness

    In general - measure how long it takes to compile and avoid situations when pattern creates infinite loop (possible to get to this situation using regex).

    Closes: https://github.com/zaibacu/rita-dsl/issues/78

    opened by zaibacu 1
  • Add TAG_WORD macro to Tag module

    Add TAG_WORD macro to Tag module

    This feature would introduce the TAG_WORD element to the Tag module

    TAG_WORD is for generating TAG patterns with a word or a list.

    e.g. match only "proposed" when it is in the sentence a verb (and not an adjective):

    !IMPORT("rita.modules.tag")
    
    TAG_WORD("^VB", "proposed")
    

    or e.g. match a list of words only to verbs

    !IMPORT("rita.modules.tag")
    
    words = {"percived", "proposed"}
    {TAG_WORD("^VB", words)}->MARK("LABEL")
    
    opened by rolandmueller 1
  • Add Orth module

    Add Orth module

    This feature would introduce the ORTH element as a module.

    Ignores case-insensitive configuration and checks words as written that means case-sensitive even if configuration is case-insensitive. Especially useful for acronyms and proper names.

    Works only with spaCy engine

    Usage:

    !IMPORT("rita.modules.orth")
    
    {ORTH("IEEE")}->MARK("TAGGED_MATCH")
    
    opened by rolandmueller 1
  • Add conifugration for implicit hyphon characters between words

    Add conifugration for implicit hyphon characters between words

    Add a new Configuration implicit_hyphon (default false) for automatically adding hyphon characters - to the rules. Enabling implicit_hyphon is disabling implicit_punct. Rationale: implicit_punct is often to much inclusive. The implicit_punct has the hyphon token included, but it is adding (at least in my use case) unwanted tokens (like parentheses) to the matches, especially for more complex rules. So implicit_hyphon is a little bit more strict than implicit_punct.

    opened by rolandmueller 1
  • Fix sequencial optional

    Fix sequencial optional

    Closes https://github.com/zaibacu/rita-dsl/issues/69

    Turns out it is a bug related to - character which in most cases used as a splitter, but in this case as a stand alone word

    opened by zaibacu 1
  • Method to validate syntax

    Method to validate syntax

    Currently it can be partially done:

    from rita.parser import RitaParser
    from rita.config import SessionConfig
    config = SessionConfig()
    p = RitaParser(config)
    p.build()
    result = p.parse(rules)
    if result is None:
        raise RuntimeError("... Something is wrong with syntax")
    

    But it would be nice to have single method for that and have actual error info.

    enhancement 
    opened by zaibacu 0
  • Dynamic case sensitivity for Standalone Engine

    Dynamic case sensitivity for Standalone Engine

    We want to be able to make specified word inside pattern to be case sensitive, while rest of the pattern is case insensitive.

    It looks like it can be achieved using inline modifier groups regex feature, it requires Python3.6+ version

    enhancement 
    opened by zaibacu 0
  • JS rule engine

    JS rule engine

    Should work similarly to standalone engine, maybe even inherit most of it, but it should result into valid JavaScript code, preferably a single function to which you give raw text and get result of multiple parsed entities

    enhancement help wanted 
    opened by zaibacu 0
  • Allow LOAD macro to load from external locations

    Allow LOAD macro to load from external locations

    now LOAD(file_name) macro searches text file in current path.

    Usually reading from the local file is the best, but it should be cool, to be able just give like github GIST url and just load everything we need. This would be very useful for Demo page case

    good first issue 
    opened by zaibacu 0
Releases(0.7.0)
  • 0.7.0(Feb 2, 2021)

    0.7.0 (2021-02-02)


    Features

    • standalone engine now will return submatches list containing start and end for each part of match #93

    • Partially covered https://github.com/zaibacu/rita-dsl/issues/70

      Allow nested patterns, like:

          num_with_fractions = {NUM, WORD("-")?, IN_LIST(fractions)}
          complex_number = {NUM|PATTERN(num_with_fractions)}
    
          {PATTERN(complex_number)}->MARK("NUMBER")
    

    #95

    • Submatches for rita-rust engine #96

    • Regex module which allows to specify word pattern, eg. REGEX(^a) means word must start with letter "a"

      Implemented by: Roland M. Mueller (https://github.com/rolandmueller) #101

    • ORTH module which allows you to specify case sensitive entry while rest of the rules ignores case. Used for acronyms and proper names

      Implemented by: Roland M. Mueller (https://github.com/rolandmueller) #102

    • Additional macro for tag module, allowing to tag specific word/list of words

      Implemented by: Roland M. Mueller (https://github.com/rolandmueller) #103

    • Added names module which allows to generate person names variations #105

    • spaCy v3 Support #109

    Fix

    • Optimizations for Rust Engine

      • No need for passing text forward and backward, we can calculate from text[start:end]

      • Grouping and sorting logic can be done in binary code #88

    • Fix NUM parsing bug #90

    • Switch from (^\s) to \b when doing IN_LIST. Should solve several corner cases #91

    • Fix floating point number matching #92

    • revert #91 changes. Keep old way for word boundary #94

    Source code(tar.gz)
    Source code(zip)
  • 0.6.0(Aug 29, 2020)

    0.6.0 (2020-08-29)


    Features

    • Implemented ability to alias macros, eg.:
          numbers = {"one", "two", "three"}
          @alias IN_LIST IL
    
          IL(numbers) -> MARK("NUMBER")
    

    Now using "IL" will actually call "IN_LIST" macro. #66

    • introduce the TAG element as a module. Needs a new parser for the SpaCy translate. Would allow more flexible matching of detailed part-of-speech tag, like all adjectives or nouns: TAG("^NN|^JJ").

      Implemented by: Roland M. Mueller (https://github.com/rolandmueller) #81

    • Add a new module for a PLURALIZE tag For a noun or a list of nouns, it will match any singular or plural word.

      Implemented by: Roland M. Mueller (https://github.com/rolandmueller) #82

    • Add a new Configuration implicit_hyphon (default false) for automatically adding hyphon characters - to the rules.

      Implemented by: Roland M. Mueller (https://github.com/rolandmueller) #84

    • Allow to give custom regex impl. By default re is used #86

    • An interface to be able to use rust engine.

      In general it's identical to standalone, but differs in one crucial part - all of the rules are compiled into actual binary code and that provides large performance boost. It is proprietary, because there are various caveats, engine itself is a bit more fragile and needs to be tinkered to be optimized to very specific case (eg. few long texts with many matches vs a lot short texts with few matches). #87

    Fix

    • Fix - bug when it is used as stand alone word #71
    • Fix regex matching, when shortest word is selected from IN_LIST #72
    • Fix IN_LIST regex so that it wouldn't take part of word #75
    • Fix IN_LIST operation bug - it was ignoring them #77
    • Use list branching only when using spaCy Engine #80
    Source code(tar.gz)
    Source code(zip)
  • 0.5.0(Jun 18, 2020)

    Features

    • Added PREFIX macro which allows to attach word in front of list items or words #47

    • Allow to pass variables directly when doing compile and compile_string #51

    • Allow to compile (and later load) rules using rita CLI while using standalone engine (spacy is already supported) #53

    • Added ability to import rule files into rule file. Recursive import is supported as well. #55

    • Added possibility to define pattern as a variable and reuse it in other patterns:

      Example:

    ComplexNumber = {NUM+, WORD("/")?, NUM?}
    
    {PATTERN(ComplexNumber), WORD("inches"), WORD("Height")}->MARK("HEIGHT")
    {PATTERN(ComplexNumber), WORD("inches"), WORD("Width")}->MARK("WIDTH")
    

    #64

    Fix

    • Fix issue with multiple wildcard words using standalone engine #46
    • Don't crash when no rules are provided #50
    • Fix Number and ANY-OF parsing #59
    • Allow escape characters inside LITERAL #62
    Source code(tar.gz)
    Source code(zip)
  • 0.4.0(Jan 25, 2020)

    0.4.0 (2020-01-25)


    Features

    • Support for deaccent. In general, if accented version of word is given, both deaccented and accented will be used to match. To turn iit off - !CONFIG("deaccent", "N") #38
    • Added shortcuts module to simplify injecting into spaCy #42

    Fix

    • Fix issue regarding Spacy rules with IN_LIST and using case-sensitive mode. It was creating Regex pattern which is not valid spacy pattern #40
    Source code(tar.gz)
    Source code(zip)
  • 0.3.2(Dec 19, 2019)

    Features

      • Introduced towncrier to track changes
      • Added linter flake8
      • Refactored code to match pep8 #32

    Fix

      • Fix WORD split by -

      • Split by (empty space) as well

      • Coverage score increase #35

    Source code(tar.gz)
    Source code(zip)
  • 0.3.0(Dec 14, 2019)

    Now there's one global config and child config created per-session (one session = one rule file compilation). Imports and variables are stored in this config as well.

    Remove context argument from MACROS, making code cleaner and easier to read

    Source code(tar.gz)
    Source code(zip)
  • 0.2.2(Dec 8, 2019)

    Features of up to this point:

    • Standalone parser - can use internal regex rather than spaCy if you need to
    • Ability to do logical OR in rule. eg.: {WORD(w1)|WORD(w2),WORD(w3)} would result into two rules: {WORD(w1),WORD(w3)} and {WORD(w2),WORD(w3)}
    • Exclude operator {WORD(w1), WORD(w2)!} would match w1 and anything but w2
    Source code(tar.gz)
    Source code(zip)
Owner
Šarūnas Navickas
Data Engineer @ TokenMill. Doing BJJ @ Voras-Bjj. Dad @ Home.
Šarūnas Navickas
Refactored version of FastSpeech2

Refactored version of FastSpeech2. An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

ILJI CHOI 10 May 26, 2022
Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

Habib Abdurrasyid 5 Dec 28, 2021
Machine Psychology: Python Generated Art

Machine Psychology: Python Generated Art A limited collection of 64 algorithmically generated artwork. Each unique piece is then given a title by the

Pixegami Team 67 Dec 13, 2022
EdiTTS: Score-based Editing for Controllable Text-to-Speech

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Neosapience 99 Jan 02, 2023
Korean stereoypte detector with TUNiB-Electra and K-StereoSet

Korean Stereotype Detector Korean stereotype sentence classifier using K-StereoSet with TUNiB-Electra Web demo you can test this model easily in demo

Sae_Chan_Oh 11 Feb 18, 2022
NLP library designed for reproducible experimentation management

Welcome to the Transfer NLP library, a framework built on top of PyTorch to promote reproducible experimentation and Transfer Learning in NLP You can

Feedly 290 Dec 20, 2022
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

pkuseg:一个多领域中文分词工具包 (English Version) pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用,支持细分领域分词,有效提升了分词准确度。 目录 主要亮点 编译和安装 各类分词工具包的性能对比 使用方式 论文引用 作者 常见问题及解答 主要

LancoPKU 6k Dec 29, 2022
Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Realistic Few-Shot Relation Extraction This repository contains code to reproduce the results in the paper "Towards Realistic Few-Shot Relation Extrac

Bloomberg 8 Nov 09, 2022
Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Frog for Python This is a Python binding to the Natural Language Processing suite Frog. Frog is intended for Dutch and performs part-of-speech tagging

Maarten van Gompel 46 Dec 14, 2022
Conversational-AI-ChatBot - Intelligent ChatBot built with Microsoft's DialoGPT transformer to make conversations with human users!

Conversational AI ChatBot Intelligent ChatBot built with Microsoft's DialoGPT transformer to make conversations with human users! In this project? Thi

Rajkumar Lakshmanamoorthy 6 Nov 30, 2022
Black for Python docstrings and reStructuredText (rst).

Style-Doc Style-Doc is Black for Python docstrings and reStructuredText (rst). It can be used to format docstrings (Google docstring format) in Python

Telekom Open Source Software 13 Oct 24, 2022
Natural Language Processing at EDHEC, 2022

Natural Language Processing Here you will find the teaching materials for the "Natural Language Processing" course at EDHEC Business School, 2022 What

1 Feb 04, 2022
Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Token Shift GPT Implementation of Token Shift GPT - An autoregressive model that relies solely on shifting along the sequence dimension and feedforwar

Phil Wang 32 Oct 14, 2022
Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

A Infomation Grathering tool that reverse search phone numbers and get their details ! What is phomber? Phomber is one of the best tools available fo

S41R4J 121 Dec 27, 2022
A natural language processing model for sequential sentence classification in medical abstracts.

NLP PubMed Medical Research Paper Abstract (Randomized Controlled Trial) A natural language processing model for sequential sentence classification in

Hemanth Chandran 1 Jan 17, 2022
This repo stores the codes for topic modeling on palliative care journals.

This repo stores the codes for topic modeling on palliative care journals. Data Preparation You first need to download the journal papers. bash 1_down

3 Dec 20, 2022
NLPShala , the best IDE for all Natural language processing tasks.

The revolutionary IDE for all NLP (Natural language processing) stuffs on the internet.

Abhi 3 Aug 08, 2021
PyTorch implementation of Tacotron speech synthesis model.

tacotron_pytorch PyTorch implementation of Tacotron speech synthesis model. Inspired from keithito/tacotron. Currently not as much good speech quality

Ryuichi Yamamoto 279 Dec 09, 2022
TLA - Twitter Linguistic Analysis

TLA - Twitter Linguistic Analysis Tool for linguistic analysis of communities TLA is built using PyTorch, Transformers and several other State-of-the-

Tushar Sarkar 47 Aug 14, 2022
The entmax mapping and its loss, a family of sparse softmax alternatives.

entmax This package provides a pytorch implementation of entmax and entmax losses: a sparse family of probability mappings and corresponding loss func

DeepSPIN 330 Dec 22, 2022