A practical and feature-rich paraphrasing framework to augment human intents in text form to build robust NLU models for conversational engines. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Overview

PyPI - License Visits Badge

Parrot

Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models. A paraphrase framework is more than just a paraphrasing model.

Table of contents

Why Parrot?

Huggingface lists 16 paraphrase generation models, (as of this writing) RapidAPI lists 7 fremium and commercial paraphrasers like QuillBot, Rasa has discussed an experimental paraphraser for augmenting text data here, Sentence-transfomers offers a paraphrase mining utility and NLPAug offers word level augmentation with a PPDB (a multi-million paraphrase database). While these attempts at paraphrasing are great, there are still some gaps and paraphrasing is NOT yet a mainstream option for text augmentation in building NLU models....Parrot is a humble attempt to fill some of these gaps.

What is a good paraphrase? Almost all conditioned text generation models are validated on 2 factors, (1) if the generated text conveys the same meaning as the original context (Adequacy) (2) if the text is fluent / grammatically correct english (Fluency). For instance Neural Machine Translation outputs are tested for Adequacy and Fluency. But a good paraphrase should be adequate and fluent while being as different as possible on the surface lexical form. With respect to this definition, the 3 key metrics that measures the quality of paraphrases are:

  • Adequacy (Is the meaning preserved adequately?)
  • Fluency (Is the paraphrase fluent English?)
  • Diversity (Lexical / Phrasal / Syntactical) (How much has the paraphrase changed the original sentence?)

Parrot offers knobs to control Adequacy, Fluency and Diversity as per your needs.

What makes a paraphraser a good augmentor? For training a NLU model we just don't need a lot of utterances but utterances with intents and slots/entities annotated. Typical flow would be:

  • Given an input utterance + input annotations a good augmentor spits out N output paraphrases while preserving the intent and slots.
  • The output paraphrases are then converted into annotated data using the input annotations that we got in step 1.
  • The annotated data created out of the output paraphrases then makes the training dataset for your NLU model.

But in general being a generative model paraphrasers doesn't guarantee to preserve the slots/entities. So the ability to generate high quality paraphrases in a constrained fashion without trading off the intents and slots for lexical dissimilarity makes a paraphraser a good augmentor. More on this in section 3 below

Getting started

Install

pip install git+https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git

Quickstart

from parrot import Parrot
import torch
import warnings
warnings.filterwarnings("ignore")

''' 
uncomment to get reproducable paraphrase generations
def random_state(seed):
  torch.manual_seed(seed)
  if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

random_state(1234)
'''

#Init models (make sure you init ONLY once if you integrate this to your code)
parrot = Parrot(model_tag="prithivida/parrot_paraphraser_on_T5")

phrases = ["Can you recommed some upscale restaurants in Newyork?",
           "What are the famous places we should not miss in Russia?"
]

for phrase in phrases:
  print("-"*100)
  print("Input_phrase: ", phrase)
  print("-"*100)
  para_phrases = parrot.augment(input_phrase=phrase, use_gpu=False)
  for para_phrase in para_phrases:
   print(para_phrase)
----------------------------------------------------------------------
Input_phrase: Can you recommed some upscale restaurants in Newyork?
----------------------------------------------------------------------
list some excellent restaurants to visit in new york city?
what upscale restaurants do you recommend in new york?
i want to try some upscale restaurants in new york?
recommend some upscale restaurants in newyork?
can you recommend some high end restaurants in newyork?
can you recommend some upscale restaurants in new york?
can you recommend some upscale restaurants in newyork?
----------------------------------------------------------------------
Input_phrase: What are the famous places we should not miss in Russia
----------------------------------------------------------------------
what should we not miss when visiting russia?
recommend some of the best places to visit in russia?
list some of the best places to visit in russia?
can you list the top places to visit in russia?
show the places that we should not miss in russia?
list some famous places which we should not miss in russia?

Getting syntactic and phrasal diversity/variety in your paraphrases ?

You can play with the do_diverse knob (checkout the next section for more knobs). Consider this example: do_diverse = False (default)*

------------------------------------------------------------------------------
Input_phrase: How are the new Macbook Pros with M1 chips?
------------------------------------------------------------------------------
'how do you rate the new macbook pros? '
'how are the new macbook pros? '
'how is the new macbook pro doing with new chips? '
'how do you like the new macbook pro m1 chip? '
'what is the use of the new macbook pro m1 chips? '

do_diverse = True

------------------------------------------------------------------------------
Input_phrase: How are the new Macbook Pros with M1 chips?
------------------------------------------------------------------------------
'what do you think about the new macbook pro m1? '
'how is the new macbook pro m1? '
'how are the new macbook pros? '
'what do you think about the new macbook pro m1 chips? '
'how good is the new macbook pro m1 chips? '
'how is the new macbook pro m1 chip? '
'do you like the new macbook pro m1 chips? '
'how are the new macbook pros with m1 chips? '

Other Knobs

 para_phrases = parrot.augment(input_phrase=phrase,
                               use_gpu=False,
                               diversity_ranker="levenshtein",
                               do_diverse=False, 
                               max_return_phrases = 10, 
                               max_length=32, 
                               adequacy_threshold = 0.99, 
                               fluency_threshold = 0.90)

Scope

In the space of conversational engines, knowledge bots are to which we ask questions like "when was the Berlin wall teared down?", transactional bots are to which we give commands like "Turn on the music please" and voice assistants are the ones which can do both answer questions and action our commands. Parrot mainly foucses on augmenting texts typed-into or spoken-to conversational interfaces for building robust NLU models. (So usually people neither type out or yell out long paragraphs to conversational interfaces. Hence the pre-trained model is trained on text samples of maximum length of 32.)

While Parrot predominantly aims to be a text augmentor for building good NLU models, it can also be used as a pure-play paraphraser.

What makes a paraphraser a good augmentor for NLU? (Details)

To enable automatic training data generation, a paraphraser needs to keep the slots in intact. So the end to end process can take input utternaces, augment and convert them into NLU training format goo et al or rasa format (as shown below). The data generation process needs to look for the same slots in the output paraphrases to derive the start and end positions.(as shown in the json below)

Ideally the above process needs an UI like below to collect to input utternaces along with annotations (Intents, Slots and slot types) which then can be agumented and converted to training data.

Sample NLU data (Rasa format)

{
    "rasa_nlu_data": {
        "common_examples": [
            {
                "text": "i would like to find a flight from charlotte to las vegas that makes a stop in st. louis",
                "intent": "flight",
                "entities": [
                    {
                        "start": 35,
                        "end": 44,
                        "value": "charlotte",
                        "entity": "fromloc.city_name"
                    },
                    {
                        "start": 48,
                        "end": 57,
                        "value": "las vegas",
                        "entity": "toloc.city_name"
                    },
                    {
                        "start": 79,
                        "end": 88,
                        "value": "st. louis",
                        "entity": "stoploc.city_name"
                    }
                ]
            },
            ...
        ]
    }
}
  • Original: I would like a list of round trip flights between indianapolis and orlando florida for the 27th
  • Paraphrase useful for augmenting: what are the round trip flights between indianapolis and orlando for the 27th
  • Paraphrase not-so-useful for augmenting: what are the round trip flights between chicago and orlando for the 27th.

Dataset for paraphrase model

THe following datasets where analysed, but the paraphrase generation model prithivida/parrot_paraphraser_on_T5 has been fine-tuned on some of them

Power of Augmentation - Metrics and Comparison

Intent Classification task:

Experimental setup: From each dataset increasing number of random utternaces per intent were taken to form the raw training data. The same data was then augmented with parrot paraphraser for Nx times(where N =10 or 15 depending the dataset) to form the augmented training data. Now models are trained on both raw data and augmented data to compare the performance. Being a multiclass classification model weighted F1 was used as a metric. The experiment was repeated 4 times for each number of utterance and F1 has been averaged to remove randomness in the trend. I have used 6 prominent NLU datasets from across domains. Below charts reveal that with a "very modest number" utterances and paraphrase augmentation we can achieve good classfication performance on day 1. "Very modest" varies between 4 to 6 utterances per intent in some datasets and 5 to 7 for some datasets.

Semantic slot-filling task:

TBD

Current Features

TBD

Roadmap

TBD

Current Limitations/Known issues

  • The diversity scores are not normalised each of the diversity rankers scores paraphrases differently
  • Some command style input phrases generate less adequate paraphrases

References

TBD

Citation

To cite Parrot in your work, please use the following bibtex reference:

@misc{prithivida2021parrot,
  author       = {Prithiviraj Damodaran},
  title        = {Parrot: Paraphrase generation for NLU.},
  year         = 2021,
  version      = {v1.0}
}
Comments
  • TypeError: 'NoneType' object is not iterable

    TypeError: 'NoneType' object is not iterable

    Hi,

    Why do I get the error when running the following code?

    `phrases = ["Can you recommend some upscale restaurants in Newyork?", "What are the famous places we should not miss in Russia?" ]

    for phrase in phrases: print("-"*100) print("Input_phrase: ", phrase) print("-"*100) para_phrases = parrot.augment(input_phrase=phrase, use_gpu=False, do_diverse=True, diversity_ranker="levenshtein") for para_phrase in para_phrases: print(para_phrase)`

    Error:

    `TypeError Traceback (most recent call last) /home/user/Code/Parrot/main.ipynb Cell 4 in <cell line: 5>() 8 print("-"*100) 9 para_phrases = parrot.augment(input_phrase=phrase, use_gpu=False, do_diverse=True, diversity_ranker="levenshtein") ---> 10 for para_phrase in para_phrases: 11 print(para_phrase)

    TypeError: 'NoneType' object is not iterable`

    opened by 0x11c11e 6
  • use_gpu=True Error

    use_gpu=True Error

    In Google Colab.

    INSTALLED: !pip install -qqq git+https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git

    MY CODE:

    from parrot import Parrot

    def random_state(seed): torch.manual_seed(seed) if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed) random_state(1234)

    parrot_gpu = Parrot(model_tag="prithivida/parrot_paraphraser_on_T5", use_gpu=True)

    phrases = ['i drive a ford pickup truck.', 'i am very conservative.', 'my family lives down the street from me.', 'i go to church every sunday.', 'i have three guns and love hunting.']

    para_phrases_gpu = parrot_gpu.augment(input_phrase=phrases[0], use_gpu=True, max_return_phrases = 10)

    ERROR:


    RuntimeError Traceback (most recent call last) in () ----> 1 para_phrases_gpu = parrot_gpu.augment(input_phrase=phrases[0], use_gpu=True, max_return_phrases = 10)

    /usr/local/lib/python3.7/dist-packages/parrot/parrot.py in augment(self, input_phrase, use_gpu, diversity_ranker, do_diverse, max_return_phrases, max_length, adequacy_threshold, fluency_threshold) 128 129 --> 130 adequacy_filtered_phrases = self.adequacy_score.filter(input_phrase, paraphrases, adequacy_threshold, device ) 131 if len(adequacy_filtered_phrases) > 0 : 132 fluency_filtered_phrases = self.fluency_score.filter(adequacy_filtered_phrases, fluency_threshold, device )

    /usr/local/lib/python3.7/dist-packages/parrot/filters.py in filter(self, input_phrase, para_phrases, adequacy_threshold, device) 13 x = self.tokenizer(input_phrase, para_phrase, return_tensors='pt', max_length=128, truncation=True) 14 self.adequacy_model = self.adequacy_model.to(device) ---> 15 logits = self.adequacy_model(**x).logits 16 probs = logits.softmax(dim=1) 17 prob_label_is_true = probs[:,1]

    /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1109 or _global_forward_hooks or _global_forward_pre_hooks): -> 1110 return forward_call(*input, **kwargs) 1111 # Do not call functions when jit is used 1112 full_backward_hooks, non_full_backward_hooks = [], []

    /usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict) 1213 output_attentions=output_attentions, 1214 output_hidden_states=output_hidden_states, -> 1215 return_dict=return_dict, 1216 ) 1217 sequence_output = outputs[0]

    /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1109 or _global_forward_hooks or _global_forward_pre_hooks): -> 1110 return forward_call(*input, **kwargs) 1111 # Do not call functions when jit is used 1112 full_backward_hooks, non_full_backward_hooks = [], []

    /usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict) 844 token_type_ids=token_type_ids, 845 inputs_embeds=inputs_embeds, --> 846 past_key_values_length=past_key_values_length, 847 ) 848 encoder_outputs = self.encoder(

    /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1109 or _global_forward_hooks or _global_forward_pre_hooks): -> 1110 return forward_call(*input, **kwargs) 1111 # Do not call functions when jit is used 1112 full_backward_hooks, non_full_backward_hooks = [], []

    /usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length) 126 127 if inputs_embeds is None: --> 128 inputs_embeds = self.word_embeddings(input_ids) 129 token_type_embeddings = self.token_type_embeddings(token_type_ids) 130

    /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1109 or _global_forward_hooks or _global_forward_pre_hooks): -> 1110 return forward_call(*input, **kwargs) 1111 # Do not call functions when jit is used 1112 full_backward_hooks, non_full_backward_hooks = [], []

    /usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py in forward(self, input) 158 return F.embedding( 159 input, self.weight, self.padding_idx, self.max_norm, --> 160 self.norm_type, self.scale_grad_by_freq, self.sparse) 161 162 def extra_repr(self) -> str:

    /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse) 2181 # remove once script supports set_grad_enabled 2182 no_grad_embedding_renorm(weight, input, max_norm, norm_type) -> 2183 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) 2184 2185

    RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

    opened by Mario-RC 2
  • Understanding adequacy metric

    Understanding adequacy metric

    Hi, I have been using the filters file from this repo to experiment on evaluating some paraphrases I created using various different models, but I noticed that the adequacy score gives some unexpected results so I was wondering if you could tell me some more about how it was trained? I noticed that if the paraphrase and the original are the exact same, the adequacy is quite low (around 0.7-0.80). If the paraphrase is shorter or longer than the original, it generally has a much higher score. Ex. Original: "I need to buy a house in the neighborhood" -> Paraphrase: "I need to buy a house" the paraphrase has a score of 0.98. Paraphrase: "I need to buy a house in the neighborhood where I want to live" results in an even higher score of .99 while the paraphrase "I need to buy a house in the neighborhood" (which is the same exact sentence as the original) gets a score of 0.7 and the same sentence with a period at the end gets 0.8. This makes me think that the adequacy model takes into account how much the new sentence has changed from the original in addition to how well its meaning was preserved in some way. Since the ReadMe states that adequacy measures whether or not the paraphrase preserves the meaning of the original, it is confusing to me that using the same sentence for original and paraphrase does not get a high score, could you clarify?

    opened by cegersdoerfer 2
  • error: legacy-install-failure

    error: legacy-install-failure

    When trying to install I get this:

    note: This error originates from a subprocess, and is likely not a problem with pip. error: legacy-install-failure

    × Encountered error while trying to install package. ╰─> python-Levenshtein

    How can I fix it?

    opened by hiddenchamp 2
  • Installation issue, torch >=1.6.0, Raspberry Pi OS 64 bit, Raspberry Pi 4b 8 gig

    Installation issue, torch >=1.6.0, Raspberry Pi OS 64 bit, Raspberry Pi 4b 8 gig

    Would love to get this working. Getting two errors - first error here is the last thing the install reports before exiting back to the terminal prompt:

    Collecting torch>=1.6.0 (from sentence-transformers->parrot==1.0) Could not find a version that satisfies the requirement torch>=1.6.0 (from sentence-transformers->parrot==1.0) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2) No matching distribution found for torch>=1.6.0 (from sentence-transformers->parrot==1.0)

    Tried running the quick start script anyway, get this error:

    Traceback (most recent call last): File "paraphrase.py", line 1, in <module> from parrot import Parrot ModuleNotFoundError: No module named 'parrot'

    Raspberry Pi 4B 8 Gig RAM version, running Raspberry Pi OS 64 Bit. Apologies if I have not provided sufficient info here - let me know how else I may help to figure out why this won't run on my build. Thanks

    opened by DaveXanatos 2
  • Parrot returns very similar description without paraphrasing for some sentences

    Parrot returns very similar description without paraphrasing for some sentences

    Hi Prithviraj,

    Good Day!

    Awesome work on building this library! I tried to use it for a personal project from the fashion domain and here's what I observed:

    image

    image

    Have a look at the two sentences above. I have provided the input sentence and the paraphrased sentence obtained using parrot. Except some punctuation and contractions, there's not much that the model is able to do.

    Such is the case even for most of the descriptions that I have scraped through fashion retailers. Could you advise how can I use parrot to obtain better paraphrased suggestions please?

    Thanks & Regards, Vinayak Nayak.

    wontfix 
    opened by ElisonSherton 2
  • fix: add missing dependencies into requirements.txt

    fix: add missing dependencies into requirements.txt

    • pandas is imported in a few places in the code
    • transformers needs to be installed with torchhub extras othewise it ends up with following error:
    Traceback (most recent call last):
      File "/Users/yed/dev/foo/bar/phraser/phrase.py", line 114, in <module>
        parrot = Parrot(model_tag="prithivida/parrot_paraphraser_on_T5")
      File "/Users/yed/dev/foo/bar/phraser/Parrot_Paraphraser/parrot/parrot.py", line 10, in __init__
        self.tokenizer = AutoTokenizer.from_pretrained(model_tag, use_auth_token=True)
      File "/Users/yed/.pyenv/versions/godot39/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 628, in from_pretrained
        return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
      File "/Users/yed/.pyenv/versions/godot39/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1775, in from_pretrained
        return cls._from_pretrained(
      File "/Users/yed/.pyenv/versions/godot39/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1930, in _from_pretrained
        tokenizer = cls(*init_inputs, **init_kwargs)
      File "/Users/yed/.pyenv/versions/godot39/lib/python3.9/site-packages/transformers/models/t5/tokenization_t5_fast.py", line 134, in __init__
        super().__init__(
      File "/Users/yed/.pyenv/versions/godot39/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 114, in __init__
        fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
      File "/Users/yed/.pyenv/versions/godot39/lib/python3.9/site-packages/transformers/convert_slow_tokenizer.py", line 1162, in convert_slow_tokenizer
        return converter_class(transformer_tokenizer).converted()
      File "/Users/yed/.pyenv/versions/godot39/lib/python3.9/site-packages/transformers/convert_slow_tokenizer.py", line 434, in __init__
        requires_backends(self, "protobuf")
      File "/Users/yed/.pyenv/versions/godot39/lib/python3.9/site-packages/transformers/utils/import_utils.py", line 967, in requires_backends
        raise ImportError("".join(failed))
    ImportError:
    T5Converter requires the protobuf library but it was not found in your environment. Checkout the instructions on the
    installation page of its repo: https://github.com/protocolbuffers/protobuf/tree/master/python#installation and follow the ones
    that match your environment.
    
    opened by yedpodtrzitko 1
  • Failed to build tokenizers ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects

    Failed to build tokenizers ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects

    error: can't find Rust compiler

    If you are using an outdated pip version, it is possible a prebuilt wheel is available for this package but pip is not able to install from it. Installing from the wheel would avoid the need for a Rust compiler.

    To update pip, run:

      pip install --upgrade pip
    

    and then retry package installation.

    If you did intend to build this package from source, try installing a Rust compiler from your system package manager and ensure it is on the PATH during installation. Alternatively, rustup (available at https://rustup.rs) is the recommended way to download and update the Rust compiler toolchain.

    ERROR: Failed building wheel for tokenizers Successfully built parrot Failed to build tokenizers ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects

    opened by jameswan 1
  • ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

    ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

    Tried installing in kaggle - Parrot_Paraphraser. It got successfully installed but i tried using it in code it shows ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

    My internet connection is also ON on kaggle. Can you please check and tell. I have added screenshot too. image

    opened by ChiragM-Hexaware 1
  • Installation Issues on Windows

    Installation Issues on Windows

    It appears there is a broken dependency on python-Levenshtein. You should update your import to point to Levenshtein package:

    Installing collected packages: python-Levenshtein Running setup.py install for python-Levenshtein ... error ERROR: Command errored out with exit status 1: command: 'C:\Users\gablanco\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\gablanco\AppData\Local\Temp\pip-install-3g1ne2jo\python-levenshtein_9f5029b6aae44944bceb3f676daf71a1\setup.py'"'"'; file='"'"'C:\Users\gablanco\AppData\Local\Temp\pip-install-3g1ne2jo\python-levenshtein_9f5029b6aae44944bceb3f676daf71a1\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\gablanco\AppData\Local\Temp\pip-record-8_vcaac_\install-record.txt' --single-version-externally-managed --user --prefix= --compile --install-headers 'C:\Users\gablanco\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\Include\python-Levenshtein' cwd: C:\Users\gablanco\AppData\Local\Temp\pip-install-3g1ne2jo\python-levenshtein_9f5029b6aae44944bceb3f676daf71a1
    Complete output (28 lines): running install running build running build_py creating build creating build\lib.win-amd64-3.9 creating build\lib.win-amd64-3.9\Levenshtein copying Levenshtein\StringMatcher.py -> build\lib.win-amd64-3.9\Levenshtein copying Levenshtein_init_.py -> build\lib.win-amd64-3.9\Levenshtein running egg_info writing python_Levenshtein.egg-info\PKG-INFO writing dependency_links to python_Levenshtein.egg-info\dependency_links.txt writing entry points to python_Levenshtein.egg-info\entry_points.txt writing namespace_packages to python_Levenshtein.egg-info\namespace_packages.txt writing requirements to python_Levenshtein.egg-info\requires.txt writing top-level names to python_Levenshtein.egg-info\top_level.txt reading manifest file 'python_Levenshtein.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' warning: no previously-included files matching '*pyc' found anywhere in distribution warning: no previously-included files matching '*so' found anywhere in distribution warning: no previously-included files matching '.project' found anywhere in distribution warning: no previously-included files matching '.pydevproject' found anywhere in distribution adding license file 'COPYING' writing manifest file 'python_Levenshtein.egg-info\SOURCES.txt' copying Levenshtein_levenshtein.c -> build\lib.win-amd64-3.9\Levenshtein copying Levenshtein_levenshtein.h -> build\lib.win-amd64-3.9\Levenshtein running build_ext building 'Levenshtein.levenshtein' extension error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/ ---------------------------------------- ERROR: Command errored out with exit status 1: 'C:\Users\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\AppData\Local\Temp\pip-install-3g1ne2jo\python-levenshtein_9f5029b6aae44944bceb3f676daf71a1\setup.py'"'"'; file='"'"'C:\Users\AppData\Local\Temp\pip-install-3g1ne2jo\python-levenshtein_9f5029b6aae44944bceb3f676daf71a1\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\AppData\Local\Temp\pip-record-8_vcaac\install-record.txt' --single-version-externally-managed --user --prefix= --compile --install-headers 'C:\Users\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\Include\python-Levenshtein' Check the logs for full command output.

    opened by gablans 1
  • Killed while running sample script

    Killed while running sample script

    Hi, I try to run this project on a VPS with 2GB RAM. When i try to enter script parrot = Parrot the python terminal is Killed. Can you help me please?

    Python 3.8.10 (default, Nov 26 2021, 20:14:08)
    [GCC 9.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from parrot import Parrot
    >>> import torch
    >>> import warnings
    >>> warnings.filterwarnings("ignore")
    >>> def random_state(seed):
    ...   torch.manual_seed(seed)
    ...   if torch.cuda.is_available():
    ...     torch.cuda.manual_seed_all(seed)
    ...
    >>> random_state(1234)
    >>>
    >>> parrot = Parrot(model_tag="prithivida/parrot_paraphraser_on_T5")
    Killed
    
    
    opened by kang-mus 1
  • TypeError: Descriptors cannot not be created directly.

    TypeError: Descriptors cannot not be created directly.

    When running the Quickstart example from the readme, I get:

    Traceback (most recent call last):
      File "$HOME/src/try-parrot/demo.py", line 19, in <module>
        parrot = Parrot(model_tag="prithivida/parrot_paraphraser_on_T5", use_gpu=False)
      File "$HOME/.local/lib/python3.8/site-packages/parrot/parrot.py", line 10, in __init__
        self.tokenizer = AutoTokenizer.from_pretrained(model_tag, use_auth_token=False)
      File "$HOME/.local/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 659, in from_pretrained
        return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
      File "$HOME/.local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1801, in from_pretrained
        return cls._from_pretrained(
      File "$HOME/.local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1956, in _from_pretrained
        tokenizer = cls(*init_inputs, **init_kwargs)
      File "$HOME/.local/lib/python3.8/site-packages/transformers/models/t5/tokenization_t5_fast.py", line 133, in __init__
        super().__init__(
      File "$HOME/.local/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 114, in __init__
        fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
      File "$HOME/.local/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py", line 1162, in convert_slow_tokenizer
        return converter_class(transformer_tokenizer).converted()
      File "$HOME/.local/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py", line 438, in __init__
        from .utils import sentencepiece_model_pb2 as model_pb2
      File "$HOME/.local/lib/python3.8/site-packages/transformers/utils/sentencepiece_model_pb2.py", line 92, in <module>
        _descriptor.EnumValueDescriptor(
      File "$HOME/.local/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 755, in __new__
        _message.Message._CheckCalledFromGeneratedFile()
    TypeError: Descriptors cannot not be created directly.
    If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
    If you cannot immediately regenerate your protos, some other possible workarounds are:
     1. Downgrade the protobuf package to 3.20.x or lower.
     2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
    
    More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates
    
    opened by ct2034 2
  • Unable to use the model due to Huggingface updated API

    Unable to use the model due to Huggingface updated API

    Hii. You might be unable to use the Parrot model due to an error something like this ...is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'...

    To solve this I already created a pull request. Till then you can open the Parrot Library source code in your code editor and add these update these lines lines (13, 14 lines most probably)

    self.tokenizer = AutoTokenizer.from_pretrained(model_tag, use_auth_token = <your auth token>)
    self.model     = AutoModelForSeq2SeqLM.from_pretrained(model_tag, use_auth_token = <your auth token>)
    
    opened by LEAGUEDORA 1
  • added use auth token according to new hugging face

    added use auth token according to new hugging face

    Since hugging face has updated their API, you cannot access Parrot models without using an auth token. You may land into this issue

    ...is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'....

    To solve this I added a new parameter called use_auth_token.

    How to get a token from Huggingface

    1. Open Hugging Face and register/login with your credentials
    2. Nativate to Token settings page and create a write permitted access token.
    3. Copy the token and pass it as a parameter to Parrot class while initiating.

    So the updated code will be

    from parrot import Parrot
    import torch
    import warnings
    warnings.filterwarnings("ignore")
    
    ''' 
    uncomment to get reproducable paraphrase generations
    def random_state(seed):
      torch.manual_seed(seed)
      if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
    
    random_state(1234)
    '''
    
    #Init models (make sure you init ONLY once if you integrate this to your code)
    parrot = Parrot(model_tag="prithivida/parrot_paraphraser_on_T5", use_auth_token = "<Your Hugging Face token>")
    
    phrases = ["Can you recommend some upscale restaurants in Newyork?",
               "What are the famous places we should not miss in Russia?"
    ]
    
    for phrase in phrases:
      print("-"*100)
      print("Input_phrase: ", phrase)
      print("-"*100)
      para_phrases = parrot.augment(input_phrase=phrase, use_gpu=False)
      for para_phrase in para_phrases:
       print(para_phrase)
    
    opened by LEAGUEDORA 8
  • Update README.md

    Update README.md

    Congratulations. Your project is featured on the kandi kit. kandi kits help developers shortlist reusable libraries and code snippets for specific topics or use cases. Add your kandi badge to help more developers discover and adopt your project easily. Thanks!

    opened by javadev984 0
Releases(v1.0)
Owner
Prithivida
Applied NLP, XAI for NLP and Data Engineering
Prithivida
CredData is a set of files including credentials in open source projects

CredData is a set of files including credentials in open source projects. CredData includes suspicious lines with manual review results and more information such as credential types for each suspicio

Samsung 19 Sep 07, 2022
Repository for the paper "Optimal Subarchitecture Extraction for BERT"

Bort Companion code for the paper "Optimal Subarchitecture Extraction for BERT." Bort is an optimal subset of architectural parameters for the BERT ar

Alexa 461 Nov 21, 2022
Sequence Modeling with Structured State Spaces

Structured State Spaces for Sequence Modeling This repository provides implementations and experiments for the following papers. S4 Efficiently Modeli

HazyResearch 902 Jan 06, 2023
Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。

GPT2-NewsTitle 带有超详细注释的GPT2新闻标题生成项目 UpDate 01.02.2021 从网上收集数据,将清华新闻数据、搜狗新闻数据等新闻数据集,以及开源的一些摘要数据进行整理清洗,构建一个较完善的中文摘要数据集。 数据集清洗时,仅进行了简单地规则清洗。

logCong 785 Dec 29, 2022
Shared, streaming Python dict

UltraDict Sychronized, streaming Python dictionary that uses shared memory as a backend Warning: This is an early hack. There are only few unit tests

Ronny Rentner 192 Dec 23, 2022
Finding Label and Model Errors in Perception Data With Learned Observation Assertions

Finding Label and Model Errors in Perception Data With Learned Observation Assertions This is the project page for Finding Label and Model Errors in P

Stanford Future Data Systems 17 Oct 14, 2022
Python library for processing Chinese text

SnowNLP: Simplified Chinese Text Processing SnowNLP是一个python写的类库,可以方便的处理中文文本内容,是受到了TextBlob的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和TextBlob

Rui Wang 6k Jan 02, 2023
TTS is a library for advanced Text-to-Speech generation.

TTS is a library for advanced Text-to-Speech generation. It's built on the latest research, was designed to achieve the best trade-off among ease-of-training, speed and quality. TTS comes with pretra

Mozilla 6.5k Jan 08, 2023
原神抽卡记录数据集-Genshin Impact gacha data

提要 持续收集原神抽卡记录中 可以使用抽卡记录导出工具导出抽卡记录的json,将json文件发送至[email protected],我会在清除个人信息后

117 Dec 27, 2022
Pretrained Japanese BERT models

Pretrained Japanese BERT models This is a repository of pretrained Japanese BERT models. The models are available in Transformers by Hugging Face. Mod

Inui Laboratory 387 Dec 30, 2022
Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

SEW (Squeezed and Efficient Wav2vec) The repo contains the code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speec

ASAPP Research 67 Dec 01, 2022
A library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.

blurr A library that integrates huggingface transformers with version 2 of the fastai framework Install You can now pip install blurr via pip install

ohmeow 253 Dec 31, 2022
All the code I wrote for Overwatch-related projects that I still own the rights to.

overwatch_shit.zip This is (eventually) going to contain all the software I wrote during my five-year imprisonment stay playing Overwatch. I'll be add

zkxjzmswkwl 2 Dec 31, 2021
SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

The goal is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems for speech recognition, speaker recognit

SpeechBrain 5.1k Jan 09, 2023
Transformers and related deep network architectures are summarized and implemented here.

Transformers: from NLP to CV This is a practical introduction to Transformers from Natural Language Processing (NLP) to Computer Vision (CV) Introduct

Ibrahim Sobh 138 Dec 27, 2022
A Flask Sentiment Analysis API, with visual implementation

The Sentiment Analysis Api was created using python flask module,it allows users to parse a text or sentence throught the (?text) arguement, then view the sentiment analysis of that sentence. It can

Ifechukwudeni Oweh 10 Jul 17, 2022
DeBERTa: Decoding-enhanced BERT with Disentangled Attention

DeBERTa: Decoding-enhanced BERT with Disentangled Attention This repository is the official implementation of DeBERTa: Decoding-enhanced BERT with Dis

Microsoft 1.2k Jan 03, 2023
Fast topic modeling platform

The state-of-the-art platform for topic modeling. Full Documentation User Mailing List Download Releases User survey What is BigARTM? BigARTM is a pow

BigARTM 633 Dec 21, 2022
topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API

NLP Space News Topic Modeling Photos by nasa.gov (1, 2, 3, 4, 5) and extremetech.com Table of Contents Project Idea Data acquisition Primary data sour

edesz 1 Jan 03, 2022
Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT)

CIRPLANT This repository contains the code and pre-trained models for Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT) For d

Zheyuan (David) Liu 29 Nov 17, 2022