MMDA - multimodal document analysis

Last update: Jan 04, 2023

Related tags

Overview

MMDA - multimodal document analysis

This is work in progress...

Setup

conda create -n mmda python=3.8
pip install -r requirements.txt

Parsers

SymbolScraper - Apache 2.0
- Quoted from their README: From the main directory, issue make. This will run the Maven build system, download dependencies, etc., compile source files and generate .jar files in ./target. Finally, a bash script bin/sscraper is generated, so that the program can be easily used in different directories.

Library walkthrough

1. Creating a Document for the first time

In this example, we use the SymbolScraperParser. Each parser implements its own .parse().

import os
from mmda.parsers.symbol_scraper_parser import SymbolScraperParser
from mmda.types.document import Document

ssparser = SymbolScraperParser(sscraper_bin_path='...')
doc: Document = ssparser.parse(infile='...pdf', outdir='...', outfname='...json')

Because we provided outdir and outfname, the document is also serialized for you:

assert os.path.exists(os.path.join(outdir, outfname))

2. Loading a serialized Document

Each parser implements its own .load().

doc: Document = ssparser.load(infile=os.path.join(outdir, outfname))

3. Iterating through a Document

The minimum requirement for a Document is its .text field, which is just a .

But the usefulness of this library really is when you have multiple different ways of segmenting the .text. For example:

for page in doc.pages:
    print(f'\n=== PAGE: {page.id} ===\n\n')
    for row in page.rows:
        print(row.text)

shows two nice aspects of this library:

Document provides iterables for different segmentations of text. Options include pages, tokens, rows, sents, blocks. Not every Parser will provide every segmentation, though. For example, SymbolScraperParser only provides pages, tokens, rows.
Each one of these segments (precisely, DocSpan objects) is aware of (and can access) other segment types. For example, you can call page.rows to get all Rows that intersect a particular Page. Or you can call sent.tokens to get all Tokens that intersect a particular Sentence. Or you can call sent.block to get the Block(s) that intersect a particular Sentence. These indexes are built dynamically when the Document is created and each time a new DocSpan type is loaded. In the extreme, one can do:

for page in doc.pages:
    for block in page.blocks:
        for sent in block.sents:
            for row in sent.rows:
                for token in sent.tokens:
                    pass

4. Loading new DocSpan type

Not all Documents will have all segmentations available at creation time. You may need to load new definitions to an existing Document.

It's strongly recommended to create the full Document using a Parser.load() but if you need to build it up step by step using the DocSpan class and Document.load() method:

from mmda.types.span import Span
from mmda.types.document import Document, DocSpan, Token, Page, Row, Sent, Block

doc: Document(text='I live in New York. I read the New York Times.')
page_jsons = [{'start': 0, 'end': 46, 'id': 0}]
sent_jsons = [{'start': 0, 'end': 19, 'id': 0}, {'start': 20, 'end': 46, 'id': 1}]

pages = [
    DocSpan.from_span(span=Span.from_json(span_json=page_json), 
                      doc=doc, 
                      span_type=Page)
    for page_json in page_jsons
]
sents = [
    DocSpan.from_span(span=Span.from_json(span_json=sent_json), 
                      doc=doc, 
                      span_type=Sent)
    for sent_json in sent_jsons
]

doc.load(sents=sents, pages=pages)

assert doc.sents
assert doc.pages

5. Changing the Document

We currently don't support any nice tools for mutating the data in a Document once it's been created, aside from loading new data. Do at your own risk.

But a note -- If you're editing something (e.g. replacing some DocSpan in tokens), always call:

Document._build_span_type_to_spans()
Document._build_span_type_to_index()

to keep the indices up-to-date with your modified DocSpan.

Comments

VILA predictor service

https://github.com/allenai/scholar/issues/29184

REST service for VILA predictors

Any code that I didn't comment on in this PR is s2age-maker boilerplate.

opened by rodneykinney 12

cleanup Annotation class: remove `uuid`, pull `id` out of `metadata`, remove `dataclasses`, add `getter/setter` for `text` and `type`, make `Metadata()` take args

@soldni im tryin to migrate off dataclass, but the tests are failing at:

ERROR tests/test_eval/test_metrics.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_internal_ai2/test_api.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_parsers/test_grobid_header_parser.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_parsers/test_override.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_parsers/test_pdf_plumber_parser.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_predictors/test_bibentry_predictor.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_predictors/test_dictionary_word_predictor.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_predictors/test_span_group_classification_predictor.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_predictors/test_vila_predictors.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_types/test_document.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_types/test_indexers.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_types/test_json_conversion.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_types/test_metadata.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_types/test_span_group.py - TypeError: add_deprecated_field only works on dataclasses

not sure how best to handle

opened by kyleclo 7

Add attributes to API data classes
This PR adds metadata to API data classes.

API data classes and mmda types differ in a few significant aspects:

id and type (and text for SpanGroup) are stored in metadata for mmda types; in the APIs, they are part of the top-level attributes.

metadata can store arbitrary content in the mmda types; in the data API, all attributes that are not explicitly declared are dropped.

we expect applications that require specific fields to declare custom Metadata and SpanGroup/BoxGroup classes that inherit from their parent class. For an example, see tests/test_internal_ai2/test_api.py

metadata entries are mapped to an attributes field to match how data is stored in the Annotation Store
opened by soldni 4
Egork/merge spans

Adding a class to the utils for merging spans with optional parameter of x, y. x, y are used as a distance added to the boundaries of the boxes to decided if they overlap.

Here is an example of tokens which are represented by list of spans, the task is to merge them into a single span and box

Result of merging tokens with x=0.04387334, y=0.01421097 being average size of the token in the document

Another example with same x=0.04387334, y=0.01421097

opened by comorado 4

Bib Entry Parser/Predictor

https://github.com/allenai/scholar/issues/32461

Pretty standard model and interface implementation. You can see the result on http://bibentry-predictor.v0.dev.models.s2.allenai.org/ or

curl -X 'POST' \
  'http://bibentry-predictor.v0.dev.models.s2.allenai.org/invocations' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "instances": [
    {
      "bib_entry": "[4] Wei Zhuo, Qianyi Zhan, Yuan Liu, Zhenping Xie, and Jing Lu. Context attention heterogeneous network embed- ding. Computational Intelligence and Neuroscience , 2019. doi: 10.1155/2019/8106073."
    }
  ]
}'

Tests: integration test passed and dev deployment works

TODO: release as version 0.0.10 after merge

opened by stefanc-ai2 4

Dockerized pipeline
Pattern for wrapping services in a lighter-weight containers.

Replace full sub-projects in the services directory with a Dockerfile plus a xxx_api.py file for each service.

Add a docker-compose file that build the services plus a python container for running a REPL or scripts

Add pipeline.py sample end-to-end pipeline script using the services

Add run-pipeline.sh file to run pipeline

@rauthur
opened by rodneykinney 4
MMDA predictor evaluation
[x] Grobid implemented as a Parser

[x] Script to obtain S2-VLUE (check w/ shannons on location)

[x] Script to run S2-VLUE through an mmda.Predictor & obtain evaluation metrics

[x] Definition/implementation of end-to-end evaluation metrics.

S2-VLUE evaluation looks like this: [('VILA', 'title'), ('a', 'title'), ('new', 'title')...] and evaluation associated with it in VILA paper is token-level F1. But if we want to compare against GROBID or other systsems that use different parsers (i.e. different ._symbols and .tokens), what we want for evaluation is {'title': 'VILA a new...', ...} Our evaluation metric is based on string match / edit-distance metrics.

Focus for now on title and abstract. Other S2-VLUE classes may not even exist in Grobid or the other tools we want to compare against. If we have time, other types of content-types we'd want are:

Section names

Author names

Bibliographies (split out; optionally, also-parsed)

Body text

Captions

Footnotes

Tables/Figures
opened by kyleclo 4
Speed up vila pre-processing

From earlier testing, I remember that convert_document_page_to_pdf_dict takes a significant fraction of the total prediction time for vila. Here's a simple change that does all the work in a single iteration over tokens instead of multiple list comprehensions. I tested this on some production PDFs and saw a 3x speed-up.

@cmwilhelm @yoganandc

https://github.com/allenai/scholar/issues/32695

opened by rodneykinney 3
Kylel/2022 09/span group utils

minor PR: added tests for a pretty important functionality that was being used -- how to combine Spans that are next to each other into a single big Span. the key thing that was undocumented was that Boxes for the underlying Spans actually disappear after this merging functionality, which the tests now capture.

dont think this is intended behavior we want to support in future, but for now, this is how this utility is being used

opened by kyleclo 2

`Document._annotate_box_group` returns empty SpanGroups

Very bizarre bug I've encounter when trying to annotate a document with blocks from layout parser. To reproduce, run following code:

from mmda.parsers.pdfplumber_parser import PDFPlumberParser
from mmda.rasterizers.rasterizer import PDF2ImageRasterizer
from mmda.predictors.lp_predictors import LayoutParserPredictor

import torch
import warnings
from cached_path import cached_path        # must install via `pip install cached_path`


pdfplumber_parser = PDFPlumberParser()
rasterizer = PDF2ImageRasterizer()
layout_predictor = LayoutParserPredictor.from_pretrained(
    "lp://efficientdet/PubLayNet"
)

path = str(cached_path('https://arxiv.org/pdf/2110.08536.pdf'))
doc = pdfplumber_parser.parse(path)
images = rasterizer.rasterize(input_pdf_path=path, dpi=72)
doc.annotate_images(images)

with torch.no_grad(), warnings.catch_warnings():
    layout_regions = layout_predictor.predict(doc)
    doc.annotate(blocks=layout_regions)

# these asserts should fail
assert doc.blocks[0].spans == []
assert doc.blocks[0].tokens == []

I've done a bit of poking around and it seems to stem from the following snipped of code in mmda/types/document.py:

derived_span_groups = sorted(
    derived_span_groups, key=lambda span_group: span_group.start
)

In particular, it seems like the spans attribute for each SpanGroup gets emptied after sorting.

No clue what would be causing this, but perhaps there's an explanation?

opened by soldni 2

VILA models crashing when bounding boxes are not int

Because of changes in #69 , bounding boxes are now float instead of int, which VILA does not like:

  File "/Users/lucas/miniforge3/envs/pdod/lib/python3.10/site-packages/mmda/predictors/hf_predictors/vila_predictor.py", line 161, in predict
    model_outputs = self.model(**self.model_input_collator(model_inputs))
  File "/Users/lucas/miniforge3/envs/pdod/lib/python3.10/site-packages/mmda/predictors/hf_predictors/vila_predictor.py", line 178, in model_input_collator
    return {
  File "/Users/lucas/miniforge3/envs/pdod/lib/python3.10/site-packages/mmda/predictors/hf_predictors/vila_predictor.py", line 179, in <dictcomp>
    key: torch.tensor(val, dtype=torch.int64, device=self.device)
TypeError: 'float' object cannot be interpreted as an integer

This PR adds an explicit cast operation to get around this issue during pre-processing.

opened by soldni 2

Bib predictor index error bug fix
attempt at fix for part 1 of https://github.com/allenai/scholar/issues/34858 I think we can work around the index error that keeps popping up this way.

tt verify integration test passes

next steps:

[ ] tt push

[ ] update timo-services config for bib-predictor to use new code which will trigger new deployment
opened by geli-gel 4
Add fontinfo to tokens without requiring word split

This PR appends font information (font name and size) as metadata to 'tokens' on a Document without requiring tokens to be split on that information (i.e., "best" effort if a token contains many font names or sizes). The code subclasses the WordExtractor provided by PDFPlumber.

Currently, in a default configuration of PDFPlumberParser, tokens are already extracted with font name and size information (although it is discarded and only used for splitting). This could be added to the metadata as-is, however, the method used is the extra_attrs argument of PDFPlumber's extract_words which forces token splitting if font name and size does not match. I believe this is a bad default and further is not required (users can override this argument). The approach here guarantees this metadata will be captured.

Re: "best effort" in case of many name/size options for a token: I have maintained the logic of just taking the font name and size from the first character in the token. Another approach provides the min, max and average font sizes (or others) as well as a set of font names. This could be provided in the future or by adapting the append_attrs argument. For current use cases (section nesting prediction) this approach is sufficient.

opened by rauthur 0
Incomplete sentences in README

I was going through the readme and noticed a couple sentences that start, but don't end. Since I'm new to the project, I don't know how to finish the sentences myself.
documentation

opened by dmh43 0
cleanup JSON conversion for all data types

Noticed JSON serialization had inconsistent behavior across various data types, especially in cases where certain fields were empty or None.

This PR adds a set of comprehensive tests in tests/test_types/test_json_conversion.py that documents these behaviors. PR also includes resolving inconsistencies. For example, now Metadata that's attached to a SpanGroup or BoxGroup won't get accidentally serialized as an empty dictionary.

opened by kyleclo 0
adding relations
This PR extends this library functionality substantially -- Adding a new Annotation type called Relation. A Relation is a link between 2 annotations (e.g. a Citation linked to its Bib Entry). The input Annotations are called key and value.

A few things needed to change to support Relations:

Annotation Names

Relations store references to Annotation objects. But we didn't want Relation.to_json() to also .to_json() those objects. We only want to store minimal identifiers of the key and value. Something short like bib_entry-5 or sentence-13. We call these short strings names.

To do this, we added to Annotation class, an optional attribute called field: str which stores this name. It's automatically populated when you run Document.annotate(new_field = list_of_annotations); each of those input annotations will have the new field name stored under .field.

We also added a method name() that returns the name of a particular Annotation object that is unique at the document-level. Names are a minimal class that basically stores .field and .id.

In short, now after you annotate a Document with annotations, you can do stuff like:

doc.tokens[15].name == AnnotationName(field='tokens', id=15) str(annotation_name) == 'tokens-15' AnnotationName.from_str('tokens-15') == AnnotationName(field='tokens', id=15)

Lookups based on names

To support reconstructing a Relation object given the names of key and value, we need the ability to lookup those involved Annotations. We introduce a new method to enable this:

annotation_name = AnnotationName.from_str('paragraphs-99') a = document.locate_annotation( annotation_name ) --> returns the specific Annotation object assert a.id == 99 assert a.field == 'paragraphs'

to and from JSON

Finally, we need some way to serializing to JSON and reconstructing from JSON. For serialization, now that we have Names, this makes the JSON quite minimal:

{'key': <name_of_key>, 'value': <name_of_value>, ...other stuff that all Annotation objects have, like Metadata...}

Reconstructing a Relation from JSON is more tricky because it's meaningless without a Document object. The Document object must also store the specific Annotations correctly so we can correctly perform the lookup based on these Names.

The API for this is similar, but you must also pass in the Document object:

relation = Relation.from_json(my_relation_dict, my_document_containing_necessary_fields)
opened by kyleclo 2

Releases(0.2.7)

0.2.7(Dec 9, 2022)
Add section header predictor (0.2.5)

Add new logic to DictionaryWordPredictor to handle cases with singleton symbols per row (e.g. in a table) (0.2.6)

Bugfixes to better handling of DWP (0.2.7)

Source code(tar.gz)
Source code(zip)
0.2.4(Nov 23, 2022)
Added a Metadata as a type that can exist at a Document-level

Added utility for obtaining OutlineMetadata from a PDF

Fixes to citation_linker because of sklearn deprecation

Add WhiteSpaceTokenizer

Fixes to DictionaryWordPredictor because of change to how tokenization happens in PDFPlumberParser

Change how fieldnames are defined in types.names

Move off setup.py into pyproject.toml

Source code(tar.gz)
Source code(zip)
0.1.0(Oct 21, 2022)
Changes to Annotation class to remove uuid, require id, change Metadata default behavior

Changes to JSON serialization schema for Box

Bugfix in MentionDetector that was changing Document.tokens accidentally due to lack of deepcopy

Add new predictor for Table/Figure Captions

Hotfix in PDFPlumberParser that avoids injection of new whitespace in Document.symbols

Source code(tar.gz)
Source code(zip)
0.0.44(Oct 5, 2022)

Source code(tar.gz)
Source code(zip)
0.0.29(Aug 31, 2022)
What's Changed

is_overwrite was ignored in Document.annotate by @soldni in https://github.com/allenai/mmda/pull/128

Full Changelog: https://github.com/allenai/mmda/compare/0.0.28...0.0.29
Source code(tar.gz)
Source code(zip)
0.0.28(Aug 29, 2022)
What's Changed

Fixed issue with failing initialization when metadata is provided to span group by @soldni in https://github.com/allenai/mmda/pull/126

Angelez/bibentries by @geli-gel in https://github.com/allenai/mmda/pull/113

Full Changelog: https://github.com/allenai/mmda/compare/0.0.26.1...0.0.28
Source code(tar.gz)
Source code(zip)
0.0.26.1(Aug 24, 2022)

Source code(tar.gz)
Source code(zip)
0.0.26(Aug 19, 2022)

Added support for metadata #119
Source code(tar.gz)
Source code(zip)
0.0.25(Aug 9, 2022)

Source code(tar.gz)
Source code(zip)
0.0.21(Jul 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.18(Jul 26, 2022)

Source code(tar.gz)
Source code(zip)
0.0.7(Apr 22, 2022)

Source code(tar.gz)
Source code(zip)
0.0.6(Apr 13, 2022)

Source code(tar.gz)
Source code(zip)
0.0.5(Mar 3, 2022)

Source code(tar.gz)
Source code(zip)
0.0.4(Jan 25, 2022)

Source code(tar.gz)
Source code(zip)
0.0.3(Jan 25, 2022)

Source code(tar.gz)
Source code(zip)
0.0.2(Aug 25, 2021)

Source code(tar.gz)
Source code(zip)
0.0.1(Aug 23, 2021)

Source code(tar.gz)
Source code(zip)

Owner

AI2

GitHub Repository

Chinese Grammatical Error Diagnosis

nlp-CGED Chinese Grammatical Error Diagnosis 中文语法纠错研究基于序列标注的方法所需环境 Python==3.6 tensorflow==1.14.0 keras==2.3.1 bert4keras==0.10.6 笔者使用了开源的bert4keras

12 Nov 25, 2022

CLIPfa: Connecting Farsi Text and Images

CLIPfa: Connecting Farsi Text and Images OpenAI released the paper Learning Transferable Visual Models From Natural Language Supervision in which they

66 Dec 14, 2022

KoBART model on huggingface transformers

KoBART-Transformers SKT에서 공개한 KoBART를 편리하게 사용할 수 있게 transformers로 포팅하였습니다. Install (Optional) BartModel과 PreTrainedTokenizerFast를 이용하면 설치하실 필요 없습니다. p

58 Dec 07, 2022

Uncomplete archive of files from the European Nopsled Team

European Nopsled CTF Archive This is an archive of collected material from various Capture the Flag competitions that the European Nopsled team played

4 Nov 24, 2021

NLTK Source

Natural Language Toolkit (NLTK) NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets, and tutorials supporting

11.4k Jan 04, 2023

🤗🖼️ HuggingPics: Fine-tune Vision Transformers for anything using images found on the web.

🤗 🖼️ HuggingPics Fine-tune Vision Transformers for anything using images found on the web. Check out the video below for a walkthrough of this proje

185 Dec 21, 2022

NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

This file contains the following documents sumbited for Baruch CIS9665 group 9 fall 2021. 1. Dataset: drug_reviews.csv 2. python codes for text classi

2 Jan 04, 2023

This is the source code of RPG (Reward-Randomized Policy Gradient)

RPG (Reward-Randomized Policy Gradient) Zhenggang Tang*, Chao Yu*, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Shaolei Du, Yu Wang, Yi Wu (

40 Nov 25, 2022

This simple Python program calculates a love score based on your and your crush's full names in English

This simple Python program calculates a love score based on your and your crush's full names in English. There is no logic or reason in the calculation behind the love score. The calculation could ha

1 Jan 24, 2022

PUA Programming Language written in Python.

pua-lang PUA Programming Language written in Python. Installation git clone https://github.com/zhaoyang97/pua-lang.git cd pua-lang pip install . Try

4 Feb 19, 2022

Code release for "COTR: Correspondence Transformer for Matching Across Images"

COTR: Correspondence Transformer for Matching Across Images This repository contains the inference code for COTR. We plan to release the training code

358 Dec 24, 2022

A practical and feature-rich paraphrasing framework to augment human intents in text form to build robust NLU models for conversational engines. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Parrot Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models. A paraphrase framework is more t

690 Jan 04, 2023

Official implementations for various pre-training models of ERNIE-family, covering topics of Language Understanding & Generation, Multimodal Understanding & Generation, and beyond.

English|简体中文 ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框架，该框架将大数据预训练与多源丰富知识相结合，通过持续学习技术，不断吸收海量文本数据中词汇、结构、语义等方面的知识，实现模型效果不断进化。ERNIE在累积 40 余个典型 NLP 任务取得 SOTA 效果，并在 G

5.4k Jan 03, 2023

pyupbit 라이브러리를 활용하여 upbit에서 비트코인을 자동매매하는 코드입니다. 조코딩 유튜브 채널에서 자세한 강의 영상을 보실 수 있습니다.

파이썬 비트코인 투자 자동화 강의 코드 by 유튜브 조코딩 채널 pyupbit 라이브러리를 활용하여 upbit 거래소에서 비트코인 자동매매를 하는 코드입니다. 파일 구성 test.py : 잔고 조회 (1강) backtest.py : 백테스팅 코드 (2강) bestK.p

186 Dec 29, 2022

☀️ Measuring the accuracy of BBC weather forecasts in Honolulu, USA

Accuracy of BBC Weather forecasts for Honolulu This repository records the forecasts made by BBC Weather for the city of Honolulu, USA. Essentially, t

12 Oct 15, 2022

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism This repository is the official PyTorch implementation of our AAAI-2022 paper, in

829 Jan 07, 2023

Simple Python library, distributed via binary wheels with few direct dependencies, for easily using wav2vec 2.0 models for speech recognition

Wav2Vec2 STT Python Beta Software Simple Python library, distributed via binary wheels with few direct dependencies, for easily using wav2vec 2.0 mode

22 Dec 29, 2022

MMDA - multimodal document analysis

Related tags

Overview

MMDA - multimodal document analysis

Setup

Parsers

Library walkthrough

1. Creating a Document for the first time

2. Loading a serialized Document

3. Iterating through a Document

4. Loading new DocSpan type

5. Changing the Document

Comments

Annotation Names

Lookups based on names

to and from JSON

Releases(0.2.7)

0.2.7(Dec 9, 2022)

0.2.4(Nov 23, 2022)

0.1.0(Oct 21, 2022)

0.0.44(Oct 5, 2022)

0.0.29(Aug 31, 2022)

What's Changed

0.0.28(Aug 29, 2022)

What's Changed

0.0.26.1(Aug 24, 2022)

0.0.26(Aug 19, 2022)

0.0.25(Aug 9, 2022)

0.0.21(Jul 29, 2022)

0.0.18(Jul 26, 2022)

0.0.7(Apr 22, 2022)

0.0.6(Apr 13, 2022)

0.0.5(Mar 3, 2022)

0.0.4(Jan 25, 2022)

0.0.3(Jan 25, 2022)

0.0.2(Aug 25, 2021)

0.0.1(Aug 23, 2021)

Owner

AI2

Chinese Grammatical Error Diagnosis

CLIPfa: Connecting Farsi Text and Images

KoBART model on huggingface transformers

Uncomplete archive of files from the European Nopsled Team

NLTK Source

🤗🖼️ HuggingPics: Fine-tune Vision Transformers for anything using images found on the web.

NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

This is the source code of RPG (Reward-Randomized Policy Gradient)

This simple Python program calculates a love score based on your and your crush's full names in English

PUA Programming Language written in Python.

Code release for "COTR: Correspondence Transformer for Matching Across Images"

A practical and feature-rich paraphrasing framework to augment human intents in text form to build robust NLU models for conversational engines. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Official implementations for various pre-training models of ERNIE-family, covering topics of Language Understanding & Generation, Multimodal Understanding & Generation, and beyond.

pyupbit 라이브러리를 활용하여 upbit에서 비트코인을 자동매매하는 코드입니다. 조코딩 유튜브 채널에서 자세한 강의 영상을 보실 수 있습니다.

☀️ Measuring the accuracy of BBC weather forecasts in Honolulu, USA

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

Simple Python library, distributed via binary wheels with few direct dependencies, for easily using wav2vec 2.0 models for speech recognition

BERT-based Financial Question Answering System

Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

A natural language modeling framework based on PyTorch

`Annotation Names`