A pipeline for making highlighted text stand-alone.

Overview
title emoji colorFrom colorTo sdk app_file pinned
decontextualizer
📤
green
gray
streamlit
main.py
false

Decontextualizer

As a second step in improving our content consumption workflows, I investigated a new approach to extracting fragments from a content item before saving them for subsequent surfacing. While the lexiscore deals with content items on a holistic level -- evaluating entire books, articles, and papers -- I speculated then that going granular is a natural next step in building tools which help us locate specific valuable ideas in long-form content. The decontextualizer is a stepping stone in that direction, consisting of a pipeline for making text excerpts compact and semantically self-contained. Concretely, the decontextualizer is a web app able to take in an annotated PDF and automatically tweak the highlighted excerpts so that they make more sense on their own, even out of context.

Read more...

Installation

The decontextualizer can either be deployed from source or using Docker.

Docker

To deploy the decontextualizer labeler using Docker, first make sure to have Docker installed, then simply run the following.

docker run -p 8501:8501 paulbricman/decontextualizer 

The tool should be available at localhost:8501.

From Source

To set up the decontextualizer, clone the repository and run the following:

python3 -m pip install -r requirements.txt
streamlit run main.py

The tool should be available at localhost:8501.

Screenshots

You might also like...
🐸   Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️
🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️

🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️

box is a text-based visual programming language inspired by Unreal Engine Blueprint function graphs.
box is a text-based visual programming language inspired by Unreal Engine Blueprint function graphs.

Box is a text-based visual programming language inspired by Unreal Engine blueprint function graphs. $ cat factorial.box ┌─ƒ(Factorial)───┐

Export solved codewars kata challenges to a text file.

Codewars Kata Exporter Note:this is not totally my work.i've edited the project to make more easier and faster for me.you can find the original work h

Extract knowledge from raw text

Extract knowledge from raw text This repository is a nearly copy-paste of "From Text to Knowledge: The Information Extraction Pipeline" with some cosm

AnnIE - Annotation Platform, tool for open information extraction annotations using text files.
AnnIE - Annotation Platform, tool for open information extraction annotations using text files.

AnnIE - Annotation Platform, tool for open information extraction annotations using text files.

py-trans is a Free Python library for translate text into different languages.

Free Python library to translate text into different languages.

text-to-speach bot - You really do NOT have time for read a newsletter? Now you can listen to it

NewsletterReader You really do NOT have time for read a newsletter? Now you can listen to it The Newsletter of Filipe Deschamps is a great place to re

a python package that lets you add custom colors and text formatting to your scripts in a very easy way!
a python package that lets you add custom colors and text formatting to your scripts in a very easy way!

colormate Python script text formatting package What is colormate? colormate is a python library that lets you add text formatting to your scripts, it

Repository containing the code for An-Gocair text normaliser

Scottish Gaelic Text Normaliser The following project contains the code and resources for the Scottish Gaelic text normalisation project. The repo can

Comments
  • PDF Failures

    PDF Failures

    Hello!

    I attempted two different PDFs (which I can share in a DM or a email) — one was an older-pre computer PDF that had been OCRed professionally and hilighted. Another was a modern PDF, of a book published last year, also highlighted. Zotero was able to extract from both using pdfjs. When i use https://huggingface.co/spaces/paulbricman/decontextualizer i get:

    File "/home/user/.local/lib/python3.8/site-packages/streamlit/script_runner.py", line 354, in _run_script
        exec(code, module.__dict__)
    File "/home/user/app/main.py", line 17, in <module>
        components.add_section()
    File "/home/user/app/components.py", line 46, in add_section
        excerpts = pdf_to_excerpts(filename)
    File "/home/user/app/processing.py", line 48, in pdf_to_excerpts
        excerpt = extract_annot(annot, words)
    File "/home/user/app/processing.py", line 24, in extract_annot
        quad_count = int(len(quad_points) / 4)
    
    opened by brimwats1 4
Releases(v1.0.0)
Owner
Paul Bricman
Building tools which augment the mind.
Paul Bricman
Vector space based Information Retrieval System for Text Processing - Information retrieval

Information Retrieval: Text Processing Group 13 Sequence of operations Install Requirements Add given wikipedia files to the corpus directory. Downloa

1 Jan 01, 2022
Hspell, the free Hebrew spellchecker and morphology engine.

Hspell, the free Hebrew spellchecker and morphology engine.

16 Sep 15, 2022
JSON and CSV data for Swahili dictionary with over 16600+ words

kamusi JSON and CSV data for swahili dictionary with over 16600+ words. This repo consists of data from swahili dictionary with about 16683 words toge

Jordan Kalebu 8 Jan 13, 2022
Wikipedia Reader for the GNOME Desktop

Wike Wike is a Wikipedia reader for the GNOME Desktop. Provides access to all the content of this online encyclopedia in a native application, with a

Hugo Olabera 126 Dec 24, 2022
You can encode and decode base85, ascii85, base64, base32, and base16 with this tool.

You can encode and decode base85, ascii85, base64, base32, and base16 with this tool.

8 Dec 20, 2022
Add your new words to a text file and get them randomly.

Memorize-New-Words In this very very very little project, I've wrote a code to memorize new english words. Therefore you can add the words and their m

Mostafa 2 Jul 04, 2022
Text to ASCII and ASCII to text

Text2ASCII Description This python script (converter.py) contains two functions: encode() is used to return a list of Integer, one item per character

4 Jan 22, 2022
text-to-speach bot - You really do NOT have time for read a newsletter? Now you can listen to it

NewsletterReader You really do NOT have time for read a newsletter? Now you can listen to it The Newsletter of Filipe Deschamps is a great place to re

ItanuRomero 8 Sep 18, 2021
Word-Generator - Generates meaningful words from dictionary with given no. of letters and words.

Meaningful Word Generator Generates meaningful words from dictionary with given no. of letters and words. This might be useful for generating short li

Mohammed Rabil 1 Jan 01, 2022
TextStatistics - Get a text file wich contains English text

TextStatistics This program get a text file wich contains English text. The program analyses the text, and print some information. For this program I

2 Nov 15, 2021
Phone Number formatting for PlaySMS Platform - BulkSMS Platform

BulkSMS-Number-Formatting Phone Number formatting for PlaySMS Platform - BulkSMS Platform. Phone Number Formatting for PlaySMS Phonebook Service This

Edwin Senunyeme 1 Nov 08, 2021
box is a text-based visual programming language inspired by Unreal Engine Blueprint function graphs.

Box is a text-based visual programming language inspired by Unreal Engine blueprint function graphs. $ cat factorial.box ┌─ƒ(Factorial)───┐

Pranav 104 Dec 24, 2022
A pipeline for making highlighted text stand-alone.

title emoji colorFrom colorTo sdk app_file pinned decontextualizer 📤 green gray streamlit main.py false Decontextualizer As a second step in improvin

Paul Bricman 26 Dec 17, 2022
Find a Doc is a free online resource aimed at helping connect the foreign community in Japan with health services in their native language.

Find a Doc - Localization Find a Doc is a free online resource aimed at helping connect the foreign community in Japan with health services in their n

Our Japan Life 18 Dec 19, 2022
ChirpText is a collection of text processing tools for Python 3.

ChirpText is a collection of text processing tools for Python 3. It is not meant to be a powerful tank like the popular NTLK but a small package which

Le Tuan Anh 5 Nov 30, 2022
Hamming code generation, error detection & correction.

Hamming code generation, error detection & correction.

Farhan Bin Amin 2 Jun 30, 2022
split Word file by chapter

split Word file by chapter we use the mircosoft word api to code this tool api url:https://docs.microsoft.com/zh-cn/dotnet/api/ if this tool is good f

wisdom under lemon trees 5 Nov 06, 2021
Text Summarizationcls app with python

Text Summarizationcls app This is the repo for the Text Summarization AI Project. It makes use of pre-trained Hugging Face models Packages Used The pa

Edem Gold 1 Oct 23, 2021
Etranslate is a free and unlimited python library for transiting your texts

Etranslate is a free and unlimited python library for transiting your texts

Abolfazl Khalili 16 Sep 13, 2022
CowExcept - Spice up those exceptions with cowexcept!

CowExcept - Spice up those exceptions with cowexcept!

James Ansley 41 Jun 30, 2022