A Python client for the Softcite software mention recognizer server

Overview

Softcite software mention recognizer client

Python client for using the Softcite software mention recognition service. It can be applied to

  • individual PDF files

  • recursively to a local directory, processing all the encountered PDF

  • to a collection of documents harvested by biblio-glutton-harvester and article-dataset-builder, with the benefit of re-using the collection manifest for injectng metadata and keeping track of progress. The collection can be stored locally or on a S3 storage.

Requirements

The client has been tested with Python 3.5-3.7.

The client requires a working Softcite software mention recognition service. Service host and port can be changed in the config.json file of the client.

Install

cd software_mention_client/

It is advised to setup first a virtual environment to avoid falling into one of these gloomy python dependency marshlands:

virtualenv --system-site-packages -p python3 env

source env/bin/activate

Install the dependencies, use:

pip3 install -r requirements.txt

Usage and options

usage: software_mention_client.py [-h] [--repo-in REPO_IN] [--file-in FILE_IN]
                                  [--file-out FILE_OUT]
                                  [--data-path DATA_PATH] [--config CONFIG]
                                  [--reprocess] [--reset] [--load]
                                  [--diagnostic] [--scorched-earth]

Softcite software mention recognizer client

optional arguments:
  -h, --help            show this help message and exit
  --repo-in REPO_IN     path to a directory of PDF files to be processed by
                        the Softcite software mention recognizer
  --file-in FILE_IN     a single PDF input file to be processed by the
                        Softcite software mention recognizer
  --file-out FILE_OUT   path to a single output the software mentions in JSON
                        format, extracted from the PDF file-in
  --data-path DATA_PATH
                        path to the resource files created/harvested by
                        biblio-glutton-harvester
  --config CONFIG       path to the config file, default is ./config.json
  --reprocess           reprocessed failed PDF
  --reset               ignore previous processing states and re-init the
                        annotation process from the beginning
  --load                load json files into the MongoDB instance, the --repo-
                        in parameter must indicate the path to the directory
                        of resulting json files to be loaded
  --diagnostic          perform a full count of annotations and diagnostic
                        using MongoDB regarding the harvesting and
                        transformation process
  --scorched-earth      remove a PDF file after its successful processing in
                        order to save storage space, careful with this!

The logs are written by default in a file ./client.log, but the location of the logs can be changed in the configuration file (default ./config.json).

Processing local PDF files

For processing a single file., the resulting json being written as file at the indicated output path:

python3 software_mention_client.py --file-in toto.pdf --file-out toto.json

For processing recursively a directory of PDF files, the results will be:

  • written to a mongodb server and database indicated in the config file

  • and in the directory of PDF files, as json files, together with each processed PDF

python3 software_mention_client.py --repo-in /mnt/data/biblio/pmc_oa_dir/

The default config file is ./config.json, but could also be specified via the parameter --config:

python3 software_mention_client.py --repo-in /mnt/data/biblio/pmc_oa_dir/ --config ./my_config.json

Processing a collection of PDF harvested by biblio-glutton-harvester

biblio-glutton-harvester and article-dataset-builder creates a collection manifest as a LMDB database to keep track of the harvesting of large collection of files. Storage of the resource can be located on a local file system or on a AWS S3 storage. The software-mention client will use the collection manifest to process these harvested documents.

  • locally:

python3 software_mention_client.py --data-path /mnt/data/biblio-glutton-harvester/data/

--data-path indicates the path to the repository of data harvested by biblio-glutton-harvester.

The resulting JSON files will be enriched by the metadata records of the processed PDF and will be stored together with each processed PDF in the data repository.

If the harvested collection is located on a S3 storage, the access information must be indicated in the configuration file of the client config.json. The extracted software mention will be written in a file with extension .software.json, for example:

-rw-rw-r-- 1 lopez lopez 1.1M Aug  8 03:26 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.pdf
-rw-rw-r-- 1 lopez lopez  485 Aug  8 03:41 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.software.json

If a MongoDB server access information is indicated in the configuration file config.json, the extracted information will additionally be written in MongoDB.

License and contact

Distributed under Apache 2.0 license. The dependencies used in the project are either themselves also distributed under Apache 2.0 license or distributed under a compatible license.

Main author and contact: Patrice Lopez ([email protected])

Innocent-Bot - A Discord client self-bot for destroying, nuking and causing mischief in servers

Innocent-bot A Discord client self-bot for destroying, nuking and causing mischi

†† 5 Jan 26, 2022
Python gets the friend's articles from hexo's friend-links

你是否经常烦恼于友链过多但没有时间浏览?那么友链朋友圈将解决这一痛点。你可以随时获取友链网站的更新内容,并了解友链的活跃情况。

129 Dec 28, 2022
Semplice pagina di informazione per sapere se e quando è uscito Joypad, il podcast a tema videoludico di Matteo Bordone (Corri!), Francesco Fossetti (Salta!) e Alessandro Zampini (Spara! per finta).

È uscito Joypad? Semplice pagina di informazione per sapere se e quando è uscito Joypad, il podcast a tema videoludico di Matteo Bordone (Corri!), Fra

Paolo Donadeo 32 Jan 02, 2023
A simple python discord bot which give you a yogurt brand name, basing on a large database often updated.

YaourtBot A discord simple bot by Lopinosaurus Before using this code : ・Move env file to .env ・Change the channel ID on line 38 of bot.py to your #pi

The only one bunny who can dev. 0 May 09, 2022
Simple Telegram Bot for generating BalckPearl BBCode Templates

blackpearl-bbcode-bot Simple Telegram Bot for generating BlackPearl BBCode Templates Written in Pyrogram Features - 🎉 IMDB Info fetching from files -

D. Luffy 5 Oct 19, 2022
A script to automate the process of downloading Markdown and CSV backups of Notion

Automatic-Notion-Backup A script to automate the process of downloading Markdown and CSV backups of Notion. In addition, the data is processed to remo

Jorge Manuel Lozano Gómez 2 Nov 02, 2022
Ross Virtual Assistant is a programme which can play Music, search Wikipedia, open Websites and much more.

Ross-Virtual-Assistant Ross Virtual Assistant is a programme which can play Music, search Wikipedia, open Websites and much more. Installation Downloa

Jehan Patel 4 Nov 08, 2021
⭐️ Pyro String Generator ⭐️ Genrate String Session Using this bot.Made by TeamUltronX 🔥

⭐️ Pyro String Generator ⭐️ Genrate String Session Using this bot.Made by TeamUltronX 🔥 Configs: API_HASH Get from Here. API_ID Get from Here. API_KE

TheUltronX 2 Dec 16, 2022
An unofficial Python wrapper for the 'Binance exchange REST API'

Welcome to binex_f v0.1.0 many interfaces are heavily used by myself in product environment, the websocket is reliable (re)connected. Latest version:

DeepLn 2 Jan 05, 2022
Discord E-Store Bot

A delivery bot for Discord, works like Amazon where real users can pack & deliver orders in different servers!

Amit Pathak 2 Jan 28, 2022
This code will guide you on how you can make your own Twitter Bot.

This code will guide you on how you can make your own Twitter Bot. This bot retweets, and likes to tweet with a particular word, here Python3.

Kunal Diwan 1 Oct 14, 2022
NFTs Upload to OpenSea CuseEdition

NFTs-Upload-to-OpenSea-CuseEdition YOUTUBE VIDEO - Soon... Download Python and

Lil Cuse 2 Jan 04, 2022
Ciclo 1 - MisiónTIC - UIS (Retos)

misiontic_uis Ciclo 1 - MisiónTIC - UIS Reto 1: Fundamentos del Lenguaje Python Reto 2: Estructuras de Control Condicional Reto 3: Estructuras de Cont

9 May 24, 2022
Automatically changes your discord status

Automatically changes your discord status, Be careful as this may get you rate limited and banned

octo 5 Sep 20, 2022
A Python Client for News API

newsapi-python A Python client for the News API. License Provided under MIT License by Matt Lisivick. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRAN

Matt Lisivick 281 Dec 29, 2022
Bulk NFT uploader to OpenSea!

Bulk NFT Uploader Description Simple easy peasy python script which logins to opensea account using metamask and bulk uploads NFT to your default coll

Lakshya Khera 25 May 23, 2022
Telegram vc userbot

Telegram Vc Userbot Available Commands /ping :- To check whether userbot is up or not /joinvc :- To join vc /leavevc :- To leave vc /join_group :- To

NandyDark 7 Nov 18, 2022
un outil pour bypasser les code d'états HTTP négatif coté client ( 4xx )

4xxBypasser un outil pour bypasser les code d'états HTTP négatif coté client ( 4xx ) Liscence : MIT license Creator Installation : git clone https://g

21 Dec 25, 2022
→ Comando Básico para Python Discord

Discord.py · Código @client.event async def on_ready(): print('He iniciado sessión en: {0.user}'.format(client)) @client.event async def on_messa

Panda.xyz 4 Mar 12, 2022
A WhatsApp Crashing Tool for Termux

CrashW A WhatsApp Crashing Tool For Termux Users Installing : apt update && apt upgrade -y pkg install python3 pkg install git git clone git://gith

Gokul Mahato 20 Dec 27, 2022