Repositorio com arquivos processados da CPI da COVID para facilitar analise

Related tags

Miscellaneouscpi4all
Overview

cpi4all

Repositorio com arquivos processados da CPI da COVID para facilitar analise

Organização

No site do senado é possivel encontrar a lista de todos os documentos coletados pela CPI da COVID.

A tabela no site possui a seguinte estrutura:

No Arquivos Data de recebimento Remetente Origem Descrição Caixa Em Resposta
1 Link1 ... ... ... ... ... ...
2 Link2/link3 ... ... ... ... ... ...

Esses links levam ao download de arquivos PDF com os documentos em questão.

Nesse repositorio você podera encontrar a versão txt desses arquivos. O nome do arquivo nesse repositorio é formado por <No do documento>_<numero do link>. Por exemplo:

link1 = 1_1 porque ele é relativo ao arquivo No 1, e é o primeiro link.

link2 = 2_1 porque ele é relativo ao arquivo No 2, e é o primeiro link dessa linha.

link3 = 2_2 porque ele é relativo ao arquivo No 2, e é o segundo link da linha.

A versão texto de todos os documentos está na pasta database/txts/.

Exemplos:

Arquivo No 1, primeiro link: 1_1

Arquivo No 4, quarto link: 3_4

Nota 1: Nem todos os arquivos foram convertidos ainda

Nota 2: A conversão usa reconhecimento de imagem e pode ficar bem ruim as vezes, gerando erros ortograficos ou palavras sem nexo algum.

Para desenvolvedores

Os scripts funcionam na seguinte sequencia:

  1. extract_rows.py: Vai no site do senado e extrai as informações de cada linha da tabela. Todos os dados são salvos em database/rows.
  2. extract_headers.py: Para cada link em cada linha, esse script pega metadados do arquivo (tamanho, tipo) que vão ser uteis depois. Esses dados são salvos em database/headers.
  3. download_pdfs.py: Baixa todos os PDFs descritos em database/headers e salva em database/pdfs.
  4. convert_pdf_to_jpg.py: Converte todos os PDFs em database/pdfs para imagens em database/jpgs.
  5. convert_jpg_to_txt.py: Converte todos as imagens em database/jpgs para texto em database/txt.

Por motivos de performance, apenas as pastas database/rows, database/headers e database/txts sao salvas nesse repositorio.

TODO: 0. Melhorar esse readme :)

  1. Usar o githubpages para gerar um site estatico que permite pesquisar em todos os txt
  2. Terminar de converter todos os arquivos
  3. Investigar arquivos em que a conversão ficou pessima.
  4. Fazer extração automatica de datas e prover um json com a ordem cronologica dos arquivos.
Owner
Breno Rodrigues Guimarães
Breno Rodrigues Guimarães
Python project setup, updater, and launcher

pyLaunch Python project setup, updater, and launcher Purpose: Increase project productivity and provide features easily. Once installed as a git submo

DAAV, LLC 1 Jan 07, 2022
Collaboration project to creating bank application maded by Anzhelica Sakun and Yuriy Konyukh

Collaboration project to creating bank application maded by Anzhelica Sakun and Yuriy Konyukh

Yuriy 1 Jan 08, 2022
Exercise to teach a newcomer to the CLSP grid to set up their environment and run jobs

Exercise to teach a newcomer to the CLSP grid to set up their environment and run jobs

Alexandra 2 May 18, 2022
Junos PyEZ is a Python library to remotely manage/automate Junos devices.

The repo is under active development. If you take a clone, you are getting the latest, and perhaps not entirely stable code. DOCUMENTATION Official Do

Juniper Networks 623 Dec 10, 2022
A Trace Explorer for Reverse Engineers

Tenet - A Trace Explorer for Reverse Engineers Overview Tenet is an IDA Pro plugin for exploring execution traces. The goal of this plugin is to provi

1k Jan 02, 2023
Google Fit Sensor Component

Google Fit Sensor Component

Ivan Vojtko 21 Dec 20, 2022
A python program to detect rickrolls with just the youtube link.

rickroll_detector A python program to detect rickrolls with just the youtube link. Usage: clone this repo or download zip run the main.py file with py

Tricky 4 Nov 06, 2022
Provide Prometheus url_sd compatible API Endpoint with data from Netbox

netbox-plugin-prometheus-sd Provide Prometheus http_sd compatible API Endpoint with data from Netbox. HTTP SD is a new feature in Prometheus and not a

Felix Peters 66 Dec 19, 2022
Python-geoarrow - Storing geometry data in Apache Arrow format

geoarrow Storing geometry data in Apache Arrow format Installation $ pip install

Joris Van den Bossche 11 Mar 03, 2022
This is the Code Institute student template for Gitpod.

Welcome AnaG0307, This is the Code Institute student template for Gitpod. We have preinstalled all of the tools you need to get started. It's perfectl

0 Feb 02, 2022
A clipboard where a user can add and retrieve multiple items to and from (resp) from the clipboard cache.

A clipboard where a user can add and retrieve multiple items to and from (resp) from the clipboard cache.

Gaurav Bhattacharjee 2 Feb 07, 2022
Comprehensive OpenAPI schema generator for Django based on pydantic

🗡️ Djagger Automated OpenAPI documentation generator for Django. Djagger helps you generate a complete and comprehensive API documentation of your Dj

13 Nov 26, 2022
API Rate Limit Decorator

ratelimit APIs are a very common way to interact with web services. As the need to consume data grows, so does the number of API calls necessary to re

Tomas Basham 574 Dec 26, 2022
Set up a sidechain for the XRPL quickly and easily

Sidechain Launch Kit Introduction This directory contains python scripts to tests and explore side chains. This document walks through the steps to se

Xpring Engineering 15 Dec 08, 2022
IPython: Productive Interactive Computing

IPython: Productive Interactive Computing Overview Welcome to IPython. Our full documentation is available on ipython.readthedocs.io and contains info

IPython 15.6k Dec 31, 2022
WhyNotWin11 - Detection Script to help identify why your PC isn't Windows 11 Release Ready

WhyNotWin11 - Detection Script to help identify why your PC isn't Windows 11 Release Ready

Robert C. Maehl 5.9k Dec 31, 2022
Number calculator application.

Number calculator application.

Michael J Bailey 3 Oct 08, 2021
A program that analyzes data from inertia measurement units installeed in aircraft and generates g-exceedance curves

A program that analyzes data from inertia measurement units installeed in aircraft and generates g-exceedance curves

Pooya 1 Nov 23, 2021
Consolemenu on python with pynput

ConsoleMenu Consolemenu on python 3 with pynput Powered by pynput and colorama Description Модуль позволяющий сделать меню выбора с помощью стрелок дл

KrouZ_CZ 2 Nov 15, 2021
To check my COVID-19 vaccine appointment, I wrote an infinite loop that sends me a Whatsapp message hourly using Twilio and Selenium. It works on my Raspberry Pi computer.

COVID-19_vaccine_appointment To check my COVID-19 vaccine appointment, I wrote an infinite loop that sends me a Whatsapp message hourly using Twilio a

Ayyuce Demirbas 24 Dec 17, 2022