Repositorio com arquivos processados da CPI da COVID para facilitar analise

Related tags

Miscellaneouscpi4all
Overview

cpi4all

Repositorio com arquivos processados da CPI da COVID para facilitar analise

Organização

No site do senado é possivel encontrar a lista de todos os documentos coletados pela CPI da COVID.

A tabela no site possui a seguinte estrutura:

No Arquivos Data de recebimento Remetente Origem Descrição Caixa Em Resposta
1 Link1 ... ... ... ... ... ...
2 Link2/link3 ... ... ... ... ... ...

Esses links levam ao download de arquivos PDF com os documentos em questão.

Nesse repositorio você podera encontrar a versão txt desses arquivos. O nome do arquivo nesse repositorio é formado por <No do documento>_<numero do link>. Por exemplo:

link1 = 1_1 porque ele é relativo ao arquivo No 1, e é o primeiro link.

link2 = 2_1 porque ele é relativo ao arquivo No 2, e é o primeiro link dessa linha.

link3 = 2_2 porque ele é relativo ao arquivo No 2, e é o segundo link da linha.

A versão texto de todos os documentos está na pasta database/txts/.

Exemplos:

Arquivo No 1, primeiro link: 1_1

Arquivo No 4, quarto link: 3_4

Nota 1: Nem todos os arquivos foram convertidos ainda

Nota 2: A conversão usa reconhecimento de imagem e pode ficar bem ruim as vezes, gerando erros ortograficos ou palavras sem nexo algum.

Para desenvolvedores

Os scripts funcionam na seguinte sequencia:

  1. extract_rows.py: Vai no site do senado e extrai as informações de cada linha da tabela. Todos os dados são salvos em database/rows.
  2. extract_headers.py: Para cada link em cada linha, esse script pega metadados do arquivo (tamanho, tipo) que vão ser uteis depois. Esses dados são salvos em database/headers.
  3. download_pdfs.py: Baixa todos os PDFs descritos em database/headers e salva em database/pdfs.
  4. convert_pdf_to_jpg.py: Converte todos os PDFs em database/pdfs para imagens em database/jpgs.
  5. convert_jpg_to_txt.py: Converte todos as imagens em database/jpgs para texto em database/txt.

Por motivos de performance, apenas as pastas database/rows, database/headers e database/txts sao salvas nesse repositorio.

TODO: 0. Melhorar esse readme :)

  1. Usar o githubpages para gerar um site estatico que permite pesquisar em todos os txt
  2. Terminar de converter todos os arquivos
  3. Investigar arquivos em que a conversão ficou pessima.
  4. Fazer extração automatica de datas e prover um json com a ordem cronologica dos arquivos.
Owner
Breno Rodrigues Guimarães
Breno Rodrigues Guimarães
Automatically load and dump your dataclasses 📂🙋

file dataclasses Installation By default, filedataclasses comes with support for JSON files only. To support other formats like YAML and TOML, filedat

Alon 1 Dec 30, 2021
LSO, also known as Linux Swap Operator, is a software with both GUI and terminal versions that you can manage the Swap area for Linux operating systems.

LSO - Linux Swap Operator Türkçe - LSO Nedir? LSO, diğer adıyla Linux Swap Operator Linux işletim sistemleri için Swap alanını yönetebileceğiniz hem G

Eren İnce 4 Feb 09, 2022
Backend/API for the Mumble.dev, an open source social media application.

Welcome to the Mumble Api Repository Getting Started If you are trying to use this project for the first time, you can get up and running by following

Dennis Ivy 189 Dec 27, 2022
「📖」Tool created to extract metadata from a domain

Metafind is an OSINT tool created with the aim of automating the search for metadata of a particular domain from the search engine known as Google.

9 Dec 28, 2022
Module-based cryptographic tool

Cryptosploit A decryption/decoding/cracking tool using various modules. To use it, you need to have basic knowledge of cryptography. Table of Contents

/SNESE_AR\ 33 Nov 27, 2022
EasyBuild is a software build and installation framework that allows you to manage (scientific) software on High Performance Computing (HPC) systems in an efficient way.

EasyBuild is a software build and installation framework that allows you to manage (scientific) software on High Performance Computing (HPC) systems in an efficient way.

EasyBuild community 87 Dec 27, 2022
A script to generate NFT art living on the Solana blockchain.

NFT Generator This script generates NFT art based on its desired traits with their specific rarities. It has been used to generate the full collection

Rude Golems 24 Oct 08, 2022
Lightweight and Modern kernel for VK Bots

This is the kernel for creating VK Bots written in Python 3.9

Yrvijo 4 Nov 21, 2021
kurwa deska ADB

kurwa-deska-ADB kurwa-deska Запуск Linux -- python3 kurwa_deska.py Termux -- python3 kurwa_deska.py Встановлення cd kurwa_deska ADB і зразу запуск pyt

1 Jan 21, 2022
Script that creates graphical representations of Julia an Mandelbrot sets.

Julia and Mandelbrot Picture Maker This simple functions create simple plots of the Julia and Mandelbrot sets. The Julia set require the important par

Juan Riera Gomez 1 Jan 10, 2022
This repository containing cross-section cut and fill calculations using Python programming language.

cross-section This repository is containing cut and fill calculations for cross-section using Python programming language. This codes is made to calcu

3 Jun 15, 2022
If Google News had a Python library

pygooglenews If Google News had a Python library Created by Artem from newscatcherapi.com but you do not need anything from us or from anyone else to

Artem Bugara 1.1k Jan 08, 2023
A C-like hardware description language (HDL) adding high level synthesis(HLS)-like automatic pipelining as a language construct/compiler feature.

██████╗ ██╗██████╗ ███████╗██╗ ██╗███╗ ██╗███████╗ ██████╗ ██╔══██╗██║██╔══██╗██╔════╝██║ ██║████╗ ██║██╔════╝██╔════╝ ██████╔╝██║██████╔╝█

Julian Kemmerer 391 Jan 01, 2023
ThinkPHP全日志扫描工具,命令行版和BurpSuite插件版

ThinkPHP3和5日志扫描工具,提供命令行版和BurpSuite插件版,尽可能全的发掘网站日志信息 命令行版 安装 git clone https://github.com/r3change/TPLogScan.git cd TPLogScan/ pip install -r requireme

119 Dec 27, 2022
The docker-based Open edX distribution designed for peace of mind

Tutor: the docker-based Open edX distribution designed for peace of mind Tutor is a docker-based Open edX distribution, both for production and local

Overhang.IO 696 Dec 31, 2022
Script de monitoramento de telemetria para missões espaciais, cansat e foguetemodelismo.

Aeroespace_GroundStation Script de monitoramento de telemetria para missões espaciais, cansat e foguetemodelismo. Imagem 1 - Dashboard realizando moni

Vinícius Azevedo 5 Nov 27, 2022
This is a vscode extension with a Virtual Assistant that you can play with when you are bored or you need help..

VS Code Virtual Assistant This is a vscode extension with a Virtual Assistant that you can play with when you are bored or you need help. Its currentl

Soham Ghugare 6 Aug 22, 2021
An unofficial python API for trading on the DeGiro platform, with the ability to get real time data and historical data.

DegiroAPI An unofficial API for the trading platform Degiro written in Python with the ability to get real time data and historical data for products.

Jorrick Sleijster 5 Dec 16, 2022
Open Source defrag's mod code

Open Source defrag's mod code Goals: Code & License: Respect FOSS philosophy. Open source and community focus. Eliminate all traces of q3a-sdk licensi

sOkam! 1 Dec 10, 2022
A scuffed remake of Kahoot... Made by Y9 and Y10 SHSB

A scuffed remake of Kahoot... Made by Y9 and Y10 SHSB

Tobiloba Kujore 3 Oct 28, 2022