These scripts look for non-printable unicode characters in all text files in a source tree

Overview

find-unicode-control

These scripts look for non-printable unicode characters in all text files in a source tree. find_unicode_control.py should work with python2 as well as python3. It uses python-magic if available to determine file type, or simply spawns the file --mime-type command. They should be functionally the same and find_unicode_control.py could eventually get disposed.

usage: find_unicode_control.py [-h] [-p {all,bidi}] [-v] [-c CONFIG] path [path ...]

Look for Unicode control characters

positional arguments:
  path                  Sources to analyze

optional arguments:
  -h, --help            show this help message and exit
  -p {all,bidi}, --nonprint {all,bidi}
                        Look for either all non-printable unicode characters or bidirectional control characters.
  -v, --verbose         Verbose mode.
  -d, --detailed        Print line numbers where characters occur.
  -t, --notests         Exclude tests (basically test.* as a component of path).
  -c CONFIG, --config CONFIG
                        Configuration file to read settings from.

If unicode BIDI control characters or non-printable characters are found in a file, it will print output as follows:

$ python3 find_unicode_control.py -p bidi *.c
commenting-out.c: bidirectional control characters: {'\u202e', '\u2066', '\u2069'}
early-return.c: bidirectional control characters: {'\u2067'}
stretched-string.c: bidirectional control characters: {'\u202e', '\u2066', '\u2069'}

Using the -d flag, the output is more detailed, showing line numbers in files, but this mode is also slower:

find_unicode_control.py -p bidi -d .
./commenting-out.c:4 bidirectional control characters: ['\u202e', '\u2066', '\u2069', '\u2066']
./commenting-out.c:6 bidirectional control characters: ['\u202e', '\u2066']
./early-return.c:4 bidirectional control characters: ['\u2067']
./stretched-string.c:6 bidirectional control characters: ['\u202e', '\u2066', '\u2069', '\u2066']

The optimal workflow would be to do a quick scan through a source tree and if any issues are found, do a detailed scan on only those files.

Configuration file

If files need to be excluded from the scan, make a configuration file and define a scan_exclude variable to a list of regular expressions that match the files or paths to exclude. Alternatively, add a scan_exclude_mime list with the list of mime types to ignore; this can again be a regular expression. Here is an example configuration that glibc uses:

scan_exclude = [
        # Iconv test data
        r'/iconvdata/testdata/',
        # Test case data
        r'libio/tst-widetext.input$',
        # Test script.  This is to silence the warning:
        # 'utf-8' codec can't decode byte 0xe9 in position 2118: invalid continuation byte
        # since the script tests mixed encoding characters.
        r'localedata/tst-langinfo.sh$']

Notes

This script was quickly hacked together to scan repositories with mostly LTR, unicode content. If you have RTL content (either in comments, literals or even identifiers in code), it will give false warnings that you need to weed out. For now you need to exclude such RTL code using scan_exclude but a long term wish list (if this remains relevant, hopefully more sophisticated RTL diagnostics will make it obsolete!) is to handle RTL a bit more intelligently.

Owner
Siddhesh Poyarekar
Toolchain hacker and all round nice guy. My openhub profile will probably tell you more about my work: https://www.openhub.net/accounts/siddhesh
Siddhesh Poyarekar
Aggregating gridded data (xarray) to polygons

A package to aggregate gridded data in xarray to polygons in geopandas using area-weighting from the relative area overlaps between pixels and polygons.

Kevin Schwarzwald 42 Nov 09, 2022
Data Utilities e.g. for importing files to onetask

Use this repository to easily convert your source files (csv, txt, excel, json, html) into record-oriented JSON files that can be uploaded into onetask.

onetask.ai 1 Jul 18, 2022
A quick username checker to see if a username is available on a list of assorted websites.

A quick username checker to see if a username is available on a list of assorted websites.

Maddie 4 Jan 04, 2022
Pyfunctools is a module that provides functions, methods and classes that help in the creation of projects in python

Pyfunctools Pyfunctools is a module that provides functions, methods and classes that help in the creation of projects in python, bringing functional

Natanael dos Santos Feitosa 5 Dec 22, 2022
python script to generate color coded resistor images

Resistor image generator I got nerdsniped into making this. It's not finished at all, and the code is messy. The end goal it generate a whole E-series

MichD 1 Nov 12, 2021
Dice Rolling Simulator using Python-random

Dice Rolling Simulator As the name of the program suggests, we will be imitating a rolling dice. This is one of the interesting python projects and wi

PyLaboratory 1 Feb 02, 2022
A way to write regex with objects instead of strings.

Py Idiomatic Regex (AKA iregex) Documentation Available Here An easier way to write regex in Python using OOP instead of strings. Makes the code much

Ryan Peach 18 Nov 15, 2021
Script to generate a massive volume of data in sql, csv, json or xml format

DataGenerator Made with Python Open for pull requests 1. Dependencies To install required dependencies run pip install -r requirements.txt 2. Executi

icrescenti 3 Sep 20, 2022
Finds price floor for every single attribute in a given collection

Solana Solanart Scanner Enjoy the Free Code Steps to run Download VS Code

Dalton Nisbett 19 Oct 20, 2022
ULID implementation for Python

What is this? This is a port of the original JavaScript ULID implementation to Python. A ULID is a universally unique lexicographically sortable ident

Martin Domke 158 Jan 04, 2023
A simple dork generator written in python that outputs dorks with the domain extensions you enter

Dork Gen A simple dork generator written in python that outputs dorks with the domain extensions you enter in a ".txt file". Usage The code is pretty

Z3NToX 4 Oct 30, 2022
Simple script to export contacts from telegram into vCard file

Telegram Contacts Exporter Simple script to export contacts from telegram into vCard file Getting Started Prerequisites You must to put your Telegram

Pere Antoni 1 Oct 17, 2021
A random cats photos python module

A random cats photos python module

Fayas Noushad 6 Dec 01, 2021
A pythonic dependency injection library.

Pinject Pinject is a dependency injection library for python. The primary goal of Pinject is to help you assemble objects into graphs in an easy, main

Google 1.3k Dec 30, 2022
Michael Vinyard's utilities

Install vintools To download this package from pypi: pip install vintools Install the development package To download and install the developmen

Michael Vinyard 2 May 22, 2022
A simple and easy to use collection of random python functions.

A simple and easy to use collection of random python functions.

Diwan Mohamed Faheer 1 Nov 17, 2021
This is a tool to calculate a resulting color of the alpha blending process.

blec: alpha blending calculator This is a tool to calculate a resulting color of the alpha blending process. A gamma correction is enabled and the def

Igor Mikushkin 12 Sep 07, 2022
Course-parsing - Parsing Course Info for NIT Kurukshetra

Parsing Course Info for NIT Kurukshetra Overview This repository houses code for

Saksham Mittal 3 Feb 03, 2022
Software to help automate collecting crowdsourced annotations using Mechanical Turk.

Video Crowdsourcing Software to help automate collecting crowdsourced annotations using Mechanical Turk. The goal of this project is to enable crowdso

Mike Peven 1 Oct 25, 2021
Greenery - tools for parsing and manipulating regular expressions

Greenery - tools for parsing and manipulating regular expressions

qntm 242 Dec 15, 2022