🔎 Like Chardet. 🚀 Package for encoding & language detection. Charset detection.

Last update: Dec 31, 2022

Overview

Charset Detection, for Everyone 👋

^{The Real First Universal Charset Detector}

A library that helps you read text from an unknown charset encoding.
Motivated by chardet, I'm trying to resolve the issue by taking a new approach. All IANA character set names for which the Python core library provides codecs are supported.

>>>>> 👉 Try Me Online Now, Then Adopt Me 👈 <<<<<

This project offers you an alternative to Universal Charset Encoding Detector, also known as Chardet.

Feature	Chardet	Charset Normalizer	cChardet
`Fast`	❌	✔️	✔️
`Universal**`	❌	✔️	❌
`Reliable` without distinguishable standards	❌	✔️	✔️
`Reliable` with distinguishable standards	✔️	✔️	✔️
`Free & Open`	✔️	✔️	✔️
`License`	LGPL-2.1	MIT	MPL-1.1
`Native Python`	✔️	✔️	❌
`Detect spoken language`	❌	✔️	N/A
`Supported Encoding`	30	🎉 93	40

** : They are clearly using specific code for a specific encoding even if covering most of used one

⭐ Your support

Fork, test-it, star-it, submit your ideas! We do listen.

⚡ Performance

This package offer better performance than its counterpart Chardet. Here are some numbers.

Package	Accuracy	Mean per file (ms)	File per sec (est)
chardet	92 %	220 ms	5 file/sec
charset-normalizer	98 %	40 ms	25 file/sec

Package	99th percentile	95th percentile	50th percentile
chardet	1115 ms	300 ms	27 ms
charset-normalizer	460 ms	240 ms	18 ms

Chardet's performance on larger file (1MB+) are very poor. Expect huge difference on large payload.

Stats are generated using 400+ files using default parameters. More details on used files, see GHA workflows. And yes, these results might change at any time. The dataset can be updated to include more files. The actual delays heavily depends on your CPU capabilities. The factors should remain the same.

cchardet is a non-native (cpp binding) and unmaintained faster alternative with a better accuracy than chardet but lower than this package. If speed is the most important factor, you should try it.

✨ Installation

Using PyPi for latest stable

pip install charset-normalizer -U

If you want a more up-to-date unicodedata than the one available in your Python setup.

pip install charset-normalizer[unicode_backport] -U

🚀 Basic Usage

CLI

This package comes with a CLI.

usage: normalizer [-h] [-v] [-a] [-n] [-m] [-r] [-f] [-t THRESHOLD]
                  file [file ...]

The Real First Universal Charset Detector. Discover originating encoding used
on text file. Normalize text to unicode.

positional arguments:
  files                 File(s) to be analysed

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         Display complementary information about file if any.
                        Stdout will contain logs about the detection process.
  -a, --with-alternative
                        Output complementary possibilities if any. Top-level
                        JSON WILL be a list.
  -n, --normalize       Permit to normalize input file. If not set, program
                        does not write anything.
  -m, --minimal         Only output the charset detected to STDOUT. Disabling
                        JSON output.
  -r, --replace         Replace file when trying to normalize it instead of
                        creating a new one.
  -f, --force           Replace file without asking if you are sure, use this
                        flag with caution.
  -t THRESHOLD, --threshold THRESHOLD
                        Define a custom maximum amount of chaos allowed in
                        decoded content. 0. <= chaos <= 1.
  --version             Show version information and exit.

normalizer ./data/sample.1.fr.srt

🎉 Since version 1.4.0 the CLI produce easily usable stdout result in JSON format.

{
    "path": "/home/default/projects/charset_normalizer/data/sample.1.fr.srt",
    "encoding": "cp1252",
    "encoding_aliases": [
        "1252",
        "windows_1252"
    ],
    "alternative_encodings": [
        "cp1254",
        "cp1256",
        "cp1258",
        "iso8859_14",
        "iso8859_15",
        "iso8859_16",
        "iso8859_3",
        "iso8859_9",
        "latin_1",
        "mbcs"
    ],
    "language": "French",
    "alphabets": [
        "Basic Latin",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.149,
    "coherence": 97.152,
    "unicode_path": null,
    "is_preferred": true
}

Python

Just print out normalized text

from charset_normalizer import from_path

results = from_path('./my_subtitle.srt')

print(str(results.best()))

Normalize any text file

from charset_normalizer import normalize
try:
    normalize('./my_subtitle.srt') # should write to disk my_subtitle-***.srt
except IOError as e:
    print('Sadly, we are unable to perform charset normalization.', str(e))

Upgrade your code without effort

from charset_normalizer import detect

The above code will behave the same as chardet. We ensure that we offer the best (reasonable) BC result possible.

See the docs for advanced usage : readthedocs.io

😇 Why

When I started using Chardet, I noticed that it was not suited to my expectations, and I wanted to propose a reliable alternative using a completely different method. Also! I never back down on a good challenge!

I don't care about the originating charset encoding, because two different tables can produce two identical rendered string. What I want is to get readable text, the best I can.

In a way, I'm brute forcing text decoding. How cool is that ? 😎

Don't confuse package ftfy with charset-normalizer or chardet. ftfy goal is to repair unicode string whereas charset-normalizer to convert raw file in unknown encoding to unicode.

🍰 How

Discard all charset encoding table that could not fit the binary content.
Measure chaos, or the mess once opened (by chunks) with a corresponding charset encoding.
Extract matches with the lowest mess detected.
Additionally, we measure coherence / probe for a language.

Wait a minute, what is chaos/mess and coherence according to YOU ?

Chaos : I opened hundred of text files, written by humans, with the wrong encoding table. I observed, then I established some ground rules about what is obvious when it seems like a mess. I know that my interpretation of what is chaotic is very subjective, feel free to contribute in order to improve or rewrite it.

Coherence : For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought that intel is worth something here. So I use those records against decoded text to check if I can detect intelligent design.

⚡ Known limitations

Language detection is unreliable when text contains two or more languages sharing identical letters. (eg. HTML (english tags) + Turkish content (Sharing Latin characters))
Every charset detector heavily depends on sufficient content. In common cases, do not bother run detection on very tiny content.

👤 Contributing

Contributions, issues and feature requests are very much welcome.
Feel free to check issues page if you want to contribute.

📝 License

Comments

[Proposal] Add module creation with mypyc to speed up

Hello. I ran some tests to find bottlenecks and speed up the package. The easiest option, since you are already using mypy, is to compile the module during installation using mypyc. In this case the acceleration is about 2 times. Here are the results of the tests using your bin/performance.py file:

------------------------------
--> Charset-Normalizer Conclusions
   --> Avg: 0.03485252343844548s
   --> 99th: 0.2629306570015615s
   --> 95th: 0.14874039799906313s
   --> 50th: 0.02182378301222343s
------------------------------
--> Charset-Normalizer_m Conclusions (Charset-Normalizer, compiled with mypyc )
   --> Avg: 0.01605459922575392s
   --> 99th: 0.12211546800972428s
   --> 95th: 0.06977643301070202s
   --> 50th: 0.009204783011227846s
------------------------------
--> Chardet Conclusions
   --> Avg: 0.12291852888552735s
   --> 99th: 0.6617688919941429s
   --> 95th: 0.17344348499318585s
   --> 50th: 0.023028297000564635s
------------------------------
--> Cchardet Conclusions
   --> Avg: 0.003174804929368931s
   --> 99th: 0.04868195200106129s
   --> 95th: 0.008641656007966958s
   --> 50th: 0.0005420649977168068s

test_log.txt I think the acceleration would be greater if annotate all functions

enhancement

opened by deedy5 20

Don't inject unicodedata2 into sys.modules

I noticed charset_normalizer meddles with sys.modules, causing this:

>>> import charset_normalizer
>>> import unicodedata
>>> unicodedata
<module 'unicodedata2' from '.../site-packages/unicodedata2.cpython-39-darwin.so'>

This PR fixes that by using a fairly standard try: except ImportError: guard instead of the sys.modules hook.

>>> import charset_normalizer
>>> import unicodedata
>>> unicodedata
<module 'unicodedata' from '.../python3.9/lib-dynload/unicodedata.cpython-39-darwin.so'>

opened by akx 16

[Proposal] Incrase language coverage
Is your feature request related to a problem? Please describe. Not of a problem, more of an enhancement

Describe the solution you'd like Add other languages from other repos, assuming that they use the Unicode codepoint + n-grams model.

Describe alternatives you've considered

https://github.com/wooorm/franc/tree/master/packages/franc-all (JS, 401 languages)

Codepoints https://github.com/wooorm/franc/blob/master/packages/franc-all/expressions.js

Ngrams https://github.com/wooorm/franc/blob/master/packages/franc-all/data.json

https://github.com/cloudmark/language-detect (Python, 271 languages)

https://github.com/cloudmark/language-detect/tree/master/data/udhr

https://github.com/kapsteur/franco (Golang, 175 languages)

Codepoints https://github.com/kapsteur/franco/blob/master/expression_data.go

Ngrams https://github.com/kapsteur/franco/blob/master/script_data.go

https://github.com/patrickschur/language-detection (PHP, 110 languages)

https://github.com/patrickschur/language-detection/tree/master/resources

https://github.com/richtr/guessLanguage.js (JS, 100 languages)

Codepoints https://github.com/richtr/guessLanguage.js/blob/master/lib/guessLanguage.js

Ngrams https://github.com/richtr/guessLanguage.js/blob/master/lib/_languageData.js

https://github.com/saffsd/langid.py (Python, 97 languages)

Alternate https://github.com/saffsd/langid.c

Alternate https://github.com/saffsd/langid.js

Alternate https://github.com/carrotsearch/langid-java

https://github.com/feedbackmine/language_detector (Ruby, 96 languages)

https://github.com/feedbackmine/language_detector/tree/master/lib/training_data

https://github.com/jonathansp/guess-language (Golang, 94 languages)

Codepoints

https://github.com/jonathansp/guess-language/blob/master/data/blocks.go

https://github.com/jonathansp/guess-language/blob/master/data/languages.go

Ngrams

https://github.com/jonathansp/guess-language/blob/master/data/trigrams.go

https://github.com/abadojack/whatlanggo (Golang, 84 languages)

Codepoints

https://github.com/abadojack/whatlanggo/blob/master/script.go

https://github.com/abadojack/whatlanggo/blob/master/detect.go

Ngrams https://github.com/abadojack/whatlanggo/blob/master/lang.go

https://github.com/chattylabs/language-detector (JS, 73 language)

https://github.com/chattylabs/language-detector/tree/master/data/resources

https://github.com/optimaize/language-detector (Java, 71 languages)

https://github.com/endeveit/guesslanguage (Golang, 67 languages)

https://github.com/endeveit/guesslanguage/tree/master/models

https://github.com/dsc/guess-language (Python, 64 languages)

https://github.com/dsc/guess-language/tree/master/guess_language/trigrams

Co-reference https://github.com/kent37/guess-language

https://github.com/decultured/Python-Language-Detector (Python, 58 languages)

https://github.com/decultured/Python-Language-Detector/tree/master/trigrams

https://github.com/Mimino666/langdetect (Python, 55 languages)

Codepoints

https://github.com/Mimino666/langdetect/blob/master/langdetect/utils/unicode_block.py

https://github.com/Mimino666/langdetect/blob/master/langdetect/utils/messages.properties

https://github.com/Mimino666/langdetect/blob/master/langdetect/utils/ngram.py

Ngrams https://github.com/Mimino666/langdetect/tree/master/langdetect/profiles

https://github.com/pemistahl/lingua (Kotlin, 55 languages)

Codepoints https://github.com/pemistahl/lingua/blob/master/src/main/kotlin/com/github/pemistahl/lingua/internal/Alphabet.kt

Ngrams https://github.com/pemistahl/lingua/tree/master/src/main/resources/language-models

https://github.com/landrok/language-detector (PHP, 54 language)

https://github.com/landrok/language-detector/tree/master/src/LanguageDetector/subsets

https://github.com/shuyo/language-detection (Java, 53 languages)

https://github.com/newmsz/node-language-detection (JS, 53 languages)

Codepoints https://github.com/newmsz/node-language-detection/blob/master/index.js

Ngrams https://github.com/newmsz/node-language-detection/tree/master/profiles

https://github.com/pdonald/language-detection (C#, 53 languages)

https://github.com/pdonald/language-detection/tree/master/LanguageDetection/Profiles

https://github.com/malcolmgreaves/language-detection (Java, 53 languages)

https://github.com/FGRibreau/node-language-detect (JS, 52 languages)

Codepoints https://github.com/FGRibreau/node-language-detect/blob/master/data/unicode_blocks.json

Ngram https://github.com/FGRibreau/node-language-detect/blob/master/data/lang.json

https://github.com/webmil/text-language-detect (PHP, 52 languages)

Codepoints https://github.com/webmil/text-language-detect/blob/master/lib/data/unicode_blocks.dat

Ngram https://github.com/webmil/text-language-detect/blob/master/lib/data/lang.dat

https://github.com/pear/Text_LanguageDetect (PHP, 52 languages)

https://github.com/pear/Text_LanguageDetect/tree/master/data

https://github.com/Imaginatio/langdetect (Java, 50 languages)

https://github.com/Imaginatio/langdetect/tree/master/src/main/resources/profiles

https://github.com/dachev/node-cld (C++, 160 languages)

co-reference https://github.com/jtoy/cld

co-reference https://github.com/mzsanford/cld

co-reference https://github.com/jaukia/cld-js

co-reference https://github.com/vhyza/language_detection

Co-referecne https://github.com/ambs/Lingua-Identify-CLD

Co-reference https://github.com/jaukia/cld-js

https://github.com/CLD2Owners/cld2 (C++, 83 languages)

Co-reference https://github.com/rainycape/cld2

Co-reference https://github.com/dachev/node-cld

Co-reference https://github.com/ropensci/cld2

Co-reference https://github.com/fntlnz/cld2-php-ext

https://github.com/commoncrawl/language-detection-cld2 (Java)

https://github.com/lstrojny/php-cld (PHP)

enhancement good first issue
opened by DonaldTsang 13
charset_normalizer logging behavior

Hi @Ousret,

This is a bit of a continuation of #145. I wanted to start a discussion on the current logging levels and why they were chosen to better understand the use case/design decision. Most of that wasn't covered in the previous issue. I'd originally read this as being a DEBUG level log but realized I was mistaken, as it's INFO.

What do you envision the common case for logging these messages as INFO (there are more but we'll start here) [1][2][3][4]? What would the user be expected to do with the info provided? They seem like more of a stream of consciousness on what the hot path for the charset_normalizer is doing, rather than noting novel events. I'd personally not expect this to be relevant for general library usage. It probably becomes less relevant to libraries integrating with the project.

Currently, that would result in somewhere around 3 MB of logs per hour at 1 TPS which scales out to a couple gigabytes a month. While that's not huge, it's not trivial either. If you start to scale that up to 100s of TPS, we start recording closer to 250-500GB/mo. That's a lot of IO and potential disk space for long lived logs.
enhancement

opened by nateprewitt 9
Refactoring for potential performance improvements in loops
Experiments with ideas to potentially improve performance or code consistency without impacting readability (#111).

This PR:

defines caches and sets in cd.py

uses list comprehensions for language associations in cd.py

refactors duplicate code in md.py

Close #111
opened by adbar 9
Use unicodedata2 if available

https://pypi.org/project/unicodedata2/ is usually more up to date than even the latest cpython release.

iirc, using it is simply a matter of checking if unicodedata2 data version is higher than unicodedata, and if so sys.modules['unicodedata'] = unicodedata2 . Need to check that though
enhancement question

opened by jayvdb 8

Fixing some performance bottlenecks

pprofile tests

test.py

from glob import glob
from os.path import isdir
from charset_normalizer import detect

def performance_compare(size_coeff):
    if not isdir("./char-dataset"):
        print("This script require https://github.com/Ousret/char-dataset to be cloned on package root directory")
        exit(1)
    for tbt_path in sorted(glob("./char-dataset/**/*.*")):
        with open(tbt_path, "rb") as fp:
            content = fp.read() * size_coeff            
        detect(content)

if __name__ == "__main__":
    performance_compare(1)

Before

pprofile --format callgrind --out cachegrind.out.original.test test.py

Time: 838.97 s. cachegrind.out.original.zip cachegrind out original test

Merged

pprofile --format callgrind --out cachegrind.out.commits.test test.py

Time: 716.45 s. cachegrind.out.commits.zip cachegrind out commits

opened by deedy5 7

Python 2 not yet supported

Traceback:
test/test_on_file.py:5: in <module>
    from charset_normalizer import CharsetNormalizerMatches as CnM
charset_normalizer/__init__.py:2: in <module>
    from charset_normalizer.normalizer import CharsetNormalizerMatches, CharsetNormalizerMatch
charset_normalizer/normalizer.py:3: in <module>
    import statistics
E   ImportError: No module named statistics

help wanted

opened by jayvdb 7

:wrench: Tweak/adjust the logging verbosity greater-eq to warning level
I understand that the latest release unexpectedly generated some noise for some people in specific environments.

The engagement I made with charset-normalizer given its wide deployments* still applies. Therefore, regarding :

https://github.com/spaam/svtplay-dl/issues/1445

https://github.com/home-assistant/core/issues/60615

https://github.com/Ousret/charset_normalizer/issues/145

With this PR I adjust the impact to a minimal impact while keeping backward compatibility. Fixes/Adress #145

*: Listening as broadly as possible regarding any side-effects to the community
enhancement bugfix release flourish
opened by Ousret 6
Revise the logger instanciation/initial handlers

I added the logging functionality described in the proposal. I also took care to make sure the explain argument would operate the same way. I left the behavior in api.py where if explain is not set, the logger will still log messages at the WARNING level. That behavior is really up to you as the package maintainer. It is as easy as removing that branch from the if statement and adding documentation to the repository that describes how a logger must be set via the handler if an application developer so desires.

I also added two simple tests that check whether the set_stream_handler function does what it should. Apologies if the tests are not in the correct style. Let me know if anything is in need of attention or you have changed your mind about the behavior change for logging. Thanks for the awesome library.

Close #134

opened by nmaynes 6
[BUG] Support for custom Python environment that ignore PEP 3120
Describe the bug With requests library using charset-normalizer I am getting an error when calling Python via User-Defined Transform in SAP BODS:

File "EXPRESSION", line 6, in <module> File "c:\program files\python39\lib\site-packages\requests\__init__.py", line 48, in <module> from charset_normalizer import __version__ as charset_normalizer_version File "c:\program files\python39\lib\site-packages\charset_normalizer\__init__.py", line 11 SyntaxError: Non-ASCII character '\xd1' in file c:\program files\python39\lib\site-packages\charset_normalizer\__init__.py on line 12, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details.

I am not able to define a source code encoding by placing a magic comment into the source files (either as a first or second line in the file) because the app probably modifies the script by itself (placing # -*- coding: utf-8 -*- doesn't help). The setting of environment variable PYTHONUTF8=1 doesn't help too.

To Reproduce I am not able to provide code to reproduce the issue, it arises when calling Python via User-Defined Transform in SAP BODS Please check: https://github.com/apache/superset/issues/15631 This could be the same problem: https://stackoverflow.com/questions/68594538/syntaxerror-non-ascii-character-xd1-in-file-charset-normalizer-init-py-i

Expected behavior No error - with requests version using chardet library there is no problem. Maybe avoiding non-ASCII characters in init.py could help...?

Logs Please see the bug description.

Desktop (please complete the following information):

OS: Windows 2016 Server

Python version 3.9.6

Package version 2.0.6

Requests version 2.26.0

Additional context N/A
bug help wanted
opened by kivhub 6
⬆️ Bump pypa/cibuildwheel from 2.11.2 to 2.11.4
Bumps pypa/cibuildwheel from 2.11.2 to 2.11.4.

Release notes

Sourced from pypa/cibuildwheel's releases.

v2.11.4

🐛 Fix a bug that caused missing wheels on Windows when a test was skipped using CIBW_TEST_SKIP (#1377)

🛠 Updates CPython 3.11 to 3.11.1 (#1371)

🛠 Updates PyPy 3.7 to 3.7.10, except on macOS which remains on 7.3.9 due to a bug. (#1371)

📚 Added a reference to abi3audit to the docs (#1347)

v2.11.3

✨ Improves the 'build options' log output that's printed at the start of each run (#1352)

✨ Added a friendly error message to a common misconfiguration of the CIBW_TEST_COMMAND option - not specifying path using the {project} placeholder (#1336)

🛠 The GitHub Action now uses Powershell on Windows to avoid occasional incompabilities with bash (#1346)

Changelog

Sourced from pypa/cibuildwheel's changelog.

v2.11.4

24 Dec 2022

🐛 Fix a bug that caused missing wheels on Windows when a test was skipped using CIBW_TEST_SKIP (#1377)

🛠 Updates CPython 3.11 to 3.11.1 (#1371)

🛠 Updates PyPy to 7.3.10, except on macOS which remains on 7.3.9 due to a bug on that platform. (#1371)

📚 Added a reference to abi3audit to the docs (#1347)

v2.11.3

5 Dec 2022

✨ Improves the 'build options' log output that's printed at the start of each run (#1352)

✨ Added a friendly error message to a common misconfiguration of the CIBW_TEST_COMMAND option - not specifying path using the {project} placeholder (#1336)

🛠 The GitHub Action now uses Powershell on Windows to avoid occasional incompabilities with bash (#1346)

Commits

27fc88e Bump version: v2.11.4

a7e9ece Merge pull request #1371 from pypa/update-dependencies-pr

b9a3ed8 Update cibuildwheel/resources/build-platforms.toml

3dcc2ff fix: not skipping the tests stops the copy (Windows ARM) (#1377)

1c9ec76 Merge pull request #1378 from pypa/henryiii-patch-3

22b433d Merge pull request #1379 from pypa/pre-commit-ci-update-config

98fdf8c [pre-commit.ci] pre-commit autoupdate

cefc5a5 Update dependencies

e53253d ci: move to ubuntu 20

e9ecc65 [pre-commit.ci] pre-commit autoupdate (#1374)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies github_actions
opened by dependabot[bot] 1
⬆️ Bump isort from 5.10.1 to 5.11.4
Bumps isort from 5.10.1 to 5.11.4.

Release notes

Sourced from isort's releases.

5.11.4

Changes

Remove obsolete toml import from the test suite (#1978) @mgorny

Stop installing documentation files to top-level site-packages (#2057) @mgorny

Only run release workflows for upstream (#2052) @hugovk

:package: Dependencies

Bump Poetry 1.3.1 (#2058) @staticdev

5.11.3

Changes

Renable portray (#2043) @timothycrosley

chore(ci): add minimum GitHub token permissions for workflows (#1969) @varunsh-coder

:beetle: Fixes

Fix packaging pypoetry (#2042) @staticdev

Fix settings for py3.11 (#2040) @staticdev

:construction_worker: Continuous Integration

General CI improvements (#2041) @staticdev

Add release workflow (#2026) @staticdev

v5.11.3

Changes

Renable portray (#2043) @timothycrosley

chore(ci): add minimum GitHub token permissions for workflows (#1969) @varunsh-coder

:beetle: Fixes

Fix packaging pypoetry (#2042) @staticdev

Fix settings for py3.11 (#2040) @staticdev

:construction_worker: Continuous Integration

General CI improvements (#2041) @staticdev

Add release workflow (#2026) @staticdev

5.11.2

Changes

Hotfix for --version. (#2035) @felixxm

5.11.1

Changes December 12 2022

... (truncated)

Changelog

Sourced from isort's changelog.

5.11.4 December 21 2022

Fixed #2038 (again): stop installing documentation files to top-level site-packages (#2057) @mgorny

CI: only run release workflows for upstream (#2052) @hugovk

Tests: remove obsolete toml import from the test suite (#1978) @mgorny

CI: bump Poetry 1.3.1 (#2058) @staticdev

5.11.3 December 16 2022

Fixed #2007: settings for py3.11 (#2040) @staticdev

Fixed #2038: packaging pypoetry (#2042) @staticdev

Docs: renable portray (#2043) @timothycrosley

Ci: add minimum GitHub token permissions for workflows (#1969) @varunsh-coder

Ci: general CI improvements (#2041) @staticdev

Ci: add release workflow (#2026) @staticdev

5.11.2 December 12 2022

Hotfix #2034: isort --version is not accurate on 5.11.x releases (#2034) @gschaffner

5.11.1 December 12 2022

Hotfix #2031: only call colorama.init if colorama is available (#2032) @tomaarsen

5.11.0 December 12 2022

Added official support for Python 3.11 (#1996, #2008, #2011) @staticdev

Dropped support for Python 3.6 (#2019) @barrelful

Fixed problematic tests (#2021, #2022) @staticdev

Fixed #1960: Rich compatibility (#1961) @ofek

Fixed #1945, #1986: Python 4.0 upper bound dependency resolving issues @staticdev

Fixed Pyodide CDN URL (#1991) @andersk

Docs: clarify description of use_parentheses (#1941) @mgedmin

Fixed #1976: black compatibility for .pyi files @XuehaiPan

Implemented #1683: magic trailing comma option (#1876) @legau

Add missing space in unrecoverable exception message (#1933) @andersk

Fixed #1895: skip-gitignore: use allow list, not deny list @bmalehorn

Fixed #1917: infinite loop for unmatched parenthesis (#1919) @anirudnits

Docs: shared profiles (#1896) @matthewhughes934

Fixed build-backend values in the example plugins (#1892) @mgorny

Remove reference to jamescurtin/isort-action (#1885) @AndrewLane

Split long cython import lines (#1931) @davidcollins001

Update plone profile: copy of black, plus three settings. (#1926) @mauritsvanrees

Fixed #1815, #1862: Add a command-line flag to sort all re-exports (#1863) @parafoxia

Fixed #1854: lines_before_imports appending lines after comments (#1861) @legau

Remove redundant multi_line_output = 3 from "Compatibility with black" (#1858) @jdufresne

Add tox config example (#1856) @umonaca

Docs: add examples for frozenset and tuple settings (#1822) @sgaist

Docs: add multiple config documentation (#1850) @anirudnits

Commits

98390f5 Merge pull request #2059 from PyCQA/version/5.11.4

df69a05 Bump version 5.11.4

f9add58 Merge pull request #2058 from PyCQA/deps/poetry-1.3.1

36caa91 Bump Poetry 1.3.1

3c2e2d0 Merge pull request #1978 from mgorny/toml-test

45d6abd Remove obsolete toml import from the test suite

3020e0b Merge pull request #2057 from mgorny/poetry-install

a6fdbfd Stop installing documentation files to top-level site-packages

ff306f8 Fix tag template to match old standard

227c4ae Merge pull request #2052 from hugovk/main

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies python
opened by dependabot[bot] 1
⬆️ Bump black from 22.10.0 to 22.12.0
Bumps black from 22.10.0 to 22.12.0.

Release notes

Sourced from black's releases.

22.12.0

Preview style

Enforce empty lines before classes and functions with sticky leading comments (#3302)

Reformat empty and whitespace-only files as either an empty file (if no newline is present) or as a single newline character (if a newline is present) (#3348)

Implicitly concatenated strings used as function args are now wrapped inside parentheses (#3307)

Correctly handle trailing commas that are inside a line's leading non-nested parens (#3370)

Configuration

Fix incorrectly applied .gitignore rules by considering the .gitignore location and the relative path to the target file (#3338)

Fix incorrectly ignoring .gitignore presence when more than one source directory is specified (#3336)

Parser

Parsing support has been added for walruses inside generator expression that are passed as function args (for example, any(match := my_re.match(text) for text in texts)) (#3327).

Integrations

Vim plugin: Optionally allow using the system installation of Black via let g:black_use_virtualenv = 0(#3309)

Changelog

Sourced from black's changelog.

22.12.0

Preview style

Enforce empty lines before classes and functions with sticky leading comments (#3302)

Reformat empty and whitespace-only files as either an empty file (if no newline is present) or as a single newline character (if a newline is present) (#3348)

Implicitly concatenated strings used as function args are now wrapped inside parentheses (#3307)

Correctly handle trailing commas that are inside a line's leading non-nested parens (#3370)

Configuration

Fix incorrectly applied .gitignore rules by considering the .gitignore location and the relative path to the target file (#3338)

Fix incorrectly ignoring .gitignore presence when more than one source directory is specified (#3336)

Parser

Parsing support has been added for walruses inside generator expression that are passed as function args (for example, any(match := my_re.match(text) for text in texts)) (#3327).

Integrations

Vim plugin: Optionally allow using the system installation of Black via let g:black_use_virtualenv = 0(#3309)

Commits

2ddea29 Prepare release 22.12.0 (#3413)

5b1443a release: skip bad macos wheels for now (#3411)

9ace064 Bump peter-evans/find-comment from 2.0.1 to 2.1.0 (#3404)

19c5fe4 Fix CI with latest flake8-bugbear (#3412)

d4a8564 Bump sphinx-copybutton from 0.5.0 to 0.5.1 in /docs (#3390)

2793249 Wordsmith current_style.md (#3383)

d97b789 Remove whitespaces of whitespace-only files (#3348)

c23a5c1 Clarify that Black runs with --safe by default (#3378)

8091b25 Correctly handle trailing commas that are inside a line's leading non-nested ...

ffaaf48 Compare each .gitignore found with an appropiate relative path (#3338)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies python
opened by dependabot[bot] 1

Releases(3.0.1)

3.0.1(Nov 18, 2022)
3.0.1 (2022-11-18)

Fixed

Multi-bytes cutter/chunk generator did not always cut correctly (PR #233)

Changed

Speedup provided using mypy/c 0.990 on Python >= 3.7

Source code(tar.gz)
Source code(zip)
3.0.0(Oct 20, 2022)
3.0.0 (2022-10-20)

Added

Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results

Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES

Add parameter language_threshold in from_bytes, from_path and from_fp to adjust the minimum expected coherence ratio

normalizer --version now specify if the current version provides extra speedup (meaning mypyc compilation whl)

Changed

Build with static metadata (not pyproject.toml yet)

Make language detection stricter

Optional: Module md.py can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1

Fixed

CLI with opt --normalize fail when using full path for files

TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha characters have been fed to it

Sphinx warnings when generating the documentation

Removed

Coherence detector no longer returns 'Simple English' instead returns 'English'

Coherence detector no longer returns 'Classical Chinese' instead returns 'Chinese'

Breaking: Method first() and best() from CharsetMatch

UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflicts with ASCII)

Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches

Breaking: Top-level function normalize

Breaking: Properties chaos_secondary_pass, coherence_non_latin and w_counter from CharsetMatch

Support for the backport unicodedata2

This is the last version (3.0.x) to support Python 3.6 We plan to drop it for 3.1.x
Source code(tar.gz)
Source code(zip)
charset_normalizer-3.0.0-cp310-cp310-macosx_10_9_universal2.whl(193.82 KB)
charset_normalizer-3.0.0-cp310-cp310-macosx_10_9_x86_64.whl(120.40 KB)
charset_normalizer-3.0.0-cp310-cp310-macosx_11_0_arm64.whl(117.62 KB)
charset_normalizer-3.0.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl(190.53 KB)
charset_normalizer-3.0.0-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl(203.79 KB)
charset_normalizer-3.0.0-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl(193.87 KB)
charset_normalizer-3.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl(193.97 KB)
charset_normalizer-3.0.0-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl(195.36 KB)
charset_normalizer-3.0.0-cp310-cp310-musllinux_1_1_aarch64.whl(187.48 KB)
charset_normalizer-3.0.0-cp310-cp310-musllinux_1_1_i686.whl(189.25 KB)
charset_normalizer-3.0.0-cp310-cp310-musllinux_1_1_ppc64le.whl(199.58 KB)
charset_normalizer-3.0.0-cp310-cp310-musllinux_1_1_s390x.whl(190.04 KB)
charset_normalizer-3.0.0-cp310-cp310-musllinux_1_1_x86_64.whl(187.73 KB)
charset_normalizer-3.0.0-cp310-cp310-win32.whl(86.69 KB)
charset_normalizer-3.0.0-cp310-cp310-win_amd64.whl(94.01 KB)
charset_normalizer-3.0.0-cp311-cp311-macosx_10_9_universal2.whl(191.40 KB)
charset_normalizer-3.0.0-cp311-cp311-macosx_10_9_x86_64.whl(119.25 KB)
charset_normalizer-3.0.0-cp311-cp311-macosx_11_0_arm64.whl(116.55 KB)
charset_normalizer-3.0.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl(188.24 KB)
charset_normalizer-3.0.0-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl(201.64 KB)
charset_normalizer-3.0.0-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl(192.50 KB)
charset_normalizer-3.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl(192.04 KB)
charset_normalizer-3.0.0-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl(193.20 KB)
charset_normalizer-3.0.0-cp311-cp311-musllinux_1_1_aarch64.whl(185.45 KB)
charset_normalizer-3.0.0-cp311-cp311-musllinux_1_1_i686.whl(187.14 KB)
charset_normalizer-3.0.0-cp311-cp311-musllinux_1_1_ppc64le.whl(197.37 KB)
charset_normalizer-3.0.0-cp311-cp311-musllinux_1_1_s390x.whl(188.18 KB)
charset_normalizer-3.0.0-cp311-cp311-musllinux_1_1_x86_64.whl(185.91 KB)
charset_normalizer-3.0.0-cp311-cp311-win32.whl(86.50 KB)
charset_normalizer-3.0.0-cp311-cp311-win_amd64.whl(93.59 KB)
charset_normalizer-3.0.0-cp36-cp36m-macosx_10_9_x86_64.whl(112.89 KB)
charset_normalizer-3.0.0-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl(154.95 KB)
charset_normalizer-3.0.0-cp36-cp36m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl(164.66 KB)
charset_normalizer-3.0.0-cp36-cp36m-manylinux_2_17_s390x.manylinux2014_s390x.whl(156.58 KB)
charset_normalizer-3.0.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl(159.18 KB)
charset_normalizer-3.0.0-cp36-cp36m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl(160.87 KB)
charset_normalizer-3.0.0-cp36-cp36m-musllinux_1_1_aarch64.whl(152.18 KB)
charset_normalizer-3.0.0-cp36-cp36m-musllinux_1_1_i686.whl(156.99 KB)
charset_normalizer-3.0.0-cp36-cp36m-musllinux_1_1_ppc64le.whl(161.33 KB)
charset_normalizer-3.0.0-cp36-cp36m-musllinux_1_1_s390x.whl(154.45 KB)
charset_normalizer-3.0.0-cp36-cp36m-musllinux_1_1_x86_64.whl(154.90 KB)
charset_normalizer-3.0.0-cp36-cp36m-win32.whl(83.28 KB)
charset_normalizer-3.0.0-cp36-cp36m-win_amd64.whl(89.19 KB)
charset_normalizer-3.0.0-cp37-cp37m-macosx_10_9_x86_64.whl(116.99 KB)
charset_normalizer-3.0.0-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl(162.89 KB)
charset_normalizer-3.0.0-cp37-cp37m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl(173.04 KB)
charset_normalizer-3.0.0-cp37-cp37m-manylinux_2_17_s390x.manylinux2014_s390x.whl(164.25 KB)
charset_normalizer-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl(166.55 KB)
charset_normalizer-3.0.0-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl(168.13 KB)
charset_normalizer-3.0.0-cp37-cp37m-musllinux_1_1_aarch64.whl(159.29 KB)
charset_normalizer-3.0.0-cp37-cp37m-musllinux_1_1_i686.whl(164.25 KB)
charset_normalizer-3.0.0-cp37-cp37m-musllinux_1_1_ppc64le.whl(169.50 KB)
charset_normalizer-3.0.0-cp37-cp37m-musllinux_1_1_s390x.whl(161.36 KB)
charset_normalizer-3.0.0-cp37-cp37m-musllinux_1_1_x86_64.whl(162.41 KB)
charset_normalizer-3.0.0-cp37-cp37m-win32.whl(85.47 KB)
charset_normalizer-3.0.0-cp37-cp37m-win_amd64.whl(91.75 KB)
charset_normalizer-3.0.0-cp38-cp38-macosx_10_9_universal2.whl(191.68 KB)
charset_normalizer-3.0.0-cp38-cp38-macosx_10_9_x86_64.whl(119.23 KB)
charset_normalizer-3.0.0-cp38-cp38-macosx_11_0_arm64.whl(116.78 KB)
charset_normalizer-3.0.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl(187.60 KB)
charset_normalizer-3.0.0-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl(200.40 KB)
charset_normalizer-3.0.0-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl(190.65 KB)
charset_normalizer-3.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl(190.81 KB)
charset_normalizer-3.0.0-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl(193.14 KB)
charset_normalizer-3.0.0-cp38-cp38-musllinux_1_1_aarch64.whl(184.31 KB)
charset_normalizer-3.0.0-cp38-cp38-musllinux_1_1_i686.whl(188.71 KB)
charset_normalizer-3.0.0-cp38-cp38-musllinux_1_1_ppc64le.whl(196.42 KB)
charset_normalizer-3.0.0-cp38-cp38-musllinux_1_1_s390x.whl(188.83 KB)
charset_normalizer-3.0.0-cp38-cp38-musllinux_1_1_x86_64.whl(186.44 KB)
charset_normalizer-3.0.0-cp38-cp38-win32.whl(86.46 KB)
charset_normalizer-3.0.0-cp38-cp38-win_amd64.whl(93.31 KB)
charset_normalizer-3.0.0-cp39-cp39-macosx_10_9_universal2.whl(193.84 KB)
charset_normalizer-3.0.0-cp39-cp39-macosx_10_9_x86_64.whl(120.40 KB)
charset_normalizer-3.0.0-cp39-cp39-macosx_11_0_arm64.whl(117.66 KB)
charset_normalizer-3.0.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl(190.93 KB)
charset_normalizer-3.0.0-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl(203.82 KB)
charset_normalizer-3.0.0-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl(194.14 KB)
charset_normalizer-3.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl(194.00 KB)
charset_normalizer-3.0.0-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl(195.75 KB)
charset_normalizer-3.0.0-cp39-cp39-musllinux_1_1_aarch64.whl(187.57 KB)
charset_normalizer-3.0.0-cp39-cp39-musllinux_1_1_i686.whl(189.55 KB)
charset_normalizer-3.0.0-cp39-cp39-musllinux_1_1_ppc64le.whl(199.35 KB)
charset_normalizer-3.0.0-cp39-cp39-musllinux_1_1_s390x.whl(190.34 KB)
charset_normalizer-3.0.0-cp39-cp39-musllinux_1_1_x86_64.whl(187.73 KB)
charset_normalizer-3.0.0-cp39-cp39-win32.whl(86.74 KB)
charset_normalizer-3.0.0-cp39-cp39-win_amd64.whl(94.00 KB)
charset_normalizer-3.0.0-py3-none-any.whl(44.43 KB)
3.0.0rc1(Oct 18, 2022)
This is the last pre-release. If everything goes well, I will publish the stable tag.

3.0.0rc1 (2022-10-18)

Added

Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results

Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES

Add parameter language_threshold in from_bytes, from_path and from_fp to adjust the minimum expected coherence ratio

Changed

Build with static metadata using 'build' frontend

Make language detection stricter

Fixed

CLI with opt --normalize fail when using full path for files

TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha characters have been fed to it

Removed

Coherence detector no longer returns 'Simple English' instead returns 'English'

Coherence detector no longer returns 'Classical Chinese' instead returns 'Chinese'

Source code(tar.gz)
Source code(zip)
3.0.0b2(Aug 21, 2022)
3.0.0b2 (2022-08-21)

Added

normalizer --version now specify if current version provide extra speedup (meaning mypyc compilation whl)

Removed

Breaking: Method first() and best() from CharsetMatch

UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflict with ASCII)

Fixed

Sphinx warnings when generating the documentation

Source code(tar.gz)
Source code(zip)
2.1.1(Aug 19, 2022)
2.1.1 (2022-08-19)

Deprecated

Function normalize scheduled for removal in 3.0

Changed

Removed useless call to decode in fn is_unprintable (#206)

Fixed

Third-party library (i18n xgettext) crashing not recognizing utf_8 (PEP 263) with underscore from @aleksandernovikov (#204)

Source code(tar.gz)
Source code(zip)
3.0.0b1(Aug 15, 2022)
3.0.0b1 (2022-08-15)

Changed

Optional: Module md.py can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1

Removed

Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches

Breaking: Top-level function normalize

Breaking: Properties chaos_secondary_pass, coherence_non_latin and w_counter from CharsetMatch

Support for the backport unicodedata2

Source code(tar.gz)
Source code(zip)
charset_normalizer-3.0.0b1-cp310-cp310-macosx_10_9_universal2.whl(181.06 KB)
charset_normalizer-3.0.0b1-cp310-cp310-macosx_10_9_x86_64.whl(110.69 KB)
charset_normalizer-3.0.0b1-cp310-cp310-macosx_11_0_arm64.whl(108.03 KB)
charset_normalizer-3.0.0b1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl(179.39 KB)
charset_normalizer-3.0.0b1-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl(191.60 KB)
charset_normalizer-3.0.0b1-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl(182.72 KB)
charset_normalizer-3.0.0b1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl(181.86 KB)
charset_normalizer-3.0.0b1-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl(183.20 KB)
charset_normalizer-3.0.0b1-cp310-cp310-musllinux_1_1_aarch64.whl(176.64 KB)
charset_normalizer-3.0.0b1-cp310-cp310-musllinux_1_1_i686.whl(177.76 KB)
charset_normalizer-3.0.0b1-cp310-cp310-musllinux_1_1_ppc64le.whl(187.55 KB)
charset_normalizer-3.0.0b1-cp310-cp310-musllinux_1_1_s390x.whl(179.21 KB)
charset_normalizer-3.0.0b1-cp310-cp310-musllinux_1_1_x86_64.whl(176.61 KB)
charset_normalizer-3.0.0b1-cp310-cp310-win32.whl(77.59 KB)
charset_normalizer-3.0.0b1-cp310-cp310-win_amd64.whl(84.61 KB)
charset_normalizer-3.0.0b1-cp36-cp36m-macosx_10_9_x86_64.whl(103.94 KB)
charset_normalizer-3.0.0b1-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl(146.32 KB)
charset_normalizer-3.0.0b1-cp36-cp36m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl(155.79 KB)
charset_normalizer-3.0.0b1-cp36-cp36m-manylinux_2_17_s390x.manylinux2014_s390x.whl(147.94 KB)
charset_normalizer-3.0.0b1-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl(149.84 KB)
charset_normalizer-3.0.0b1-cp36-cp36m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl(151.61 KB)
charset_normalizer-3.0.0b1-cp36-cp36m-musllinux_1_1_aarch64.whl(143.46 KB)
charset_normalizer-3.0.0b1-cp36-cp36m-musllinux_1_1_i686.whl(147.85 KB)
charset_normalizer-3.0.0b1-cp36-cp36m-musllinux_1_1_ppc64le.whl(152.79 KB)
charset_normalizer-3.0.0b1-cp36-cp36m-musllinux_1_1_s390x.whl(145.71 KB)
charset_normalizer-3.0.0b1-cp36-cp36m-musllinux_1_1_x86_64.whl(145.87 KB)
charset_normalizer-3.0.0b1-cp36-cp36m-win32.whl(74.41 KB)
charset_normalizer-3.0.0b1-cp36-cp36m-win_amd64.whl(80.03 KB)
charset_normalizer-3.0.0b1-cp37-cp37m-macosx_10_9_x86_64.whl(107.49 KB)
charset_normalizer-3.0.0b1-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl(152.20 KB)
charset_normalizer-3.0.0b1-cp37-cp37m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl(163.00 KB)
charset_normalizer-3.0.0b1-cp37-cp37m-manylinux_2_17_s390x.manylinux2014_s390x.whl(154.13 KB)
charset_normalizer-3.0.0b1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl(156.30 KB)
charset_normalizer-3.0.0b1-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl(157.62 KB)
charset_normalizer-3.0.0b1-cp37-cp37m-musllinux_1_1_aarch64.whl(149.14 KB)
charset_normalizer-3.0.0b1-cp37-cp37m-musllinux_1_1_i686.whl(153.88 KB)
charset_normalizer-3.0.0b1-cp37-cp37m-musllinux_1_1_ppc64le.whl(159.77 KB)
charset_normalizer-3.0.0b1-cp37-cp37m-musllinux_1_1_s390x.whl(151.54 KB)
charset_normalizer-3.0.0b1-cp37-cp37m-musllinux_1_1_x86_64.whl(151.98 KB)
charset_normalizer-3.0.0b1-cp37-cp37m-win32.whl(76.64 KB)
charset_normalizer-3.0.0b1-cp37-cp37m-win_amd64.whl(82.53 KB)
charset_normalizer-3.0.0b1-cp38-cp38-macosx_10_9_universal2.whl(179.39 KB)
charset_normalizer-3.0.0b1-cp38-cp38-macosx_10_9_x86_64.whl(109.85 KB)
charset_normalizer-3.0.0b1-cp38-cp38-macosx_11_0_arm64.whl(107.35 KB)
charset_normalizer-3.0.0b1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl(176.27 KB)
charset_normalizer-3.0.0b1-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl(188.29 KB)
charset_normalizer-3.0.0b1-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl(179.73 KB)
charset_normalizer-3.0.0b1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl(179.26 KB)
charset_normalizer-3.0.0b1-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl(180.84 KB)
charset_normalizer-3.0.0b1-cp38-cp38-musllinux_1_1_aarch64.whl(173.77 KB)
charset_normalizer-3.0.0b1-cp38-cp38-musllinux_1_1_i686.whl(177.20 KB)
charset_normalizer-3.0.0b1-cp38-cp38-musllinux_1_1_ppc64le.whl(184.60 KB)
charset_normalizer-3.0.0b1-cp38-cp38-musllinux_1_1_s390x.whl(177.90 KB)
charset_normalizer-3.0.0b1-cp38-cp38-musllinux_1_1_x86_64.whl(175.23 KB)
charset_normalizer-3.0.0b1-cp38-cp38-win32.whl(77.61 KB)
charset_normalizer-3.0.0b1-cp38-cp38-win_amd64.whl(84.11 KB)
charset_normalizer-3.0.0b1-cp39-cp39-macosx_10_9_universal2.whl(181.10 KB)
charset_normalizer-3.0.0b1-cp39-cp39-macosx_10_9_x86_64.whl(110.64 KB)
charset_normalizer-3.0.0b1-cp39-cp39-macosx_11_0_arm64.whl(108.04 KB)
charset_normalizer-3.0.0b1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl(179.37 KB)
charset_normalizer-3.0.0b1-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl(191.34 KB)
charset_normalizer-3.0.0b1-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl(182.73 KB)
charset_normalizer-3.0.0b1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl(181.77 KB)
charset_normalizer-3.0.0b1-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl(183.34 KB)
charset_normalizer-3.0.0b1-cp39-cp39-musllinux_1_1_aarch64.whl(176.56 KB)
charset_normalizer-3.0.0b1-cp39-cp39-musllinux_1_1_i686.whl(177.82 KB)
charset_normalizer-3.0.0b1-cp39-cp39-musllinux_1_1_ppc64le.whl(187.37 KB)
charset_normalizer-3.0.0b1-cp39-cp39-musllinux_1_1_s390x.whl(179.27 KB)
charset_normalizer-3.0.0b1-cp39-cp39-musllinux_1_1_x86_64.whl(176.50 KB)
charset_normalizer-3.0.0b1-cp39-cp39-win32.whl(77.61 KB)
charset_normalizer-3.0.0b1-cp39-cp39-win_amd64.whl(84.59 KB)
charset_normalizer-3.0.0b1-py3-none-any.whl(37.62 KB)
2.1.0(Jun 19, 2022)
2.1.0 (2022-06-19)

Added

Output the Unicode table version when running the CLI with --version (PR #194)

Changed

Re-use decoded buffer for single byte character sets from @nijel (PR #175)

Fixing some performance bottlenecks from @deedy5 (PR #183)

Fixed

Workaround potential bug in cpython with Zero Width No-Break Space located in Arabic Presentation Forms-B, Unicode 1.1 not acknowledged as space (PR #175)

CLI default threshold aligned with the API threshold from @oleksandr-kuzmenko (PR #181)

Removed

Support for Python 3.5 (PR #192)

Deprecated

Use of backport unicodedata from unicodedata2 as Python is quickly catching up, scheduled for removal in 3.0 (PR #194)

Source code(tar.gz)
Source code(zip)
2.0.12(Feb 12, 2022)
2.0.12 (2022-02-12)

Fixed

ASCII miss-detection on rare cases (PR #170)

Source code(tar.gz)
Source code(zip)
2.0.11(Jan 30, 2022)
2.0.11 (2022-01-30)

Added

Explicit support for Python 3.11 (PR #164)

Changed

The logging behavior has been completely reviewed, now using only TRACE and DEBUG levels (PR #163 #165)

Source code(tar.gz)
Source code(zip)
2.0.10(Jan 4, 2022)
2.0.10 (2022-01-04)

Fixed

Fallback match entries might lead to UnicodeDecodeError for large bytes sequence (PR #154)

Changed

Skipping the language-detection (CD) on ASCII (PR #155)

Source code(tar.gz)
Source code(zip)
2.0.9(Dec 3, 2021)
2.0.9 (2021-12-03)

Changed

Moderating the logging impact (since 2.0.8) for specific environments (PR #147)

Fixed

Wrong logging level applied when setting kwarg explain to True (PR #146)

Source code(tar.gz)
Source code(zip)
2.0.8(Nov 24, 2021)
Changed

Improvement over Vietnamese detection (PR #126)

MD improvement on trailing data and long foreign (non-pure latin) data (PR #124)

Efficiency improvements in cd/alphabet_languages from @adbar (PR #122)

call sum() without an intermediary list following PEP 289 recommendations from @adbar (PR #129)

Code style as refactored by Sourcery-AI (PR #131)

Minor adjustment on the MD around european words (PR #133)

Remove and replace SRTs from assets / tests (PR #139)

Initialize the library logger with a NullHandler by default from @nmaynes (PR #135)

Setting kwarg explain to True will add provisionally (bounded to function lifespan) a specific stream handler (PR #135)

Fixed

Fix large (misleading) sequence giving UnicodeDecodeError (PR #137)

Avoid using too insignificant chunk (PR #137)

Added

Add and expose function set_logging_handler to configure a specific StreamHandler from @nmaynes (PR #135)

Add CHANGELOG.md entries, format is based on Keep a Changelog (PR #141)

Source code(tar.gz)
Source code(zip)
2.0.7(Oct 11, 2021)
We arrived in a pretty stable state.

Changes:

Addition: :bento: Add support for Kazakh (Cyrillic) language detection #109

Improvement: :sparkle: Further improve inferring the language from a given code page (single-byte) #112

Removed: :fire: Remove redundant logging entry about detected language(s) #115

Miscellaneous: :wrench: Trying to leverage PEP263 when PEP3120 is not supported #116

While I do not think that this (116) will actually fix something, it will rather raise a SyntaxError (Not about ASCII decoding error) for those trying to install this package using a non-supported Python version

Improvement: :zap: Refactoring for potential performance improvements in loops #113 @adbar

Improvement: :sparkles: Various detection improvement (MD+CD) #117

Bugfix: :bug: Fix a minor inconsistency between Python 3.5 and other versions regarding language detection #117 #102

This version pushes forward the detection-coverage to 98%! https://github.com/Ousret/charset_normalizer/runs/3863881150 The great filter (cannot be better than) shall be 99% in conjunction with the current dataset. In future releases.
Source code(tar.gz)
Source code(zip)
2.0.6(Sep 17, 2021)
Changes:

Bugfix: :bug: Unforeseen regression with the loss of the backward-compatibility with some older minor of Python 3.5.x #100

Bugfix: :bug: Fix CLI crash when using --minimal output in certain cases #103

Improvement: :sparkles: Minor improvement to the detection efficiency (less than 1%) #106 #101

Source code(tar.gz)
Source code(zip)
2.0.5(Sep 14, 2021)

Changes:

Internal: :art: The project now comply with: flake8, mypy, isort and black to ensure a better overall quality #81 Internal: :art: The MANIFEST.in was not exhaustive #78 Improvement: :sparkles: The BC-support with v1.x was improved, the old staticmethods are restored #82 Remove: :fire: The project no longer raise warning on tiny content given for detection, will be simply logged as warning instead #92 Improvement: :sparkles: The Unicode detection is slightly improved, see #93 Bugfix: :bug: In some rare case, the chunks extractor could cut in the middle of a multi-byte character and could mislead the mess detection #95 Bugfix: :bug: Some rare 'space' characters could trip up the UnprintablePlugin/Mess detection #96 Improvement: :art: Add syntax sugar __bool__ for results CharsetMatches list-container see #91

This release push further the detection coverage to 97 % !
Source code(tar.gz)
Source code(zip)
2.0.4(Jul 30, 2021)
Changes:

Improvement: ❇️ Adjust the MD to lower the sensitivity, thus improving the global detection reliability (#69 #76)

Improvement: ❇️ Allow fallback on specified encoding if any (#71)

Bugfix: 🐛 The CLI no longer raise an unexpected exception when no encoding has been found (#70)

Bugfix: 🐛 Fix accessing the 'alphabets' property when the payload contains surrogate characters (#68)

Bugfix: 🐛 ✏️ The logger could mislead (explain=True) on detected languages and the impact of one MBCS match (in #72)

Bugfix: 🐛 Submatch factoring could be wrong in rare edge cases (in #72)

Bugfix: 🐛 Multiple files given to the CLI were ignored when publishing results to STDOUT. (After the first path) (in #72)

Internal: 🎨 Fix line endings from CRLF to LF for certain files (#67)

Source code(tar.gz)
Source code(zip)
2.0.3(Jul 16, 2021)
Changes:

Improvement: ✨ Part of the detection mechanism has been improved to be less sensitive, resulting in more accurate detection results. Especially ASCII. #63 Fix #62

Improvement: ✨According to the community wishes, the detection will fall back on ASCII or UTF-8 in a last-resort case. #64 Complete #62

Be assured that this project is disposed to listen to any of your concerns you may have. I know the vast majority did not expect requests to switch Chardet to Charset-Normalizer. I am inclined to make this change worth it, only together that we can achieve great things. Do not hesitate to leave feedback or bug report, will answer them all!
Source code(tar.gz)
Source code(zip)
2.0.2(Jul 14, 2021)
Changes:

Bugfix: 🐛 Empty/Too small JSON payload miss-detection fixed (#59) Thanks @tseaver for the report

Improvement: 🎇 Don't inject unicodedata2 into sys.modules (#57) @akx

Source code(tar.gz)
Source code(zip)
2.0.1(Jul 13, 2021)
Minor bug fixes release.

Changes:

Bugfix: :bug: Make it work where there isn't a filesystem available, dropping assets frequencies.json #54 #55 original report by @sethmlarson

Improvement: :sparkles: You may now use aliases in cp_isolation and cp_exclusion arguments #47

Bugfix: :bug: Using explain=False permanently disable the verbose output in the current runtime #47

Bugfix: :bug: One log entry (language target preemptive) was not show in logs when using explain=True #47

Bugfix: :bug: Fix undesired exception (ValueError) on getitem of instance CharsetMatches #52

Improvement: :wrench: Public function normalize default args values were not aligned with from_bytes #53

Source code(tar.gz)
Source code(zip)
2.0.0(Jul 2, 2021)
This package is reaching its two years of existence, now is a good time for a nice refresh.

Changes: See PR #45

Performance: ⚡ 4x to 5 times faster than the previous 1.4.0 release.

Performance: ⚡ At least 2x faster than Chardet.

Performance: ⚡ Accent has been made on UTF-8 detection, should perform rather instantaneous.

Improvement: 🔙 The backward compatibility with Chardet has been greatly improved. The legacy detect function returns an identical charset name whenever possible.

Improvement: ❇️ The detection mechanism has been slightly improved, now Turkish content is detected correctly (most of the time)

Code: 🎨 The program has been rewritten to ease the readability and maintainability. (+Using static typing)

Tests: ✔️ New workflows are now in place to verify the following aspects: Performance, Backward-Compatibility with Chardet, and Detection Coverage in addition to currents tests. (+CodeQL)

Dependency: ➖ This package no longer require anything when used with Python 3.5 (Dropped cached_property)

Docs: ✏️ Performance claims have been updated, the guide to contributing, and the issue template.

Improvement: ❇️ Add --version argument to CLI

Bugfix: 🐛 The CLI output used the relative path of the file(s). Should be absolute.

Deprecation: 🔴 Methods coherence_non_latin, w_counter, chaos_secondary_pass of the class CharsetMatch are now deprecated and scheduled for removal in v3.0

Improvement: ❇️ If no language was detected in content, trying to infer it using the encoding name/alphabets used.

Removal: 🔥 Removed support for these languages: Catalan, Esperanto, Kazakh, Baque, Volapük, Azeri, Galician, Nynorsk, Macedonian, and Serbocroatian.

Improvement: ❇️ utf_7 detection has been reinstated.

Removal: 🔥 The exception hook on UnicodeDecodeError has been removed.

After much consideration, this release won't drop Python 3.5 in v2.
Source code(tar.gz)
Source code(zip)
1.4.1(May 28, 2021)
Changes :

Improvement: :art: Logger configuration/usage no longer conflict with others #44

Source code(tar.gz)
Source code(zip)
1.4.0(May 21, 2021)
Changes :

Thanks to @potiuk for his tests/ideas that permitted us to improve the quality of this project.

Dependency: ➖ Using standard logging instead of using the package loguru.

Dependency: ➖ Dropping nose test framework in favor of the maintained pytest.

Dependency: ➖ Choose to not use dragonmapper package to help with gibberish Chinese/CJK text.

Dependency: :wrench: ➖ Require cached_property only for Python 3.5 due to constraint. Dropping for every other interpreter version.

Bugfix: 🐛 BOM marker in a CharsetNormalizerMatch instance could be False in rare cases even if obviously present. Due to the sub-match factoring process.

Improvement: 🎇 Return ASCII if given sequences fit. Given reasonable confidence.

Performance: ⚡ Huge improvement over the larges payload.

Change: 🔥 Stop support for UTF-7 that does not contain a SIG. (Contributions are welcome to improve that point)

Feature: 🎇 CLI now produces JSON consumable output.

Dependency: Dropping PrettyTable, replaced with pure JSON output.

Bugfix: 🐛 Not searching properly for the BOM when trying utf32/16 parent codec.

Other: ⚡ Improving the package final size by compressing frequencies.json.

This project no longer requires anything except for python 3.5. It is still supported even if passed EOL. Version 2.x will require Python 3.6+
Source code(tar.gz)
Source code(zip)
1.3.9(May 13, 2021)
Changes :

Bugfix: :bug: In some very rare cases, you may end up getting encode/decode errors due to a bad bytes payload #40

Source code(tar.gz)
Source code(zip)
1.3.8(May 12, 2021)
Changes :

Bugfix: :bug: Empty given payload for detection may cause an exception if trying to access the alphabets property. #39

Source code(tar.gz)
Source code(zip)
1.3.7(May 12, 2021)
Changes :

Bugfix: :bug: The legacy detect function should return UTF-8-SIG if sig is present in the payload. #38

Source code(tar.gz)
Source code(zip)
1.3.6(Feb 9, 2021)

Amend the previous release to allow prettytable 2.0 Thanks to @jayvdb #35
Source code(tar.gz)
Source code(zip)
1.3.5(Feb 8, 2021)
Changes :

Miscellaneous: 🔧 Dependencies refactor, add python 3.9 and 3.10 to the supported interpreters

Bugfix: 🐛 Fix error while using the package with a python pre-release interpreter #33

Small refresh to keep the project up and running until further dev. (Upcoming version 1.4.0) Thanks to the many adopters.
Source code(tar.gz)
Source code(zip)
1.3.4(Dec 16, 2019)
Changes :

Improvement/Bugfix : False positive when searching for successive upper, lower char. (ProbeChaos) (#31)

Source code(tar.gz)
Source code(zip)
1.3.3(Dec 16, 2019)
Changes :

Improvement : Noticeable better detection for jp #30

Source code(tar.gz)
Source code(zip)
1.3.2(Dec 13, 2019)
Changes :

Bugfix : Passing zero-length bytes to from_bytes (#29)

Source code(tar.gz)
Source code(zip)

🔎 Like Chardet. 🚀 Package for encoding & language detection. Charset detection.

Related tags

Overview

Charset Detection, for Everyone 👋

⭐ Your support

⚡ Performance

✨ Installation

🚀 Basic Usage

CLI

Python

😇 Why

🍰 How

⚡ Known limitations

👤 Contributing

📝 License

Comments

pprofile tests

Before

Merged

v2.11.4

v2.11.3

v2.11.4

v2.11.3

5.11.4

Changes

:package: Dependencies

5.11.3

Changes

:beetle: Fixes

:construction_worker: Continuous Integration

v5.11.3

Changes

:beetle: Fixes

:construction_worker: Continuous Integration

5.11.2

Changes

5.11.1

Changes December 12 2022

5.11.4 December 21 2022

5.11.3 December 16 2022

5.11.2 December 12 2022

5.11.1 December 12 2022

5.11.0 December 12 2022

22.12.0

Preview style

Configuration

Parser

Integrations

22.12.0

Preview style

Configuration

Parser

Integrations

Releases(3.0.1)

3.0.1(Nov 18, 2022)

3.0.1 (2022-11-18)

Fixed

Changed

3.0.0(Oct 20, 2022)

3.0.0 (2022-10-20)

Added

Changed

Fixed

Removed

3.0.0rc1(Oct 18, 2022)

3.0.0rc1 (2022-10-18)

Added

Changed

Fixed

Removed

3.0.0b2(Aug 21, 2022)

3.0.0b2 (2022-08-21)

Added

Removed

Fixed

2.1.1(Aug 19, 2022)

2.1.1 (2022-08-19)

Deprecated

Changed

Fixed