An extremely configurable markdown reverser for Python3.

Overview

๐Ÿ”„ Unmarkd

codecov Code style: black CI PyPI - Downloads

A markdown reverser.


Unmarkd is a BeautifulSoup-powered Markdown reverser written in Python and for Python.

Why

This is created as a StackSearch (one of my other projects) dependency. In order to create a better API, I needed a way to reverse HTML. So I created this.

There are similar projects (written in Ruby) but I have not found any written in Python (or for Python) later I found a popular library, html2text. But Unmarkd still is still better. See comparison.

Installation

You know the drill

pip install unmarkd

Known issues

  • Nested lists are not properly indented (#4) Fixed in #11
  • Blockquote bug (#18) Fixed in #23

Comparison

TL;DR: Html2Text is fast. If you don't need much configuration, you could use Html2Text for the little speed increase.

Click to expand

Speed

TL;DR: Unmarkd < Html2Text

Html2Text is basically faster:

Benchmark

(The DOC variable used can be found here)

Unmarkd sacrifices speed for power.

Html2Text directly uses Python's html.parser module (in the standard library). On the other hand, Unmarkd uses the powerful HTML parsing library, beautifulsoup4. BeautifulSoup can be configured to use different HTML parsers. In Unmarkd, we configure it to use Python's html.parser, too.

But another layer of code means more code is ran.

I hope that's a good explanation of the speed difference.

Correctness

TL;DR: Unmarkd == Html2Text

I actually found two html-to-markdown libraries. One of them was Tomd which had an incorrect implementation:

Actual results

It seems to be abandoned, anyway.

Now with Html2Text and Unmarkd:

Epic showdown

In other words, they work

Configurability

TL;DR: Unmarkd > Html2Text

This is Unmarkd's strong point.

In Html2Text, you only have a limited set of options.

In Unmarkd, you can subclass the BaseUnmarker and implement conversions for new tags (e.g. ), etc. In my opinion, it's much easier to extend and configure Unmarkd.

Unmarkd was originally written as a StackSearch dependancy.

Html2Text has no options for configuring parsing of code blocks. Unmarkd does

Documentation

Here's an example of basic usage

I love markdown!")) # Output: **I *love* markdown!**">
import unmarkd
print(unmarkd.unmark("I love markdown!"))
# Output: **I *love* markdown!**

or something more complex (shamelessly taken from here):

Sample Markdown

This is some basic, sample markdown.

Second Heading

  • Unordered lists, and:
    1. One
    2. Two
    3. Three
  • More

Blockquote

And bold, italics, and even italics and later bold. Even strikethrough. A link to somewhere.

And code highlighting:

var foo = 'bar';

function baz(s) {
   return foo + ':' + s;
}

Or inline code like var foo = 'bar';.

Or an image of bears

bears

The end ...

""" print(unmarkd.unmark(html_doc))">
import unmarkd
html_doc = R"""

Sample Markdown

This is some basic, sample markdown.

Second Heading

  • Unordered lists, and:
    1. One
    2. Two
    3. Three
  • More

Blockquote

And bold, italics, and even italics and later bold. Even strikethrough. A link to somewhere.

And code highlighting:

var foo = 'bar';

function baz(s) {
   return foo + ':' + s;
}

Or inline code like var foo = 'bar';.

Or an image of bears

bears

The end ...

""" print(unmarkd.unmark(html_doc))

and the output:

    # Sample Markdown


    This is some basic, sample markdown.

    ## Second Heading



    - Unordered lists, and:
     1. One
     2. Two
     3. Three
    - More

    >Blockquote


    And **bold**, *italics*, and even *italics and later **bold***. Even ~~strikethrough~~. [A link](https://markdowntohtml.com) to somewhere.

    And code highlighting:


    ```js
    var foo = 'bar';

    function baz(s) {
       return foo + ':' + s;
    }
    ```


    Or inline code like `var foo = 'bar';`.

    Or an image of bears

    ![bears](http://placebear.com/200/200)

    The end ...

Extending

Brief Overview

Most functionality should be covered by the BasicUnmarker class defined in unmarkd.unmarkers.

If you need to reverse markdown from StackExchange (as in the case for my other project), you may use the StackOverflowUnmarker (or it's alias, StackExchangeUnmarker), which is also defined in unmarkd.unmarkers.

Customizing

If the above two classes do not suit your needs, you can subclass the unmarkd.unmarkers.BaseUnmarker abstract class.

Currently, you can optionally override the following methods:

  • detect_language (parameters: 1)
    • Parameters:
      • html: bs4.BeautifulSoup
    • When a fenced code block is approached, this function is called with a parameter of type bs4.BeautifulSoup passed to it; this is the element the code block was detected from (i.e. pre).
    • This function is responsible for detecting the programming language (or returning '' if none was detected) of the code block.
    • Note: This method is different from unmarkd.unmarkers.BasicUnmarker. It is simpler and does less checking/filtering

But Unmarkd is more flexible than that.

Customizable constants

There are currently 3 constants you may override:

  • Formats: NOTE: Use the Format String Syntax
    • UNORDERED_FORMAT
      • The string format of unordered (bulleted) lists.
    • ORDERED_FORMAT
      • The string format of ordered (numbered) lists.
  • Miscellaneous:
    • ESCAPABLES
      • A container (preferably a set) of length-1 str that should be escaped
Customize converting HTML tags

For an HTML tag some_tag, you can customize how it's converted to markdown by overriding a method like so:

from unmarkd.unmarkers import BaseUnmarker
class MyCustomUnmarker(BaseUnmarker):
    def tag_some_tag(self, child) -> str:
        ...  # parse code here

To reduce code duplication, if your tag also has aliases (e.g. strong is an alias for b in HTML) then you may modify the TAG_ALIASES.

If you really need to, you may also modify DEFAULT_TAG_ALIASES. Be warned: if you do so, you will also need to implement the aliases (currently em and strong).

Utility functions when overriding

You may use (when extending) the following functions:

  • __parse, 2 parameters:
    • html: bs4.BeautifulSoup
      • The html to unmark. This is used internally by the unmark method and is slightly faster.
    • escape: bool
      • Whether to escape the characters inside the string or not. Defaults to False.
  • escape: 1 parameter:
    • string: str
      • The string to escape and make markdown-safe
  • wrap: 2 parameters:
    • element: bs4.BeautifulSoup
      • The element to wrap.
    • around_with: str
      • The character to wrap the element around with. WILL NOT BE ESCPAED
  • And, of course, tag_* and detect_language.
Comments
  • Nested lists of same type don't work

    Nested lists of same type don't work

    Both unordered and ordered list don't work when nested of the same type:

    Two nested ordered lists

    HTML:

    <ol>
        <li>Top level 1</li>
        <li>Top level 2
            <ol>
                <li>A</li>
                <li>B</li>
                <li>C</li>
            </ol>
        </li>
        <li>Top level 3</li>
    </ol>
    

    Output:

    1. Top level 1
     2. Top level 2
            
     1. A
     2. B
     3. C
     3. Top level 3
    

    Two nested unordered lists

    HTML:

    <ul>
        <li>Top level 1</li>
        <li>Top level 2
            <ul>
                <li>A</li>
                <li>B</li>
                <li>C</li>
            </ul>
        </li>
        <li>Top level 3</li>
    </ul>
    

    Output:

    - Top level 1
    - Top level 2
            
    - A
    - B
    - C
    - Top level 3
    
    bug good first issue reproduced 
    opened by sirnacnud 3
  • [ImgBot] Optimize images

    [ImgBot] Optimize images

    Beep boop. Your images are optimized!

    Your image file size has been reduced by 39% ๐ŸŽ‰

    Details

    | File | Before | After | Percent reduction | |:--|:--|:--|:--| | /assets/correct.png | 372.04kb | 224.67kb | 39.61% | | /assets/tomd_cant_handle.png | 347.74kb | 210.22kb | 39.55% | | /assets/benchmark.png | 219.28kb | 141.36kb | 35.53% | | | | | | | Total : | 939.06kb | 576.25kb | 38.64% |


    ๐Ÿ“ docs | :octocat: repo | ๐Ÿ™‹๐Ÿพ issues | ๐Ÿช marketplace

    ~Imgbot - Part of Optimole family

    opened by imgbot[bot] 1
  • Fix indent getting added to list children that weren't other lists

    Fix indent getting added to list children that weren't other lists

    I was running in to an issue where list items using tags where getting indented when they shouldn't of been.

    Example:

    <ol>
        <li>A</li>
        <li>B</li>
        <li><b>C</b></li>
    </ol>
    

    Output:

    1. A
    2. B
    3.     **C**
    

    I added a test for this case as well. When doing the roundtrip style test, this indentation got lost, so I made the test compare the markdown output.

    opened by sirnacnud 1
  • Support for tables

    Support for tables

    While Unmarkd currently supports tables, it spits out the html it was given. It would be nice if it supported tables:

    | Syntax      | Description |
    | ----------- | ----------- |
    | Header      | Title       |
    | Paragraph   | Text        |
    
    enhancement 
    opened by ThatXliner 1
  • Nested lists are not properly indented

    Nested lists are not properly indented

    When the following HTML block is parsed:

    <ul>
        <li>Unordered lists, and:
            <ol>
                <li>One</li>
                <li>Two</li>
                <li>Three</li>
            </ol>
        </li>
        <li>More</li>
    </ul>
    

    The output is incorrect:

     * Unordered lists, and:
     0. One
     1. Two
     2. Three
     * More
    
    bug 
    opened by ThatXliner 1
  • Blockquote bug

    Blockquote bug

    Apply this patch:

    diff --git a/tests/test_roundtrip.py b/tests/test_roundtrip.py
    index a836024..5c1e097 100644
    --- a/tests/test_roundtrip.py
    +++ b/tests/test_roundtrip.py
    @@ -1,10 +1,9 @@
     import unicodedata
     
     import markdown_it
    -from hypothesis import assume, example, given
    -from hypothesis import strategies as st
    -
     import unmarkd
    +from hypothesis import assume, example, given, reproduce_failure
    +from hypothesis import strategies as st
     
     md = markdown_it.MarkdownIt()
     
    @@ -17,6 +16,7 @@ def helper(text: str, func=unmarkd.unmark) -> None:
     
     
     @given(text=st.text(st.characters(blacklist_categories=("Cc", "Cf", "Cs", "Co", "Cn"))))
    [email protected]_failure("6.10.1", b"AAEADgEADgEADgA=")
     def test_roundtrip_commonmark_unmark(text):
         assume(unicodedata.normalize("NFKC", text) == text)
         helper(text)
    
    
    
    

    Or add an example with text=">>>". Tests will fail

    bug 
    opened by ThatXliner 0
  • Update README for better comparison

    Update README for better comparison

    1. html2text is fast but not very configurable (there's only so any options)
    2. Tomd sucks
    3. Add an unmarker (with html2text-style configuration) to prove that unmarkd's configurability is at least equal to html2text
    documentation 
    opened by ThatXliner 0
  • Use a more reliable markdown parser

    Use a more reliable markdown parser

    Instead of using commonmark, maybe https://github.com/executablebooks/markdown-it-py, https://github.com/trentm/python-markdown2, https://github.com/lepture/mistune, or https://github.com/Python-Markdown/markdown.

    Also, I found tomd which might render this project useless ๐Ÿ˜ฌ

    tests 
    opened by ThatXliner 0
  • Cannot handle nested bold and italics

    Cannot handle nested bold and italics

    When encountering input like <em>Italic and <strong>bold and italic</strong></em>, the output is wrong, usually shadowed by the outer tag (in this case, <em>)

    bug 
    opened by ThatXliner 0
  • Optimize code

    Optimize code

    I've noticed that unmarkers.BaseUnmarker been documented as an "abstract base class" when we're actually using it otherwise.

    Also, there's some dead code and we should actually sprinkle @staticmethod on some of them.

    Here's my idea:

    • Move all the tag_* methods in BaseUnmarker โžก๏ธ BasicUnmarker
    • Rename: BaseUnmarker โžก๏ธ AbstractUnmarker
    • Alias: BaseUnmarker โžก๏ธ BasicMarker
    • Run shed on the whole codebase (with --refactor)

    Version bump: minor

    enhancement 
    opened by ThatXliner 0
  • Save CSS information

    Save CSS information

    1. Parse any css files or style tags found. Save it
    2. When a class attribute is found, try to resolve it to the css
    3. Add the resolved to the style attribute: convert to inline css
    enhancement 
    opened by ThatXliner 1
Releases(v0.1.9)
Owner
ThatXliner
I code Python. To me, programming is a logic puzzle. A fun one :D
ThatXliner
Simple tooling for marking deprecated functions or classes and re-routing to the new successors' instance.

pyDeprecate Simple tooling for marking deprecated functions or classes and re-routing to the new successors' instance

Jirka Borovec 45 Nov 24, 2022
Dockernized ZeroTierOne controller with zero-ui web interface.

docker-zerotier-controller Dockernized ZeroTierOne controller with zero-ui web interface. ไธญๆ–‡่ฎจ่ฎบ Customize ZeroTierOne's controller planets Modify patch

sbilly 209 Jan 04, 2023
Boot.img patcher for Tolino ebook readers to enable ADB and root.

I'm not responsible for any damage to your devices by running this tool. Please note that you may loose warranty when using this, although (This is no

Aaron Dewes 9 Nov 13, 2022
๐Ÿคž Website-Survival-Detection

- ๐Ÿคž Website-Survival-Detection It can help you to detect the survival status of the website in batches and return the status code! - ๐Ÿ“œ Instructions

B0kd1 4 Nov 14, 2022
This is a fork of the BakeTool with some improvements that I did to have better workflow.

blender-bake-tool This is a fork of the BakeTool with some improvements that I did to have better workflow. 99.99% of work was done by BakeTool team.

Acvarium 3 Oct 04, 2022
Islam - This is a simple python script.In this script I have written all the suras of Al Quran. As a result, by using this script, you can know the number of any sura at the moment.

Introduction: If you want to know sura number of al quran by just typing the name of sura than you can use this script. Usage in termux: $ pkg install

Fazle Rabbi 1 Jan 02, 2022
Convert text with ANSI color codes to HTML or to LaTeX.

Convert text with ANSI color codes to HTML or to LaTeX.

PyContribs 326 Dec 28, 2022
Python 100daysofcode

#python #100daysofcode Python is a simple, general purpose ,high level & object-oriented programming language even it's is interpreted scripting langu

Tara 1 Feb 10, 2022
fetchmesh is a tool to simplify working with Atlas anchoring mesh measurements

A Python library for working with the RIPE Atlas anchoring mesh. fetchmesh is a tool to simplify working with Atlas anchoring mesh measurements. It ca

2 Aug 30, 2022
ะ‘ั‹ัั‚ั€ั‹ะน ะปะพะบะฐะปัŒะฝั‹ะน ัั‚ะฐั€ั‚

ะ‘ั‹ัั‚ั€ั‹ะน ะปะพะบะฐะปัŒะฝั‹ะน ัั‚ะฐั€ั‚

Anton Ogorodnikov 1 Sep 28, 2021
A scuffed remake of Kahoot... Made by Y9 and Y10 SHSB

A scuffed remake of Kahoot... Made by Y9 and Y10 SHSB

Tobiloba Kujore 3 Oct 28, 2022
CaskDB is a disk-based, embedded, persistent, key-value store based on the Riak's bitcask paper, written in Python.

CaskDB - Disk based Log Structured Hash Table Store CaskDB is a disk-based, embedded, persistent, key-value store based on the Riak's bitcask paper, w

886 Dec 27, 2022
Load dependent libraries dynamically.

dypend dypend Load dependent libraries dynamically. A few days ago, I encountered many users feedback in an open source project. The Problem is they c

Louis 5 Mar 02, 2022
Is a polybar module that will show you your progress in Hack The Box

HTB-Status for Polybar Is a polybar module that will show you your progress in Hack The Box indicating your current rank, global rank, points and resp

bitc0de 8 Jan 14, 2022
Python most simple|stupid programming language (MSPL)

Most Simple|Stupid Programming language. (MSPL) Stack - Based programming language "written in Python" Features: Interpretate code (Run). Generate gra

Kirill Zhosul 14 Nov 03, 2022
A program made in PYTHON๐Ÿ that automatically performs data insertions into a POSTGRES database ๐Ÿ˜ , using as base a .CSV file ๐Ÿ“ , useful in mass data insertions

A program made in PYTHON๐Ÿ that automatically performs data insertions into a POSTGRES database ๐Ÿ˜ , using as base a .CSV file ๐Ÿ“ , useful in mass data insertions.

Davi Galdino 1 Oct 17, 2022
Calculatrix is a project where I'll create plenty of calculators in a lot of differents languages

Calculatrix What is Calculatrix ? Calculatrix is a project where I'll create plenty of calculators in a lot of differents languages. I know this sound

1 Jun 14, 2022
Get you an ultimate lexer generator using Fable; port OCaml sedlex to FSharp, Python and more!

NOTE: currently we support interpreted mode and Python source code generation. It's EASY to compile compiled_unit into source code for C#, F# and othe

Taine Zhao 15 Aug 06, 2022
A python script to turn tabs into spaces the right way.

detab A python script to turn tabs into spaces the right way. detab turns all tabs into spaces, not just leading tabs. Not all tabs have the same leng

1 Jan 26, 2022
Gunakan Dengan Bijak!!

YMBF Made with โค๏ธ by ikiwzXD_ menu Results notice me: if you get cp results, save 3/7 days then log in. Install script on Termux $ pkg update && pkg u

Ikiwz 0 Jul 11, 2022