Extract XML from the OS X dictionaries.

Overview

Before You Start

Apple-peeler was written using python 3.9 (but it should be trivial to support earlier versions of python 3.5+).

Installation

pip install apple-peeler

Dependencies

BeautifulSoup 4, lxml, and click

Usage

Apple likes to move around the dictionaries location from macOS version to macOS version. So if the dictionaries are no longer at the path below you can tell apple-peeler where to look by exporting DICT_BASE in your environment or using the --base option below.

export DICT_BASE="/System/Library/AssetsV2/com_apple_MobileAsset_DictionaryServices_dictionaryOSX/"

After that, useage is straightforward.

Usage: apple-peeler [OPTIONS]

Extract XML from Apple Dictionary files.

Options:
--base DIRECTORY                The root directory of the OS X dictionaries.
                                (Default: /System/Library/AssetsV2/com_apple
                                _MobileAsset_DictionaryServices_dictionaryOS
                                X/) [Env var DICT_BASE]
--out DIRECTORY                 The path to place extracted XML files.
-d, --dictionary [
    all|Arabic - English|Danish|Duden Dictionary Data Set I|Dutch|
    Dutch - English|French|French - English|French - German|German - English|
    Hebrew|Hindi|Hindi - English|Indonesian - English|Italian|
    Italian - English|Korean|Korean - English|New Oxford American Dictionary|
    Norwegian|Oxford American Writer's Thesaurus|
    Oxford Dictionary of English|Oxford Thesaurus of English|
    Polish - English|Portuguese|Portuguese - English|Russian|
    Russian - English|Sanseido Super Daijirin|
    Sanseido The WISDOM English-Japanese Japanese-English Dictionary|
    Simplified Chinese - English|Simplified Chinese - Japanese|Spanish|
    Spanish - English|Swedish|Thai|Thai - English|
    The Standard Dictionary of Contemporary Chinese|Traditional Chinese|
    Traditional Chinese - English|Turkish|Vietnamese - English]
                                The dictionary to extract or 'all'.
                                (Default: all) [Accepts multiple]
--format-xml / --no-format-xml  Format the XML files using BeautifulSoup.
                                (Default: False)
--debug                         Output debug information to STDERR.
                                (Default: False)
--help                          Show this message and exit.

Introduction

I need a ton of dictionary data for prototyping my learning a language tool, Parsnip, and licensing 40 dictionaries seems too expensive for a bootstrapper prototyping / working on an MVP (I look forward to the day this is no longer true). [Note: I am not planning to redistribute or otherwise use the data in an unlicensed manner.]

Parsnip uses Natural Language Processing and Dictionaries to decouple the word <-> sentence tug-of-war that's existed as long as flashcards have been used for language learning. I.e., should I make a word (concept) or a sentence (example) flashcard?

I care about what words I know for tracking purposes, but I want those words in context when I'm practicing. So the learning system breaks down sentences into lemmas (or dictionary form of a word) and a database of example sentences that the words appear in. This resolves the conceptual tug-of-war for flashcards.

But by removing reference data from the flashcards themselves, I need to integrate reference material directly into Parsnip's UI. JMDict is a great open source project for this, but that only covers a single language. So, I've been keeping my eyes open for people working on extracting the data from Apple's bundled dictionaries.

This has been a community effort that's spanned several years. My contribution is to collect the results, clear up some details about the file format, and package it into a general command-line tool.

References

This is inspired by Reverse-Engineering Apple Dictionary. And the discussion on Hacker News Hacker News: Reverse-Engineering Apple Dictionary (2020). Special thanks to tim-- and enragedcacti who introduced me to binwalk. And dunham who mentioned the random bytes looking like ints of payload sizes.

Additionally, I've found these posts informative:

You might also like...
extract gene TSS/TES site form gencode/ensembl/gencode database GTF file and export bed format file.

GetTsite python Package extract gene TSS/TES site form gencode/ensembl/gencode database GTF file and export bed format file. Install $ pip install Get

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Parsel Parsel is a BSD-licensed Python library to extract and remove data from HTML and XML using XPath and CSS selectors, optionally combined with re

Finally decent dictionaries based on Wiktionary for your beloved eBook reader.
Finally decent dictionaries based on Wiktionary for your beloved eBook reader.

eBook Reader Dictionaries Finally, decent dictionaries based on Wiktionary for your beloved eBook reader. Dictionaries Catalan 🚧 Ελληνικά (help welco

Generates password lists/dictionaries based on keywords written in python3.

dicbyru Introduction Generates password lists/dictionaries based on keywords. It uses the keywords and adds capital letters, numbers and special chara

This utility synchronises spelling dictionaries from various tools with each other.

This utility synchronises spelling dictionaries from various tools with each other. This way the words that have been trained on MS Office are also correctly checked in vim or Firefox. And vice versa of course.

Safely add untrusted strings to HTML/XML markup.

MarkupSafe MarkupSafe implements a text object that escapes characters so it is safe to use in HTML and XML. Characters that have special meanings are

Converts XML to Python objects

untangle Documentation Converts XML to a Python object. Siblings with similar names are grouped into a list. Children can be accessed with parent.chil

Python module that makes working with XML feel like you are working with JSON

xmltodict xmltodict is a Python module that makes working with XML feel like you are working with JSON, as in this "spec": print(json.dumps(xmltod

Create Open XML PowerPoint documents in Python

python-pptx is a Python library for creating and updating PowerPoint (.pptx) files. A typical use would be generating a customized PowerPoint presenta

The lxml XML toolkit for Python

What is lxml? lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language. It's also very fast and memory

PubMed Mapper: A Python library that map PubMed XML to Python object

pubmed-mapper: A Python Library that map PubMed XML to Python object 中文文档 1. Philosophy view UML Programmatically access PubMed article is a common ta

Simple app for visual editing of Page XML files

Name nw-page-editor - Simple app for visual editing of Page XML files. Version: 2021.02.22 Description nw-page-editor is an application for viewing/ed

PAGE XML format collection for document image page content and more
PAGE XML format collection for document image page content and more

PAGE-XML PAGE XML format collection for document image page content and more For an introduction, please see the following publication: http://www.pri

A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file
A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Udemy Scraper A Web Scraper built with beautiful soup, that fetches udemy course information. Installation Virtual Environment Firstly, it is recommen

A repository that shares tuning results of trained models generated by TensorFlow / Keras. Post-training quantization (Weight Quantization, Integer Quantization, Full Integer Quantization, Float16 Quantization), Quantization-aware training. TensorFlow Lite. OpenVINO. CoreML. TensorFlow.js. TF-TRT. MediaPipe. ONNX. [.tflite,.h5,.pb,saved_model,tfjs,tftrt,mlmodel,.xml/.bin, .onnx]
A python script to convert an ucompressed Gnucash XML file to a text file for Ledger and hledger.

README 1 gnucash2ledger gnucash2ledger is a Python script based on the Github Gist by nonducor (nonducor/gcash2ledger.py). This Python script will tak

Json2Xml tool will help you convert from json COCO format to VOC xml format in Object Detection Problem.

JSON 2 XML All codes assume running from root directory. Please update the sys path at the beginning of the codes before running. Over View Json2Xml t

Txt2Xml tool will help you convert from txt COCO format to VOC xml format in Object Detection Problem.

TXT 2 XML All codes assume running from root directory. Please update the sys path at the beginning of the codes before running. Over View Txt2Xml too

Releases(v0.1.1)
Owner
Joshua Olson
Joshua Olson
Allows you to canibalize methods from classes effectively implementing trait-oriented programming

About This package enables code reuse in non-inheritance way from existing classes, effectively implementing traits-oriented programming pattern. Stor

1 Dec 13, 2021
A thing to simplify listening for PG notifications with asyncpg

A thing to simplify listening for PG notifications with asyncpg

ANNA 18 Dec 23, 2022
Link-tree - Script that iterate over the links found in each page

link-tree Script that iterate over the links found in each page, recursively fin

Rodrigo Stramantinoli 2 Jan 05, 2022
Check the basic quality of any dataset

Data Quality Checker in Python Check the basic quality of any dataset. Sneak Peek Read full tutorial at Medium. Explore the app Requirements python 3.

MalaDeep 8 Feb 23, 2022
🦩 A Python tool to create comment-free Jupyter notebooks.

Pelikan Pelikan lets you convert notebooks to comment-free notebooks. In other words, It removes Python block and inline comments from source cells in

Hakan Özler 7 Nov 20, 2021
一款不需要买代理来减少扫网站目录被封概率的扫描器,适用于中小规格字典。

PoorScanner使用说明书 -工具在不同环境下可能不怎么稳定,如果有什么问题恳请大家反馈。说明书有什么错误的地方也大家欢迎指正。 更新记录 2021.8.23 修复了云函数主程序 gitee上传文件接口写错了的BUG(之前把自己的上传地址写死进去了,没从配置文件里读) 更新了说明书 PoorS

14 Aug 02, 2022
A script to check for common mistakes in LaTeX source files of scientific papers.

LaTeX Paper Linter This script checks for common mistakes in LaTeX source files of scientific papers. Usage python3 paperlint.py file.tex [-i/x inc

Michael Schwarz 12 Nov 16, 2022
Simplex using Jordan exchanges taught in 236A

Simplex for 236A Python script to solve LP using simplex by Jordan exchanges taught in 236A. You will need python installed along with the 'numpy' and

Kunal Kishore 1 Nov 30, 2021
A Randomizer Oracle

Tezos Randomizer Tezod Randomizer "Oracle". It's a smart contract that you can call to get a random number between X and Y (for now). It uses entropy

Asbjorn Enge 19 Sep 13, 2022
Search, generate & deliver Msfvenom payloads in an quick and easy way

Goal Search, generate & deliver payloads in an quick and easy way Be as simple as possible BUT with all msfvenom payloads. Ever lost time searching th

2 Mar 03, 2022
Python implementation of Gorilla time series compression

Gorilla Time Series Compression This is an implementation (with some adaptations) of the compression algorithm described in section 4.1 (Time series c

Ghiles Meddour 19 Jan 01, 2023
Python utilities for writing cross-version compatible libraries

Python utilities for writing cross-version compatible libraries

Tyler M. Kontra 85 Jun 29, 2022
A pythonic dependency injection library.

Pinject Pinject is a dependency injection library for python. The primary goal of Pinject is to help you assemble objects into graphs in an easy, main

Google 1.3k Dec 30, 2022
Automatic generator of readmes for git repositories (Includes file' listing)

Readme Generator We are bored of write the same things once and once again. We trust in the comments made inside of our files, and we decided to autom

Natalia Vera Duran 6 Jul 20, 2021
Generates a random prnt.sc link and display image.

Generates a random prnt.sc link and display image.

Emirhan 3 Oct 08, 2021
PyHook is an offensive API hooking tool written in python designed to catch various credentials within the API call.

PyHook is the python implementation of my SharpHook project, It uses various API hooks in order to give us the desired credentials. PyHook Uses

Ilan Kalendarov 158 Dec 22, 2022
An URL checking python module

An URL checking python module

Fayas Noushad 6 Aug 10, 2022
A primitive Python wrapper around the Gromacs tools.

README: GromacsWrapper A primitive Python wrapper around the Gromacs tools. The library is tested with GROMACS 4.6.5, 2018.x, 2019.x, 2020.x, and 2021

Becksteinlab 140 Dec 28, 2022
Yet another retry utility in Python

Yet another retry utility in Python, avereno being the Malagasy word for retry.

Haute École d'Informatique de Madagascar 4 Nov 02, 2021
Napari plugin for loading Bitplane Imaris files .ims

napari-imaris-loader Napari plugin for loading Bitplane Imaris files '.ims'. Notes: For this plugin to work "File/Preferences/Experimental/Render Imag

Alan Watson 4 Dec 01, 2022