Maha is a text processing library specially developed to deal with Arabic text.

Last update: Nov 27, 2022

Overview

An Arabic text processing library intended for use in NLP applications

Maha is a text processing library specially developed to deal with Arabic text. The beta version can be used to clean and parse text, files, and folders with or without streaming capability.

If you need help or want to discuss topics related to Maha, feel free to reach out to our Discord server. If you would like to submit a bug report or feature request, please open an issue.

Installation

Simply run the following to install Maha:

pip install mahad # pronounced maha d

For source installation, check the documentation.

Overview

Check out the overview section in the documentation to get started with Maha.

Documentation

Documentation are hosted at ReadTheDocs.

Contributing

Maha welcomes and encourages everyone to contribute. Contributions are always appreciated. Feel free to take a look at our contribution guidelines in the documentation.

License

Maha is BSD-licensed.

Comments

Time: Add the ability to parse Hijri dates
What does this pull request change?

Closes #27.

Status (please check what you already did):

[x] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

new feature highlight
opened by TRoboto 6
Added distance to dimension parsing
What does this pull request change?

Resolves #15.

Status (please check what you already did):

[x] added some tests for the functionality

[x] updated the documentation

[x] tox passes

parsing highlight
opened by TRoboto 5
Introduce :mod:`~.datasets` module and the first dataset, `names`, with over 40,000 unique names
What does this pull request change?

This PR introduces a new datasets module that offers an interface for all upcoming datasets. A new dataset, names, is released along with the module. It comprises 44,161 unique names with descriptions and name origin included for most names.

Link to updated docs: https://maha--40.org.readthedocs.build/en/40/overview.html#datasets

Status (please check what you already did):

[x] added some tests for the functionality

[x] updated the documentation

[x] tox passes

new feature highlight
opened by TRoboto 4
Add pyupgrade to pre-commit and upgrade to future-style type annotations
What does this pull request change?

Upgrades to new type annotations style.

Status (please check what you already did):

[ ] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

maintenance
opened by TRoboto 3
Deprecate and remove `datasets` module and host datasets on Hugging Face instead
What does this pull request change?

Removes datasets module.

Datasets are now hosted here

Status (please check what you already did):

[ ] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

breaking changes deprecation
opened by TRoboto 3
Add the ability to parse names from text
What does this pull request change?

Adds #24. Depends on #40

Status (please check what you already did):

[x] added some tests for the functionality

[x] updated the documentation

[x] tox passes

new feature highlight
opened by TRoboto 3
Add a deprecation system
What does this pull request change?

Closes #23

Adds 3 deprecation decorators; for functions, for parameters, for default parameters.

Status (please check what you already did):

[x] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

development
opened by saedx1 3
Prepare for the next release of Maha (v0.3.0)
This is an auto-generated PR to prepare for the next release of Maha. The following changes were automatically made:

Generated changelogs for release v0.3.0.

Bumped pypi version to v0.3.0.

Updated the citation information.
opened by github-actions[bot] 2
Ordinal: Add support to `بعد` in ordinal parsing
What does this pull request change?

Closes #48.

Status (please check what you already did):

[x] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

new feature
opened by TRoboto 2
Numeral: Add support for hierarchical parsing
What does this pull request change?

Closes #25

Status (please check what you already did):

[x] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

new feature
opened by TRoboto 2
Prepare for the next release of Maha (v0.2.0)
This is an auto-generated PR to prepare for the next release of Maha. The following changes were automatically made:

Generated changelogs for release v0.2.0.

Bumped pypi version to v0.2.0.

Updated the citation information.
opened by github-actions[bot] 2
Update ci.yml
Check the support for python 3,10

What does this pull request change? It checks if the library is supporting python 3.10.

...

Status (please check what you already did):

[ ] added some tests for the functionality

[ ] updated the documentation

[ ] tox passes
opened by PAIN-BARHAM 1
[pre-commit.ci] pre-commit autoupdate
updates:

github.com/pre-commit/pre-commit-hooks: v4.3.0 → v4.4.0

github.com/psf/black: 22.6.0 → 22.12.0

github.com/pycqa/isort: 5.10.1 → 5.11.4

github.com/asottile/pyupgrade: v2.37.3 → v3.3.1
opened by pre-commit-ci[bot] 1
Add the option to ignore Harakat when removing or replacing
What problem are you trying to solve?

Currently, the cleaner functions do not consider two strings similar if they have different Harakat/diacritics, which is the correct behavior. However, it would be great if the user had the option to ignore Harakat when comparing strings.

Examples (if relevant)

Current:

>> from maha.cleaners.functions import remove >> output = remove("يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى", custom_expressions=r"اللغة") >> output يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى

Suggested:

>> from maha.cleaners.functions import remove >> remove("يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى", custom_expressions=r"اللغة", ignore_harakat=True) >> output يُدَرِّسُ العَرَبِيَّةَ الفُصْحَى

Definition of Done

It must adhere to the coding style used in the defined cleaner functions.

The implementation should cover most use cases.

Adding tests

feature request
opened by xaleel 1
Wrong parsed name using name dimension
What happened?

The name parser extracted wrong name likes : بي, شكرا.

Example: text: أريد البحث في سجل الإنفاق الخاص بي [Dimension(body=بي, value=بي, start=32, end=34, dimension_type=DimensionType.NAME)]

I expect to extract the names on the name dataset only.

Python version

3.8

What operating system are you using?

Linux

Code to reproduce the issue

>>> from maha.parsers.functions import parse_dimension >>> text = `أريد البحث في سجل الإنفاق الخاص بي` >>> extracted = parse_dimension(text, names=True) [Dimension(body=بي, value=بي, start=32, end=34, dimension_type=DimensionType.NAME)]

Relevant log output

No response
bug parsing
opened by PAIN-BARHAM 0
Add feature to parse duration period
What problem are you trying to solve?

Parsing the duration from the text that has the difference between the two dates.

Examples (if relevant)

>>> from maha.parsers.functions import parse_dimension >>> output = parse_dimension('عن ربع نمو سكان العالم القديم والتحضر بين 1700 و 1900 ميلادي', duration=True)[0].value >>> output DurationValue(values=[ValueUnit(value=200, unit=<DurationUnit.YEARS: 7>)], normalized_unit=<DurationUnit.SECONDS: 1>)

Definition of Done

It must adhere to the coding style used in the defined dimensions, duration dimension.

The implementation should cover most use cases.

Adding tests

feature request
opened by PAIN-BARHAM 1

Adding the parser functionality to Processors

What problem are you trying to solve?

Adding the parser functionality to Processors to parse different dimensions.

Examples (if relevant)

>>> from pathlib import Path
>>> import maha
>>> resource_path = Path(maha.__file__).parents[1] / "sample_data/tweets.txt"
>>> data = resource_path.read_text()
>>> print(data)

الساعة الآن 12:00 في اسبانيا 🇪🇸, انتهى بشكل رسمي عقد الأسطورة ليو ميسي مع برشلونة . .
طبعا بكونو حاطين المكيف ع٣ مئوية وخود تقلبات وبرد وحر وCNS وزعيق المراقب وألف نيلة وقر فتحت اشوف درجة الحرارة هتبقي كام يو الامتحان لقيتها ٤٢ والامتحان الساعه ١ فعايز انورماليز اننا ننزل بالفالنه الحمالات Hot fac
يسعدلي مساكم ❤🌹 شرح كلمة zwa هالمنشور رح تلاقو (zwar) سهل و لذيذ (aber) ناقصو شوية ملح وكزبر #منقو
مـعلش استحملوني ب الاصفر هالفتره 💛 #ريشـه هههههههه
لما حد يسالني بتختفي كتير لية =..
زيِّنوا ليلة الجمع بالصلاة على النَّبِيِّ ﷺ" ❤
#Windows11 is on the horizon. What feature are you looking forward to
Get vaccinate #savethesaviour
Today I am beginning project on 10 days duratio #30daysofcod #DEVCommunit

>>> from maha.processors import FileProcessor
>>> proc = FileProcessor(resource_path)
>>> parsed = proc.parse_dimension(time=True)
[Dimension(body=الساعة الآن 12:00, value=TimeValue(years=0, months=0, days=0, hours=0, minutes=0, seconds=0, hour=12, minute=0, second=0, microsecond=0), start=0, end=17, dimension_type=DimensionType.TIME),
 Dimension(body=الساعه ١, value=TimeValue(hour=1, minute=0, second=0, microsecond=0), start=238, end=246, dimension_type=DimensionType.TIME),
 Dimension(body=ليلة, value=TimeValue(am_pm='PM'), start=491, end=495, dimension_type=DimensionType.TIME)]

Definition of Done

It must adhere to the coding style.
The implementation should cover most use cases.
Adding tests.

good first issue feature request parsing

opened by PAIN-BARHAM 0

Releases(v0.3.0)

v0.3.0(Apr 4, 2022)

Check out the changelog for this release.
Source code(tar.gz)
Source code(zip)
v0.2.0(Nov 16, 2021)

Check out the changelog for this release.
Source code(tar.gz)
Source code(zip)
v0.1.2(Sep 23, 2021)
Quick fix:

Added readme badges

Fixed missing regex dependency

Source code(tar.gz)
Source code(zip)

Owner

Mohammad Al-Fetyani

Machine Learning Engineer

GitHub Repository

Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

142 Dec 21, 2022

Lumped-element impedance calculator and frequency-domain plotter.

fastZ: Lumped-Element Impedance Calculator fastZ is a small tool for calculating and visualizing electrical impedance in Python. Features include: Sup

47 Nov 18, 2022

A simple Speech Emotion Recognition (SER) API created using Flask and running in a Docker container.

keyword_searching Steps to use this Python scripts： (1)Paste this script into the file folder containing the PDF files you need to search from; (2)Thi

2 Nov 11, 2022

Question and answer retrieval in Turkish with BERT

trfaq Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! 🎉 What is this? At this repo, I'm

13 Oct 10, 2022

Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

A Infomation Grathering tool that reverse search phone numbers and get their details ! What is phomber? Phomber is one of the best tools available fo

121 Dec 27, 2022

Resources for "Natural Language Processing" Coursera course.

Natural Language Processing course resources This github contains practical assignments for Natural Language Processing course by Higher School of Eco

1.1k Jan 01, 2023

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 Billion Parameters) on a single 16 GB VRAM V100 Google Cloud instance with Huggingfa