Get list of common stop words in various languages in Python

Overview

Python Stop Words

Overview

Get list of common stop words in various languages in Python.

Build Status Coverage Status PyPI Version PyPI Status License PyPI Py_versions

Available languages

  • Arabic
  • Bulgarian
  • Catalan
  • Czech
  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Hungarian
  • Indonesian
  • Italian
  • Norwegian
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Spanish
  • Swedish
  • Turkish
  • Ukrainian

Installation

stop-words is available on PyPI

http://pypi.python.org/pypi/stop-words

So easily install it by pip

$ pip install stop-words

Another way is by cloning stop-words's git repo

$ git clone --recursive git://github.com/Alir3z4/python-stop-words.git

Then install it by running:

$ python setup.py install

Basic usage

from stop_words import get_stop_words

stop_words = get_stop_words('en')
stop_words = get_stop_words('english')

from stop_words import safe_get_stop_words

stop_words = safe_get_stop_words('unsupported language')

Python compatibility

Python Stop Words is compatibe with:

  • Python 2.7
  • Python 3.4
  • Python 3.5
  • Python 3.6
  • Python 3.7
Comments
  • Enforces packaging of eggs into folders.

    Enforces packaging of eggs into folders.

    We had an error in our CI pipeline where a package build would fail since the .egg of stop-words is downloaded as a zip.

    This leads to the following error where the initializer tries to open a directory when it is actually a zip archive.

    Not a directory: '/opt/project/.eggs/stop_words-2015.2.23.1-py3.6.egg/stop_words/stop-words/languages.json'

    opened by hfjn 10
  • add indonesian stop word list

    add indonesian stop word list

    Add stop word list for indonesian language, added mapping to JSON file. Source: https://www.illc.uva.nl/Research/Publications/Reports/MoL-2003-02.text.pdf

    opened by frankdevans 4
  • can you handle a text?

    can you handle a text?

    hello, no description about how to use. Now I have a text: The University of Waterloo Stratford Campus is located in Stratford Ontario Canada. It is one of the three satellite campuses of the University of Waterloo a member of the U15 Group of Canadian Research Universities.Established in June 2009 the University of Waterloo Stratford Campus is part of the Faculty of Arts at the University of Waterloo. how to use python-stop-words to filter the stop-words to get a text without stop-words?

    thank you very much!!

    question 
    opened by PapaMadeleine2022 2
  • Python 3 support

    Python 3 support

    List of improvements:

    • Tests
    • Python 3 support
    • Dev installation via zc.buildout
    • Continuous integration via Travis

    Can you make a new release once the branch merged ?

    Regards

    enhancement 
    opened by Fantomas42 2
  • languages.json is missing, if you don't git clone with `--recursive`

    languages.json is missing, if you don't git clone with `--recursive`

    languages.json is still missing, if you don't clone with --recursive

    $ git clone git://github.com/Alir3z4/python-stop-words.git $ cd python-stop-words $ python3 setup.py install Traceback (most recent call last): File "setup.py", line 5, in version=import("stop_words").get_version(), File "./stop_words/init.py", line 9, in with open(os.path.join(STOP_WORDS_DIR, 'languages.json'), 'rb') as map_file: FileNotFoundError: [Errno 2] No such file or directory: './stop_words/stop-words/languages.json'

    opened by marcindulak 1
  • Update submodule to the latest

    Update submodule to the latest

    Include the stops for newly added languages

    https://github.com/Alir3z4/stop-words/pull/4 https://github.com/Alir3z4/stop-words/pull/5 https://github.com/Alir3z4/stop-words/pull/6 https://github.com/Alir3z4/stop-words/pull/7

    enhancement 
    opened by norkans7 1
  • Decode error AND Add catalan language to LANGUAGE_MAPPING

    Decode error AND Add catalan language to LANGUAGE_MAPPING

    1. Add catalan language to LANGUAGE_MAPPING. I previously I added the file with stop words in project "stop-words"

    2. Decode error

    stop_words = [line.strip().decode('utf-8')
                 for line in language_file.readlines()]
    

    Strip() return a copy of the string with leading and trailing whitespace characters removed. But if the string contains non-ascii characters, Strip() causes a UnicodeDecodeError error (eg UnicodeDecodeError: 'utf8' codec can not decode byte 0xc3 in position 34: unexpected end of data).

    The workaround is to reorder the call:

    stop_words = [line.decode('utf-8').strip()
                 for line in language_file.readlines()]
    
    opened by dmiro 1
  • Defining custom stop words in NLTK

    Defining custom stop words in NLTK

    Hi, I want to know what is the method for defining our own custom stop word? I'm currently developing a sentiment analysis in my local language in which i'm using Naive Bayes classifier to classify the text. I'm quite new to this type of NLP project so sorry if there's a method that I miss.

    Hope you can help me thanks.

    opened by AllikDaniel 0
  • Example not work on python 3.7.0

    Example not work on python 3.7.0

    It return empty []

    from stop_words import get_stop_words
    
    stop_words = get_stop_words('en')
    stop_words = get_stop_words('english')
    
    from stop_words import safe_get_stop_words
    
    stop_words = safe_get_stop_words('unsupported language')
    print(stop_words)
    
    opened by nadavvin 2
Releases(2018.7.23)
  • 2018.7.23(Jul 23, 2018)

    2018.7.23

    • Fixed #14: languages.json is missing, if you don't git clone with --recursive.
    • Feature: Support latest version of Python (3.7+).
    • Feature #22: Enforces packaging of eggs into folders.
    • Update the stop-words repository to get the latest languages.
    • Fixed Travis failing and tests due to bootstrap.

    PyPI: https://pypi.org/project/stop-words/2018.7.23/

    To install:

    $ pip install stop-words==2018.7.23
    
    Source code(tar.gz)
    Source code(zip)
  • 2015.2.23.1(Feb 23, 2015)

  • 2015.2.23(Feb 23, 2015)

    2015.2.23


    • Feature: Using the cache is optional
    • Feature: Filtering stopwords

    Special thanks to Taras Labiak @kissarat

    PyPi: https://pypi.python.org/pypi/stop-words/2015.2.21

    Source code(tar.gz)
    Source code(zip)
  • 2015.2.21(Feb 21, 2015)

    2015.2.21


    • Feature: LANGUAGE_MAPPING is loads from stop-words/languages.json
    • Fix: Made paths OS-independent

    PyPi: https://pypi.python.org/pypi/stop-words/2015.2.21

    Special thanks to Taras Labiak @kissarat

    Source code(tar.gz)
    Source code(zip)
  • 2015.1.31(Feb 1, 2015)

  • 2015.1.22(Jan 22, 2015)

    2015.1.22


    • Feature: Tests
    • Feature: Python 3 support
    • Feature: Dev installation via zc.buildout
    • Feature: Continuous integration via Travis

    pypi: https://pypi.python.org/pypi/stop-words/2015.1.22

    Source code(tar.gz)
    Source code(zip)
  • 2015.1.19(Jan 19, 2015)

Owner
Alireza Savand
I am Alireza Savand, a Software Architect.
Alireza Savand
This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest

Rachford-Rice Contest This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest. Can you solve the Rachford-Rice problem for all t

13 Sep 20, 2022
BERT score for text generation

BERTScore Automatic Evaluation Metric described in the paper BERTScore: Evaluating Text Generation with BERT (ICLR 2020). News: Features to appear in

Tianyi 1k Jan 08, 2023
An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

THUNLP 3.9k Jan 03, 2023
Shared, streaming Python dict

UltraDict Sychronized, streaming Python dictionary that uses shared memory as a backend Warning: This is an early hack. There are only few unit tests

Ronny Rentner 192 Dec 23, 2022
Anuvada: Interpretable Models for NLP using PyTorch

Anuvada: Interpretable Models for NLP using PyTorch So, you want to know why your classifier arrived at a particular decision or why your flashy new d

EDGE 102 Oct 01, 2022
문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.

Namuwiki corpus 문장단위로 미리 분절된 나무위키 코퍼스. 목적이 LM등에서 사용하기 위한 데이터셋이라, 링크/이미지/테이블 등등이 잘려있습니다. 문장 단위 분절은 kss를 활용하였습니다. 라이선스는 나무위키에 명시된 바와 같이 CC BY-NC-SA 2.0

Jeong Ukjae 16 Apr 02, 2022
Officile code repository for "A Game-Theoretic Perspective on Risk-Sensitive Reinforcement Learning"

CvarAdversarialRL Official code repository for "A Game-Theoretic Perspective on Risk-Sensitive Reinforcement Learning". Initial setup Create a virtual

Mathieu Godbout 1 Nov 19, 2021
Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Sploitus Command line search tool for sploitus.com. Think searchsploit, but with

watchdog2000 5 Mar 07, 2022
Need: Image Search With Python

Need: Image Search The problem is that a user needs to search for a specific ima

Surya Komandooru 1 Dec 30, 2021
neural network based speaker embedder

Content What is deepaudio-speaker? Installation Get Started Model Architecture How to contribute to deepaudio-speaker? Acknowledge What is deepaudio-s

20 Dec 29, 2022
DeepSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.

(简体中文|English) Quick Start | Documents | Models List PaddleSpeech is an open-source toolkit on PaddlePaddle platform for a variety of critical tasks i

5.6k Jan 03, 2023
New Modeling The Background CodeBase

Modeling the Background for Incremental Learning in Semantic Segmentation This is the updated official PyTorch implementation of our work: "Modeling t

Fabio Cermelli 9 Dec 28, 2022
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 8.8k Jan 01, 2023
Uncomplete archive of files from the European Nopsled Team

European Nopsled CTF Archive This is an archive of collected material from various Capture the Flag competitions that the European Nopsled team played

European Nopsled 4 Nov 24, 2021
☀️ Measuring the accuracy of BBC weather forecasts in Honolulu, USA

Accuracy of BBC Weather forecasts for Honolulu This repository records the forecasts made by BBC Weather for the city of Honolulu, USA. Essentially, t

Max Halford 12 Oct 15, 2022
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

Salesforce 564 Jan 08, 2023
Text classification is one of the popular tasks in NLP that allows a program to classify free-text documents based on pre-defined classes.

Deep-Learning-for-Text-Document-Classification Text classification is one of the popular tasks in NLP that allows a program to classify free-text docu

Happy N. Monday 2 Mar 17, 2022
TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

TweebankNLP This repo contains the new Tweebank-NER dataset and off-the-shelf Twitter-Stanza pipeline for state-of-the-art Tweet NLP, as described in

Laboratory for Social Machines 84 Dec 20, 2022
NLP applications using deep learning.

NLP-Natural-Language-Processing NLP applications using deep learning like text generation etc. 1- Poetry Generation: Using a collection of Irish Poem

KASHISH 1 Jan 27, 2022
Longformer: The Long-Document Transformer

Longformer Longformer and LongformerEncoderDecoder (LED) are pretrained transformer models for long documents. ***** New December 1st, 2020: Longforme

AI2 1.6k Dec 29, 2022