Get list of common stop words in various languages in Python

Overview

Python Stop Words

Overview

Get list of common stop words in various languages in Python.

Build Status Coverage Status PyPI Version PyPI Status License PyPI Py_versions

Available languages

  • Arabic
  • Bulgarian
  • Catalan
  • Czech
  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Hungarian
  • Indonesian
  • Italian
  • Norwegian
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Spanish
  • Swedish
  • Turkish
  • Ukrainian

Installation

stop-words is available on PyPI

http://pypi.python.org/pypi/stop-words

So easily install it by pip

$ pip install stop-words

Another way is by cloning stop-words's git repo

$ git clone --recursive git://github.com/Alir3z4/python-stop-words.git

Then install it by running:

$ python setup.py install

Basic usage

from stop_words import get_stop_words

stop_words = get_stop_words('en')
stop_words = get_stop_words('english')

from stop_words import safe_get_stop_words

stop_words = safe_get_stop_words('unsupported language')

Python compatibility

Python Stop Words is compatibe with:

  • Python 2.7
  • Python 3.4
  • Python 3.5
  • Python 3.6
  • Python 3.7
Comments
  • Enforces packaging of eggs into folders.

    Enforces packaging of eggs into folders.

    We had an error in our CI pipeline where a package build would fail since the .egg of stop-words is downloaded as a zip.

    This leads to the following error where the initializer tries to open a directory when it is actually a zip archive.

    Not a directory: '/opt/project/.eggs/stop_words-2015.2.23.1-py3.6.egg/stop_words/stop-words/languages.json'

    opened by hfjn 10
  • add indonesian stop word list

    add indonesian stop word list

    Add stop word list for indonesian language, added mapping to JSON file. Source: https://www.illc.uva.nl/Research/Publications/Reports/MoL-2003-02.text.pdf

    opened by frankdevans 4
  • can you handle a text?

    can you handle a text?

    hello, no description about how to use. Now I have a text: The University of Waterloo Stratford Campus is located in Stratford Ontario Canada. It is one of the three satellite campuses of the University of Waterloo a member of the U15 Group of Canadian Research Universities.Established in June 2009 the University of Waterloo Stratford Campus is part of the Faculty of Arts at the University of Waterloo. how to use python-stop-words to filter the stop-words to get a text without stop-words?

    thank you very much!!

    question 
    opened by PapaMadeleine2022 2
  • Python 3 support

    Python 3 support

    List of improvements:

    • Tests
    • Python 3 support
    • Dev installation via zc.buildout
    • Continuous integration via Travis

    Can you make a new release once the branch merged ?

    Regards

    enhancement 
    opened by Fantomas42 2
  • languages.json is missing, if you don't git clone with `--recursive`

    languages.json is missing, if you don't git clone with `--recursive`

    languages.json is still missing, if you don't clone with --recursive

    $ git clone git://github.com/Alir3z4/python-stop-words.git $ cd python-stop-words $ python3 setup.py install Traceback (most recent call last): File "setup.py", line 5, in version=import("stop_words").get_version(), File "./stop_words/init.py", line 9, in with open(os.path.join(STOP_WORDS_DIR, 'languages.json'), 'rb') as map_file: FileNotFoundError: [Errno 2] No such file or directory: './stop_words/stop-words/languages.json'

    opened by marcindulak 1
  • Update submodule to the latest

    Update submodule to the latest

    Include the stops for newly added languages

    https://github.com/Alir3z4/stop-words/pull/4 https://github.com/Alir3z4/stop-words/pull/5 https://github.com/Alir3z4/stop-words/pull/6 https://github.com/Alir3z4/stop-words/pull/7

    enhancement 
    opened by norkans7 1
  • Decode error AND Add catalan language to LANGUAGE_MAPPING

    Decode error AND Add catalan language to LANGUAGE_MAPPING

    1. Add catalan language to LANGUAGE_MAPPING. I previously I added the file with stop words in project "stop-words"

    2. Decode error

    stop_words = [line.strip().decode('utf-8')
                 for line in language_file.readlines()]
    

    Strip() return a copy of the string with leading and trailing whitespace characters removed. But if the string contains non-ascii characters, Strip() causes a UnicodeDecodeError error (eg UnicodeDecodeError: 'utf8' codec can not decode byte 0xc3 in position 34: unexpected end of data).

    The workaround is to reorder the call:

    stop_words = [line.decode('utf-8').strip()
                 for line in language_file.readlines()]
    
    opened by dmiro 1
  • Defining custom stop words in NLTK

    Defining custom stop words in NLTK

    Hi, I want to know what is the method for defining our own custom stop word? I'm currently developing a sentiment analysis in my local language in which i'm using Naive Bayes classifier to classify the text. I'm quite new to this type of NLP project so sorry if there's a method that I miss.

    Hope you can help me thanks.

    opened by AllikDaniel 0
  • Example not work on python 3.7.0

    Example not work on python 3.7.0

    It return empty []

    from stop_words import get_stop_words
    
    stop_words = get_stop_words('en')
    stop_words = get_stop_words('english')
    
    from stop_words import safe_get_stop_words
    
    stop_words = safe_get_stop_words('unsupported language')
    print(stop_words)
    
    opened by nadavvin 2
Releases(2018.7.23)
  • 2018.7.23(Jul 23, 2018)

    2018.7.23

    • Fixed #14: languages.json is missing, if you don't git clone with --recursive.
    • Feature: Support latest version of Python (3.7+).
    • Feature #22: Enforces packaging of eggs into folders.
    • Update the stop-words repository to get the latest languages.
    • Fixed Travis failing and tests due to bootstrap.

    PyPI: https://pypi.org/project/stop-words/2018.7.23/

    To install:

    $ pip install stop-words==2018.7.23
    
    Source code(tar.gz)
    Source code(zip)
  • 2015.2.23.1(Feb 23, 2015)

  • 2015.2.23(Feb 23, 2015)

    2015.2.23


    • Feature: Using the cache is optional
    • Feature: Filtering stopwords

    Special thanks to Taras Labiak @kissarat

    PyPi: https://pypi.python.org/pypi/stop-words/2015.2.21

    Source code(tar.gz)
    Source code(zip)
  • 2015.2.21(Feb 21, 2015)

    2015.2.21


    • Feature: LANGUAGE_MAPPING is loads from stop-words/languages.json
    • Fix: Made paths OS-independent

    PyPi: https://pypi.python.org/pypi/stop-words/2015.2.21

    Special thanks to Taras Labiak @kissarat

    Source code(tar.gz)
    Source code(zip)
  • 2015.1.31(Feb 1, 2015)

  • 2015.1.22(Jan 22, 2015)

    2015.1.22


    • Feature: Tests
    • Feature: Python 3 support
    • Feature: Dev installation via zc.buildout
    • Feature: Continuous integration via Travis

    pypi: https://pypi.python.org/pypi/stop-words/2015.1.22

    Source code(tar.gz)
    Source code(zip)
  • 2015.1.19(Jan 19, 2015)

Owner
Alireza Savand
I am Alireza Savand, a Software Architect.
Alireza Savand
A combination of autoregressors and autoencoders using XLNet for sentiment analysis

A combination of autoregressors and autoencoders using XLNet for sentiment analysis Abstract In this paper sentiment analysis has been performed in or

James Zaridis 2 Nov 20, 2021
100+ Chinese Word Vectors 上百种预训练中文词向量

Chinese Word Vectors 中文词向量 中文 This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse),

embedding 10.4k Jan 09, 2023
FactSumm: Factual Consistency Scorer for Abstractive Summarization

FactSumm: Factual Consistency Scorer for Abstractive Summarization FactSumm is a toolkit that scores Factualy Consistency for Abstract Summarization W

devfon 83 Jan 09, 2023
Multiple implementations for abstractive text summurization , using google colab

Text Summarization models if you are able to endorse me on Arxiv, i would be more than glad https://arxiv.org/auth/endorse?x=FRBB89 thanks This repo i

463 Dec 26, 2022
Code for lyric-section-to-comment generation based on huggingface transformers.

CommentGeneration Code for lyric-section-to-comment generation based on huggingface transformers. Migrate Guyu model and code (both 12-layers and 24-l

Yawei Sun 8 Sep 04, 2021
Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

Yongliang Shen 45 Nov 29, 2022
BERT Attention Analysis

BERT Attention Analysis This repository contains code for What Does BERT Look At? An Analysis of BERT's Attention. It includes code for getting attent

Kevin Clark 401 Dec 11, 2022
nlp基础任务

NLP算法 说明 此算法仓库包括文本分类、序列标注、关系抽取、文本匹配、文本相似度匹配这五个主流NLP任务,涉及到22个相关的模型算法。 框架结构 文件结构 all_models ├── Base_line │   ├── __init__.py │   ├── base_data_process.

zuxinqi 23 Sep 22, 2022
The tool to make NLP datasets ready to use

chazutsu photo from Kaikado, traditional Japanese chazutsu maker chazutsu is the dataset downloader for NLP. import chazutsu r = chazutsu.data

chakki 243 Dec 29, 2022
In this project, we aim to achieve the task of predicting emojis from tweets. We aim to investigate the relationship between words and emojis.

Making Emojis More Predictable by Karan Abrol, Karanjot Singh and Pritish Wadhwa, Natural Language Processing (CSE546) under the guidance of Dr. Shad

Karanjot Singh 2 Jan 17, 2022
Beyond Accuracy: Behavioral Testing of NLP models with CheckList

CheckList This repository contains code for testing NLP Models as described in the following paper: Beyond Accuracy: Behavioral Testing of NLP models

Marco Tulio Correia Ribeiro 1.8k Dec 28, 2022
Paddlespeech Streaming ASR GUI

Paddlespeech-Streaming-ASR-GUI Introduction A paddlespeech Streaming ASR GUI. Us

Niek Zhen 3 Jan 05, 2022
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.5k Dec 05, 2022
An assignment on creating a minimalist neural network toolkit for CS11-747

minnn by Graham Neubig, Zhisong Zhang, and Divyansh Kaushik This is an exercise in developing a minimalist neural network toolkit for NLP, part of Car

Graham Neubig 63 Dec 29, 2022
Open-source offline translation library written in Python. Uses OpenNMT for translations

Open source neural machine translation in Python. Designed to be used either as a Python library or desktop application. Uses OpenNMT for translations and PyQt for GUI.

Argos Open Tech 1.6k Jan 01, 2023
Mysticbbs-rjam - rJAM splitscreen message reader for MysticBBS A46+

rJAM splitscreen message reader for MysticBBS A46+

Robbert Langezaal 4 Nov 22, 2022
MASS: Masked Sequence to Sequence Pre-training for Language Generation

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Microsoft 1.1k Dec 17, 2022
To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

Ragesh Hajela 0 Feb 08, 2022
ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

ThinkTwice ThinkTwice is a retriever-reader architecture for solving long-text machine reading comprehension. It is based on the paper: ThinkTwice: A

Walle 4 Aug 06, 2021
NLP made easy

GluonNLP: Your Choice of Deep Learning for NLP GluonNLP is a toolkit that helps you solve NLP problems. It provides easy-to-use tools that helps you l

Distributed (Deep) Machine Learning Community 2.5k Jan 04, 2023