Add Custom Stopwords using NLTK and Remove Them

Removing stopwords is a common text-processing task. The words (like "is," "the," "at," etc.) usually don’t contribute to the meaning of a sentence and are often removed in text preprocessing phase.

While NLTK provides a default set of stopwords for multiple languages, there are cases where you may need to add custom stopwords to tailor the list to your specific use case. In this article, we will demonstrate how to add custom stopwords to NLTK's existing list and remove them from your text.

Adding Custom Stopwords using NLTK Library

Step 1: Install and Import NLTK

Before proceeding, ensure you have NLTK installed. If not, install it using pip:

pip install nltk

Now, import the necessary modules:

Python

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

You need to download the stopwords and tokenizer data if you haven’t done it yet:

Python

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')

Step 2: Define the Default Stopwords

NLTK’s stopwords can be accessed for multiple languages. Here, we’ll use English stopwords:

Python

default_stopwords = set(stopwords.words('english'))
print(f"Default Stopwords: {len(default_stopwords)} words")

Output:

Default Stopwords: 179 words

Step 4: Add Custom Stopwords

You can extend the default stopword list by adding your custom words. For instance:

Python

custom_stopwords = {'example', 'customword', 'anotherword'}
extended_stopwords = default_stopwords.union(custom_stopwords)

print(f"Extended Stopwords: {len(extended_stopwords)} words")

Output:

Extended Stopwords: 182 words

Step 5: Remove Stopwords from Text

Now, let’s see how to remove the extended stopwords from a sample text:

Python

sample_text = "This is an example text with a customword that needs to be removed."
tokens = word_tokenize(sample_text)

# Remove stopwords
filtered_text = [word for word in tokens if word.lower() not in extended_stopwords]

print("Original Text:", sample_text)
print("Filtered Text:", ' '.join(filtered_text))

Output:

Original Text: This is an example text with a customword that needs to be removed.
Filtered Text: text needs removed .

Step 6: Additional Functionality (Removing Stopwords Dynamically)

If you want to dynamically manage custom stopwords during runtime:

Python

# Function to add and remove stopwords dynamically
def process_text(text, custom_words=None):
    if custom_words:
        current_stopwords = default_stopwords.union(custom_words)
    else:
        current_stopwords = default_stopwords
    
    tokens = word_tokenize(text)
    filtered = [word for word in tokens if word.lower() not in current_stopwords]
    return ' '.join(filtered)

# Example usage
custom_words = {'dynamicword', 'runtime'}
text = "This is a dynamicword that should be removed during runtime."
result = process_text(text, custom_words)
print("Processed Text:", result)

Output:

Processed Text: removed .

Complete Code

Python

# Import necessary modules
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download required NLTK data
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('punkt')

# Define the default stopwords
default_stopwords = set(stopwords.words('english'))

# Print the default stopwords count
print(f"Default Stopwords: {len(default_stopwords)} words")

# Add custom stopwords
custom_stopwords = {'example', 'customword', 'anotherword'}
extended_stopwords = default_stopwords.union(custom_stopwords)

# Print the extended stopwords count
print(f"Extended Stopwords: {len(extended_stopwords)} words")

# Sample text
sample_text = "This is an example text with a customword that needs to be removed."

# Tokenize the text
tokens = word_tokenize(sample_text)

# Remove stopwords
filtered_text = [word for word in tokens if word.lower() not in extended_stopwords]

# Print the original and filtered text
print("Original Text:", sample_text)
print("Filtered Text:", ' '.join(filtered_text))

# Function to process text with dynamic custom stopwords
def process_text(text, custom_words=None):
    """
    Removes stopwords from the text, including any custom stopwords if provided.
    
    Args:
        text (str): Input text to process.
        custom_words (set, optional): Set of custom stopwords to add. Defaults to None.
        
    Returns:
        str: Text after removing stopwords.
    """
    if custom_words:
        current_stopwords = default_stopwords.union(custom_words)
    else:
        current_stopwords = default_stopwords

    tokens = word_tokenize(text)
    filtered = [word for word in tokens if word.lower() not in current_stopwords]
    return ' '.join(filtered)

# Example usage of the function
custom_words = {'dynamicword', 'runtime'}
text = "This is a dynamicword that should be removed during runtime."
result = process_text(text, custom_words)

# Print the processed text
print("Processed Text:", result)

Adding custom stopwords in NLTK allows for more flexibility in preprocessing text for specific use cases. By extending the default stopword list and dynamically managing it, you can refine your text preprocessing pipeline to improve the performance of downstream tasks like text classification, sentiment analysis, or information retrieval.

Add Custom Stopwords using NLTK and Remove Them

Adding Custom Stopwords using NLTK Library

Step 1: Install and Import NLTK

Step 2: Define the Default Stopwords

Step 4: Add Custom Stopwords

Step 5: Remove Stopwords from Text

Step 6: Additional Functionality (Removing Stopwords Dynamically)

Complete Code

Explore