Removing stopwords is a common text-processing task. The words (like "is," "the," "at," etc.) usually donât contribute to the meaning of a sentence and are often removed in text preprocessing phase.
While NLTK provides a default set of stopwords for multiple languages, there are cases where you may need to add custom stopwords to tailor the list to your specific use case. In this article, we will demonstrate how to add custom stopwords to NLTK's existing list and remove them from your text.
Adding Custom Stopwords using NLTK Library
Step 1: Install and Import NLTK
Before proceeding, ensure you have NLTK installed. If not, install it using pip:
pip install nltk
Now, import the necessary modules:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
You need to download the stopwords and tokenizer data if you havenât done it yet:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
Step 2: Define the Default Stopwords
NLTKâs stopwords can be accessed for multiple languages. Here, weâll use English stopwords:
default_stopwords = set(stopwords.words('english'))
print(f"Default Stopwords: {len(default_stopwords)} words")
Output:
Default Stopwords: 179 words
Step 4: Add Custom Stopwords
You can extend the default stopword list by adding your custom words. For instance:
custom_stopwords = {'example', 'customword', 'anotherword'}
extended_stopwords = default_stopwords.union(custom_stopwords)
print(f"Extended Stopwords: {len(extended_stopwords)} words")
Output:
Extended Stopwords: 182 words
Step 5: Remove Stopwords from Text
Now, letâs see how to remove the extended stopwords from a sample text:
sample_text = "This is an example text with a customword that needs to be removed."
tokens = word_tokenize(sample_text)
# Remove stopwords
filtered_text = [word for word in tokens if word.lower() not in extended_stopwords]
print("Original Text:", sample_text)
print("Filtered Text:", ' '.join(filtered_text))
Output:
Original Text: This is an example text with a customword that needs to be removed.
Filtered Text: text needs removed .
Step 6: Additional Functionality (Removing Stopwords Dynamically)
If you want to dynamically manage custom stopwords during runtime:
# Function to add and remove stopwords dynamically
def process_text(text, custom_words=None):
if custom_words:
current_stopwords = default_stopwords.union(custom_words)
else:
current_stopwords = default_stopwords
tokens = word_tokenize(text)
filtered = [word for word in tokens if word.lower() not in current_stopwords]
return ' '.join(filtered)
# Example usage
custom_words = {'dynamicword', 'runtime'}
text = "This is a dynamicword that should be removed during runtime."
result = process_text(text, custom_words)
print("Processed Text:", result)
Output:
Processed Text: removed .
Complete Code
# Import necessary modules
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Download required NLTK data
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('punkt')
# Define the default stopwords
default_stopwords = set(stopwords.words('english'))
# Print the default stopwords count
print(f"Default Stopwords: {len(default_stopwords)} words")
# Add custom stopwords
custom_stopwords = {'example', 'customword', 'anotherword'}
extended_stopwords = default_stopwords.union(custom_stopwords)
# Print the extended stopwords count
print(f"Extended Stopwords: {len(extended_stopwords)} words")
# Sample text
sample_text = "This is an example text with a customword that needs to be removed."
# Tokenize the text
tokens = word_tokenize(sample_text)
# Remove stopwords
filtered_text = [word for word in tokens if word.lower() not in extended_stopwords]
# Print the original and filtered text
print("Original Text:", sample_text)
print("Filtered Text:", ' '.join(filtered_text))
# Function to process text with dynamic custom stopwords
def process_text(text, custom_words=None):
"""
Removes stopwords from the text, including any custom stopwords if provided.
Args:
text (str): Input text to process.
custom_words (set, optional): Set of custom stopwords to add. Defaults to None.
Returns:
str: Text after removing stopwords.
"""
if custom_words:
current_stopwords = default_stopwords.union(custom_words)
else:
current_stopwords = default_stopwords
tokens = word_tokenize(text)
filtered = [word for word in tokens if word.lower() not in current_stopwords]
return ' '.join(filtered)
# Example usage of the function
custom_words = {'dynamicword', 'runtime'}
text = "This is a dynamicword that should be removed during runtime."
result = process_text(text, custom_words)
# Print the processed text
print("Processed Text:", result)
Adding custom stopwords in NLTK allows for more flexibility in preprocessing text for specific use cases. By extending the default stopword list and dynamically managing it, you can refine your text preprocessing pipeline to improve the performance of downstream tasks like text classification, sentiment analysis, or information retrieval.