Member-only story
Text Clustering with TF-IDF in Python
Explanation of a simple pipeline for text clustering. Full example and code
TF-IDF is a well known and documented vectorization technique in data science. Vectorization is the act of converting data into a numerical format in such a way that a statistical model can interpret it and make predictions.
In this article we will see how to convert a corpus of text into numerical format and apply machine learning algorithms to bring out interesting patterns and anomalies.
Methodology
We will use a dataset provided by Sklearn to have a replicable corpus. After that, we will use the KMeans algorithm to group the vectors generated by the TF-IDF. We will then use Principal Component Analysis to visualize our groups and bring out common or unusual characteristics of the texts present in our corpus.
Here is what we’ll do
- import the dataset
- apply preprocessing to our corpus to remove words and symbols which, when converted into numerical format, do not add value to our model
- use TF-IDF as a vectorization algorithm
- apply KMeans to group our data
- apply PCA to reduce the dimensionality of our vectors to 2 for…
