Sitemap

Member-only story

Text Clustering with TF-IDF in Python

Explanation of a simple pipeline for text clustering. Full example and code

8 min readNov 24, 2021

--

Press enter or click to view image in full size
Photo by Andrew Wulf on Unsplash

TF-IDF is a well known and documented vectorization technique in data science. Vectorization is the act of converting data into a numerical format in such a way that a statistical model can interpret it and make predictions.

In this article we will see how to convert a corpus of text into numerical format and apply machine learning algorithms to bring out interesting patterns and anomalies.

Methodology

We will use a dataset provided by Sklearn to have a replicable corpus. After that, we will use the KMeans algorithm to group the vectors generated by the TF-IDF. We will then use Principal Component Analysis to visualize our groups and bring out common or unusual characteristics of the texts present in our corpus.

Here is what we’ll do

  • import the dataset
  • apply preprocessing to our corpus to remove words and symbols which, when converted into numerical format, do not add value to our model
  • use TF-IDF as a vectorization algorithm
  • apply KMeans to group our data
  • apply PCA to reduce the dimensionality of our vectors to 2 for…

--

--

Andrea D'Agostino
Andrea D'Agostino

Written by Andrea D'Agostino

Data scientist. I write about data science, machine learning and analytics. I also write about career and productivity tips to help you thrive in the field.