Member-only story

Similarity Search, Part 2: Product Quantization

Learn a powerful technique to effectively compress large data

9 min readMay 10, 2023

Similarity search is a problem where given a query the goal is to find the most similar documents to it among all the database documents.

Introduction

In data science, similarity search often appears in the NLP domain, search engines or recommender systems where the most relevant documents or items need to be retrieved for a query. There exists a large variety of different ways to improve search performance in massive volumes of data.

In the first part of this article series, we looked at kNN and inverted file index structure for performing similarity search. As we learned, kNN is the most straightforward approach while inverted file index acts on top of it suggesting a trade-off between speed acceleration and accuracy. Nevertheless, both methods do not use data compression techniques which might lead to memory issues, especially in cases of large datasets and limited RAM. In this article, we will try to address this issue by looking at another method called Product Quantization.

TDS Archive

Similarity Search, Part 2: Product Quantization

Learn a powerful technique to effectively compress large data

Introduction

Published in TDS Archive

Written by Vyacheslav Efimov