Member-only story

Similarity Search, Part 1: kNN & Inverted File Index

Introduction to similarity search with kNN and its acceleration with inverted file.

9 min readApr 28, 2023

Similarity search is a problem where given a query the goal is to find the most similar documents to it among all the database documents.

Introduction

In data science, similarity search often appears in the NLP domain, search engines or recommender systems where the most relevant documents or items need to be retrieved for a query. Normally, documents or items are represented in the form of texts or images. However, machine learning algorithms cannot directly work with raw texts or images, which is why documents and items are usually preprocessed and stored as vectors of numbers.

Sometimes each component of a vector can store a semantic meaning. In this case, these representations are also called embeddings. Such embeddings can have hundreds of dimensions and their quantity can reach up to millions! Because of such huge numbers, any information retrieval system must be capable of rapidly detecting relevant documents.

In machine learning, a vector is also referred to as an object or point.

TDS Archive

Similarity Search, Part 1: kNN & Inverted File Index

Introduction to similarity search with kNN and its acceleration with inverted file.

Introduction

Index

Published in TDS Archive

Written by Vyacheslav Efimov