Binary (Lower Bound) Search in SQL

We explore an implementation of Binary Search in declarative SQL using recursive CTEs

6 min readDec 24, 2022

Binary Search is a very popular (and machine efficient) technique to locate data in a sorted array. It is usually implemented in imperative programming languages such as C++, Java, Python, etc… Today we shall explore the implementation of Binary Search in declarative SQL syntax.

More specifically, we shall look at the implementation of the lower_bound() function in SQL. lower_bound(value) returns the location of the first element in the sorted input range before which value can be inserted to preserve the sorted-ness of the data.

Previous Article: Maximum Subarray Sum using SQL

An implementation in C++

This is how lower_bound may be implemented in C++ (for an integer array).

// Returns a pointer into the array (begin, end) such that the pointer
// points to the first position where target_value can be safely
// inserted while preserving the sortedness of the range (begin, end+1).
int* lower_bound(int *begin, int *end, int target_value) {
  while (begin != end) {
    int *m = begin + (end - begin) / 2;
    if (target_value <= *m) {
      // Go Left
      end = m;
    } else {
      // Go Right
      begin = m + 1;
    }
  }
  return begin;
}

Input Table Schema

The input table has 2 columns:

idx: The array element index (if using arrays in C++ this is what you would address elements in the array with).
value: The stored value.

The table is assumed to be sorted (in non-descending order) on the value field.

CREATE TABLE array_data(idx SERIAL PRIMARY KEY, value INTEGER NOT NULL);

-- Insert 1000 randomly generated values in non-decreasing sorted order
-- into the table array_data.
INSERT INTO array_data(idx, value)

WITH RECURSIVE ins_query(value, n) AS (
  SELECT FLOOR(RANDOM() * 5000) + 1 AS value, 1000 AS n
  
  UNION ALL
  
  SELECT FLOOR(RANDOM() * 5000) + 1, n - 1
  FROM ins_query
  WHERE n - 1 > 0
)

SELECT
  ROW_NUMBER() OVER (ORDER BY value ASC) AS idx,
  value
FROM ins_query
ORDER BY idx ASC;

We use a recursive CTE (common table expression) to generate the 1000 random values to add into the table in sorted order.

We’ll be seeing more uses of recursive CTEs to simulate looping in SQL below.

SELECT * FROM array_data LIMIT 10;

A sample of 10 rows from the input table array_data (Image by author)

First Solution: Simple SQL using ORDER BY and LIMIT

We can simply solve this problem fairly efficiently in SQL since we already have a PRIMARY KEY index on the idx column.

SELECT
  idx
FROM array_data
WHERE value >= 181
ORDER BY value ASC, idx ASC
LIMIT 1;

Result of the query from the first solution (Image by author)

Estimated Cost: The estimated cost for this query on a table of 1000 rows is 22.

Get Dhruv Matani’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign in

We expect the expected complexity of this solution to be O(log n), where “n” is the number of elements in our table array_data. This is because the default index type for PostgreSQL is B-Tree.

Second Solution: Explicit binary search-like execution

Since we’re trying to see how far we can go with SQL, we want to explicitly write our own search procedure.

Drawing inspiration from the C++ based solution above, this is what our SQL code looks like.

WITH RECURSIVE lower_bound(begin_idx, end_idx, target_value) AS (
    VALUES (1, (SELECT 1 + COUNT(1) FROM array_data), 181)
  UNION ALL
    (
      -- First, locate the mid point of the sorted range, and the
      -- element at that location so that we can compare it with the
      -- value being searched for. Later, we shall see how to decide
      -- whether to go left or right in our search.
      WITH element_at_idx AS (
        SELECT
          idx, value, target_value, begin_idx, end_idx,
          -- is_complete is TRUE if the incoming range [begin, end) is empty.
          -- If it's empty, we can exit the search.
          CASE WHEN begin_idx = end_idx THEN TRUE ELSE FALSE END AS is_complete
        FROM lower_bound INNER JOIN array_data
        ON idx = (begin_idx + (end_idx - begin_idx) / 2)
        LIMIT 1
      ),

      -- Due to the nature of SQL, we need to compute both ranges
      -- as if the value was found either on the left half of the
      -- original range...
      go_left AS (
        SELECT
          begin_idx, idx AS end_idx, target_value, is_complete
        FROM element_at_idx
        WHERE target_value <= value
        LIMIT 1
      ),
      
      -- ... or on the right half of the original range.
      go_right AS (
        SELECT
          idx + 1 AS begin_idx, end_idx, target_value, is_complete
        FROM element_at_idx
        WHERE target_value > value
        LIMIT 1
      ),
      
      -- and then we need to pick the side on which the value
      -- *actually* may lie by ignoring the incorrect side.
      one_of_2 AS (
        SELECT * FROM go_left
        UNION ALL
        SELECT * FROM go_right
      )
      
      SELECT
        begin_idx,
        end_idx,
        target_value
      FROM one_of_2
      -- We need to check is_complete (the incoming range) and not the
      -- outgoing range since if we check the outgoing range and return
      -- an empty result set, it will not be recorded in the output by
      -- the recursive CTE, and we will be left with an ambiguous output
      -- range.
      WHERE is_complete = FALSE
    )
)

SELECT * FROM lower_bound;

This is what the query output looks like. The value of end_idx for the row where begin_idx = end_idx is our answer (which is the same as the query above).

The result of the query from the second solution (Image by author)

As you can see, we have 11 rows in the result set. We recursed O(log n) times to find our answer. In a conventional programming language, an array access costs O(1) time, but in SQL, locating an element in a table via a PRIMARY KEY column costs O(log n) (assuming a B-Tree index). The code below shows where this array access is happening in our code.

WITH element_at_idx AS (
  SELECT
    idx, value, target_value, begin_idx, end_idx,
    -- is_complete is TRUE if the incoming range [begin, end) is empty.
    -- If it's empty, we can exit the search.
    CASE WHEN begin_idx = end_idx THEN TRUE ELSE FALSE END AS is_complete
  FROM lower_bound INNER JOIN array_data
  ON idx = (begin_idx + (end_idx - begin_idx) / 2)
  LIMIT 1
),

Hence, the cost of our solution becomes O(log² n).

Estimated Cost: The estimated cost for this query on a table of 1000 rows is 62.

SQL Fiddle

The SQL Fiddle link to all the solutions in this post can be found here.

Learning

Here’s what I learned while writing this solution:

Iteration can sometimes be replaced by recursion as supported by recursive CTEs in SQL.
Handling conditional logic and early returns is hard. One way to do it is to compute values for both code (all) paths in the conditional logic and then eliminate the part which isn’t needed. This needs to be done carefully since you need to ensure that the part that isn’t needed won’t end up throwing some sort of exception and crashing the query. CPU processors typically do something similar as part of specilative instruction execution. In some sense we are also being speculative, and then choosing the result from the actual path.
When writing recursive CTEs, we need to think of our intermediate table as storing all the arguments to the recursive function as well as the intermediate variables and final result from the method.

References

Using CTEs to do a binary search of large tables with non-indexed correlated data

Dev Genius