Longest Increasing Subsequence of an array in SQL

7 min readDec 26, 2022

The Longest Increasing Subsequence problem asks you to find the longest subsequence of elements in an array of integers such that they are strictly increasing in value from left to right. This problem has a well known O(n log n) solution in imperative programming languages such as C++, Java, Python, etc… We explore a solution purely in declarative SQL and present some learnings and new techniques.

Previous Article: Binary (Lower Bound) Search in SQL

Problem statement and online coverage

This problem is pretty popular in programming circles and has wide coverage online. Here are some resources to get you more familiar with the problem in case you aren’t already.

Input Table Schema

The input table has 2 columns:

idx: The array element index (if using arrays in C++ this is what you would address elements in the array with).
i: The stored value.

CREATE TABLE input_data(idx SERIAL PRIMARY KEY, i INTEGER);

INSERT INTO input_data(i) VALUES
(10), (22), (9), (33), (21), (50), (41), (60), (80);

Here’s what the table looks like when viewed.

SELECT * FROM input_data;

Contents of the input_data table (Image by author)

Longest Increasing Subsequence

The longest increasing subsequence of this sequence has length 6 and one of them is the following:

10, 22, 33, 50, 60, 80

There’s another increasing subsequence of the same length which looks like this:

10, 22, 33, 41, 60, 80

There could be more than 1 increasing subsequence of numbers such that they are all the longest increasing subsequence. i.e. the longest increasing subsequence may not be unique.

A note common to both solutions

Both the solutions below rely heavily on recursive CTEs (common table expressions). If you are unfamiliar with this concept, please consider reading up on it in the provided link.

First Solution: O(n⁴ log n)

Our first attempt at solving the problem starts with “n” sequences of length 1 each (just a single element) and attempt to double the length at every step. Converging to the final solution takes O(log n) steps. However, at each step, there are O(n²) rows to consider which each need to be matched up with O(n²) rows, and hence the O(n⁴) term in the final complexity. Each of the rows being considered has the following columns:

idx1: The start index of the subsequence
idx2: The end index of the subsequence
i1: The start element of the subsequence
i2: The end element of the subsequence
lis_length: The length of the increasing subsequence formed by elements between idx1 and idx2
iters: The number of iterations remaining to converge to the final solution

The rule to merge 2 sequences (assuming the LHS sequence is positionally before the RHS sequence) is simple:

Are the 2 LIS sequences such that the last element of the LHS sequence is the same element as the first element of the RHS sequence, in which case the LIS length is LIS(LHS) + LIS(RHS) — 1 (due to the overlapping element), OR
Are the 2 LIS sequences such that the last element of the LHS sequence is < the first element of the RHS sequence, in which case the LIS length is LIS(LHS) + LIS(RHS) (due to no overlapping element).

The SQL code for this solution is shown below.

WITH RECURSIVE base_table AS (
  SELECT idx AS idx1, idx AS idx2, i AS i1, i AS i2, 1 AS lis_length,
    (SELECT CEIL(LOG(COUNT(1)) / LOG(2.0)) + 2 FROM input_data) AS iters
  FROM input_data
),

lis_table(idx1, idx2, i1, i2, lis_length, iters) AS (
  SELECT
    *
  FROM base_table

  UNION ALL
  
  (
    WITH snapshot AS (
      SELECT * FROM lis_table
    )

    SELECT
      lhs.idx1 AS idx1,
      rhs.idx2 AS idx2,
      lhs.i1 AS i1,
      rhs.i2 AS i2,
      MAX(
        lhs.lis_length + rhs.lis_length + (
          CASE WHEN lhs.idx2 = rhs.idx1 THEN -1 ELSE 0 END
        )
      ) AS lis_length,
      lhs.iters - 1 AS iters
    FROM snapshot lhs INNER JOIN snapshot rhs
    ON (lhs.idx2 = rhs.idx1) OR (lhs.idx2 < rhs.idx1 AND lhs.i2 < rhs.i1)
    -- Stop the recursion (by returning an empty table) when iters = 0
    WHERE lhs.iters > 0
    GROUP BY 1, 2, 3, 4, 6
  )
)

-- We display the top 8 candidates to show that there can be
-- potentially a different number of sequences with the same
-- length (number of elements).
SELECT idx1, idx2, i1, i2, lis_length FROM lis_table
GROUP BY 1, 2, 3, 4, 5
ORDER BY lis_length DESC
LIMIT 8;

Output of the query for the first solution which runs in time O(n⁴ log n) (Image by author)

Estimated Cost: The estimated cost for this query on a table of 9 rows is 104M.

Second Solution: O(n²)

The more common solution to this problem looks at every element in the input sequence from back to front, and tries to see if the element being considered can extend one of the existing LIS sequences that start after the element being considered.

Get Dhruv Matani’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign in

For example, if you are considering element at index x in the input array, you check if there is an LIS starting at indexes [x+1..n], and try to extend them all. Then record the longest such sequence against the index x.

The schema of the recursive table (lis_table) looks like this:

idx: The index of the array element
i: The value of the array element
lis_length: The length of the LIS starting at the array element at index idx
current_idx: The index of the element to process in the next recursive iteration. When current_idx = idx — 1, the element at index idx will be processed.

Finally, we show only those rows that were updated in any given iteration. Since a row will only be updated when current_idx = idx — 1, we only display those rows in the final result.

WITH RECURSIVE with_last_idx AS (
  SELECT idx, i, MAX(idx) OVER () AS last_idx
  FROM input_data
),

with_lis_length AS (
  SELECT
    idx,
    i,
    CASE WHEN idx = last_idx THEN 1 ELSE 0 END AS lis_length,
    last_idx - 1 AS current_idx
  FROM with_last_idx
),

lis_table(idx, i, lis_length, current_idx) AS (
  SELECT * FROM with_lis_length
  
  UNION ALL
  
  (
    -- We can't run an OUTER JOIN on a table that is participating
    -- in a recursive CTE, so we take a point in time snapshot and
    -- use that instead.
    WITH lis_with_tag AS (
      SELECT
        idx,
        i,
        lis_length,
        current_idx,
        -- When the incoming table's current_idx is 0, it means that we're
        -- going to process the row with idx = 0 in this iteration. Such a
        -- row is not part of the original input, so we mark this process
        -- as complete and tag is_complete = TRUE so that we don't consider
        -- this result of this iteration in the final result.
        --
        -- We need to be careful and ensure that we are able to process
        -- this round without an error or exception.
        CASE WHEN current_idx = 0 THEN TRUE ELSE FALSE END AS is_complete 
      FROM lis_table
    ),

    -- First extract the row containing the element that we are going
    -- to attempt to add to the LIS.
    row_to_process AS (
      SELECT
        idx,
        i,
        lis_length,
        current_idx,
        is_complete 
      FROM lis_with_tag
      WHERE idx = current_idx
    ),

    -- Then extract the result of the rows that are expected to be unchanged.
    rest_of_the_rows AS (
      SELECT
        idx,
        i,
        lis_length,
        current_idx - 1 AS current_idx,
        is_complete 
      FROM lis_with_tag
      WHERE idx <> current_idx
    ),

    -- The wavefront is the set of possible extensions of the LIS
    -- due to the element in the table row_to_process.
    wavefront AS (
      SELECT
        lhs.idx,
        lhs.i,
        MAX(COALESCE(rhs.lis_length + 1, 0)) AS lis_length,
        lhs.current_idx - 1 AS current_idx,
        lhs.is_complete
      FROM row_to_process lhs LEFT JOIN (
        SELECT * FROM rest_of_the_rows WHERE idx > current_idx
      ) rhs
      ON lhs.i < rhs.i
      GROUP BY lhs.idx, lhs.i, lhs.current_idx, lhs.is_complete
    ),

    -- Now UNION the unchanged rows with the new row that
    -- potentially extends the LIS.
    unioned AS (
      SELECT * FROM wavefront
      UNION ALL
      SELECT * FROM rest_of_the_rows
    )
    
    SELECT idx, i, lis_length, current_idx FROM unioned
    WHERE is_complete = FALSE
  )
) 

SELECT * FROM lis_table
-- Only keep rows that were processed in that iteration.
WHERE idx - 1 = current_idx
ORDER BY idx ASC, current_idx ASC;

The result printed by running the query for the second solution (Image by author)

The LIS is shown to have 6 elements, and it starts at the number 10. The numbers circled in orange are the ones that participate in an LIS of length 6.

Alternative ways to implement this solution

A peculiarity of this solution is that the recursive CTE table contains both the rows (indexes) to be processed as well as the rows that have already been processed (starting from the end). We could have implemented the solution so that this table contains only the already processed rows instead of the unprocessed rows. It would have involved fetching the new rows from a separate table. The final solution would still be O(n²), but the constants would be lower since we won’t be copying exactly “n” rows at every step, but will copy 1, 2, 3, …, n rows in our recursive steps.

SQL Fiddle

The SQL Fiddle link to all the solutions in this post can be found here.

Conclusion

One of the things I learnt after implementing this solution is how to think of solutions functionally. i.e. in a world where one can not update any state, but can only add to or aggregate existing state. An update can be thought of as an row addition followed by an aggregation of rows having the same keys. This same concept is leveraged in SSTables (a popular storage format for database systems) where updates and deletes are processed as inserts and row data gets aggregated as various SSTables at different levels containing the same row key get periodically merged together.

Dev Genius