Emulating Analytic (AKA Ranking) Functions with MySQL

于 2007-11-26 17:22:25 发布 · 249 阅读

收录于

MySQL

本文介绍如何使用MySQL模拟Oracle等数据库中的分析函数，如排名函数，通过实际案例展示如何实现跨部门薪资排名，并优化查询效率。

Emulating Analytic (AKA Ranking) Functions with MySQL

by Stéphane Faroult
03/29/2007

http://www.oreilly.com/pub/a/mysql/2007/03/29/emulating-analytic-aka-ranking-functions-with-mysql.html

One of the most hailed extensions brought to SQL in recent years has been these functions that Oracle calls analytic functions, DB2 calls OLAP functions, and SQL Server 2005 calls ranking functions--but which MySQL, so far, still lacks. The good news is that they can be (relatively) easily and efficiently emulated.

Oracle introduced these functions early (with version 8.1.6, back in 1999 or about) and still offers a greater number of such functions than the other major players (who usually provide the most basic functions, from which the others can be derived.) Therefore, I will take my references from Oracle, and will show how you can, in most cases, obtain the same result with MySQL.

What Are Analytic Functions?

Anyone who has dabbled with SQL is familiar with aggregate functions such as SUM() or COUNT(). But as their name implies, when you use these functions and perform a group by, the detail is lost in the aggregate. Very simply, analytic functions allow you to return the detail (simple, unaggregated rows) and at the same time, on the very same row, a result that is computed from operations on several rows that are related in one way or another to the current one. Let me give an example. Suppose we have the unavoidable Oracle EMP table:

SQL> desc emp Name Null? Type ----------------------------------------- -------- ---------------- EMPNO NOT NULL NUMBER(4) ENAME VARCHAR2(10) JOB VARCHAR2(9) MGR NUMBER(4) HIREDATE DATE SAL NUMBER(7,2) COMM NUMBER(7,2) DEPTNO NUMBER(2)

We can easily compute with a group by the total salary amount, department by department:

SQL> select deptno, sum(sal) 2 from emp 3 group by deptno;

DEPTNO SUM(SAL) ---------- ---------- 30 9400 20 10875 10 8750

By doing so, we lose the detail. However, if we use SUM() in an analytical way, we can return each row, and compute on the same row the total salary amount, specifying in the over clause that the SUM() function actually applies to all rows that refer to the same department number as the current one:

SQL> select deptno, ename, sal, sum(sal) over (partition by deptno) 2 from emp 3 order by 1, 3 4 /

DEPTNO ENAME SAL SUM(SAL)OVER(PARTITIONBYDEPTNO) ---------- ---------- ---------- ------------------------------- 10 MILLER 1300 8750 10 CLARK 2450 8750 10 KING 5000 8750 20 SMITH 800 10875 20 ADAMS 1100 10875 20 JONES 2975 10875 20 SCOTT 3000 10875 20 FORD 3000 10875 30 JAMES 950 9400 30 MARTIN 1250 9400 30 WARD 1250 9400 30 TURNER 1500 9400 30 ALLEN 1600 9400 30 BLAKE 2850 9400

14 rows selected.

SQL>

Such a result can be very useful in computing, for instance, what percentage of the salary mass of a department does the Pointy Haired Boss's salary represent? Needless to say, all we need to do with MySQL is to join a regular select to the aggregate to obtain a similar result:

mysql> select a.DEPTNO, a.ENAME, a.SAL, b.TOTSAL -> from EMP as a -> inner join (select DEPTNO, sum(SAL) TOTSAL -> from EMP -> group by DEPTNO) as b -> on a.DEPTNO = b.DEPTNO -> order by 1, 3; +--------+--------+------+--------+ | DEPTNO | ENAME | SAL | TOTSAL | +--------+--------+------+--------+ | 10 | MILLER | 1300 | 8750 | | 10 | CLARK | 2450 | 8750 | | 10 | KING | 5000 | 8750 | | 20 | SMITH | 800 | 10875 | | 20 | ADAMS | 1100 | 10875 | | 20 | JONES | 2975 | 10875 | | 20 | SCOTT | 3000 | 10875 | | 20 | FORD | 3000 | 10875 | | 30 | JAMES | 950 | 9400 | | 30 | WARD | 1250 | 9400 | | 30 | MARTIN | 1250 | 9400 | | 30 | TURNER | 1500 | 9400 | | 30 | ALLEN | 1600 | 9400 | | 30 | BLAKE | 2850 | 9400 | +--------+--------+------+--------+ 14 rows in set (0.00 sec)

mysql>

We must, though, be careful about one thing: the analytic function applies to the result set defined by the query in which it appears. This means that if we apply a restrictive condition, the sum as computed by the analytic function changes:

SQL> select deptno, ename, sal, sum(sal) over (partition by deptno) 2 from emp 3 where job not in ('PRESIDENT', 'MANAGER') 4 order by 1, 3 5 /

DEPTNO ENAME SAL SUM(SAL)OVER(PARTITIONBYDEPTNO) ---------- ---------- ---------- ------------------------------- 10 MILLER 1300 1300 20 SMITH 800 7900 20 ADAMS 1100 7900 20 SCOTT 3000 7900 20 FORD 3000 7900 30 JAMES 950 6550 30 WARD 1250 6550 30 MARTIN 1250 6550 30 TURNER 1500 6550 30 ALLEN 1600 6550

10 rows selected.

SQL>

If we want to have the same result in our MySQL implementation, we must apply the restrictive condition on the job description twice, once when computing the aggregate, and once when returning the individual rows. So, those highly praised analytic functions are just shortcuts for joining a table to an aggregate over the same table? Big deal! Well, not quite. Things get a little more complicated when we introduce a minor twist: relating rows in the "window" defined by the over clause to each other, which is usually performed by adding an order by to the clause, and optionally a range restriction.

`Adding Ordering to the OVER Clause`

SUM() is not the best function to apply an order by to, because whatever the order of the rows, the sum remains the same. It is a different matter when you want to compare rows within a window to each other, for instance ranking them with regard to a particular column. Oracle allows you to use regular aggregate functions in an analytical way, but it also introduces a number of purely analytic functions, of which RANK() can be considered to be the archetype. Actually, the typical question that calls for an analytic function is along the lines of "what are the top two salaries per department?". This is a question that calls for ranking rows relative to each other (hence the name of ranking function used with SQL Server 2005).

The per department in the previous question clearly tells us that we must partition by department. We could have omitted the partition by to apply the function to the whole of the table, but then a simple order by and a limit clause would have done the trick in MySQL. What allows us to assign a particular rank is the salary, which is what we must order by.

SQL> select deptno, empno, ename, sal, rank() over (partition by deptno 2 order by sal desc) "RANK" 3 from emp;

DEPTNO EMPNO ENAME SAL RANK ---------- ---------- ---------- ---------- ---------- 10 7839 KING 5000 1 10 7782 CLARK 2450 2 10 7934 MILLER 1300 3 20 7788 SCOTT 3000 1 20 7902 FORD 3000 1 20 7566 JONES 2975 3 20 7876 ADAMS 1100 4 20 7369 SMITH 800 5 30 7698 BLAKE 2850 1 30 7499 ALLEN 1600 2 30 7844 TURNER 1500 3 30 7654 MARTIN 1250 4 30 7521 WARD 1250 4 30 7900 JAMES 950 6

14 rows selected.

SQL>

We get a result set that we just need to "push" inside a subquery in the from clause. By applying to it a filter referring to the newly computed column RANK we get what we want:

SQL> select * 2 from (select deptno, empno, ename, sal, 3 rank() over (partition by deptno 4 order by sal desc) "RANK" 5 from emp) 6 where "RANK" <= 2 7 order by deptno, "RANK" 8 /

DEPTNO EMPNO ENAME SAL RANK ---------- ---------- ---------- ---------- ---------- 10 7839 KING 5000 1 10 7782 CLARK 2450 2 20 7788 SCOTT 3000 1 20 7902 FORD 3000 1 30 7698 BLAKE 2850 1 30 7499 ALLEN 1600 2

6 rows selected.

SQL>

Note that there is a tie in department 20.

How can we compute a rank with MySQL? A quick Web search will provide you with an answer: by using a subquery in the select list. Saying that Blake has the highest salary in department 30 only means that no one from this department earns more. Because Allen has the second highest salary, only one person in the same department, our friend Blake, gets a bigger bank transfer each month. Computing how many people from the same department earn more than the current employee and adding one to the result yields the desired answer. In the very same fashion as we have done it with Oracle, we can compute the rank for every row, push the query into the from clause and filter on the rank in MySQL. The only difference is that with MySQL, associating an alias to a subquery in the from clause is mandatory.

mysql> select * -> from (select a.DEPTNO, a.EMPNO, a.ENAME, a.SAL, -> (select 1 + count(*) -> from EMP b -> where b.DEPTNO = a.DEPTNO -> and b.SAL > a.SAL) RANK -> from EMP as a) as x -> where x.RANK <= 2 -> order by x.DEPTNO, x.RANK; +--------+-------+-------+------+------+ | DEPTNO | EMPNO | ENAME | SAL | RANK | +--------+-------+-------+------+------+ | 10 | 7839 | KING | 5000 | 1 | | 10 | 7782 | CLARK | 2450 | 2 | | 20 | 7902 | FORD | 3000 | 1 | | 20 | 7788 | SCOTT | 3000 | 1 | | 30 | 7698 | BLAKE | 2850 | 1 | | 30 | 7499 | ALLEN | 1600 | 2 | +--------+-------+-------+------+------+ 6 rows in set (0.01 sec)

mysql>

A lovely result. Are we happy? Well, if there is one thing that too many years spent with databases has taught me, it is that tables have a tendency to grow over time. That the nice little trick that gives exactly the result we want had better be tested against a table a tad bigger than 14 lines. I have therefore created, both in Oracle and MySQL, an EMPLOYEES table vaguely similar in structure to EMP, and I have dutifully populated it with a mere 10,000 rows. The is by no means an impressive number of rows, even if companies with 10,000 employees are big corporations. The employees are uniformly (but randomly) spread among six departments. I have asked Oracle to time its queries, and Oracle (10.2) and MySQL (5.0) are running on the very same server. Let's ask for the five biggest salaries for each department.

Here is what we get with Oracle:

SQL> select * 2 from (select deptno, empno, lastname, firstname, sal, 3 rank() over (partition by deptno 4 order by sal desc) "RANK" 5 from employees) 6 where "RANK" <= 5 7 order by deptno, "RANK" 8 /

DEPTNO EMPNO LASTNAME FIRSTNAME SAL RANK ---------- ---------- -------------------- -------------------- ---------- ---------- 10 4751 WOLFORD KATHY 7997.34 1 10 10517 HARRIS GARLAND 7993.74 2 10 8135 GRAHAM LIANA 7988.67 3 10 7028 PRATT GLORIA 7980.45 4 10 6694 GEARY SANDRA 7973.03 5 20 7158 WHALEY PATRICIA 7997.44 1 20 10008 JUNKER TOM 7992.63 2 20 1782 SCOTT JOHN 7991.65 3 20 4158 OLIVER BLANCHE 7989.13 4 20 3537 BUCHANAN LESLEY 7982.61 5 30 7892 VALENCIA EDWARD 7993.51 1 30 5491 PEREZ MARTIN 7990.48 2 30 5751 LARSON LEE 7990.4 3 30 5743 PACHECO YOLANDA 7989.91 4 30 7115 COVEY BRANDON 7984.36 5 40 5493 KOFFLER DENNIS 7998.95 1 40 4123 MURRAH MARY 7997.23 2 40 8103 MORGAN ASHLEY 7995.9 3 40 6171 WINSTEAD DOROTHY 7993.35 4 40 7098 FERRELL HECTOR 7987.78 5 50 9860 ORMAN GUADALUPE 7993.79 1 50 7446 KURLAND GARY 7989.36 2 50 8470 THOMAS JAMES 7989.11 3 50 9251 SOLANO JULIE 7988.24 4 50 2769 MILTON DEBORAH 7986.29 5 60 2899 FARMER FRANK 7981.1 1 60 5021 COCHRAN GARY 7978.5 2 60 6607 MOORE DONALD 7977.7 3 60 4013 BURMEISTER HATTIE 7973.04 4 60 1472 LADD JEFFREY 7952.5 5

30 rows selected.

Elapsed: 00:00:00.05 SQL>

MySQL now:

mysql>select * -> from (select a.DEPTNO, a.EMPNO, a.LASTNAME, a.FIRSTNAME, a.SAL, -> (select 1 + count(*) -> from EMPLOYEES b -> where b.DEPTNO = a.DEPTNO -> and b.SAL > a.SAL) RANK -> from EMPLOYEES as a) as x -> where x.RANK <= 5 -> order by x.DEPTNO, x.RANK; +--------+-------+------------+-----------+---------+------+ | DEPTNO | EMPNO | LASTNAME | FIRSTNAME | SAL | RANK | +--------+-------+------------+-----------+---------+------+ | 10 | 4751 | WOLFORD | KATHY | 7997.34 | 1 | | 10 | 10517 | HARRIS | GARLAND | 7993.74 | 2 | [snip] | 60 | 4013 | BURMEISTER | HATTIE | 7973.04 | 4 | | 60 | 1472 | LADD | JEFFREY | 7952.5 | 5 | +--------+-------+------------+-----------+---------+------+ 30 rows in set (5 min 29.12 sec)

Exactly the same result (<blush></blush>except for the slightly embarrassing line under the table.) Let's take a closer look at what we are doing. Our subquery that computes the rank is correlated: that means that it refers to the current row of EMPLOYEES, through both the DEPTNO and the SAL columns. We need them to compute how many employees in the same department earn a bigger salary than the current employee. In other words, we must fire the query for each of the 10,000 rows we inspect. Now, when I created my table, I defined EMPNO as the primary key, and created an additional index on LASTNAME, since I expected it to be a frequent search criterion. Neither DEPTNO nor SAL are indexed, in either database. Indexing DEPTNO, given the distribution, is pointless, because it isn't selective enough. Ten thousand times a full scan over a ten thousand row table: one hundred million rows inspected in all. Somewhat more visible than fourteen scans of a fourteen row table...

Let me be upfront about this, I dislike adding indexes. People often underestimate their impact on inserts, deletes, and (sometimes) updates, which can be very high. I don't want to fix a query and break the nightly batch program. There are also cases when I couldn't index. Suppose that instead of having departments, we have stockmarkets. Instead of having employees, we have securities, and instead of ranking salaries, we want to rank the difference between the current price and yesterday's closing price, as a percentage. There is a high update rate, and even if we wanted to, MySQL doesn't allow you to index a computed column. What would we do? Never mind, this isn't a real life assignment, just an ONLamp article. Let's just try to save face and index the two columns that we need to correlate the main query and the subquery.

mysql> create index tempidx on EMPLOYEES(DEPTNO, SAL); Query OK, 10000 rows affected (0.14 sec) Records: 10000 Duplicates: 0 Warnings: 0

When we run the query again, here is the bottom line:

30 rows in set (26.48 sec)

We have avoided having to commit hara-kiri (just) but compared to Oracle, it's still far from glorious. Shall we have to kowtow to the mighty Oracle? Let's not throw in the towel yet.

`Getting the Top Values Efficiently, Take One`

When we run into a performance problem, there is no such thing as RTFMing. We are trying to get with MySQL the same results as SQL extensions that Oracle has, and MySQL hasn't. Instead of using an otherwise squeaky clean SQL construct such as the correlated subquery in the select list, couldn't we find extensions that MySQL has and Oracle hasn't to help us? Perhaps it would be wise to restate the problem. We want to compare the employee in the current row to the employees in the same department. How would we do it manually? I don't know about you, but I don't think that I would compute for each employee in turn how many people make more in the same department. I'd rather establish a full list of employees in the same department, and sort them by salary (which is probably what Oracle does behind the scene); and I'd reuse the same list for each employee in the department instead of recomputing the list each time. What we need is a sorted list.

There is precisely a function that builds a list, GROUP_CONCAT, and interestingly it is a function that Oracle natively lacks (even if it is possible to create such a function as a user-written function). The MySQL documentation provides the full syntax as:

GROUP_CONCAT([DISTINCT] expr [,expr ...] [ORDER BY {unsigned_integer | col_name | expr} [ASC | DESC] [,col_name ...]] [SEPARATOR str_val])

Have you noticed the order by passed as a parameter? Doesn't it ring a bell? There is of course a catch: the resulting string has a limited length (although adjustable through the group_concat_max_len variable), 1024 characters by default. I shall return to this topic in a moment, but since for the time being I am only interested by the first five items in the list (since I want the people who have the top five salaries) the default maximum length is much more than I need.

I now have found how to construct my sorted list. But how shall I determine the position of the employee described by the current row inside this list? We have another function (which once again Oracle lacks) that fits our requirements, FIND_IN_SET(str,strlist). You should take note that in this function, the separator used in strlist is required to be a comma, which means that we must use a comma (the default) as the separator in our call to GROUP_CONCAT(). So, if whatever we need to store in our list may contain a comma, we must make use of REPLACE() to substitute this comma for an innocuous character whenever we add an item to the list, or search an item in the list.

Properly equipped with these two functions, I can now write and run my query (after having dropped the index on DEPTNO and SAL to be in the same context as Oracle):

mysql> select * -> from (select e.DEPTNO, -> e.EMPNO, -> e.LASTNAME, -> e.FIRSTNAME, -> e.SAL, -> find_in_set(e.SAL, x.SALLIST) RANK -> from EMPLOYEES as e, -> (select DEPTNO, group_concat(SAL -> order by SAL desc) SALLIST -> from EMPLOYEES -> group by DEPTNO) as x -> where e.DEPTNO = x.DEPTNO) as z -> where RANK <= 5 -> and RANK > 0 -> order by DEPTNO, RANK;

A word about the list we build: we haved simply concatenated the salaries by decreasing order, letting MySQL implicitly convert float values to character strings. Since we have about 1,600 employees per department, the default length of 1024 characters is grossly insufficient to build a complete list. However, as I have said previously, we don't need the complete list, we just need the first five values. Nevertheless, we must anticipate that salaries who are not in the top 120 or whereabout will fail to be stored, and therefore will not be found by the FIND_IN_SET() function (which will return 0.) Hence the necessity for an additional condition that RANK must be strictly positive. We get the same result as usual:

+--------+-------+------------+-----------+---------+------+ | DEPTNO | EMPNO | LASTNAME | FIRSTNAME | SAL | RANK | +--------+-------+------------+-----------+---------+------+ | 10 | 4751 | WOLFORD | KATHY | 7997.34 | 1 | | 10 | 10517 | HARRIS | GARLAND | 7993.74 | 2 | [snip] | 60 | 4013 | BURMEISTER | HATTIE | 7973.04 | 4 | | 60 | 1472 | LADD | JEFFREY | 7952.5 | 5 | +--------+-------+------------+-----------+---------+------+ 30 rows in set, 1 warning (0.30 sec)

Same result, except that this time the performance, without being quite as excellent as Oracle's, is very decent. We get a warning, in punishment of our sin of trying to build lists that are more than 1K in length, but otherwise we get precisely what we want, with an SQL usage that is both relatively elegant and efficient. We have in the GROUP_CONCAT()/FIND_IN_SET() combination a kind of MySQL crypto-analytic function that enables us to answer almost all the "what are the top (or bottom) n" questions, as long as n isn't too large, and can be used to emulate several Oracle analytic functions. For instance, Oracle can handle both rank() and dense_rank(); the former focuses on what is ranked, and the latter on how it is ranked. Let's suppose that Tom, Dick, and Harry have obtained the following scores at a game:

Tom

825

Dick

825

Harry

750

Harry's rank() is 3, because rank() focuses on what is ranked, the people, and there are two people who have a better score than Harry. Both Tom and Dick have rank 1. However, dense_rank() focuses on how people are ranked, which is to say scores. And here, Harry's dense_rank() is 2, since there is only one score value higher than 750; "dense" refers to the fact that there is no gap in rank numbers. If you want to emulate dense_rank(), all you need to do is to add a distinct to the expression in group_concat(). And if you want to emulate row_number(), that assigns a unique number within the group defined by the partition clause, all you need to do is build a list using unique identifiers. In our example, keeping in mind that we accept that we don't need more values than what the longest string returned by group_concat() is able to hold, we can emulate:

row_number() over (partition by deptno order by hiredate)

by using

FIND_IN_SET(EMPNO, EMPNOLIST) with EMPNOLIST built as GROUP_CONCAT(EMPNO ORDER BY HIREDATE) as EMPNOLIST

associated with the usual group by DEPTNO. A word of caution though, as with in our example with SUM(), any additional restrictive condition that Oracle would need only once will have to be added to the group by query as well with MySQL.

In my next article, we'll look at another way to do this same analytical query, as well as how to implement the LAG function and how to foresee values in queries.

Stéphane Faroult first discovered the relational model and the SQL language in 1983.

Return to ONLamp.com.

Can't wait for part two? Post your analysis here.
You must be logged in to the O'Reilly Network to post a talkback.

Showing messages 1 through 8 of 8.

Flow sum()? have you any ideas? 
2007-04-02 03:54:44  LionSoftware [Reply | View]

The idea to repeat analytic SUM() OVER (PARTITION) is GREAT! But I have no ideas how to repeat SUM() OVER (ORDER BY) on MySQL. Have you any ideas?

Because I trying to repeat this:

 

 Table transactions:
 id amount
 1 100
 2 200
 3 -50
 4 -10

 select id, amount, sum(amount) over (order by id) flowsum from mytable;
 id amount flowsum
 1 100 100
 2 200 300
 3 -50 250
 4 -10 240

Flow sum()? have you any ideas? 
2007-04-02 04:17:01  sfaroult [Reply | View]

There is a part 2 coming :-). It should give you some ideas.

Very good article 
2007-04-01 06:27:03  gmaxia [Reply | View]

Hi.
 Thanks for this very insightful article.
 You said that you were using MySQL 5.0. Could you be more specific about the exact version?
 If there is a public place where we can download the sample data used for your test, it would be greatly appreciated.

 Best regards

 Giuseppe Maxia

MySQL version 
2007-04-02 00:24:19  sfaroult [Reply | View]

If you want all the gory details the tests were performed with MySQL 5.0.27-standard and Oracle 10.2.0.1.0. Both were running on a double Xeon 2.4 GHz test machine under what initially was a Mandriva 2007.0 that is in a perpetual state of flux and that Oracle anyway believes to be a redhat-4.
 Sample data available on request :-).

MySQL version 
2007-04-04 20:10:40  jaypipes [Reply | View]

Hi there! Great article, Stephane! Reminds me of the examples I put into Chapter 8 of "Pro MySQL"...

 Couple things I wanted to note, however.

 First, the number one reason you see such a performance difference with adding an index *isn't* actually the (correct) point you make about the subquery being fired once for each subset row. The reason is because you are using the InnoDB storage engine, and InnoDB has a known issue with doing a SELECT COUNT(*) when there is no WHERE condition on an indexed field. InnoDB is forced to do a full table scan, as you point out, but for a different reason, each time a SELECT COUNT(*) is used in this manner. See http://dev.mysql.com/doc/refman/5.0/en/innodb-restrictions.html for more info on that.

 Second, is that you can get *much* better performance from your MySQL SELECTs by avoiding correlated subqueries and rewriting as a derived table... It's a bug in the MySQL optimizer unfortunately. Even better performance wou