…continued from Part 1 and Part 2.
OK, let’s return to original query before with no index in place and lets review the optimizer’s estimate of how many rows will be returned, and try to deduce where it came from. Note throughout this article, I will be referring to sections in Jonathan Lewis‘s essential book, Cost-Based Oracle Fundamentals (CBOF).
SQL> CREATE TABLE t (c NUMBER(2, 0) NOT NULL,
2* p VARCHAR2(4000));
Table T created.
SQL> INSERT /*+append*/ INTO t
2 SELECT
3 mod(ROWNUM, 100) c,
4 rpad('X', 4000) padding
5 FROM
6 dual
7 CONNECT BY
8* ROWNUM <= 100000;
100 000 rows inserted.
SQL> exec dbms_stats.gather_table_stats(null, 't', method_opt=>'for all columns size 1');
PL/SQL procedure successfully completed.
SQL> SELECT /*+gather_plan_statistics*/ DISTINCT trim(p) p FROM t WHERE
2 (c BETWEEN 0 AND 1) OR (c BETWEEN 2 AND 3) OR (c BETWEEN 4 AND 5) OR (c BETWEEN 6 AND 7) OR (c BETWEEN 8 AND 9)
3 OR (c BETWEEN 10 AND 11) OR (c BETWEEN 12 AND 13) OR (c BETWEEN 14 AND 15) OR (c BETWEEN 16 AND 17) OR (c BETWEEN 18 AND 19)
4 OR (c BETWEEN 20 AND 21) OR (c BETWEEN 22 AND 23) OR (c BETWEEN 24 AND 25) OR (c BETWEEN 26 AND 27) OR (c BETWEEN 28 AND 29)
5 OR (c BETWEEN 30 AND 31) OR (c BETWEEN 32 AND 33) OR (c BETWEEN 34 AND 35) OR (c BETWEEN 36 AND 37) OR (c BETWEEN 38 AND 39)
6 OR (c BETWEEN 40 AND 41) OR (c BETWEEN 42 AND 43) OR (c BETWEEN 44 AND 45) OR (c BETWEEN 46 AND 47) OR (c BETWEEN 48 AND 49)
7 OR (c BETWEEN 50 AND 51) OR (c BETWEEN 52 AND 53) OR (c BETWEEN 54 AND 55) OR (c BETWEEN 56 AND 57) OR (c BETWEEN 58 AND 59)
8 OR (c BETWEEN 60 AND 61) OR (c BETWEEN 62 AND 63) OR (c BETWEEN 64 AND 65) OR (c BETWEEN 66 AND 67) OR (c BETWEEN 68 AND 69)
9 OR (c BETWEEN 70 AND 71) OR (c BETWEEN 72 AND 73) OR (c BETWEEN 74 AND 75) OR (c BETWEEN 76 AND 77) OR (c BETWEEN 78 AND 79)
10 OR (c BETWEEN 80 AND 81) OR (c BETWEEN 82 AND 83) OR (c BETWEEN 84 AND 85) OR (c BETWEEN 86 AND 87) OR (c BETWEEN 88 AND 89)
11* OR (c BETWEEN 90 AND 91) OR (c BETWEEN 92 AND 93) OR (c BETWEEN 94 AND 95) OR (c BETWEEN 96 AND 97) OR (c BETWEEN 98 AND 99);
P
____
X
SQL> SELECT * FROM dbms_xplan.display_cursor(format=>'allstats last -bytes -cost -predicate');
...
----------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |
----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:01.18 | 100K| 100K|
| 1 | HASH UNIQUE | | 1 | 1 | 1 |00:00:01.18 | 100K| 100K|
| 2 | TABLE ACCESS FULL| T | 1 | 77857 | 100K|00:00:01.35 | 100K| 100K|
----------------------------------------------------------------------------------------------
I’m going to refer to the individual predicates as Pnn, so P1 is “c BETWEEN 0 AND 1”, P2 is “C BETWEEN 2 AND 3” and so on up to P50 which is “c BETWEEN 98 AND 99”. My aim is to show why the optimizer estimated 77,857 rows for (P1 OR P2 OR … OR P50).
Let’s first check what information the database has about the table T and column C from database statistics.
SELECT num_rows FROM user_tables WHERE table_name = 'T';
NUM_ROWS
___________
100000
SQL> SELECT histogram,
2 num_nulls,
3 num_distinct,
4 utl_raw.cast_to_number(low_value) low_value,
5 utl_raw.cast_to_number(high_value) high_value
6 FROM
7 user_tab_cols
8 WHERE
9 table_name = 'T'
10* AND column_name = 'C';
HISTOGRAM NUM_NULLS NUM_DISTINCT LOW_VALUE HIGH_VALUE
____________ ____________ _______________ ____________ _____________
NONE 0 100 0 99
I’ll be referring to value from NUM_ROWS, NUM_DISTINCT, LOW_VALUE and HIGH_VALUE throughout this article, in addition I will be rounding some numbers as we go along as the optimizer does.
We’re going to be applying a standard procedure of divide and conquer for these calculations, so let’s start with P1.
SQL> SELECT /*+gather_plan_statistics*/ DISTINCT trim(p) p FROM t WHERE (c BETWEEN 0 AND 1);
P
____
X
SQL> SELECT * FROM dbms_xplan.display_cursor(format=>'allstats last -bytes -cost');
PLAN_TABLE_OUTPUT
_________________________________________________________________________________________________
SQL_ID 9cnx92h4yupdc, child number 0
-------------------------------------
SELECT /*+gather_plan_statistics*/ DISTINCT trim(p) p FROM t WHERE (c
...
----------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |
----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.64 | 100K| 100K|
| 1 | HASH UNIQUE | | 1 | 1 | 1 |00:00:00.64 | 100K| 100K|
|* 2 | TABLE ACCESS FULL| T | 1 | 2010 | 2000 |00:00:00.16 | 100K| 100K|
----------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter(("C"<=1 AND "C">=0))
As you can see from the Predicate Information section, Oracle has split P1 into two predicates “C >= 0” and “C <= 1”. And as the optimizer knows that C >= LOW_VALUE should match all rows, it effectively ignores this and uses “C <= 1”.
CBOR (p53 case 4) gives us the formula for selectivity, and cardinality into which we can insert the values from the statistics.
Sel(P1) = Sel("C <= 1")
(1 - LOW_VALUE)/(HIGH_VALUE-LOW_VALUE)+1/NUM_DISTINCT
=(1-0)/(99-0)+1/100
=0.020101
Card(P1)=Sel(P1)*NUM_ROWS
=0.020101*100000
=2010
Let's move to P2.
SQL> SELECT /*+gather_plan_statistics*/ DISTINCT trim(p) p FROM t WHERE (c between 2 and 3);
P
____
X
SQL> SELECT * FROM dbms_xplan.display_cursor(format=>'allstats last -bytes -cost -predicate');
PLAN_TABLE_OUTPUT
_________________________________________________________________________________________________
...
----------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |
----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.66 | 100K| 100K|
| 1 | HASH UNIQUE | | 1 | 1 | 1 |00:00:00.66 | 100K| 100K|
|* 2 | TABLE ACCESS FULL| T | 1 | 3010 | 2000 |00:00:01.17 | 100K| 100K|
----------------------------------------------------------------------------------------------
This is CBOR p53 case 6, so selectivity and cardinality are calculated as follows
Sel(P2) = (3-2)/(HIGH_VALUE-LOW_VALUE)+2/NUM_DISTINCT =1/99+2/100 =0.030101 Card(P2) =Sel(P2)*NUM_ROWS =0.030101*100000 =3010
So next we need Card(P1 OR P2). You might think that the optimizer would simply add the estimates of 2,010 rows from the P1 and 3,010 rows from the P2 but we can see that it doesn’t.
SQL> SELECT /*+gather_plan_statistics*/ DISTINCT trim(p) p FROM t WHERE (c BETWEEN 0 AND 1) OR (c BETWEEN 2 AND 3); P ____ X SQL> SELECT * FROM dbms_xplan.display_cursor(format=>'allstats last -bytes -cost -predicate'); ... ---------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | ---------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.68 | 100K| 100K| | 1 | HASH UNIQUE | | 1 | 1 | 1 |00:00:00.68 | 100K| 100K| |* 2 | TABLE ACCESS FULL| T | 1 | 4960 | 3000 |00:00:00.63 | 100K| 100K| ----------------------------------------------------------------------------------------------
The reason for this is that while it might be obvious to us that any rows matched by P1 will not be matched by P2, the optimizer doesn’t know this so it has to use the following calculations (CBOF p56)
Sel(P1 OR P2) =Sel(P1)+Sel(P2)-Sel(P1)*Sel(P2) =0.030101+0.020101-0.030101*0.020101 =0.049596 Card(P1 OR P2) =Sel(P1 OR P2)* NUM_ROWS =0.049596*100000 =4960
Calculation for P3 is pretty much the same as for P2 and we get SEL(P3)=0.030101
And also following the same pattern in P2 we get
Sel(P1 OR P2 OR P3) =Sel(P1 OR P2)+Sel(P3)-Sel(P1 OR P2)*Sel(P3) =0.049596+0.030101-0.049596*0.030101 =0.078204 Card(P1 OR P2 OR P3)= Sel(P1 OR P2 OR P3)*NUM_ROWS =0.078204*100000 =7821
This matches the estimate that the optimizer shows us for the query
---------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | ---------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.68 | 100K| 100K| | 1 | HASH UNIQUE | | 1 | 1 | 1 |00:00:00.68 | 100K| 100K| | 2 | TABLE ACCESS FULL| T | 1 | 7821 | 6000 |00:00:00.20 | 100K| 100K| ----------------------------------------------------------------------------------------------
This pattern repeats all the way up to P1 OR P2 … P49 (I used a spreadsheet) to give us
SEL(P1 OR ... P49) =0.77403 CARD(P1 OR ... P49) =0.77403*100000 =77403
This matches what the optimizer tells us it estimates for this query
-------------------------------------------- | Id | Operation | Name | E-Rows | -------------------------------------------- | 0 | SELECT STATEMENT | | | | 1 | HASH UNIQUE | | 1 | | 2 | TABLE ACCESS FULL| T | 77403 | --------------------------------------------
Finally we come to P50, and this is the same edge case that P1 was so we get SEL(P50)=0.020101 and the calcuations are as follows
SEL(P1 OR ... OR P49 OR P50) = SEL(P1 OR ... OR P49) + SEL(P50) - SEL(P1 OR ... P49)* SEL(P50) = 0.77403 + 0.020101 - 0.77403 * 0.020101 = 0.778572 CARD(P1 OR ... OR P49 OR P50) =SEL(P1 OR ... OR P49 OR P50)*NUM_ROWS = 0.778572*10000 = 77857
Fortunately this matches the optimizer has said is it’s estimated cardinality for the full query.
I think it’s useful at this point to compare how the estimated number of rows deviates from the actual number of rows matched as the number of predicates increases. This is shown in the chart below, and you can see that with less than 30 predicates the optimizer overestimates the number of rows returned, and that as the we get more than 30 predicates this turns into an under-estimation.

In the next article, we will compare this behaviour with the case where Cost Based Or Expansion occurs.