Denormalize tables to improve cardinality estimate

I was working on the example from an earlier blog post that showed how the optimizer could not correctly estimate the number of rows accessed on a table if the non-join condition was on a different table.  Number of rows = cardinality.  This example from the blog post mirrors a real situation we have where we have a small table joined to a large one and conditions on the small table resulting in incorrect cardinality estimates on the larger table.  Something like this:

select SUM(B.cost)
from DIVISION A,SALES B
where
a.DIVNUM=B.DIVNUM and
A.REGION=’North East’;

The optimizer has no idea how many sales rows are in the North East region because of the join.  Duh! It dawned on me that we just need to denormalize the two tables so that region is part of the sales table.  Like this:

select SUM(B.cost)
from SALES B
where
B.REGION=’North East’;

This seems simple, but the reasons are – at least to me – pretty cool and sophisticated.

To prove all this out I build a test case with the previous blog post as the first part of the script.  First here is the original schema:

```create table DIVISION (DIVNAME VARCHAR2(2000),DIVNUM NUMBER);
create table SALES (DIVNUM NUMBER);
create index SALESINDEX on SALES(DIVNUM);```

Next, here is the slow query that doesn’t use the index because of the cardinality estimate being way off:

```SQL> select B.DIVNUM
2  from DIVISION A,SALES B
3  where
4  a.DIVNUM=B.DIVNUM and
5  A.DIVNAME='Mayberry';

DIVNUM
----------
2

Elapsed: 00:00:00.11

Plan hash value: 480645376

-------------------------------------------------------
| Id  | Operation                  | Name     | Rows  |
-------------------------------------------------------
|   0 | SELECT STATEMENT           |          |       |
|*  1 |  HASH JOIN                 |          |   500K|
|*  2 |   TABLE ACCESS STORAGE FULL| DIVISION |     1 |
|   3 |   TABLE ACCESS STORAGE FULL| SALES    |  1000K|
-------------------------------------------------------```

Note that it says it will access 1 million sales rows when it really will access 1.  It uses a full table scan on sales because it thinks we will be accessing a large percentage of the rows.

Here is the new denormalized schema with the two tables merged into one:

```create table SALESWITHDIVNAME as
select d.divname,d.divnum
from
SALES s,
DIVISION d
where
s.divnum=d.divnum;

create index SWDNINDEX on SALESWITHDIVNAME(divname);```

And here is the faster query which chooses an index:

```SQL> select DIVNUM
2  from SALESWITHDIVNAME
3  where
4  DIVNAME='Mayberry';

DIVNUM
----------
2

Elapsed: 00:00:00.00

Plan hash value: 447456343

----------------------------------------------------------------
| Id  | Operation                   | Name             | Rows  |
----------------------------------------------------------------
|   0 | SELECT STATEMENT            |                  |       |
|   1 |  TABLE ACCESS BY INDEX ROWID| SALESWITHDIVNAME |     1 |
|*  2 |   INDEX RANGE SCAN          | SWDNINDEX        |     1 |
----------------------------------------------------------------```

Notice that the optimizer knows that the query will only access 1 row in the SALESWITHDIVNAME table and so it chooses an index range scan which is more efficient than a full scan when accessing a small percentage of rows from a table.

This may seem simplistic, but the profound fact is that the optimizer can not know which div_num is associated with the DIVNAME=’Mayberry’ condition in the two table query.  It uses the column statistics on DIVISION.DIVNAME to figure out that only one DIVISION row meets the condition.  But, since it doesn’t know that DIVNAME=’Mayberry’ translates into DIVNUM=2 it can’t use the histogram on SALES.DIVNUM to figure out that only one row will be retrieved from SALES.  But, if you merge the two tables together then the histogram on SALESWITHDIVNAME.DIVNAME tells the optimizer it only will access one row and so it should use the index instead of a full scan.

So, this is getting me thinking about the benefits of denormalizing the tables.  Also, maybe if you smash your small tables into your large one you can use multi-column histograms on the various columns you added as well.  In our case we have small division and date dimension tables and a large sales fact table.  What if the query was something like this:

```select SUM(cost)
from SALES
where
REGION='North East' and
Quarter='2013Q1';```

You might have a multi-column histogram on region and quarter.  You couldn’t have this if the query was normalized like this:

```select SUM(cost)
from SALES S,
TIME T,
DIVISION D
where
D.REGION='North East' and
T.Quarter='2013Q1'and
D.DIVNUM=S.DIVNUM and
T.sdate=s.sdate;```

You can’t have a histogram across the three tables but you can across the columns of the one denormalized table.

– Bobby

I live in Chandler, Arizona with my wife and three daughters. I work for US Foods, the second largest food distribution company in the United States. I have worked in the Information Technology field since 1989. I have a passion for Oracle database performance tuning because I enjoy challenging technical problems that require an understanding of computer science. I enjoy communicating with people about my work.
This entry was posted in Uncategorized. Bookmark the permalink.

10 Responses to Denormalize tables to improve cardinality estimate

1. Andrew says:

Would using the SQL Performance Analyser create a better sql profile that would give you the same result ?

• Bobby says:

I don’t think it does, but I don’t know why. This is on 11.2.0.2 by the way. Haven’t tried it on 12c. I’d have to run it through again to be sure. But you could build a profile using an outline that forces the index range scan. But the point is that the optimizer can’t get the cardinality between the two tables.

Look at this url:

This was my original test cases for a talk I did. The fourth test is the same type multi-table situation and a sql profile wasn’t generated in this case, but it was in my single table example, test3.

2. Narendra says:

Bobby,

Can’t this be solved with Extended Statistics feature (of 11g) as well?

• Bobby says:

No, I tried the extended statistics in my test case. It only works on one table, not across tables.

• Narendra says:

Bobby,

In your test case, won’t the extended statistics on DIVNAME and DIVNUM columns of DIVISION table be able to define the correlation between those 2 columns and affect the CBO?
I am sure you must have tried it but your statement “but the profound fact is that the optimizer can not know which div_num is associated with the DIVNAME=’Mayberry’ condition” made me wonder if extended statistics would help.

• Bobby says:

Narendra,

It is a valid question and I did try extended statistics on divname and divnum. I think that the extended statistics on those two columns only helps if you have conditions on both of those columns on the same table. I.e. divname=’Mayberry” and divnum =2. But, when the optimizer looks at the conditions on the sales table all it sees is the join back to the division table on the divnum column. So, even with a histogram on sales.divnum it can’t use it because it doesn’t know that 2 is the divnum that will be used. It seems like it could use the extended statistics to figure this out since it knows that (Mayberry, 2) only occurs once on the division table. But, it doesn’t. I think that the fundamental issue is how complicated do you want to make the optimizer and how much time do you want to spend parsing the query versus actually running it. There is a tradeoff between parse time and quality of the execution plans and I think that is really the limitation of SQL optimization. I.e. A simple and perfect method of parsing a query would be to just run the query with every possible plan and then pick the plan that ran fastest. But, the parse time will take thousands of years on a complex query with large tables. Maybe some sort of time warp would allow this to work efficiently. 🙂

– Bobby

• Anonymous says:

Bobby,

3. rajiviyer says:

Hi Bobby.
Good point. But usually in production systems, table design change is not at all an option. Changing the sql (which would be a cheaper option) using precompute_subquery hint may help optimizer get a good cardinality estimate:

select SUM(B.cost)
from SALES B
where
B.DIVNUM = (select /*+ precompute_subquery */ a.division_num from DIVISION A where A.REGION=’North East’);