Cheese in January and the limits of SQL optimization

Prepping for my Intro to SQL tuning talk on Monday I did a little further research on one of my examples.  It is a concept I’ve known about for a while and I feel like it is profound evidence for the limits of SQL optimization.

The sample query just wants a list of sales and products where the sales take place in January 2012 and the product is a type of cheese.

select 
sale_date, product_name, amount
from sales, products
where
sales.product_number=products.product_number and
sale_date between 
  to_date('01/01/2012','MM/DD/YYYY') and 
  to_date('01/31/2012','MM/DD/YYYY') and
product_type = 'Cheese';

The Oracle optimizer has no way of knowing how many sales rows are both in January and cheese.  None of the optimizer statistics relate data in one table that is joined to another.  I made some sample data so that half of the sales rows are in January and half are cheese but none are both.

create table sales
(
sale_date date,
product_number number,
customer_number number,
amount number
);

create table products
(
product_number number,
product_type varchar2(12),
product_name varchar2(12)
);

insert into sales values(to_date('01/02/2012','MM/DD/YYYY'),1,1,10);
insert into sales values(to_date('01/03/2012','MM/DD/YYYY'),1,1,20);
insert into sales values(to_date('02/04/2012','MM/DD/YYYY'),2,2,30);
insert into sales values(to_date('02/05/2012','MM/DD/YYYY'),2,2,40);

insert into products values (1,'Meat','Ground Beef');
insert into products values (2,'Cheese','Chedder');

Imagine that your company previously sold only meats and in February 2012 you just started selling cheese as well.  You know it, but the optimizer can’t.  Here is the plan showing that the optimizer thinks that half of the four sales rows meet the criteria:

-----------------------------------------------
| Id  | Operation          | Name     | Rows  |
-----------------------------------------------
|   0 | SELECT STATEMENT   |          |     2 |
|*  1 |  HASH JOIN         |          |     2 |
|*  2 |   TABLE ACCESS FULL| PRODUCTS |     1 |
|*  3 |   TABLE ACCESS FULL| SALES    |     3 |
-----------------------------------------------

This was run with histograms on all the columns, 100% sample size, and a multi-column histogram on sales_date and product number so the optimizer has about as much information as I can give it.  The example script for this and its log is in this zip.

This is a simple example, but my intuition tells me it has profound implications on how limited SQL optimization truly is.  Expand this example to hundreds of tables and columns with many complex real world relationships between the data.  What type of optimizer can quickly optimize an arbitrary query with all these complex interrelations?  I still have a lot to learn about all the things you can do to help the optimizer come up with the best plans without manually overriding it with hints or rewriting the code, but I think that ultimately manual intervention is required in many cases because of the fundamental limitations of SQL optimizer, not only in Oracle’s software, but in any possible SQL database implementation.  Of course there are many new features with each release of Oracle’s flagship RDBMS software and future research will probably come up with all kinds of clever ways to improve SQL optimization.  But my gut tells me that no optimizer that runs fast can ever be built that can always make the right decision when complex relationships exist between data in multiple tables.

– Bobby

 

About Bobby

I live in Chandler, Arizona with my wife and three daughters. I work for US Foods, the second largest food distribution company in the United States. I have worked in the Information Technology field since 1989. I have a passion for Oracle database performance tuning because I enjoy challenging technical problems that require an understanding of computer science. I enjoy communicating with people about my work.
This entry was posted in Uncategorized. Bookmark the permalink.

4 Responses to Cheese in January and the limits of SQL optimization

  1. Yasser says:

    Extended stats OR Dynamic Sampling should be able to help Optimizer here.

    -Yasser

  2. Bobby says:

    Dom and Yasser,

    Thanks for your comments. I think that extended statistics are intended to handle cases like this where there are relationships between the data in two or more columns in the same table, but I don’t believe it can handle relationships across two or more tables as in my example.

    Still, in some future version of Oracle they may add a histogram that crosses tables but I think the bigger picture is that there must in any version of Oracle be cases where the optimizer doesn’t have the information it needs to create an efficient execution plan.

    – Bobby

  3. Pingback: Log Buffer #285, A Carnival of the Vanities for DBAs

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.