Optimizer bug fix makes a query run more than 3 times slower

I’m working on an 11.1.0.7 to 11.2.0.4 upgrade and found a handful of queries that run more than 3 times longer on 11.2.0.4 than 11.1.0.7. The data and optimizer statistics are very similar on the two test databases. I’m pretty sure that an optimizer bug fix caused this difference. So, the irony is that a fix to the optimizer that we get with the upgrade to the very stable 11.2.0.4 release is causing a 3x slowdown in the query.

For my testing I’m using the gather_plan_statistics hint and this query to dump out the plan after executing the query:

select * from table(dbms_xplan.display_cursor(null,null,’ALLSTATS’));

I used an outline hint to force the 11.2.0.4 plan to run under 11.1.0.7 and then I looked at the estimated and actual row counts to find a discrepancy. I found one table with estimated row counts that did not look correct on 11.1.0.7 but made sense on 11.2.0.4.

11.1.0.7

---------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name                      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem |  O/1/M   |
---------------------------------------------------------------------------------------------------------------------------------------------------------
|* 29 |           TABLE ACCESS BY INDEX ROWID  | MY_TABLE                  |     17 |      1 |      0 |00:00:07.34 |   96306 |       |       |          |
|* 30 |            INDEX RANGE SCAN            | MY_TABLE_PK               |     17 |     16 |    102 |00:00:01.20 |   96255 |       |       |          |

11.2.0.4

---------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name                      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem |  O/1/M   |
---------------------------------------------------------------------------------------------------------------------------------------------------------
|* 29 |           TABLE ACCESS BY INDEX ROWID  | MY_TABLE                  |     17 |      8 |      0 |00:00:07.44 |   96306 |       |       |          |
|* 30 |            INDEX RANGE SCAN            | MY_TABLE_PK               |     17 |     17 |    102 |00:00:01.22 |   96255 |       |       |          |

After the range scan in step 30 in the plan in 11.1.0.7 you have an estimate of 16 rows but the table access in step 29 has an estimate of only 1. In 11.2.0.4 the estimate for step 29 is 8 rows.  Given the optimizer statistics, the optimizer should have estimated 8 rows step 29 in 11.1.0.7. It appears that 11.2.0.4 fixed a bug like this.

Here are the predicates for step 29 in the plan:

29 – filter((“PCP”.”MY_FLAG”=’M’ OR “PCP”.”MY_FLAG”=’Y’))

So the column could have value M or Y. The optimizer statistics have 4 distinct values for the column and no histogram. So the optimizer should assume that 1/4 of the rows meet each criteria. So the optimizer should have estimated 1/4 + 1/4 = 1/2 of the rows from step 30 meet the criteria in step 29. So, 17/2 = 8, rounding down. But in 11.1.0.7 it seems that they multiplied the rows from step 30 by 1/4 two times making it 16*1/4*1/4 = 1. It seems that in 11.1.0.7 the optimizer multiplied by 1/4 twice instead of adding them and then multiplying. There is a known bug related to OR conditions in where clauses:

Bug 10623119 – wrong cardinality with ORs and columns with mostly nulls (Doc ID 10623119.8)

Our 11.2.0.4 database includes this bug fix but I don’t know if this fix caused the difference in behavior that I saw. It seems possible that it did.

The interesting thing is that the real row count for step 29 is 0. So, the pre-bug fix plan in 11.1.0.7 actually estimated the row count more accurately by accident. It estimated 1 and the real count was 0. The correct estimate should have been 8, but that is not as close to 0 as 1 is. I think we just happened to have a few queries where the bug resulted in a more accurate estimate than a properly functioning optimizer. But, I’m only looking at the queries whose performance is worse after the upgrade. There may have been other queries that performed better because of this bug fix.

I ended up passing this and a similar query back to a senior SQL developer and he took one look at the query and described it as “ugly”. He fixed both of them in no time so that both queries now run just as fast or faster on 11.2.0.4 than they did on 11.1.0.7.

So, the original query ran faster when the optimizer was not working properly. A human developer simplified the query and then it ran faster when the optimizer was working properly. Maybe the moral of the story is to build simpler and cleaner SQL queries to begin with and if you find a query whose performance declines with better optimizer information then consider improving the query so that it works well with the better functioning optimizer.

Bobby

Update: I messed around with cardinality hints some more. The problem with blog posts is that once I’ve written one I start second guessing myself. Did I cover all of my bases? I finally found a cardinality hint on the problem table that forced the plan back to the 11.1.0.7 plan. But the weird thing is that I had to hint that the number of rows on the table was larger than reality. My hint was like this: cardinality(pcp,1000000). I expected that a smaller cardinality would change the plan!

The good thing about this test is that it brought me back to why I focused on this table in the first place. Most of the time in the query execution centered around this one table and its index range scan. Now I know that messing with the cardinality hint on this table changes the plan back I feel good about the idea that this table’s row count has something to do with the plan change in 11.2.0.4. But, I’m not sure how to tie this change back to a specific bug fix.

Yet another update:

I ran the query on 11.2.0.4 with the parameter optimizer_features_enable=’11.1.0.7′ and the plan reverted to the 11.1.0.7 plan. At least I know that the slight differences in statistics between the two databases aren’t causing this issue. It is something about 11.2.0.4. I also forced the 11.1.0.7 plan under 11.2.0.4 and its cost was less than the plan that 11.2.0.4 was choosing so that suggests that the 11.2.0.4 optimizer is not even trying the better plan. It would have chosen the lower cost plan. I tried setting optimizer_max_permutations=1000000000000 to get it to try more plans but it still didn’t choose the lower cost 11.1.0.7 plan. So, I guess that my observations about the difference is the row counts on the table do not explain the slower plan. I am not sure how to diagnose why 11.2.0.4 is not trying the lower cost plan. It could be that this is a bug that 11.2.0.4 introduced.

Still I guess all of this research reinforces the main point. I’m tearing my hair out trying to apply all of this arcane Oracle SQL tuning information to understand why the queries ran slower on 11.2.0.4. But, a good SQL developer rewrote the queries in minutes so maybe I do not need to spend more time on it?

I guess if we run into a query that changes to a worse plan in the 11.2.0.4 upgrade we can always try setting optimizer_features_enable=’11.1.0.7′ in a hint like this

/*+ OPTIMIZER_FEATURES_ENABLE(‘11.1.0.7’) */

About Bobby

I live in Chandler, Arizona with my wife and three daughters. I work for US Foods, the second largest food distribution company in the United States. I've been working as an Oracle database administrator and PeopleSoft administrator since 1994. I'm very interested in Oracle performance tuning.
This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to Optimizer bug fix makes a query run more than 3 times slower

  1. Nice one.

    There’s a fix control (v$system_fix_control) for that bug number from 11.2.0.3 onwards, so if you find you need some time to work around the problem or rewrite the SQL you could try setting the fix_control to zero to disable it, e.g.:

    alter session set “_fix_control”=’10623119 :0′;

    or use the hint:

    /*+ OPT_PARAM(‘_fix_control’ ‘10623119 :0’) */

    (I haven’t tested to see if either of those work, though – I’d have to model a suitable data set first.)

    Regards
    Jonathan Lewis

    • Bobby says:

      Thanks Jonathan. I tried it and got the same results. I guess it was not that bug fix that caused the change in behavior.

      • Bobby,

        Thanks for the feedback. It’s possible that zero was the wrong value for the “level”; but I couldn’t try it out. The original bug note also says something about bitmap index effects (which may be a concidence or effect, rather than a cause) so it’s possible that there’s something particular about your case that doesn’t quite match the bug.

        I may try building a model over the next couple of days.

        Regards
        Jonathan Lewis

Leave a Reply