On December 26th, which is a holiday for my company this year, our team got a page that a plan had changed on an important 126.96.36.199 HP Unix platform Oracle database. The new plan was inefficient but did not cause any harm to the application. I added the SQL_ID value for the query to our plan change monitor exception list so that it would not page us again when it changes to slower plans because the slower plans were not a threat to the system. Normally I would just move on but this time I thought I would dig into why the query changed plan and see if there is anything we could do to prevent similar changes. If the new plan was a problem, I typically would force the faster plan with a SQL Profile. I was going to do that this morning, but the system was busy, and the profile script was hung up for a few minutes on a library cache lock, so I just backed it out. I do not really need to intervene since it is not hurting the system. We have enough CPU capacity to handle the slow plan. But I was still curious if I could find the underlying cause of the plan change and learn any lessons from it. That is what this post is about.
Like many queries that change plan and set off alerts, this problem query uses bind variables. The PL/SQL package that includes the queries passes in values in the where clause instead of using constants. I.e., it looks like this:
WHERE div = p_div AND cust_nbr = p_cust_nbr
and not this:
WHERE div = 123 AND cust_nbr = 456
You can see here where the query changed to the bad plan:
PLAN_HASH_VALUE END_INTERVAL_TIME EXECUTIONS_DELTA Elapsed Average ms --------------- --------------------- ---------------- ------------------ 504369030 25-DEC-22 04.00.45 AM 838 .478386635 504369030 25-DEC-22 07.00.40 PM 1599 .669989368 504369030 25-DEC-22 11.00.15 PM 1044 .595122605 504369030 26-DEC-22 01.00.16 AM 891 .558159371 504369030 26-DEC-22 02.00.43 AM 473 .453122622 2693825138 26-DEC-22 02.00.43 AM 311 3.61250804 2693825138 26-DEC-22 03.00.07 AM 779 2.91877792 2693825138 26-DEC-22 04.00.10 AM 1076 7.274671 2693825138 26-DEC-22 05.00.41 AM 1218 11.1912258
The bad plan is averaging 11 milliseconds instead of less than 1 millisecond so “bad” is relative. It still is not slow enough to affect the users.
I used my plan script to run the problem query with different constants in place of the two variables. I used my optimizer statistics scripts to find what the optimizer thought the range of possible values was for the two columns. Here are the ranges:
COLUMN_NAME LO HI ----------- ---- -------- DIV 228 5220 CUST_NBR 1675 74394502
I tried a variety of variable value combinations in and out of this range and most of the values outside these ranges resulted in the bad plan having a lower cost than the good one. It was especially sensitive to the values of DIV. It turns out that there are many possible DIV values outside this range. They are just not on this table. So likely this query is often run with variable values that are out of the range of what the optimizer knows is in the table.
I tried to find a way to make the out-of-range values pick the better plan. I loaded this data on a 19c database but got the same results. I tried different kinds of histograms, including one on (DIV, CUST_NBR), but they didn’t help.
Maybe when you write a query that takes variables you should try out its performance with atypical values for the variable values. In simple equals conditions like those in my problem query maybe check how the query runs on values that are outside of those found in the table. Since the query’s plan is not fixed you must be prepared for the query to run with all the plans that the data passed into its variables could generate. Also, the database will stick with any of those plans for a long time unless something forces it to be recalculated. So, you would have to test the query with atypical values and once the plan is in memory run all your typical data through that plan to see if it is still fast enough. If the plan from the atypical data is too slow then you would have to change the query or the data to handle those cases or resort to something like a hint to lock in the plan that runs well with the more typical variable values.
I just want to say how hard it is to write a blog post. What a pain. But the interaction with others on the internet is very valuable.
Someone commented that I could look at v$sql_shared_cursor to find why 1anm65yacs6ky changed plan. As expected, it changed plan because statistics were gathered on the table. Often statistics gathering causes plans to be recreated (hard parsing). But it was interesting that the reason from v$sql_shared_cursor was “Rolling Invalidate Window Exceeded”. A quick Google search led me to this article:
It claims that there is a 5 hour window after statistics are gathered for plans to be reparsed. This makes sense because I know statistics are gathered on the table around 22:00 but the plan is flipping between 01:00 and 02:00 the next day. Very cool.
But it does not really explain why the plan changes. It only explains why 1anm65yacs6ky was hard parsed. The plan changed because out of range data was passed into the bind variables during the hard parse and this resulted in a plan that was inefficient for in range data. On December 26th we got unlucky during the hard parse. This morning, January 9th, we got lucky, and we flipped back to the good plan.
What makes blogging about this stuff so hard is that there is so much I want to say about this, but I just don’t know if I have the energy or ability to put it all down. We use bind variables to minimize parsing. With the good plan this query runs in a couple of milliseconds. If it had to be hard parsed every time it runs, 14,000 times per hour, it would be very wasteful and slow. But the cost of using bind variables is that a plan can get locked in that isn’t efficient for many of the values passed into the variables. It is a no-win scenario. The bigger picture is that I think Oracle’s optimizer, and probably any conceivable SQL optimizer, has limits. Somehow all these years Oracle and others have produced SQL RDBMSs that people find useful. But based on my Oracle experience it seems like they are imperfect but useful.
In the case of my December 26th query that changed plan, our business is functioning just fine with the query’s plan flipping back and forth. If I did not have my plan change monitor setup, I would not even know that the plan was changing because it is not causing an issue that is visible to the users. So, it offends my perfectionist tendencies to have a query run an inefficient plan, but from a business perspective the inefficiency and imperfection that is fundamental to the design of Oracle’s optimizer with the tradeoffs of minimizing hard parses with bind variables but suffering with inefficient plans until the next hard parse is acceptable because the business purpose of the application is still being accomplished. Maybe in the bigger picture Oracle’s SQL and SQL in general have been useful to people all these years because they do well enough to meet people’s needs despite their fundamentally imperfect execution of SQL queries.
I found two ways to prevent the SQL statement from changing plans when it is hard parsed with variable values that are out of the range of the column statistics. One is to add a new index, and the other is to add a cardinality hint.
I had no idea if the index would help but I noticed that the query touched the same table 3 times after looking up rows using the only index. I looked at the subselects on the query and picked a new index that would cover all the columns used by the subselects and be efficient. This eliminated all but the one table lookup that we had to have. For some reason this plan is chosen even if the data in the variables is out of range. Maybe the new index made the plan so efficient that out-of-range data would not cause a plan flip.
Just now I got the idea of trying a cardinality hint like this:
/*+ cardinality(PSD 10) */
I put this on every subselect that had the main table. This caused the plan to act as if the table would return 10 rows and it choose the good plan even if the data in the variables was out of range.
These two approaches work but they require some query tuning knowledge when you are rolling out new SQL statements. I think part of the challenge of using SQL databases like Oracle is that you need developers and DBAs who understand the optimizer well enough to understand the challenges of bind variables and plan changes and how to prevent them or fix them.
See this Oracle support document:
Limitations of the Oracle Cost Based Optimizer (Doc ID 212809.1)
There is also this bug which may prevent us from using Adaptive Cursor Sharing on this application which has all its queries wrapped in PL/SQL:
Bug 8357294 : ADAPTIVE CURSOR SHARING DOESN’T WORK FOR STATIC SQL CURSORS FROM PL/SQL
It looks like none of our SQL is using ACS:
>select 2 IS_BIND_SENSITIVE, 3 IS_BIND_AWARE, 4 count(*) cnt 5 from 6 v$sql 7 group by 8 IS_BIND_SENSITIVE, 9 IS_BIND_AWARE 10 order by cnt desc; I I CNT - - ---------- N N 22970
Maybe our application design suppresses the use of ACS which results in more bad plans being locked in on queries with bind variables. If we bypassed PL/SQL and just ran queries against the database from the Java front end, we might use ACS to prevent a bad plan getting locked in based on certain oddball variable values.
Tired of writing PPS and all that. I am probably running out of steam. I cannot find anything that really applies generally to our system based on this example. I found this weird part of the plan when the variable values are out of range:
------------------------------------------------------------------------ | Id | Operation | Name | Rows | ------------------------------------------------------------------------ | 6 | MERGE JOIN CARTESIAN | | 1 | | 7 | TABLE ACCESS BY INDEX ROWID | PROD_ATLN_DTL | 1 | | 8 | INDEX RANGE SCAN | PROD_ATLN_DTL_0 | 1 | | 9 | BUFFER SORT | | 36 | | 10 | TABLE ACCESS FULL | XYZ_ENT | 36 |
It seems like for Id 6 the Rows value should be 36 instead of 1. I think the optimizer treats 0 rows as 1 row when it displays the plan so maybe it is multiplying 0 * 36 and then displaying 1 instead of 0 for id 6. But other parts of the plan treat 0 as if it was 1 and add up the cost of the steps of the plan. Both the good and bad plan have costs that are way off for the out of range variable values so maybe when you get down to this level the errors inherent in the optimizer make plan comparisons meaningless. But then what strategy can you use with queries that use bind variables and that have out-of-range data values passed into them? You can use hints, etc. but what can you do without intervening for every query like this?
Looks like the cartesian joins were caused by the constants being in the query in two places.
In the innermost subselect it was this:
WHERE DIV = 111 AND CUST_NBR = 222
Then in the final where clause:
WHERE PCD.DIV = 111 AND PCD.CUST_NBR = 222
DIV and CUST_NBR where part of the joins all through the query so these constants were only needed once. Ultimately this query was poorly constructed, so the optimizer had trouble locking down a good plan. Maybe the bigger picture is that the better job we do designing our queries, indexes, etc. the less likely plans are to change for the worse.