I want to document a bug I ran across in Oracle 22.214.171.124 on HP-UX Itanium and how it caused unexpected plan changes that would bring the database to its knees with many long running queries running the bad plan. I found that the problem table had a local index whose partition names did not match the partition names of the table and that recreating the index with partition names that matched the table resolved the issue.
We have a busy and important Oracle database that had queries that ran fine for weeks and months and then suddenly changed to a bad plan. They all had a similar pattern. The good plan did a simple range scan on a certain partition’s index. The bad plan did a fast full index scan.
PARTITION RANGE SINGLE | INDEX RANGE SCAN | PROBLEM_INDEX
PARTITION RANGE SINGLE | INDEX FAST FULL SCAN | PROBLEM_INDEX
The query accessed the table using the first two columns of PROBLEM_INDEX and the first column of the index is the partitioning column for the range partitioned table. So, call the first two columns of the index PARTITIONING_C and OTHER_C then the query was like:
select ... from problem_table where PARTITIONING_C = :bindvar1 and OTHER_C = :bindvar2 ...
I have known for a long time that some choice of bind variable values was causing the plan to flip to the fast full scan on the index partition. But I did not know that it had to do with the names of the index’s partitions. The table and index creation scripts must have looked something like this:
create problem_table ( PARTITIONING_C NUMBER, OTHER_C NUMBER, ... ) PARTITION BY RANGE (PARTITIONING_C) ( PARTITION P1000 VALUES LESS THAN (1000), PARTITION P2000 VALUES LESS THAN (2000), PARTITION P3000 VALUES LESS THAN (3000), PARTITION P4000 VALUES LESS THAN (4000), PARTITION P5000 VALUES LESS THAN (5000)); CREATE UNIQUE INDEX problem_index ON problem_table (PARTITIONING_C, OTHER_C , ...) LOCAL ( PARTITION P2000, PARTITION P3000, PARTITION P4000, PARTITION P5000, PARTITION P6000);
The index has the same number of partitions as the table and a lot of them have the same names, but they do not line up. I found this bug in the version of Oracle that we are on:
Bug 14013094 – DBMS_STATS places statistics in the wrong index partition
I am not sure that I am hitting that bug, but I am hitting some similar bug because 14013094 relates to index partitions names that do not match table partition names. For one partition of the problem index the statistics were set to 0 but its corresponding table partition had millions of rows. It would be as if partition P3000 on problem_table had 20,000,000 rows in stats and corresponding partition P4000 on problem_index had 0 rows. If I gathered index partition statistics on P4000 it correctly set stats to 20,000,000. If I gathered table partition statistics on P3000 it cleared the index partition stats on P4000 setting them to 0! (!!) Yikes! How weird is that? Seems obvious to me it is a bug, but maybe not exactly 14013094. I tried dropping and recreating the index leaving the partition names as they are, but it did not resolve the issue. Then I just created the index letting it default to matching partition names like this:
CREATE UNIQUE INDEX problem_index ON problem_table (PARTITIONING_C, OTHER_C , ...) LOCAL;
I’m not sure how the partition names got mismatched but it is a simple fix. It took me a while staring at the partition statistics to realize what was going on and then it took a while to prove out the fix. We do not yet have this in production, but I believe we have nailed down the cause of the plan changes. In the past I have been using SQL Profiles to lock in the plans of any new query that uses the problem table, but I plan to run without them after putting in the index. I kick myself for not researching this earlier, but it was not obvious so maybe it was not crazy to use SQL Profiles for a while. But it became a burden to keep using them and it left the system vulnerable to a plan change on any query on the problem table that did not already have a SQL Profile.
In a lot of ways this is a simple query tuning problem with bad zero stats on a partition with millions of rows. Any time the statistics are wildly inaccurate bad plans are likely. But tracking down the reason for the zero statistics was not simple. I should mention that some of the table partitions are empty and have 0 rows. In fact, to take our example, imagine that table partition P4000 has 0 rows just like index partition P4000. No one would suspect that the statistics are wrong unless you realize that P4000 on the index corresponds not to P4000 but to P3000 on the table and P3000 has 20,000,000 rows!
Bind variables lead to these unexpected plan changes when the plan is different when different values are passed into the variables. It is a fundamental tradeoff of using bind variables to reduce the parsing that constants would cause. If one set of bind variable values causes a bad plan that plan gets locked into memory and it is used for all the other values passed to the query until the plan is flushed out of memory. There are some cases where the optimizer will look at the values passed into the bind variables and choose between multiple plans but that did not occur in this problem.
So, what can someone get from this post if they are not on 126.96.36.199 and do not have locally partitioned indexes with similar partition names to the table but offset and mismatched? I think for me the point is after using a SQL Profile to resolve a problem query plan to dig deeper into the underlying reason for the bad plan. I expect bad plans because I think that the optimizer is limited in its ability to correctly run a query even if it has the best possible statistics. Also, since I support many databases, I do not have time to dig deeply into the underlying reason for every bad plan. But I need to dig deeper when similar queries keep getting the same kind of bad plans. In many cases bad statistics lead to bad plans and it is a lot better to fix the statistics once than to keep using SQL Profiles and hints repeatedly to fix similar queries with the same sort of switches to bad plans. In this case I had the bizarre fix of recreating the index with partition names that match the table and that ensured that the statistics on the index partitions were accurate and that I no longer need SQL Profiles to lock in the good plan.