Plan change monitor prevents user impact from bad plan

This morning a colleague of mine noticed an alert from our plan change monitor and prevented any impact on the users by forcing a SQL’s plan to its normal fast plan before the users could be affected.  I want to share the monitor script and describe how we use its output.

I know this is long but I want to paste the SQL text of the monitor script here.  Review my comments on each step.  I can not remember the details of the script myself even though I wrote it but the comments hopefully will help:

set linesize 160
set pagesize 1000
set echo on
set termout on
set trimspool on
set define off

-- status active
-- this plan isn't the most commonly
-- run plan.
-- average this plan in v$sql > 10 x average of
-- most common plan

-- status active sessions
-- sql_id and plan_hash_value
-- elapsed and executions
-- max elapsed for the sql
-- eliminate pl/sql which has
-- plan hash value of 0

drop table active_sql;

create table active_sql as
select
vs.sql_id,
sq.plan_hash_value,
max(sq.ELAPSED_TIME) elapsed,
max(sq.executions) executions
from
v$session vs,
v$sql sq
where
vs.sql_id=sq.sql_id and
vs.SQL_CHILD_NUMBER=sq.child_number and
vs.status='ACTIVE' and
sq.plan_hash_value <> 0
group by 
vs.sql_id,
sq.plan_hash_value;

-- to get the most frequently
-- used plan first get the number
-- of exections by plan
-- for each of the active sqls

drop table plan_executions;

create table plan_executions as
select
ss.sql_id,
ss.plan_hash_value,
sum(ss.executions_delta) total_executions
from
dba_hist_sqlstat ss,
active_sql a
where
a.sql_id=ss.sql_id
group by 
ss.sql_id,
ss.plan_hash_value;

-- use the previous table to get
-- the plans that are most frequently 
-- used. note that two plans could
-- have the same number of executions
-- but this is unlikely.

drop table most_frequent_executions;

create table most_frequent_executions as
select
pe1.sql_id,
pe1.plan_hash_value
from plan_executions pe1
where
pe1.total_executions =
(select max(pe2.total_executions)
from plan_executions pe2
where
pe1.sql_id=pe2.sql_id);

-- handle special case of two plans with
-- same number of executions.
-- pick one with highest plan value
-- just to eliminate dups.

drop table most_frequent_nodups;

create table most_frequent_nodups as
select
mfe1.sql_id,
mfe1.plan_hash_value
from most_frequent_executions mfe1
where
mfe1.plan_hash_value =
(select max(mfe2.plan_hash_value)
from most_frequent_executions mfe2
where
mfe1.sql_id=mfe2.sql_id);

-- get list of active sql that 
-- are not running the most
-- frequently executed plan

drop table not_most_freq;

create table not_most_freq as
select * from active_sql
where
(sql_id,plan_hash_value) not in
(select sql_id,plan_hash_value from most_frequent_nodups);

-- working on this part of the logic:
-- average this plan in v$sql > 10 x average of
-- most common plan

-- get average elapsed of most
-- frequently executed plans
-- add 1 to handle case of 0 executions

drop table avg_elapsed_most_f;

create table avg_elapsed_most_f as
select
ss.sql_id,
ss.plan_hash_value,
sum(ss.ELAPSED_TIME_DELTA)/(sum(ss.executions_delta)+1) 
avg_elapsed
from DBA_HIST_SQLSTAT ss,
most_frequent_nodups nd
where 
ss.sql_id = nd.sql_id and
ss.plan_hash_value = nd.plan_hash_value
group by
ss.sql_id,
ss.plan_hash_value;

-- get list of the sqls that are running
-- the plan that isn't most frequently 
-- executed and has an average elapsed 
-- more than 10 times the average of 
-- the most frequently executed plan
-- add 1 to executions to prevent 
-- divide by zero

drop table more_than_10x;

create table more_than_10x as
select 
n.sql_id,
n.plan_hash_value
from 
not_most_freq n,
avg_elapsed_most_f m
where
(n.elapsed/(n.executions+1)) > 10 * m.avg_elapsed and
n.sql_id=m.sql_id;

spool planchangemonitor.log

select name db_name from v$database;

-- The listed sql_id and plan_hash_value items correspond to 
-- sql statements that have plans that may be
-- inefficient and need to be investigated.
-- The session id and username are included if a
-- session is currently running the sql with the plan.

select
'CHANGED '||'PLAN' flag,
m.sql_id,
m.plan_hash_value,
s.sid,
s.username
from 
more_than_10x m,
v$session s,
v$sql q
where
m.sql_id=s.sql_id(+) and
m.plan_hash_value=q.plan_hash_value(+) and
s.sql_id=q.sql_id and
s.SQL_CHILD_NUMBER=q.child_number
order by
m.sql_id,
m.plan_hash_value,
s.sid,
s.username;

spool off

If I remember correctly I think the script looks for sessions running a plan whose current run time is 10 times that of the most frequently executed plan. This script is not perfect. The join to v$sql is not perfect and in some cases you can get duplicates.  People could quibble about the approach.  Why 10 times the previous run time?  I thought about more complex approaches but I just needed to get something in place.  But, on one database with a lot of small transactions we have made good use of this script, despite its flaws.

This morning my colleague noticed emails from the script that runs this SQL listing a query whose plan had changed.  The output looked like this:

FLAG         SQL_ID        PLAN_HASH_VALUE        SID USERNAME
------------ ------------- --------------- ---------- ---------
CHANGED PLAN 75ufmwrcmsuwz      2484041482         35 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482        394 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482        395 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482        446 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482        463 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482        464 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482        544 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482        613 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482        631 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482        665 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482        678 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482        738 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482        746 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482        750 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482        752 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482       1333 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482       1416 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482       1573 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482       1943 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482       1957 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482       3038 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482       3445 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz      2484041482       3816 YOURUSER

I changed the real user to YOURUSER.  This output indicates that 23 sessions were all running the same SQL – sql_id=75ufmwrcmsuwz – and that this SQL was running on a new plan that was causing the SQL to run at least 10 times the normal run time.  In fact it was about 30 times as long.

To resolve the issue my colleague used our script to find the history of plans for 75ufmwrcmsuwz.

     select ss.sql_id,
  2  ss.plan_hash_value,
  3  sn.END_INTERVAL_TIME,
  4  ss.executions_delta,
  5  ELAPSED_TIME_DELTA/(executions_delta*1000) "Elapsed Average ms",
  6  CPU_TIME_DELTA/(executions_delta*1000) "CPU Average ms",
  7  IOWAIT_DELTA/(executions_delta*1000) "IO Average ms",
  8  CLWAIT_DELTA/(executions_delta*1000) "Cluster Average ms",
  9  APWAIT_DELTA/(executions_delta*1000) "Application Average ms",
 10  CCWAIT_DELTA/(executions_delta*1000) "Concurrency Average ms",
 11  BUFFER_GETS_DELTA/executions_delta "Average buffer gets",
 12  DISK_READS_DELTA/executions_delta "Average disk reads",
 13  ROWS_PROCESSED_DELTA/executions_delta "Average rows processed"
 14  from DBA_HIST_SQLSTAT ss,DBA_HIST_SNAPSHOT sn
 15  where ss.sql_id = '75ufmwrcmsuwz'
 16  and ss.snap_id=sn.snap_id
 17  and executions_delta > 0
 18  and ss.INSTANCE_NUMBER=sn.INSTANCE_NUMBER
 19  order by ss.snap_id,ss.sql_id;

SQL_ID        PLAN_HASH_VALUE END_INTERVAL_TIME         EXECUTIONS_DELTA Elapsed Average ms CPU Average ms IO Average ms Cluster Average ms Application Average ms Concurrency Average ms Average buffer gets Average disk reads Average rows processed
------------- --------------- ------------------------- ---------------- ------------------ -------------- ------------- ------------------ ---------------------- ---------------------- ------------------- ------------------ ----------------------
75ufmwrcmsuwz      2297146707 30-AUG-15 09.00.54.472 AM              830         587.207673     49.9638554    541.617188                  0                      0                      0          5234.01928         125.083133              332.66747
75ufmwrcmsuwz      2297146707 30-AUG-15 10.00.46.872 AM             1178         578.205867     49.3972835    532.377174                  0                      0                      0          4870.22326         126.048387              324.50764
75ufmwrcmsuwz      2297146707 30-AUG-15 11.00.39.206 AM             1433         631.484713     49.1486392    585.826676                  0                      0                      0          4624.11305         125.446615              299.57083
75ufmwrcmsuwz      2297146707 30-AUG-15 12.00.32.247 PM             1620         592.593823     49.5987654     546.29731                  0                      0                      0          4744.17284         121.961728             312.735185
75ufmwrcmsuwz      2297146707 30-AUG-15 01.00.20.783 PM             1774         534.412339      51.059752     485.46836                  0                      0                      0          4983.44983         119.564825             326.067644
75ufmwrcmsuwz      2297146707 30-AUG-15 02.00.08.843 PM             1757         447.385105     44.9345475    404.415659                  0                      0                      0          4525.13147         107.277746             293.739328
75ufmwrcmsuwz      2297146707 30-AUG-15 03.00.57.308 PM             1626         431.718507      45.904059    388.200416                  0                      0                      0          4462.93296         118.027675             300.724477
75ufmwrcmsuwz      2297146707 30-AUG-15 04.00.44.209 PM             1080         375.905966      44.212963    334.434835                  0                      0                      0          4766.81574         109.157407             310.712037
75ufmwrcmsuwz      2297146707 30-AUG-15 05.00.36.753 PM              707         368.289475     44.3140028    327.166223                  0                      0                      0          4894.20509         108.050919             315.565771
75ufmwrcmsuwz      2297146707 30-AUG-15 06.00.26.449 PM              529         341.483588     39.6408318     305.47356                  0                      0                      0          4381.19849         96.2646503             288.030246
75ufmwrcmsuwz      2297146707 30-AUG-15 07.00.17.636 PM              356         380.733635     41.5168539    342.034876                  0                      0                      0           4553.4691         105.272472             292.283708
75ufmwrcmsuwz      2297146707 30-AUG-15 08.00.11.170 PM              313         435.449406     37.1565495    402.636489                  0                      0                      0          4144.30351         92.8690096             264.923323
75ufmwrcmsuwz      2297146707 30-AUG-15 09.00.07.894 PM              214         516.455509     44.5794393    477.020692                  0                      0                      0          4567.67757         114.415888             289.607477
75ufmwrcmsuwz      2297146707 30-AUG-15 10.00.59.991 PM              182         720.749681     44.3956044    684.439467                  0                      0                      0          3811.83516         95.2362637             239.027473
75ufmwrcmsuwz      2297146707 30-AUG-15 11.00.47.388 PM               83          1043.1503     43.7349398    1008.41358                  0                      0                      0          3575.96386         114.289157             250.120482
75ufmwrcmsuwz      2484041482 30-AUG-15 11.00.47.388 PM                6         25314.6558     4311.66667    22971.4913                  0                      0                      0          78533.8333         69813.3333             157.833333
75ufmwrcmsuwz      2484041482 31-AUG-15 12.00.36.033 AM               96         25173.7346     5105.20833    21475.9516                  0                      0                      0          135242.802         62433.3125             118.395833
75ufmwrcmsuwz      2484041482 31-AUG-15 01.00.29.070 AM               39         26877.0626     5540.51282    22977.6229                  0                      0                      0          139959.308         68478.1795             93.7179487
75ufmwrcmsuwz      2484041482 31-AUG-15 02.00.18.755 AM               38          26993.419     5998.15789    22768.4285                  0                      0                      0          153843.342              74492             149.342105
75ufmwrcmsuwz      2484041482 31-AUG-15 03.00.09.615 AM               29         25432.5074     4952.06897    22288.7966                  0                      0                      0          112813.552         69803.0345             187.689655
75ufmwrcmsuwz      2484041482 31-AUG-15 04.00.01.749 AM               34         27281.7339     4541.47059    24543.1609                  0                      0                      0             95144.5         69187.3824             135.676471
75ufmwrcmsuwz      2484041482 31-AUG-15 05.00.55.876 AM              146         30512.9976     5421.43836    26984.2559                  0                      0                      0          115531.801         71886.6644             136.321918
75ufmwrcmsuwz      2484041482 31-AUG-15 06.00.50.769 AM              405         24339.6641     4853.40741    20794.0472                  0                      0                      0           115490.01         62004.4642             229.106173
75ufmwrcmsuwz      2484041482 31-AUG-15 07.00.43.047 AM              838         27552.3731     4903.06683    23661.2101                  0                      0                      0          111654.558         58324.9511             176.058473
75ufmwrcmsuwz      2484041482 31-AUG-15 08.00.50.864 AM             1653         30522.8358     4550.56261    26526.2183                  0                      0                      0           93818.418         49865.4701             137.212341

Based on this output my colleague chose 2297146707 as the good plan.  She ran coe_xfr_sql_profile.sql which is found in SQLT’s utl directory like this:

coe_xfr_sql_profile.sql 75ufmwrcmsuwz 2297146707

Then she ran the generated script:

coe_xfr_sql_profile_75ufmwrcmsuwz_2297146707.sql

This forced the plan back to its earlier efficient plan and my coworker did all of this early this morning before we reached our peak time of usage and before the bad plan could affect our users.

Bobby

P.S. This example is from an 11.2.0.3 database running on HP-UX Itanium.

About Bobby

I live in Chandler, Arizona with my wife and three daughters. I work for US Foods, the second largest food distribution company in the United States. I have worked in the Information Technology field since 1989. I have a passion for Oracle database performance tuning because I enjoy challenging technical problems that require an understanding of computer science. I enjoy communicating with people about my work.

This entry was posted in Uncategorized. Bookmark the permalink.

10 Responses to Plan change monitor prevents user impact from bad plan

  1. Hi Bobby,

    What a novel and creative approach. So what if it is not perfect in all cases, it works. . . .

    This stuff is solid gold!

    You really need to consider publishing you blog excerpts as a book. IMHO.

    • Bobby says:

      Don,

      Thanks for the comment! It is good to hear other people validate the approach. The script has helped us even though we get some false positives.

      You are kind to say that I should write a book. I’m not sure what I would put in a book without covering the same ground that other people cover. It would be fun to write about something that was more unique to me and that I felt enthusiastic about. I think I have a few creative scripts and ideas here and there but a lot of what I do I learned from other people and their books.

      Still, I enjoy doing the blog and I appreciate your positive feedback and comments about writing a book.

      Bobby

  2. tgascard says:

    Hello,
    I think you need to purge the sql_id from the shared pool before fixing the plan. I am not sure that put a sql profile will invalidate the cursor.

    • Bobby says:

      You don’t need to do anything with the shared pool. You just need to run the script generated by coe_xfr_sql_profile.sql. We have done this many times to fix a bad plan on the fly on busy systems.

  3. Kevin says:

    Would this work with Oracle 10gR2? From a quick scan of the code, it looks like that all the views and columns are supported.

    • Bobby says:

      I just did a quick run on a 10.2 system and it ran without errors. All the columns seem to be there. In our case we have it running on 11.1 and 11.2 systems so I know it works there. Probably will be fine on 10.2 but have not used it in production.

  4. Anonymous says:

    Hi Bobby, very nice script so thank you for publishing it. I’m having difficulty locating the script coe_xfr_sql_profile.sql – where exactly is it? I presume it’s an Oracle supplied script? Is this a 12c database?

    • Bobby says:

      This is an 11.2 database. coe_xfr_sql_profile.sql is part of SQLT which you can download from Oracle’s support site. This document has a link to the sqlt.zip download: All About the SQLT Diagnostic Tool (Doc ID 215187.1). coe_xfr_sql_profile.sql is in the utl subdirectory of the directory where you unzip sqlt.zip.

  5. Anonymous says:

    Hi Bobby,

    I’d suggest adding a comment on whether using the DBA view “dba_hist_sqlstat” is licensable (e.g. does it require diagnostics/tuning packs?).

    Phil

Leave a Reply