I resolved another production performance issue with a SQL Profile yesterday. I have several posts about SQL Profiles, so I do not want to be redundant, but this case was a little different because it was not clear that I had a better plan. I want to document the challenge that I had deciding if I had the best plan and show the resolution.
On September 21 when I think I was on vacation or otherwise not in the office there was a big performance issue on an older production database. The on-call DBA identified the sql_id of the top SQL statement as 30q69rbpn7g75. But he and an application support developer together could not connect that SQL statement back to the long running reports that were impacting the business. Eventually the issue went away later that night. Here is some of the execution history from that original issue:
SQL_ID PLAN_HASH_VALUE END_INTERVAL_TIME EXECUTIONS_DELTA Elapsed Average ms CPU Average ms IO Average ms Cluster Average ms Application Average ms Concurrency Average ms Average buffer gets Average disk reads Average disk write megabytes Average rows processed
------------- --------------- ------------------------- ---------------- ------------------ -------------- ------------- ------------------ ---------------------- ---------------------- ------------------- ------------------ ---------------------------- ----------------------
30q69rbpn7g75 1463081877 21-SEP-20 10.00.04.123 AM 5 420798.922 144724 209959.863 0 0 0 4824516.8 710748.4 0 870
30q69rbpn7g75 1732425686 21-SEP-20 10.00.04.123 AM 13 66.9426923 56.1538462 7.25523077 0 0 0 6410.23077 249.769231 0 414.538462
30q69rbpn7g75 592872749 21-SEP-20 10.00.04.123 AM 1 4144.437 3240 955.246 0 0 0 63878 29725 0 89
30q69rbpn7g75 4175108531 21-SEP-20 10.00.04.123 AM 11 172328.731 92788.1818 62448.1775 0 0 0 3463219 466919.273 0 610.090909
30q69rbpn7g75 2823019537 21-SEP-20 11.00.17.704 AM 19 332354.362 212357.895 22580.778 0 0 0 11145610.8 163882.105 0 303.526316
30q69rbpn7g75 2908690533 21-SEP-20 11.00.17.704 AM 1 23377.169 13070 11680.972 0 0 0 1582917 89479 0 258
30q69rbpn7g75 291832905 21-SEP-20 11.00.17.704 AM 2 40314.0255 15940 24490.36 0 0 0 1831813 128525 0 248
30q69rbpn7g75 1467059382 21-SEP-20 11.00.17.704 AM 1 20179.636 5760 16155.407 0 0 0 124599 84761 0 780
30q69rbpn7g75 1033740578 21-SEP-20 11.00.17.704 AM 1 1728.49 1570 2.906 0 0 0 1525 35 0 12
30q69rbpn7g75 4175108531 21-SEP-20 11.00.17.704 AM 8 147782.833 59617.5 65356.3268 0 0 0 2280007 245985.25 0 402
30q69rbpn7g75 3938646712 21-SEP-20 11.00.17.704 AM 2 139722.393 55905 86105.482 0 0 0 6080269 616766 0 1143.5
30q69rbpn7g75 2823019537 21-SEP-20 12.00.31.659 PM 48 238332.678 138706.875 19077.4738 0 0 0 6928661.85 99573.2708 0 145.395833
30q69rbpn7g75 2823019537 21-SEP-20 01.00.45.195 PM 64 147520.373 80835.1563 19092.0985 0 0 0 4148771.28 106131.016 0 79.890625
30q69rbpn7g75 2823019537 21-SEP-20 02.00.58.673 PM 58 180185.939 113102.931 14365.2987 0 0 0 5926129.21 123920.569 0 22.0344828
30q69rbpn7g75 2823019537 21-SEP-20 03.00.12.404 PM 37 307432.645 201436.216 22904.6901 0 0 0 10204978.9 158950.973 0 201.243243
30q69rbpn7g75 2823019537 21-SEP-20 04.00.26.543 PM 28 465140.082 326940.357 30687.9033 0 0 0 16715547.3 263153 0 460.571429
30q69rbpn7g75 2823019537 21-SEP-20 05.00.40.707 PM 14 934982.157 690958.571 41595.1995 0 0 0 34940770.8 365038.357 0 243.285714
30q69rbpn7g75 2823019537 21-SEP-20 06.00.54.453 PM 14 818768.534 640054.286 42596.9506 0 0 0 33547406.9 451864.786 0 471.5
30q69rbpn7g75 2823019537 21-SEP-20 07.00.08.229 PM 4 2329248.39 2013515 75722.5718 0 0 0 104343531 1027683.25 0 859
30q69rbpn7g75 2823019537 21-SEP-20 08.00.21.705 PM 1 4006478.22 3707840 88265.422 0 0 0 186157328 1082000 0 2744
30q69rbpn7g75 2823019537 21-SEP-20 09.00.34.774 PM 1 1818375.63 1771470 20586.628 0 0 0 88206433 374924 0 2692
30q69rbpn7g75 2823019537 21-SEP-20 10.00.49.484 PM 2 1742051 1399440 41061.122 0 0 0 68750135.5 335797 0 1479.5
30q69rbpn7g75 2823019537 21-SEP-20 11.00.02.543 PM 0 3552963.71 3183770 142948.208 0 0 0 154159601 633488 0 0
This output is from my sqlstat.sql script. Notice how earlier in the day there are several plans with widely varying elapsed times. Plan hash value 1732425686 averages 66 millisconds during the hour ending at 10 am. I do not know why the query has so many different plans. 2823019537 was the bad plan and it got locked in throughout the afternoon and into the night.
Later in that same week I reviewed the plans and found that the longest running plans used certain indexes on the top table and the shorter ones used another index. I looked at the “Segments by Logical Reads” section of an AWR report during the problem time and found that the top segment was index X6_WFCTOTAL. I used by getplans.sql to get all the plans for SQL id 30q69rbpn7g75 and found that the fastest ones used range scans against index X1_WFCTOTAL and the slow ones did range or skip scans against indexes X5_WFCTOTAL or X6_WFCTOTAL. So I picked one plan, 382022017, and used coe_xfr_sql_profile.sql to force 30q69rbpn7g75 to always run with plan 382022017 which used index X1_WFCTOTAL. Here is some execution history of the plan I picked:
SQL_ID PLAN_HASH_VALUE END_INTERVAL_TIME EXECUTIONS_DELTA Elapsed Average ms CPU Average ms IO Average ms Cluster Average ms Application Average ms Concurrency Average ms Average buffer gets Average disk reads Average disk write megabytes Average rows processed
------------- --------------- ------------------------- ---------------- ------------------ -------------- ------------- ------------------ ---------------------- ---------------------- ------------------- ------------------ ---------------------------- ----------------------
30q69rbpn7g75 382022017 24-SEP-20 09.00.13.658 AM 2 1713.2625 865 5.805 0 0 0 2135 76.5 0 141
30q69rbpn7g75 382022017 24-SEP-20 10.00.26.654 AM 2 355.329 120 265.0765 0 0 0 8183 1324.5 0 673
30q69rbpn7g75 382022017 24-SEP-20 11.00.39.519 AM 1 190.386 180 28.981 0 0 0 17505 1759 0 1878
30q69rbpn7g75 382022017 24-SEP-20 01.00.05.144 PM 4 217.37625 200 20.723 0 0 0 21009.25 392.25 0 1865
30q69rbpn7g75 382022017 24-SEP-20 04.00.45.160 PM 5 507.1578 114 432.2858 0 0 0 7076.6 936.2 0 407.8
30q69rbpn7g75 382022017 24-SEP-20 05.00.58.322 PM 3 47.5793333 10 35.1866667 0 0 0 504.333333 81.3333333 0 54
30q69rbpn7g75 382022017 24-SEP-20 06.00.11.202 PM 1 313.107 110 229.071 0 0 0 8178 399 0 396
30q69rbpn7g75 382022017 25-SEP-20 04.00.21.833 AM 3 30.7433333 16.6666667 15.0446667 0 0 0 927 89 0 110.666667
This plan was averaging less than 1800 milliseconds. But after putting in the supposedly better plan it seemed to have worse execution times that other plans. Also, the plan hash value was different than 382022017. Here is the execution history from when I put this SQL Profile in place:
SQL_ID PLAN_HASH_VALUE END_INTERVAL_TIME EXECUTIONS_DELTA Elapsed Average ms CPU Average ms IO Average ms Cluster Average ms Application Average ms Concurrency Average ms Average buffer gets Average disk reads Average disk write megabytes Average rows processed
------------- --------------- ------------------------- ---------------- ------------------ -------------- ------------- ------------------ ---------------------- ---------------------- ------------------- ------------------ ---------------------------- ----------------------
30q69rbpn7g75 653445232 28-SEP-20 12.00.42.470 PM 1 92139.015 26930 61332.901 0 20.998 0 2202073 2197909 0 171
30q69rbpn7g75 653445232 28-SEP-20 01.00.55.521 PM 9 83373.8866 25152.2222 56386.2852 0 14.6918889 0 2102264.11 2065530 0 2613.88889
30q69rbpn7g75 653445232 28-SEP-20 02.00.08.369 PM 5 101830.312 29264 70157.0998 0 9.1714 0 2439219.6 2432260.2 0 348
30q69rbpn7g75 653445232 28-SEP-20 03.00.21.759 PM 9 89705.846 26606.6667 61716.6917 0 14.2046667 0 2212464.67 2196829.33 0 825.111111
30q69rbpn7g75 653445232 28-SEP-20 04.00.34.724 PM 2 91173.826 26425 63832.838 0 22.1385 0 2181959 2195820.5 0 310.5
30q69rbpn7g75 653445232 28-SEP-20 05.00.47.682 PM 4 90240.0368 26430 62587.1345 0 16.558 0 2200555 2197764.5 0 168.75
30q69rbpn7g75 653445232 28-SEP-20 07.00.13.544 PM 3 107541.972 26816.6667 75418.071 0 16.164 0 2193977 2193579.33 0 39.3333333
30q69rbpn7g75 653445232 29-SEP-20 02.00.45.539 AM 1 92416.895 26410 66108.169 0 21.449 0 2194591 2193764 0 158
30q69rbpn7g75 653445232 29-SEP-20 03.00.30.438 AM 1 103773.265 26510 75455.905 0 13.887 0 2200242 2198725 0 122
30q69rbpn7g75 653445232 29-SEP-20 05.00.56.488 AM 13 74637.6784 20391.5385 51139.8206 0 16.9292308 0 1686133.69 1684935.15 0 24.2307692
30q69rbpn7g75 653445232 29-SEP-20 06.00.09.283 AM 10 105894.074 28882 72971.0734 0 20.1169 0 2418827.8 2417314.8 0 56
30q69rbpn7g75 653445232 29-SEP-20 07.00.22.558 AM 4 89408.2108 26080 61537.7755 0 10.0275 0 2174791 2169846 0 421.75
30q69rbpn7g75 653445232 29-SEP-20 08.00.35.596 AM 7 71644.7906 20574.2857 48465.1234 0 8.331 0 1697739 1694385.43 0 232.857143
30q69rbpn7g75 653445232 29-SEP-20 09.00.48.530 AM 2 113993.942 32580 72589.2515 0 6.2465 0 2672441 2667206 0 145
30q69rbpn7g75 653445232 29-SEP-20 10.00.01.174 AM 6 99793.2442 28600 69181.7687 0 4.65783333 0 2394135.83 2387505.17 0 246
30q69rbpn7g75 653445232 29-SEP-20 11.00.14.240 AM 10 94787.9044 28367 65535.8735 0 10.6895 0 2353904.6 2336951.3 0 476.8
30q69rbpn7g75 653445232 29-SEP-20 12.00.27.167 PM 5 89167.86 26462 61499.462 0 14.4808 0 2200557.8 2195895 0 201.8
Notice first that the plan hash value is 653445232 instead of 382022017. I usually see things like this when the plan has system generated table temporary table names, but I have not seen that in the plan. There must be something like that going on. Either that on the SQL Profile just locks in a slightly different plan. Anyway, 653445232 is the plan caused by the SQL Profile. Notice how the average elapsed time hangs around 90,000 to 100,000 milliseconds. But 382022017 has elapsed times under 1800 milliseconds. Seeing these results, I dropped the SQL Profile. Kind of like a doctor who swears to “do no harm” I did not want to slow down queries that were running very efficiently in my attempt to prevent another system slowdown or outage like we had on September 21. I dropped the SQL Profile on September 29.
Then yesterday, October 19, the problem recurred. This time the bad plan was 3908549093 and used a skip scan on X6_WFCTOTAL. I seem to see a lot of bad plans with skip scans. Here was the execution history before I put the SQL Profile back in:
SQL_ID PLAN_HASH_VALUE END_INTERVAL_TIME EXECUTIONS_DELTA Elapsed Average ms CPU Average ms IO Average ms Cluster Average ms Application Average ms Concurrency Average ms Average buffer gets Average disk reads Average disk write megabytes Average rows processed
------------- --------------- ------------------------- ---------------- ------------------ -------------- ------------- ------------------ ---------------------- ---------------------- ------------------- ------------------ ---------------------------- ----------------------
30q69rbpn7g75 3908549093 19-OCT-20 11.00.07.389 AM 16 832180.951 450528.75 68650.1103 0 0 0 21204921.8 495900.5 0 562
30q69rbpn7g75 3908549093 19-OCT-20 12.00.21.033 PM 33 392068.144 194336.364 47412.7197 0 0 0 9271475.06 221593.545 0 309.454545
30q69rbpn7g75 3908549093 19-OCT-20 01.00.34.570 PM 4 3543778.15 1793980 261653.391 0 0 0 82176276.8 1088971 0 1036.5
Between 12 and 1 pm the system was really bogged down with executions of 30q69rbpn7g75 taking almost an hour, 3543778 milliseconds. So, I put the SQL Profile back in. I had the script sitting there from the first time I tried it. These are all kept in the sqlt/utl directory. After putting it in things cleared up and the backlog of reports slowly emptied. I think we killed one long running session and had the user rerun the report with the SQL Profile in place.
SQL_ID PLAN_HASH_VALUE END_INTERVAL_TIME EXECUTIONS_DELTA Elapsed Average ms CPU Average ms IO Average ms Cluster Average ms Application Average ms Concurrency Average ms Average buffer gets Average disk reads Average disk write megabytes Average rows processed
------------- --------------- ------------------------- ---------------- ------------------ -------------- ------------- ------------------ ---------------------- ---------------------- ------------------- ------------------ ---------------------------- ----------------------
30q69rbpn7g75 653445232 19-OCT-20 02.00.48.739 PM 27 106287.519 26936.2963 61603.699 0 7.41166667 0 2226367.89 2204705.19 0 1358.25926
30q69rbpn7g75 3908549093 19-OCT-20 02.00.48.739 PM 0 9380042.72 6465310 308289.185 0 0 0 294469425 1581492 0 7067
30q69rbpn7g75 653445232 19-OCT-20 03.00.01.772 PM 80 103522.356 26259.25 67322.2418 0 21.1012125 0 2159835.1 2142347.26 0 914.2625
30q69rbpn7g75 653445232 19-OCT-20 04.00.14.862 PM 31 99252.0065 26990 66069.6995 0 24.5769355 0 2227363.45 2213021.26 0 885.709677
30q69rbpn7g75 653445232 19-OCT-20 05.00.27.768 PM 5 94587.0244 26988 64440.3338 0 24.1514 0 2223326.4 2204672.6 0 1553.6
30q69rbpn7g75 653445232 19-OCT-20 06.00.40.629 PM 5 93301.2074 26690 65105.9732 0 14.0782 0 2213653.6 2203033 0 736.4
30q69rbpn7g75 653445232 19-OCT-20 07.00.53.923 PM 3 101117.233 27193.3333 72020.9813 0 15.4373333 0 2225218.33 2207644 0 1623
30q69rbpn7g75 653445232 20-OCT-20 12.00.59.660 AM 1 92061.35 26550 64953.945 0 22.245 0 2210157 2206170 0 235
30q69rbpn7g75 653445232 20-OCT-20 02.00.25.964 AM 1 92872.242 26470 66092.822 0 11.999 0 2208305 2206231 0 158
30q69rbpn7g75 653445232 20-OCT-20 05.00.36.773 AM 2 88107.2095 26075 61670.129 0 17.2175 0 2205332.5 2203981 0 116
30q69rbpn7g75 653445232 20-OCT-20 06.00.49.657 AM 1 91007.493 26210 64276.474 0 9.972 0 2208516 2206310 0 177
30q69rbpn7g75 653445232 20-OCT-20 07.00.02.613 AM 4 101878.314 26940 65491.475 0 30.476 0 2210945.25 2204828.75 0 461.5
30q69rbpn7g75 653445232 20-OCT-20 08.00.15.526 AM 10 97441.3635 26496 67549.4579 0 3.178 0 2197412.4 2192467.5 0 478.1
30q69rbpn7g75 653445232 20-OCT-20 09.00.28.457 AM 5 59362.9672 17038 41573.7714 0 7.767 0 1416804.4 1416061 0 10.2
30q69rbpn7g75 653445232 20-OCT-20 10.00.41.653 AM 6 108681.505 30798.3333 75082.4997 0 10.5146667 0 2558854.67 2549531.83 0 405
30q69rbpn7g75 653445232 20-OCT-20 11.00.54.611 AM 1 186021.778 53440 131017.985 0 18.976 0 4285650 4202806 0 814
But with the SQL Profile in place it still in consistently around 90,000 milliseconds even today. There are no executions under 1800 milliseconds like before. It is a puzzle, but it got us through the issue. I am planning to leave this in place to prevent another production outage, but I suspect that these reports may be running longer than normal in many cases. But at least they are consistent and the business needs are being met.
So, this is another example of a SQL Profile to the rescue in a production database performance issue. But this was different because it was not clear that the one plan was always the best one. I went so far as to put it in and take it out and then put it in again. I ended up leaving it in because it resolved a performance issue that we had twice and do not want to have in the future. Hopefully any less than optimal performance outside of these peak times will be worth it since we are preventing an outage with significant impact to the business.
Bobby