I resolved another production performance issue with a SQL Profile yesterday. I have several posts about SQL Profiles, so I do not want to be redundant, but this case was a little different because it was not clear that I had a better plan. I want to document the challenge that I had deciding if I had the best plan and show the resolution.
On September 21 when I think I was on vacation or otherwise not in the office there was a big performance issue on an older production database. The on-call DBA identified the sql_id of the top SQL statement as 30q69rbpn7g75. But he and an application support developer together could not connect that SQL statement back to the long running reports that were impacting the business. Eventually the issue went away later that night. Here is some of the execution history from that original issue:
This output is from my sqlstat.sql script. Notice how earlier in the day there are several plans with widely varying elapsed times. Plan hash value 1732425686 averages 66 millisconds during the hour ending at 10 am. I do not know why the query has so many different plans. 2823019537 was the bad plan and it got locked in throughout the afternoon and into the night.
Later in that same week I reviewed the plans and found that the longest running plans used certain indexes on the top table and the shorter ones used another index. I looked at the “Segments by Logical Reads” section of an AWR report during the problem time and found that the top segment was index X6_WFCTOTAL. I used by getplans.sql to get all the plans for SQL id 30q69rbpn7g75 and found that the fastest ones used range scans against index X1_WFCTOTAL and the slow ones did range or skip scans against indexes X5_WFCTOTAL or X6_WFCTOTAL. So I picked one plan, 382022017, and used coe_xfr_sql_profile.sql to force 30q69rbpn7g75 to always run with plan 382022017 which used index X1_WFCTOTAL. Here is some execution history of the plan I picked:
This plan was averaging less than 1800 milliseconds. But after putting in the supposedly better plan it seemed to have worse execution times that other plans. Also, the plan hash value was different than 382022017. Here is the execution history from when I put this SQL Profile in place:
Notice first that the plan hash value is 653445232 instead of 382022017. I usually see things like this when the plan has system generated table temporary table names, but I have not seen that in the plan. There must be something like that going on. Either that on the SQL Profile just locks in a slightly different plan. Anyway, 653445232 is the plan caused by the SQL Profile. Notice how the average elapsed time hangs around 90,000 to 100,000 milliseconds. But 382022017 has elapsed times under 1800 milliseconds. Seeing these results, I dropped the SQL Profile. Kind of like a doctor who swears to “do no harm” I did not want to slow down queries that were running very efficiently in my attempt to prevent another system slowdown or outage like we had on September 21. I dropped the SQL Profile on September 29.
Then yesterday, October 19, the problem recurred. This time the bad plan was 3908549093 and used a skip scan. I seem to see a lot of bad plans with skip scans. Here was the execution history before I put the SQL Profile back in:
Between 12 and 1 pm the system was really bogged down with executions of 30q69rbpn7g75 taking almost an hour, 3543778 milliseconds. So, I put the SQL Profile back in. I had the script sitting there from the first time I tried it. These are all kept in the sqlt/utl directory. After putting it in things cleared up and the backlog of reports slowly emptied. I think we killed one long running session and had the user rerun the report with the SQL Profile in place.
But with the SQL Profile in place it still in consistently around 90,000 milliseconds even today. There are no executions under 1800 milliseconds like before. It is a puzzle, but it got us through the issue. I am planning to leave this in place to prevent another production outage, but I suspect that these reports may be running longer than normal in many cases. But at least they are consistent and the business needs are being met.
So, this is another example of a SQL Profile to the rescue in a production database performance issue. But this was different because it was not clear that the one plan was always the best one. I went so far as to put it in and take it out and then put it in again. I ended up leaving it in because it resolved a performance issue that we had twice and do not want to have in the future. Hopefully any less than optimal performance outside of these peak times will be worth it since we are preventing an outage with significant impact to the business.