I recently saw a dramatic example of result cache latch contention. I had just upgraded a database to 11.2.0.4 and several hours later processing pretty much came to a halt.
Of course I am telling you the end before I tell you the beginning. It all started the morning of July 22nd, 2017, a few weeks back. We had worked for a couple of months on an 11.2.0.4 upgrade. I mentioned some issues with the same upgrade in my earlier post. I spent several hours Saturday morning upgrading the database and it all went very smoothly. Then we kicked off the normal batch processing and things seemed fine. Around 1 am the next morning I get paged about an issue. All of the batch processing had ground to a halt on the upgraded database. Needless to say, I was freaking out and afraid that we would have to back out the upgrade. This would have been ugly and time-consuming.
At first I focused on the top SQL statements in the AWR report. Here are the top few from that morning:
Elapsed Time (s) | Executions | Elapsed Time per Exec (s) | %Total | %CPU | %IO | SQL Id | SQL Module | SQL Text |
---|---|---|---|---|---|---|---|---|
17,983.20 | 452,555 | 0.04 | 43.98 | 10.64 | 0.00 | 4yw0zv7ty2t47 | SQL*Plus | SELECT SLS_EXCL_RSN FROM ( SEL… |
3,643.96 | 0 | 8.91 | 13.68 | 0.19 | 2fxnrcamtbcc2 | SQL*Plus | DECLARE return_code number := … | |
3,637.60 | 0 | 8.90 | 3.67 | 0.06 | 18pdd22fh15dc | SQL*Plus | INSERT /*+ APPEND PARALLEL(TGT… | |
The top query dominates the others because it takes 43.98% of the total run time so it made sense that this query was the problem. I expected that some queries would change plan with the upgrade. I had used SQL Profiles to lock in other plans that I knew were problematic but I would not be surprised to see new ones. But, looking at the top SQL, sql id 4yw0zv7ty2t47, the plan was the same as before the upgrade. Here is some execution history of the query before and after the July 22 upgrade:
SQL_ID PLAN_HASH_VALUE END_INTERVAL_TIME EXECUTIONS_DELTA Elapsed Average ms ------------- --------------- ------------------------- ---------------- ------------------ 4yw0zv7ty2t47 848116227 15-JUL-17 11.00.28.230 PM 192430 .822087045 4yw0zv7ty2t47 848116227 16-JUL-17 02.00.24.144 AM 2522505 .691503149 4yw0zv7ty2t47 848116227 16-JUL-17 03.00.42.580 AM 2787002 .793723611 4yw0zv7ty2t47 848116227 22-JUL-17 10.00.12.787 PM 220581 23.4686899 4yw0zv7ty2t47 848116227 22-JUL-17 11.00.30.251 PM 455887 20.1393904 4yw0zv7ty2t47 848116227 23-JUL-17 12.00.47.394 AM 445173 20.5407455 4yw0zv7ty2t47 848116227 23-JUL-17 01.00.05.711 AM 369333 28.3195315 4yw0zv7ty2t47 848116227 23-JUL-17 02.00.22.811 AM 465591 39.0232079 4yw0zv7ty2t47 848116227 23-JUL-17 03.00.40.758 AM 452555 39.7370594 4yw0zv7ty2t47 848116227 23-JUL-17 04.00.57.968 AM 458328 39.3421407 4yw0zv7ty2t47 848116227 23-JUL-17 04.09.32.144 AM 10055 39.1518787 4yw0zv7ty2t47 848116227 23-JUL-17 04.11.58.484 AM 18507 39.6002968 4yw0zv7ty2t47 848116227 23-JUL-17 04.15.24.661 AM 5215 39.4672715 4yw0zv7ty2t47 848116227 23-JUL-17 04.16.30.441 AM 8542 39.1123689 4yw0zv7ty2t47 848116227 23-JUL-17 05.00.15.147 AM 321635 39.9827259
So, I stared at this for a couple of hours, getting more and more stressed because my cure-all SQL Profile was not going to help in this situation. The plan had not changed. I could not think, in my sleep deprived state, of a way to resolve this issue that morning. Then, after a quick prayer for help, I noticed the %CPU and %IO columns in the query’s line in the AWR report. 10% CPU, 0% I/O. The query was spending 90% of its time waiting and doing nothing. I had noticed some latch waits on the AWR report but I thought that the CPU on the system was just busy so we had some latch waits.
These were the top three events:
Event | Waits | Total Wait Time (sec) | Wait Avg(ms) | % DB time | Wait Class |
---|---|---|---|---|---|
latch free | 3,658,537 | 26.9K | 7 | 65.8 | Other |
DB CPU | 4749 | 11.6 | |||
db file scattered read | 71,549 | 142.1 | 2 | .3 | User I/O |
But, now finally I looked back at the latch waits. I went down to the latch section of the AWR report and the Result Cache: RC Latch latch showed up big:
Latch Sleep Breakdown
Latch Name | Get Requests | Misses | Sleeps | Spin Gets |
---|---|---|---|---|
Result Cache: RC Latch | 6,742,176 | 4,142,777 | 4,143,709 | 669,430 |
row cache objects | 5,804,568 | 169,324 | 6,087 | 163,272 |
cache buffers chains | 107,393,594 | 36,532 | 5,859 | 30,976 |
I had noticed the RESULT_CACHE hint in some queries on our test database but never saw the latch contention. Here is the first part of the problem query with the RESULT_CACHE hint.
SQL_ID 4yw0zv7ty2t47 -------------------- SELECT SLS_EXCL_RSN FROM ( SELECT /*+ PARALLEL RESULT_CACHE */ DISTINCT
Here is part of the plan with the result cache step:
Plan hash value: 848116227 ------------------------------------------ | Id | Operation | ------------------------------------------ | 11 | COUNT STOPKEY | | 12 | VIEW | | 13 | RESULT CACHE | ------------------------------------------
Early in the morning I got the idea of disabling the result cache. I ran this command:
alter system set result_cache_max_size=0 scope=both;
All of the running queries immediately died with this error:
ORA-00600: internal error code, arguments: [qesrcDO_AddRO],…
But, when the jobs were rerun they quickly went to completion. I checked the query performance before and after disabling the result cache and before the upgrade as well and found that post upgrade without the result cache the query ran ten times faster than before.
SQL_ID PLAN_HASH_VALUE END_INTERVAL_TIME EXECUTIONS_DELTA Elapsed Average ms ------------- --------------- ------------------------- ---------------- ------------------ 4yw0zv7ty2t47 848116227 15-JUL-17 11.00.28.230 PM 192430 .822087045 4yw0zv7ty2t47 848116227 16-JUL-17 02.00.24.144 AM 2522505 .691503149 4yw0zv7ty2t47 848116227 16-JUL-17 03.00.42.580 AM 2787002 .793723611 4yw0zv7ty2t47 848116227 22-JUL-17 10.00.12.787 PM 220581 23.4686899 4yw0zv7ty2t47 848116227 22-JUL-17 11.00.30.251 PM 455887 20.1393904 4yw0zv7ty2t47 848116227 23-JUL-17 12.00.47.394 AM 445173 20.5407455 4yw0zv7ty2t47 848116227 23-JUL-17 01.00.05.711 AM 369333 28.3195315 4yw0zv7ty2t47 848116227 23-JUL-17 02.00.22.811 AM 465591 39.0232079 4yw0zv7ty2t47 848116227 23-JUL-17 03.00.40.758 AM 452555 39.7370594 4yw0zv7ty2t47 848116227 23-JUL-17 04.00.57.968 AM 458328 39.3421407 4yw0zv7ty2t47 848116227 23-JUL-17 04.09.32.144 AM 10055 39.1518787 4yw0zv7ty2t47 848116227 23-JUL-17 04.11.58.484 AM 18507 39.6002968 4yw0zv7ty2t47 848116227 23-JUL-17 04.15.24.661 AM 5215 39.4672715 4yw0zv7ty2t47 848116227 23-JUL-17 04.16.30.441 AM 8542 39.1123689 4yw0zv7ty2t47 848116227 23-JUL-17 05.00.15.147 AM 321635 39.9827259 4yw0zv7ty2t47 848116227 23-JUL-17 05.20.37.524 AM 329457 17.895581 4yw0zv7ty2t47 848116227 23-JUL-17 05.21.15.363 AM 205154 .050141323 4yw0zv7ty2t47 848116227 23-JUL-17 05.25.07.159 AM 1023657 .049949389
If you look through the output you will see that pre-upgrade on July 16th the query averaged about .69 to .79 milliseconds. During the latch contention on July 23rd it averaged about 39 milliseconds, a lot worse. But, after disabling the result cache it averaged .05 milliseconds which is at least 10 times faster than with the result cache before the upgrade.
So, it seems that the result cache hint on this query has always slowed it down. But, the upgrade and the load afterward caused some intense latch contention that we had not seen before. But, it is very cool that disabling the result cache actually made the query faster than it has been in the past.
I don’t think that it makes sense to put a RESULT_CACHE hint in a query that will run in .05 milliseconds without it. The overhead of the result cache made the query run 10 times slower at least. Something about the upgrade resulted in latch contention that caused a system wide problem, but disabling the result cache made the query run faster than it ever had. We could have avoided this problem by leaving off the RESULT_CACHE hint, but it was nice that I could resolve the problem quickly by disabling the result cache using a parameter change.
Bobby
Pingback: Check your hints carefully | Bobby Durrett's DBA Blog
Thanks for your help, i have faced the same issue after new instance installation, your solution works!
That is great! I am glad that my post helped. Thank you for the comment.
Bobby