Result cache latch contention

I recently saw a dramatic example of result cache latch contention. I had just upgraded a database to 11.2.0.4 and several hours later processing pretty much came to a halt.

Of course I am telling you the end before I tell you the beginning. It all started the morning of July 22nd, 2017, a few weeks back. We had worked for a couple of months on an 11.2.0.4 upgrade. I mentioned some issues with the same upgrade in my earlier post. I spent several hours Saturday morning upgrading the database and it all went very smoothly. Then we kicked off the normal batch processing and things seemed fine. Around 1 am the next morning I get paged about an issue. All of the batch processing had ground to a halt on the upgraded database. Needless to say, I was freaking out and afraid that we would have to back out the upgrade. This would have been ugly and time-consuming.

At first I  focused on the top SQL statements in the AWR report. Here are the top few from that morning:

Elapsed Time (s) Executions Elapsed Time per Exec (s) %Total %CPU %IO SQL Id SQL Module SQL Text
17,983.20 452,555 0.04 43.98 10.64 0.00 4yw0zv7ty2t47 SQL*Plus SELECT SLS_EXCL_RSN FROM ( SEL…
3,643.96 0 8.91 13.68 0.19 2fxnrcamtbcc2 SQL*Plus DECLARE return_code number := …
3,637.60 0 8.90 3.67 0.06 18pdd22fh15dc SQL*Plus INSERT /*+ APPEND PARALLEL(TGT…

The top query dominates the others because it takes 43.98% of the total run time so it made sense that this query was the problem. I expected that some queries would change plan with the upgrade. I had used SQL Profiles to lock in other plans that I knew were problematic but I would not be surprised to see new ones. But, looking at the top SQL, sql id 4yw0zv7ty2t47, the plan was the same as before the upgrade. Here is some execution history of the query before and after the July 22 upgrade:

SQL_ID        PLAN_HASH_VALUE END_INTERVAL_TIME         EXECUTIONS_DELTA Elapsed Average ms
------------- --------------- ------------------------- ---------------- ------------------
4yw0zv7ty2t47       848116227 15-JUL-17 11.00.28.230 PM           192430         .822087045
4yw0zv7ty2t47       848116227 16-JUL-17 02.00.24.144 AM          2522505         .691503149
4yw0zv7ty2t47       848116227 16-JUL-17 03.00.42.580 AM          2787002         .793723611
4yw0zv7ty2t47       848116227 22-JUL-17 10.00.12.787 PM           220581         23.4686899
4yw0zv7ty2t47       848116227 22-JUL-17 11.00.30.251 PM           455887         20.1393904
4yw0zv7ty2t47       848116227 23-JUL-17 12.00.47.394 AM           445173         20.5407455
4yw0zv7ty2t47       848116227 23-JUL-17 01.00.05.711 AM           369333         28.3195315
4yw0zv7ty2t47       848116227 23-JUL-17 02.00.22.811 AM           465591         39.0232079
4yw0zv7ty2t47       848116227 23-JUL-17 03.00.40.758 AM           452555         39.7370594
4yw0zv7ty2t47       848116227 23-JUL-17 04.00.57.968 AM           458328         39.3421407
4yw0zv7ty2t47       848116227 23-JUL-17 04.09.32.144 AM            10055         39.1518787
4yw0zv7ty2t47       848116227 23-JUL-17 04.11.58.484 AM            18507         39.6002968
4yw0zv7ty2t47       848116227 23-JUL-17 04.15.24.661 AM             5215         39.4672715
4yw0zv7ty2t47       848116227 23-JUL-17 04.16.30.441 AM             8542         39.1123689
4yw0zv7ty2t47       848116227 23-JUL-17 05.00.15.147 AM           321635         39.9827259

So, I stared at this for a couple of hours, getting more and more stressed because my cure-all SQL Profile was not going to help in this situation. The plan had not changed. I could not think, in my sleep deprived state, of a way to resolve this issue that morning. Then, after a quick prayer for help, I noticed the %CPU and %IO columns in the query’s line in the AWR report. 10% CPU, 0% I/O. The query was spending 90% of its time waiting and doing nothing. I had noticed some latch waits on the AWR report but I thought that the CPU on the system was just busy so we had some latch waits.

These were the top three events:

Event Waits Total Wait Time (sec) Wait Avg(ms) % DB time Wait Class
latch free 3,658,537 26.9K 7 65.8 Other
DB CPU 4749 11.6
db file scattered read 71,549 142.1 2 .3 User I/O

But, now finally I looked back at the latch waits. I went down to the latch section of the AWR report and the Result Cache: RC Latch latch showed up big:

Latch Sleep Breakdown

Latch Name Get Requests Misses Sleeps Spin Gets
Result Cache: RC Latch 6,742,176 4,142,777 4,143,709 669,430
row cache objects 5,804,568 169,324 6,087 163,272
cache buffers chains 107,393,594 36,532 5,859 30,976

I had noticed the RESULT_CACHE hint in some queries on our test database but never saw the latch contention. Here is the first part of the problem query with the RESULT_CACHE hint.

SQL_ID 4yw0zv7ty2t47
--------------------
SELECT SLS_EXCL_RSN FROM ( SELECT /*+ PARALLEL RESULT_CACHE */ DISTINCT

Here is part of the plan with the result cache step:

Plan hash value: 848116227
------------------------------------------
| Id  | Operation                        |
------------------------------------------
|  11 |  COUNT STOPKEY                   |
|  12 |   VIEW                           |
|  13 |    RESULT CACHE                  |
------------------------------------------

Early in the morning I got the idea of disabling the result cache. I ran this command:

alter system set result_cache_max_size=0 scope=both;

All of the running queries immediately died with this error:

ORA-00600: internal error code, arguments: [qesrcDO_AddRO],…

But, when the jobs were rerun they quickly went to completion. I checked the query performance before and after disabling the result cache and before the upgrade as well and found that post upgrade without the result cache the query ran ten times faster than before.

SQL_ID        PLAN_HASH_VALUE END_INTERVAL_TIME         EXECUTIONS_DELTA Elapsed Average ms
------------- --------------- ------------------------- ---------------- ------------------
4yw0zv7ty2t47       848116227 15-JUL-17 11.00.28.230 PM           192430         .822087045
4yw0zv7ty2t47       848116227 16-JUL-17 02.00.24.144 AM          2522505         .691503149
4yw0zv7ty2t47       848116227 16-JUL-17 03.00.42.580 AM          2787002         .793723611
4yw0zv7ty2t47       848116227 22-JUL-17 10.00.12.787 PM           220581         23.4686899
4yw0zv7ty2t47       848116227 22-JUL-17 11.00.30.251 PM           455887         20.1393904
4yw0zv7ty2t47       848116227 23-JUL-17 12.00.47.394 AM           445173         20.5407455
4yw0zv7ty2t47       848116227 23-JUL-17 01.00.05.711 AM           369333         28.3195315
4yw0zv7ty2t47       848116227 23-JUL-17 02.00.22.811 AM           465591         39.0232079
4yw0zv7ty2t47       848116227 23-JUL-17 03.00.40.758 AM           452555         39.7370594
4yw0zv7ty2t47       848116227 23-JUL-17 04.00.57.968 AM           458328         39.3421407
4yw0zv7ty2t47       848116227 23-JUL-17 04.09.32.144 AM            10055         39.1518787
4yw0zv7ty2t47       848116227 23-JUL-17 04.11.58.484 AM            18507         39.6002968
4yw0zv7ty2t47       848116227 23-JUL-17 04.15.24.661 AM             5215         39.4672715
4yw0zv7ty2t47       848116227 23-JUL-17 04.16.30.441 AM             8542         39.1123689
4yw0zv7ty2t47       848116227 23-JUL-17 05.00.15.147 AM           321635         39.9827259
4yw0zv7ty2t47       848116227 23-JUL-17 05.20.37.524 AM           329457          17.895581
4yw0zv7ty2t47       848116227 23-JUL-17 05.21.15.363 AM           205154         .050141323
4yw0zv7ty2t47       848116227 23-JUL-17 05.25.07.159 AM          1023657         .049949389

If you look through the output you will see that pre-upgrade on July 16th the query averaged about .69 to .79 milliseconds. During the latch contention on July 23rd it averaged about 39 milliseconds, a lot worse. But, after disabling the result cache it averaged .05 milliseconds which is at least 10 times faster than with the result cache before the upgrade.

So, it seems that the result cache hint on this query has always slowed it down. But, the upgrade and the load afterward caused some intense latch contention that we had not seen before. But, it is very cool that disabling the result cache actually made the query faster than it has been in the past.

I don’t think that it makes sense to put a RESULT_CACHE hint in a query that will run in .05 milliseconds without it. The overhead of the result cache made the query run 10 times slower at least. Something about the upgrade resulted in latch contention that caused a system wide problem, but disabling the result cache made the query run faster than it ever had. We could have avoided this problem by leaving off the RESULT_CACHE hint, but it was nice that I could resolve the problem quickly by disabling the result cache using a parameter change.

Bobby

About Bobby

I live in Chandler, Arizona with my wife and three daughters. I work for US Foods, the second largest food distribution company in the United States. I have worked in the Information Technology field since 1989. I have a passion for Oracle database performance tuning because I enjoy challenging technical problems that require an understanding of computer science. I enjoy communicating with people about my work.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply