Shared servers prevents web outage

This weekend we had the most convincing evidence that our change from dedicated to shared servers on a database that supports a farm of web servers was the right move.  We have had some outages on the weekend caused by a sudden burst in web server generated database activity.  In the past the CPU load would spike and log file sync (commit) waits would be 20 times slower and we would have to bounce the database and web servers to recover.  Sunday we had a similar spike in database activity without having any sort of outage.

Here is a section of the AWR report from the last weekend outage:

Top 5 Timed Events

 

Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class
log file sync 636,948 157,447 247 9.3 Commit
latch: library cache 81,521 98,589 1,209 5.8 Concurrency
latch: library cache pin 39,580 73,409 1,855 4.3 Concurrency
latch free 42,929 45,043 1,049 2.7 Other
latch: session allocation 32,766 42,227 1,289 2.5 Other

Here is the same part of the AWR report for this weekend’s spike in activity:

Top 5 Timed Events

 

Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class
log file sync 630,867 6,802 11 43.1 Commit
CPU time 5,221 33.1
db file sequential read 604,450 4,498 7 28.5 User I/O
db file parallel write 213,913 3,661 17 23.2 System I/O
log file parallel write 522,021 1,168 2 7.4 System I/O

These are hour long intervals, in both cases between 9 and 10 am central on the first Sunday of a Month (June and August).  The key is that in both cases there are around 600,000 commits in that hour.  During the outage in June the commits took 247 milliseconds, a quarter of a second, each.  This Sunday they took only 11 milliseconds.  Note that in both cases the disk IO for commits – log file parallel write – was only 2 milliseconds.  So, the difference was CPU and primarily queuing for the CPU.  So, with dedicated servers we had 20 times as much queuing for the CPU roughly (247 ms/11 ms).  Note that during our normal peak processing log file sync is 3 milliseconds so even the 11 milliseconds we saw this weekend represents some queuing.

The key to this is that we have the number of shared server processes set to twice the number of CPUs.  When I say “CPUs” I mean from the Unix perspective.  They are probably cores, etc. But, Unix thinks we have 16 CPUs.  We have 32 shared server processes.  This prevents CPU queuing because even if all 32 shared servers are running full out they probably wont max out the CPU because they will be doing things besides CPU some of the time.  The ideal number may not be 2x CPUs.  It may be 1.5 or 2.3 but the point is there is some number of shared server processes that under overwhelming load will allow the CPU to be busy but not allow a huge queue for the CPU.  Two times the number of CPUs is a good starting point and this was what Tom Kyte recommended in my ten minute conversation with him that spawned this change to shared servers.

With dedicated servers we would have hundreds of processes and they could easily max out the CPU and then the log writer process(LGRW) would split time waiting on the CPU equally with the hundreds of active dedicated server processes.  I think what was really happening with dedicated servers is that hundreds of sessions were hung up waiting on commits and then the session pools from the web servers started spawning new connections which themselves ate up CPU and a downward spiral would occur that we could not recover from.  With shared servers the commits remained efficient and the web servers didn’t need to spawn so many new connections because they weren’t hung up waiting on commits.

If you are supporting a database that has a lot of web server connections doing a lot of commits you might want to consider shared servers as an option to prevent the log writer from being starved for CPU.

Here are my previous posts related to this issue for reference:

 http://www.bobbydurrettdba.com/2013/07/19/shared-servers-results-in-lower-cpu-usage-for-top-query-in-awr/

http://www.bobbydurrettdba.com/2013/06/26/testing-maximum-number-of-oracle-sessions-supported-by-shared-servers/

http://www.bobbydurrettdba.com/2012/08/30/faster-commit-time-with-shared-servers/

http://www.bobbydurrettdba.com/2012/03/21/reducing-size-of-connection-pool-to-improve-web-application-performance/

It may be tough to convince people to move to shared servers since it isn’t a commonly used feature of the Oracle database but in the case of hundreds of sessions with lots of commits it makes sense as a way of keeping the commit process efficient.

- Bobby

P.S.  Here are our parameters in production related to the shared servers change with the ip address removed.  We had to bump up the large pool and set local_listener in addition to setting the shared servers and dispatchers parameters.  I added newlines to the dispatchers and local listener parameters to fit on this page.

NAME                                 VALUE
------------------------------------ -------------------
max_shared_servers                   32
shared_servers                       32
dispatchers                          (PROTOCOL=TCP)
                                     (DISPATCHERS=64)
max_dispatchers                      
local_listener                       (ADDRESS=
                                     (PROTOCOL=TCP)
                                     (HOST=EDITEDOUT)
                                     (PORT=1521))
large_pool_size                      2G

P.P.S.  This server is on HP-UX 11.11 and Oracle 10.2.0.3.

About Bobby

I live in Chandler, Arizona with my wife and three daughters. I work for US Foods, the second largest food distribution company in the United States. I've been working as an Oracle database administrator and PeopleSoft administrator since 1994. I'm very interested in Oracle performance tuning.
This entry was posted in Uncategorized. Bookmark the permalink.

6 Responses to Shared servers prevents web outage

  1. Cesar says:

    This is AWESOME information Bobby! Thank you for sharing.

    Cesar Torres
    Campus Crusade for Christ

  2. Bobby says:

    Thanks Cesar. Hopefully it will be helpful to others.

    - Bobby

  3. volodimir vololdimirovich p. says:

    //Note that in both cases the disk IO for commits – log file parallel write – was only 2 milliseconds

    don’t you test LGWR process with real time priority?

    • Bobby says:

      Volodimir,

      Thanks for your comment. I didn’t test changing LGWR’s priority but that is an option I was aware of. I wasn’t sure if there would be some negative consequence of setting the higher priority. The other option I am aware of that might help is changing the commits to be no wait which could also help prevent the log writer from being the bottleneck.

      - Bobby

  4. Marko Sutic says:

    Excellent case Bobby.

    Didn’t knew that you could help yourself with shared server processes to solve LGWR issues.

    We had similar problem before few years and then we were thinking about commit nowait because possible data loss could be tolerated. Even raising LGWR process priority was known for me (even though I’ve never messed with process priorities in production).

    This is new to me – thanks for sharing ;-)

    Regards,
    Marko

    • Bobby says:

      Marko,

      Thanks for your comment. I think there are advantages to using shared servers beyond just keeping the LGWR process from being starved for CPU but it is all I’ve really tested and convinced myself of. So, I haven’t proven it, but shared servers may have advantages over increasing the LGWR process priority or commit nowait besides the obvious ones such as increased priority causing issues or commit nowait losing updates in an outage. i.e. Maybe it would prevent latching from being the bottleneck? Remains to be seen.

      - Bobby

Leave a Reply