Shared servers prevents web outage

This weekend we had the most convincing evidence that our change from dedicated to shared servers on a database that supports a farm of web servers was the right move. We have had some outages on the weekend caused by a sudden burst in web server generated database activity. In the past the CPU load would spike and log file sync (commit) waits would be 20 times slower and we would have to bounce the database and web servers to recover. Sunday we had a similar spike in database activity without having any sort of outage.

Here is a section of the AWR report from the last weekend outage:

Top 5 Timed Events

Event	Waits	Time(s)	Avg Wait(ms)	% Total Call Time	Wait Class
log file sync	636,948	157,447	247	9.3	Commit
latch: library cache	81,521	98,589	1,209	5.8	Concurrency
latch: library cache pin	39,580	73,409	1,855	4.3	Concurrency
latch free	42,929	45,043	1,049	2.7	Other
latch: session allocation	32,766	42,227	1,289	2.5	Other

Here is the same part of the AWR report for this weekend’s spike in activity:

Top 5 Timed Events

Event	Waits	Time(s)	Avg Wait(ms)	% Total Call Time	Wait Class
log file sync	630,867	6,802	11	43.1	Commit
CPU time		5,221		33.1
db file sequential read	604,450	4,498	7	28.5	User I/O
db file parallel write	213,913	3,661	17	23.2	System I/O
log file parallel write	522,021	1,168	2	7.4	System I/O

These are hour long intervals, in both cases between 9 and 10 am central on the first Sunday of a Month (June and August). The key is that in both cases there are around 600,000 commits in that hour. During the outage in June the commits took 247 milliseconds, a quarter of a second, each. This Sunday they took only 11 milliseconds. Note that in both cases the disk IO for commits – log file parallel write – was only 2 milliseconds. So, the difference was CPU and primarily queuing for the CPU. So, with dedicated servers we had 20 times as much queuing for the CPU roughly (247 ms/11 ms). Note that during our normal peak processing log file sync is 3 milliseconds so even the 11 milliseconds we saw this weekend represents some queuing.

The key to this is that we have the number of shared server processes set to twice the number of CPUs. When I say “CPUs” I mean from the Unix perspective. They are probably cores, etc. But, Unix thinks we have 16 CPUs. We have 32 shared server processes. This prevents CPU queuing because even if all 32 shared servers are running full out they probably wont max out the CPU because they will be doing things besides CPU some of the time. The ideal number may not be 2x CPUs. It may be 1.5 or 2.3 but the point is there is some number of shared server processes that under overwhelming load will allow the CPU to be busy but not allow a huge queue for the CPU. Two times the number of CPUs is a good starting point and this was what Tom Kyte recommended in my ten minute conversation with him that spawned this change to shared servers.

With dedicated servers we would have hundreds of processes and they could easily max out the CPU and then the log writer process(LGRW) would split time waiting on the CPU equally with the hundreds of active dedicated server processes. I think what was really happening with dedicated servers is that hundreds of sessions were hung up waiting on commits and then the session pools from the web servers started spawning new connections which themselves ate up CPU and a downward spiral would occur that we could not recover from. With shared servers the commits remained efficient and the web servers didn’t need to spawn so many new connections because they weren’t hung up waiting on commits.

If you are supporting a database that has a lot of web server connections doing a lot of commits you might want to consider shared servers as an option to prevent the log writer from being starved for CPU.

Here are my previous posts related to this issue for reference:

https://www.bobbydurrettdba.com/2013/07/19/shared-servers-results-in-lower-cpu-usage-for-top-query-in-awr/

https://www.bobbydurrettdba.com/2013/06/26/testing-maximum-number-of-oracle-sessions-supported-by-shared-servers/

https://www.bobbydurrettdba.com/2012/08/30/faster-commit-time-with-shared-servers/

https://www.bobbydurrettdba.com/2012/03/21/reducing-size-of-connection-pool-to-improve-web-application-performance/

It may be tough to convince people to move to shared servers since it isn’t a commonly used feature of the Oracle database but in the case of hundreds of sessions with lots of commits it makes sense as a way of keeping the commit process efficient.

– Bobby

P.S. Here are our parameters in production related to the shared servers change with the ip address removed. We had to bump up the large pool and set local_listener in addition to setting the shared servers and dispatchers parameters. I added newlines to the dispatchers and local listener parameters to fit on this page.

NAME                                 VALUE
------------------------------------ -------------------
max_shared_servers                   32
shared_servers                       32
dispatchers                          (PROTOCOL=TCP)
                                     (DISPATCHERS=64)
max_dispatchers                      
local_listener                       (ADDRESS=
                                     (PROTOCOL=TCP)
                                     (HOST=EDITEDOUT)
                                     (PORT=1521))
large_pool_size                      2G

P.P.S. This server is on HP-UX 11.11 and Oracle 10.2.0.3.

6 Responses to Shared servers prevents web outage

Cesar says:

August 5, 2013 at 4:38 pm

This is AWESOME information Bobby! Thank you for sharing.

Cesar Torres
Campus Crusade for Christ

Bobby says:

August 5, 2013 at 5:01 pm

Thanks Cesar. Hopefully it will be helpful to others.

– Bobby

volodimir vololdimirovich p. says:

August 6, 2013 at 1:10 am

//Note that in both cases the disk IO for commits – log file parallel write – was only 2 milliseconds

don’t you test LGWR process with real time priority?

- Bobby says:
  
  August 6, 2013 at 3:23 am
  
  Volodimir,
  
  Thanks for your comment. I didn’t test changing LGWR’s priority but that is an option I was aware of. I wasn’t sure if there would be some negative consequence of setting the higher priority. The other option I am aware of that might help is changing the commits to be no wait which could also help prevent the log writer from being the bottleneck.
  
  – Bobby
  
Marko Sutic says:

August 9, 2013 at 12:40 pm

Excellent case Bobby.

Didn’t knew that you could help yourself with shared server processes to solve LGWR issues.

We had similar problem before few years and then we were thinking about commit nowait because possible data loss could be tolerated. Even raising LGWR process priority was known for me (even though I’ve never messed with process priorities in production).

This is new to me – thanks for sharing 😉

Regards,
Marko

- Bobby says:
  
  August 10, 2013 at 4:34 pm
  
  Marko,
  
  Thanks for your comment. I think there are advantages to using shared servers beyond just keeping the LGWR process from being starved for CPU but it is all I’ve really tested and convinced myself of. So, I haven’t proven it, but shared servers may have advantages over increasing the LGWR process priority or commit nowait besides the obvious ones such as increased priority causing issues or commit nowait losing updates in an outage. i.e. Maybe it would prevent latching from being the bottleneck? Remains to be seen.
  
  – Bobby

Shared servers prevents web outage

About Bobby

6 Responses to Shared servers prevents web outage

Leave a Reply Cancel reply

Subscribe to Blog via Email

Archives

Bobby Durrett\’s DBA Blog

Orafaq blog aggregator

Shared servers prevents web outage

Share this:

About Bobby

6 Responses to Shared servers prevents web outage

Leave a Reply Cancel reply

Subscribe to Blog via Email

Archives

Bobby Durrett\’s DBA Blog

Orafaq blog aggregator