I saw a load of 44 on a node of our production Exadata and it worried me. The AWR report looks like this:
Host CPU
Load Average
CPUs Begin End %User %System %WIO %Idle
----- --------- --------- --------- --------- --------- ---------
16 10.66 44.73 68.3 4.3 0.0 26.8
So, why is the load average 44 and yet the CPU is 26% idle?
I started looking at ASH data and found samples with 128 processes active on the CPU:
select
2 sample_time,count(*)
3 from DBA_HIST_ACTIVE_SESS_HISTORY a
4 where
5 session_state='ON CPU' and
6 instance_number=3 and
7 sample_time
8 between
9 to_date('05-MAR-2015 01:00:00','DD-MON-YYYY HH24:MI:SS')
10 and
11 to_date('05-MAR-2015 02:00:00','DD-MON-YYYY HH24:MI:SS')
12 group by sample_time
13 order by sample_time;
SAMPLE_TIME COUNT(*)
---------------------------- ----------
05-MAR-15 01.35.31.451 AM 128
... lines removed for brevity
Then I dumped out the ASH data for one sample and found all the sessions on the CPU were running the same parallel query:
select /*+ parallel(t,128) parallel_index(t,128) dbms_stats ...
So, for some reason we are gathering stats on a table with a degree of 128 and that spikes the load. But, why does the CPU idle percentage sit at 26.8% when the load starts at 10.66 and ends at 44.73? Best I can tell load in DBA_HIST_OSSTAT is a point measurement of load. It isn’t an average over a long period. The 11.2 manual describes load in v$osstat in this way:
Current number of processes that are either running or in the ready state, waiting to be selected by the operating-system scheduler to run. On many platforms, this statistic reflects the average load over the past minute.
So, load could spike at the end of an hour-long AWR report interval and still CPU could average 26% idle for the entire hour? So it seems.
– Bobby
