sar -d on Linux

I started using sar -d to look at disk performance on a Linux system this week and had to look up what some of the returned numbers meant.  I’ve used sar -d on HP Unix but the format is different.

Here is an edited output from a Linux VM that we are copying files to:

$ sar -d 30 1
Linux 2.6.32-504.3.3.el6.x86_64 (myhostname)  04/01/2015      _x86_64_        (4 CPU)

05:26:55 PM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
05:27:25 PM  dev253-9   7669.55      2.44  61353.93      8.00     35.39      4.61      0.03     19.80

I edited out the real host name and I removed all the lines with devices except the one busy device, dev253-9.

Earlier today I got confused and thought that rd_sec/s meant read I/O requests per second but it is not.  Here is how the Linux man page describes rd_sec/s:

Number  of  sectors  read from the device. The size of a sector is 512
bytes.

In the example above all the activity is writing so if you look at wr_sec/s it is the same kind of measure of activity:

Number of sectors written to the device. The size of a sector  is  512
bytes.

So in the example you have 61353.93 512 byte sectors written per second.  Divide by 2 to get kilobytes = 30676 KB/sec.  Divide by 1024 and round-up to get 30 megabytes per second.

But, how many write I/O operations per second does this translate to?  It looks like you can’t tell in this listing.  You can get overall I/O operations per second including both reads and writes from the tps value which the man page defines as:

Total number of transfers per second  that  were  issued  to  physical
devices.   A transfer is an I/O request to a physical device. Multiple
logical requests can be combined into a  single  I/O  request  to  the
device.  A transfer is of indeterminate size.

Of course there aren’t many read requests so we can assume all the transfers are writes so that makes 7669.55 write IOPS.  Also, you can find the average I/O size by dividing rd_sec/s  + wr_sec/s by tps.  This comes out to just about 8 which is the same as avgrq-sz which the man page defines as

The  average size (in sectors) of the requests that were issued to the
device.

So, avgrq-sz is kind of superfluous since I can calculate it from the other values but it means that our average I/O is 8 * 512 bytes = 4 kilobytes.  This seems like a small I/O size considering that we are copying large data files over NFS.  Hmmm.

Also, the disk device is queuing the I/O requests but the device is only in use 19% of the time.  Maybe there are bursts of 4K writes which queue up and then gaps in activity?  Here are the definitions for the remaining items.

avgqu-sz

The average queue length of the  requests  that  were  issued  to  the
device.

await

The  average  time  (in  milliseconds)  for I/O requests issued to the
device to be served. This includes the time spent by the  requests  in
queue and the time spent servicing them.

svctm

The  average service time (in milliseconds) for I/O requests that were
issued to the device.

%util

Percentage of CPU time during which I/O requests were  issued  to  the
device  (bandwidth  utilization  for  the  device).  Device saturation
occurs when this value is close to 100%.

The service time is good – only .03 milliseconds – so I assume that the I/Os are writing to a memory cache.  But the total time is higher – 4.61 – which is mostly time spent waiting in the queue.  The average queue length of 35.39 makes sense given that I/Os spend so much time waiting in the queue.  But it’s weird that utilization isn’t close to 100%.  That’s what makes me wonder if we are having bursts of activity.

Anyway, I have more to learn but I thought I would pass along my thoughts on Linux’s version of sar -d.

– Bobby

P.S. Here is the output on HP-UX that I am used to:

HP-UX myhostname B.11.31 U ia64    04/02/15

11:27:14   device   %busy   avque   r+w/s  blks/s  avwait  avserv
11:27:44    disk1    1.60    0.50       3      95    0.00   10.27
            disk6    0.03    0.50       1       6    0.00    0.64
           disk15    0.00    0.50       0       0    0.00    3.52
           disk16  100.00    0.50     337    5398    0.00    5.52

r+w/s on HP-UX sar -d seems to be the equivalent of tps on Linux.  blks/s on HP-UX appears to be the same as rd_sec/s  + wr_sec/s on Linux.  The other weird difference is that in HP-UX avwait is just the time spent in the queue which I believe is equal to await – svctm on Linux.  I am more accustomed to the HP-UX tool so I needed to get up to speed on the differences.

Posted in Uncategorized | 5 Comments

DCLI to back up Oracle home and inventory before patch

I was preparing for my weekend patch of our Exadata system and I needed to back up all of our Oracle homes and inventories on our production system.  On our 2 node dev and qa clusters I just ran the backups by hand like this:

login as root

cd /u01/app/oracle/product/11.2.0.4

tar -cvf - dbhome_1 | gzip > dbhome_1-20150211.tgz

cd /u01/app

cp -r oraInventory oraInventory.20150211

But the production cluster has 12 nodes so I had to figure out how to use DCLI to run the equivalent on all 12 nodes instead of doing them one at a time.  To run a DCLI command you need go to the directory that has the list of database server host names.  So, first you do this:

login as root

cd /opt/oracle.SupportTools/onecommand

The file dbs_group contains a list of the database server host names.

Next, I wanted to check how much space was free on the filesystem and how much space the Oracle home occupied so I ran these commands:

dcli -g dbs_group -l root "df|grep u01"

dcli -g dbs_group -l root "cd /u01/app/oracle/product/11.2.0.4;du -ks ."

The first command gave me how much space was free on the /u01 filesystem on all database nodes.   The second command gave me how much space the 11.2.0.4 home consumed.  I should have done “du -ks dbhome_1″ since I’m backing up dbhome_1 instead of everything under 11.2.0.4, but there wasn’t much else under 11.2.0.4 so it worked out.

Now that I knew that there was enough space I ran the backup commands using DCLI.

dcli -g dbs_group -l root "cd /u01/app/oracle/product/11.2.0.4;tar -cvf - dbhome_1 | gzip > dbhome_1-20150316.tgz"

dcli -g dbs_group -l root "cd /u01/app;cp -r oraInventory oraInventory.20150316"

I keep forgetting how to do this so I thought I would post it.  I can refer back to this later and perhaps it will be helpful to others.

– Bobby

 

Posted in Uncategorized | Leave a comment

Loads of fun with DBA_HIST_OSSTAT

I saw a load of 44 on a node of our production Exadata and it worried me.  The AWR report looks like this:

Host CPU
            Load Average
 CPUs     Begin       End     %User   %System      %WIO     %Idle
----- --------- --------- --------- --------- --------- ---------
   16     10.66     44.73      68.3       4.3       0.0      26.8

So, why is the load average 44 and yet the CPU is 26% idle?

I started looking at ASH data and found samples with 128 processes active on the CPU:

     select
  2  sample_time,count(*)
  3  from DBA_HIST_ACTIVE_SESS_HISTORY a
  4  where
  5  session_state='ON CPU' and
  6  instance_number=3 and
  7  sample_time
  8  between
  9  to_date('05-MAR-2015 01:00:00','DD-MON-YYYY HH24:MI:SS')
 10  and
 11  to_date('05-MAR-2015 02:00:00','DD-MON-YYYY HH24:MI:SS')
 12  group by sample_time
 13  order by sample_time;

SAMPLE_TIME                    COUNT(*)
---------------------------- ----------
05-MAR-15 01.35.31.451 AM           128

... lines removed for brevity

Then I dumped out the ASH data for one sample and found all the sessions on the CPU were running the same parallel query:

select /*+  parallel(t,128) parallel_index(t,128) dbms_stats ...

So, for some reason we are gathering stats on a table with a degree of 128 and that spikes the load.  But, why does the CPU idle percentage sit at 26.8% when the load starts at 10.66 and ends at 44.73?  Best I can tell load in DBA_HIST_OSSTAT is a point measurement of load.  It isn’t an average over a long period.  The 11.2 manual describes load in v$osstat in this way:

Current number of processes that are either running or in the ready state, waiting to be selected by the operating-system scheduler to run. On many platforms, this statistic reflects the average load over the past minute.

So, load could spike at the end of an hour-long AWR report interval and still CPU could average 26% idle for the entire hour?  So it seems.

– Bobby

Posted in Uncategorized | Leave a comment

Arizona Oracle User Group Meeting March 18

I just found out that this meeting was cancelled.  We will have to catch the next one. :) – 3/16/2015

Sign up for the Arizona Oracle User Group (AZORA) meeting next week: signup url

The email that I received from the meeting organizer described the topic of the meeting in this way:

“…the AZORA meetup on March 18, 2015 is going to talk about how a local business decided to upgrade their Oracle Application from 11i to R12 and give you a first hand account of what went well and what didn’t go so well. ”

Description of the speakers from the email:

Becky Tipton

Becky is the Director of Project Management at Blood Systems located in Scottsdale, AZ. Prior to coming to Blood Systems, Becky was an independent consultant for Tipton Consulting for four years.

Mike Dill

Mike is the Vice President of Application Solutions at 3RP, a Phoenix consulting company. Mike has over 10 years of experience implementing Oracle E-Business Suite and managing large-scale projects.

I plan to attend.  I hope to see you there too. :)

– Bobby

Posted in Uncategorized | Leave a comment

Delphix User Group Presentation

My Delphix user group presentation went well today. 65 people attended.  It was great to have so much participation.

Here are links to my PowerPoint slides and a recording of the WebEx:

Slides: PowerPoint

Recording: WebEx

Also, I want to thank two Delphix employees, Ann Togasaki and Matthew Yeh.  Ann did a great job of converting my text bullet points into a visually appealing PowerPoint.  She also translated my hand drawn images into useful drawings.  Matthew did an amazing job of taking my bullet points and my notes and adding meaningful graphics to my text only slides

I could not have put the PowerPoint together in time without Ann and Matthew’s help and they did a great job.

Also, for the first time I wrote out my script word for word and added it to the notes on the slides.  So, you can see what I intended to say with each slide.

Thank you to Adam Leventhal of Delphix for inviting me to do this first Delphix user group WebEx presentation.  It was a great experience for me and I hope that it was useful to the user community as well.

– Bobby

Posted in Uncategorized | Leave a comment

Blog third anniversary

My first blog post was March 5, 2012, three years ago today.

I have enjoyed blogging.  Even though I am talking about topics related to my work blogging does not feel like work. The great thing about blogging is that it’s completely in my control.  I control the content and the time-table. I pay a small amount each year for hosting and for the domain name, but the entertainment value alone is worth the price of the site.  But, it also has career value because this blog has given me greater credibility both with my employer and outside the company.  Plus, I think it makes me better at my job because blogging forces me to put into words the technical issues that I am working on.

It’s been three good years of blogging.  Looking forward to more in the future.

– Bobby

Posted in Uncategorized | Leave a comment

Joined twitter

I joined twitter.  I don’t really know how to use it.  I’m setup as Bobby Durrett, @bobbydurrettdba if that means anything to you. :)

– Bobby

Posted in Uncategorized | 2 Comments

Different plan_hash_value same plan

I mentioned this same effect in an earlier post about SQL profiles: link

I get a different plan_hash_value values for a query each time I run an explain plan or run the query.  I see this in queries whose plan includes a system generated temporary segment like this:

|   1 |  TEMP TABLE TRANSFORMATION   |                             |
...
|  72 |    TABLE ACCESS STORAGE FULL | SYS_TEMP_0FD9D668C_764DD84C |

For some reason the system generated temporary table name gets included in the plan_hash_value calculation.  This makes plan_hash_value a less than perfect way to compare two plans to see if they are the same.

Last week I was using my testselect package to test the effect of applying a patch to fix bug 20061582.  I used testselect to grab 1160 select statements from production and got their plans with and without the patch applied on a development database.  I didn’t expect many if any plans to change based on what the patch does.  Surprisingly, 115 out of the 1160 select statements had a changed plan, but all the ones I looked at had the system generated temporary table names in their plan.

Now, I am going to take the queries that have different plans with and without the patch and execute them both ways.  I have a feeling that the plan differences are mainly due to system generated temp table names and their execution times will be the same with and without the patch.

I’ve run across other limitations of plan hash value as I mentioned in an earlier post: link

I’m still using plan_hash_value to compare plans but I have a list of things in my head that reminds me of cases where plan_hash_value fails to accurately compare two plans.

– Bobby

P.S. After posting this I realized that I didn’t know how many of the 115 select statements with plans that differed with and without the patch had system generated temp tables.  Now I know.  114 of the 115 have the string “TEMP TABLE TRANSFORMATION” in their plans.  So, really, there is only one select statement for which the patch may have actually changed its plan.

P.P.S. I reapplied the patch and verified that the one sql_id didn’t really change plans with the patch.  So, that means all the plan changes were due to the system generated name.  Also, all the executions times were the same except for one query that took 50 seconds to parse without the patch and 0 with the patch.  So, one of the queries with the system generated temp table name happened to benefit from the patch.  Very cool!

P.P.P.S This was all done on an 11.2.0.4 Exadata system.

Posted in Uncategorized | 6 Comments

Speaking at Delphix User Group Webex March 11

On March 11 at 10 am California time I will be speaking in a Delphix User Group Webex session.

Here is the sign up url: WebEx sign up.

Adam Leventhal, the Delphix CTO, will also be on the call previewing the new Delphix 4.2 features.

I will describe our experience with Delphix and the lessons we have learned.  It is a technical talk so it should have enough details to have value to a technical audience.  Hopefully I have put enough effort into the talk to make it useful to other people who have or are considering getting Delphix.

There will time for questions and answers in addition to our talks.

I really enjoy doing user group presentations.  I had submitted an abstract for this talk to the Collaborate 2015 Oracle user group conference but it was not accepted so I wont get a chance to give it there.  But, this WebEx event gives me a chance to present the same material, so I’m happy to have this opportunity.

If you have an interest in hearing about Delphix join the call.  It is free and there will be some good technical content.

– Bobby

P.S. If this talk interests you I also have some earlier blog posts that relate to some of the material I will be covering:

Delphix first month

Delphix direct i/o and direct path reads

Delphix table recovery

Also, I plan to post the slides after the talk.

Posted in Uncategorized | 2 Comments

A new index on a small table makes a big difference

A few weeks back on the weekend just before I went on call we got a complaint about slowness on an important set of reports.  I worried that the slowness of these reports would continue during my support week so I tried to figure out why they were slow.  I reviewed an AWR report for the 24 hours when the reports were running and found a simple query against a tiny table at the top of the “SQL ordered by Elapsed Time” report:

   SQL Id         Elapsed (s)        Execs
-------------   ------------- ------------
77hcmt4kkr4b6      307,516.33 3.416388E+09

I edited the AWR report to show just elapsed seconds and number of executions.  3.4 billion executions totaling 307,000 seconds of elapsed time.  This was about 90 microseconds per execution.

The previous weekend the same query looked like this:

   SQL Id         Elapsed (s)        Execs
-------------   ------------- ------------
77hcmt4kkr4b6      133,143.65 3.496291E+09

So, about the same number of executions but less than half of the elapsed time.  This was about 38 microseconds per execution.  I never fully explained the change from week to week, but I found a way to improve the query performance by adding a new index.

The plan was the same both weekends so the increase in average execution time was not due to a plan change.  Here is the plan:

SQL_ID 77hcmt4kkr4b6
--------------------
SELECT DIV_NBR FROM DIV_RPT_GEN_CTL WHERE RPT_NM = :B1 AND
GEN_STAT = 1

Plan hash value: 1430621991

------------------------------------------------------------------
| Id  | Operation                   | Name               | Rows  |
------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                    |       |
|   1 |  TABLE ACCESS BY INDEX ROWID| DIV_RPT_GEN_CTL    |     1 |
|   2 |   INDEX RANGE SCAN          | DIV_RPT_GEN_CTL_U1 |     1 |
------------------------------------------------------------------

I found that the table only had 369 rows and 65 blocks so it was tiny.

The table’s only index was on columns RPT_NM and RPT_ID but only RPT_NM was in the query.  For the given value of RPT_NM the index would look up all rows in the table with that value until it found those with GEN_STAT=1.  I suspect that on the weekend of the slowdown that the number of rows being scanned for a given RPT_NM value had increased, but I can not prove it.

I did a count grouping by the column GEN_STAT and found that only 1 of the 300 or so rows had GEN_STAT=1.

SELECT GEN_STAT,count(*)
FROM DIV_RPT_GEN_CTL
group by GEN_STAT;

  GEN_STAT   COUNT(*)
---------- ----------
         1          1
         2        339
         0         29

So, even though this table is tiny it made sense to add an index which included the selective column GEN_STAT.  Also, since the reports execute the query billions of times per day it made sense to include the one column in the select clause as well, DIV_NBR.  By including DIV_NBR in the index the query could get DIV_NBR from the index and not touch the table.  The new index was on the columns RPT_NM, GEN_STAT, and DIV_NBR in that order.

Here is the new plan:

SQL_ID 77hcmt4kkr4b6
--------------------
SELECT DIV_NBR FROM DIV_RPT_GEN_CTL WHERE RPT_NM = :B1 AND
GEN_STAT = 1

Plan hash value: 2395994055

-------------------------------------------------------
| Id  | Operation        | Name               | Rows  |
-------------------------------------------------------
|   0 | SELECT STATEMENT |                    |       |
|   1 |  INDEX RANGE SCAN| DIV_RPT_GEN_CTL_U2 |     1 |
-------------------------------------------------------

Note that it uses the new index and does not access the table.  Here is the part of the AWR report for the problem query for last weekend:

   SQL Id         Elapsed (s)        Execs
-------------   ------------- ------------
77hcmt4kkr4b6       84,303.02 4.837909E+09

4.8 billion executions and only 84,000 seconds elapsed.  That is about 17.4 microseconds per execution.  That is less than half of what the average execution time was the weekend before the problem started.

The first Monday after we put the index in we found that one of the slow reports had its run time reduced from 70 minutes to 50 minutes.  It was great that we could improve the run time so much with such a simple fix.

It was a simple query to tune.  Add an index using the columns in the where clause and the one column in the select clause.  It was a tiny table that normally would not even need an index.  But, any query that an application executes billions of times in a day needs to execute in the most efficient way possible so it made sense to add the best possible index.

– Bobby

Posted in Uncategorized | 4 Comments