Elementary, my dear DBA!

I read all the Arthur Conan Doyle Sherlock Holmes stories and have recently watched a couple Star Trek The Next Generation stories with the android Data pretending to be Holmes and it gave me the idea for this post about diagnosing an Oracle hang/bug that has been giving me fits lately. The logic involved in solving my Oracle problem reminds me of this often-quoted Sherlock Holmes saying:

“…when you have eliminated the impossible, whatever remains, however improbable, must be the truth…”

Sherlock Holmes

I have often had moments in diagnosing Oracle problems that resemble the idea behind this quote. I think this happens when the information I have in front of me does not directly lead to a solution. Most of the time it does. You get an error, you look it up, and you fix the problem. But sometimes things do not make sense. It looks like the database or hardware is not working correctly even though it is. Somehow, I have to discipline my mind to believe that the system is working correctly and follow the logical conclusions. Well, maybe I should talk about my example and it will be clearer.

I spent a week moving my daughter into her college dorm and then visiting my parents in Alabama. It was a lot of travel along with the COVID-19 fears although it all went well. (I wore my mask everywhere). But then I got back in the office (online from home) and found that the project I was working on had a batch job hung up. I was also on call last week, so it was a little stressful. The hung session was a query coordinator process for a parallel INSERT statement, and it was hung on an PX Deq: Join ACK wait. It was hung for several days until we killed it, so it wasn’t just slow. We had already tested this job and every other batch job with no problems. Most were running significantly faster on our newer, faster x86-64 processors. But we were doing a final batch schedule test that runs several jobs at the same time trying to mimic production. No problem I thought. I will just head to Oracle’s support site and look up the wait event and see what to do. But my search turned up empty. If I was on 10.2 or some old version of RAC maybe, but nothing for 19c non-RAC. I tried Google too but nothing. Hmm. So, I opened an SR with Oracle. I made sure to get a statedump and hanganalyze while the session was still hung up. I figured no problem. Oracle will peek at the dumps and give me a patch and we will be on our way. But no deal. Oracle support was helpful, and I was getting them the information they asked for but nothing just jumped out at us. Plus, I was doing my own investigation looking at logs, trace files, and DBA_HIST views, etc. I was getting frustrated. This upgrade is supposed to go live soon and now all of a sudden at the last minute I am getting this weird wait event that no one seems to have heard of even though we have tested on this system for months with nothing like this.

I think this is where the Sherlock Holmes logic of deduction applies. Underlying my frustration is an assumption about reality. My assumption is that this horrible unknown wait is showing up in the same conditions that all our successful testing was in. The idea that the current testing is in a radically different configuration is the “possible but improbable” category. In this case we have two database VMs setup with the same number of cores and RAM and the same version of Oracle including patches and close to the same data, etc. We have a new production database and a performance database with the idea that development could do their final testing in a performance database that matches production. As it turns out all the init.ora parameters on the two databases are the same except for the ones that size the SGA and PGA memory. We are moving from an HP Unix system with a lot more physical memory than we have on these VMs so the SGA_TARGET and PGA_TARGET settings on that system together exceed the memory available on the new VMs but we copied the parameters from the HP system to the new Linux ones. Evidently, I correctly resized these down on the production system but never on the performance system. So, somewhere in our full-scale batch testing with a bunch of parallel SQL statements overloading the system and a ton of swapping of memory caused by the wrong parameters some parallel server process died leaving my query coordinator hung on PX Deq: Join ACK. I do not really know if this is true. But it fits the facts. I synced these memory parameters up with the working production database, we reran the problem job, and it ran like clockwork. It really helps to have free memory.

In some ways my example is simple. You cannot set SGA_TARGET and PGA_TARGET to more than your total memory and expect good things to happen. DBA 101. But the more interesting thing, at least to me, is the mindset that I need to have when facing a difficult Oracle problem. Here are some of the truths in my mind during this problem:

  1. Our new production database works great and testing has gone very well for months
  2. Our new performance database is very similar to our new production database
  3. Our batch jobs ran fast and error free individually in production
  4. Run together on perf our batch jobs hung up with a weird wait
  5. Lots of things I looked at including AWR reports, alert logs, etc. looked weird

I guess I would like my brain to more quickly jump to questioning my assumption that the performance database is like production. Sure, running a bunch of batch jobs at the same time will put more load on the system and may create problems. But suddenly hitting some weird bug that no one in the world has ever hit doesn’t really make sense. There must be some reasonable explanation that makes sense without assuming some terrible thing is wrong with our database despite months of success using this version of Oracle on this data and code. Other people would be reporting the same thing and blogging about it.

I am not sure how clear this is. I have something I am trying to say about logic and problem solving with Oracle databases but do not know if I have said it in a coherent way. A working database and hardware do not suddenly start doing weird things for no reason. After eliminating that virtual impossibility, you are left with possible solutions that you have not considered because they seem so unlikely. Elementary…

Bobby

P.S. I have a feeling that people are not that interested in this post which is fine. It does not have any useful scripts in it. But, I ran into another situation today that fits the same theme.

Here are the facts:

  1. Import and index build of large schema takes longer than on earlier runs
  2. Linux admin says there is a lot of I/O on the redo logs
  3. Storage admin says there was a bump in service times despite being on a nice array and SSD
  4. Import uses data pump and indexes built parallel nologging
  5. We have been experimenting with GGS on this system
  6. We are importing more data than in previous run

I guess eliminating the impossible means eliminating the idea that heavy redo log I/O is not occurring. It was occurring no matter how much that conflicts with my model of reality. The improbable is that my parallel nologging index build and data pump are somehow doing a boatload of log I/O. Enter the solution. Somehow this database got set to force logging, probably during our work with GGS. It didn’t take me that long to figure it out today but I could have put the redo log I/O into my thoughts earlier. Anyway, I am having fun with this post and maybe it is not so helpful. Sherlock Holmes is not the sum of all knowledge. But there is something to be said for taking the evidence where it leads even if it is not where you expect.

Posted in Uncategorized | 3 Comments

Simple Performance Tuning Example

I helped resolve an Oracle database performance problem yesterday and I thought I would capture what I did in this blog post. It is a “simple” problem in that the solution was easy once I understood the issue and because this sort of problem happens often. But the process to get to the solution is not completely simple. I think documenting the process may have more value than writing down the details of this specific problem.

Today is Tuesday. Thursday last week our development team was doing some testing and performance of the system was terrible. I was busy working on another project that is my primary assignment right now, so I did not look at it. It was also Thursday before the July 4th holiday weekend and we are all mentally and physically checking out a bit early. Some key people were already off on vacation. So, we shut it all down until Monday (yesterday). Also, late Thursday there was some thought that there was a system issue maybe with our virtual machines’ host or networking. So, we let it go. Monday our dev team said they thought the slowness was caused by bad data including duplicates that were throwing errors on a target system (non-Oracle) in the cloud. So, they cleaned up the data and started a new run. But it was slow as molasses and the application front end was also super slow.

In retrospect, looking at the paragraph above, I bet a lot of DBAs with performance tuning experience would jump to the solution. I can think off the top of my head of a few types of things that might cause these symptoms. But I did not immediately think of it so I went through a process to get there. No leap of intuition was required except maybe at the very end.

Since the problem query was currently running the first step was to look at the session monitor in Toad. I may have looked at HP Unix’s Glance tool, which is like Top, to see the CPU usage and top processes at an OS level. What I saw was a parallel query running with 8 parallel child processes and the query had been running about 30 seconds.

Step two was to take an AWR report. First, I ran a command like this to get a snapshot after the current activity:

exec dbms_workload_repository.create_snapshot;

Then I ran an AWR report for the last two snapshots.

The “SQL ordered by Elapsed Time” part of the report looked like this:

SQL ordered by Elapsed Time

The top query azrtxpx1g6mxy was the one I saw in Toad. The second query is a subselect of the top one. Averaging 51.23 seconds and having 46.77% of its time on I/O I eventually went where I usually go with AWR reports. I went to the “Segments by Logical Reads” section of the AWR report. Here is part of that section:

Segments by Logical Reads

I hid the table names, but the point is that the top segment was doing 99.67% of the logical I/Os. So, the issue had to do with I/O. But looking at the “Top 5 Timed Foreground Events” section I/O seems very efficient.

Foreground Events

The top wait was the I/O event “direct path read” which can be deceptively low on systems that use asynchronous I/O. But this system uses HP Unix file system I/O which does not support asynchronous I/O so the direct path read number is correct. One of my list of possible things that could be wrong before I started was some sort of I/O contention but this report pointed away from I/O hardware.

So, the AWR report and my look at Toad session monitor points to a SQL statement doing parallel query full scans of a table and that SQL statement running 10s of seconds in length. So, I looked at the execution plan next. I like to use my getplans.sql SQL*Plus script to extract a query’s plan from the AWR using its SQL_ID. I ran this is SQL*Plus:

@getplans azrtxpx1g6mxy

Sure enough, the same top table was running a parallel full table scan in the plan.

Somewhere in this process I did some counts. I think the simplicity of using select count(*) queries may get in the way of people using it but count(*) queries are one of the biggest helps to me tuning a SQL statement. I did two types of counts for this case. I did a select count(*) on the top table and found out that it had 160 million rows. The count took 45 seconds even through it was doing a fast full scan of an index and not the full table. Then I queried grouping by the column that joins the table to its parent table. This is a typical header detail join where the child table joins the parent on the parent’s unique key. Then I got the idea that the query might run faster if there were an index on the joining column.

I ran my optimizer statistics scripts to check the index columns and to look at the column statistics to see how selective the joining column is. I found that there was no index on the column that the header table used to join to the problem detail table. So, I recommended that the dev team add an index on that column and including the other columns in the where clause for good measure. They might have to try two or three different column choices and orders to get the best index.

But the dev team was skeptical. This same query runs fine in production. It is not new so what changed? I tried running my sqlstat.sql script to get a history of the execution of SQL_ID azrtxpx1g6mxy but there was no history on our test system going back before July 2nd. I ran it on production and there was no history at all. I found out that this SQL had not run for two months on production and since we only keep 6 weeks history in our AWR any execution history of azrtxpx1g6mxy had fallen off the list.

Finally, I had a moment of insight. If I could have seen a faster plan on production, no insight would have been required. It would have been an almost mechanical process of comparing the good and bad plans. But I finally got the idea of running my same optimizer statistics scripts against production and looking at the indexes. Sure enough, production had an index on the joining column. Duh! I probably should have checked for differences between the tables and indexes in test and prod first, but I had no reason to suspect a difference. So, a coworker put the indexes back in test and everything was fine.

So, it is a simple thing. Indexes were dropped on a test system. Happens all the time. Not profound. But maybe the process of driving from problem statement through the different kinds of evidence and tools to the solution is more profound. It gets results. What does the AWR report point to? The top segments pointed to a table. The counts and column statistics suggested the need for an index. Finally, a compare of prod and test showed that an index was missing.

Bobby

Posted in Uncategorized | 4 Comments

Fixed issue with DBA_HIST_SQLSTAT script

I got rid of the old sqlstat.sql, sqlstat2.sql and sqlstat3.sql scripts in my GitHub repository and fixed an issue that was in sqlstat3.sql. Now the current version is just called sqlstat.sql. This one script should work well in all the situations where the other three were useful. (One script to rule them all…)

The old version of sqlstat3.sql was showing rows with plans that were not being run so I added a condition to only show plans that actually had any elapsed time. Slowly getting more useful.

Old output:

SQL_ID        PLAN_HASH_VALUE END_INTERVAL_TIME         EXECUTIONS_DELTA
------------- --------------- ------------------------- ----------------
g0wqzdhxy62mc      2903317857 14-APR-20 01.00.04.320 PM                0
g0wqzdhxy62mc      4109593498 14-APR-20 01.00.04.320 PM              440
g0wqzdhxy62mc      1569680226 14-APR-20 01.00.04.320 PM                0
g0wqzdhxy62mc      2845678604 14-APR-20 01.00.04.320 PM                0

New output:

SQL_ID        PLAN_HASH_VALUE END_INTERVAL_TIME         EXECUTIONS_DELTA
------------- --------------- ------------------------- ----------------
g0wqzdhxy62mc      4109593498 14-APR-20 01.00.04.320 PM              440

I really was not using the other two scripts anymore and was only using sqlstat3.sql. They are in the GitHub history if anyone needs to revisit them. Going forward I will use the new sqlstat.sql.

Bobby

Posted in Uncategorized | 2 Comments

How to email from Python

I got an email asking how to send an alert email from a Python script. Here is an example based on a script I have setup:

You need to replace “PUT YOUR SMTP SERVER HERE” with the host name of your smtp server.

Bobby

Posted in Uncategorized | Leave a comment

January 2020 Patches Not In Recommended Patch Advisor

Every quarter I download the latest patch set from Oracle’s support site by using the Recommended Patch Advisor. But the January 2020 patch sets did not show up. The RPA still recommends the October 2019 patch sets. The Recommended Patch Advisor is one of the options under the Patches and Updates tab in Oracle’s support site. I tried searching Oracle’s support site and Googling but did not find any answers. I am sure that the information is out there somewhere but the way I was searching for it did not find it. So, I thought I would post a quick note about it. The short answer is that you should use this Oracle support document to find the latest patch sets and not the Recommended Patch Advisor:

Master Note for Database Proactive Patch Program (Doc ID 756671.1)

To find this out I had to open an SR with Oracle Support. They quickly identified this document which explained the situation:

My Oracle Support: Patch and Health Recommendations No Longer Supported (Doc ID 2644769.1)

Note 2644769.1 seems to say that the Recommended Patch Advisor will continue to refer to the October 2019 patches until Oracle removes the RPA from their support site in May.

The funny thing is that I knew we had an internal document that a coworker and I had both contributed to that describes how to download the latest patch set and apply it to a database to create a tar or zip that we could use to deploy a new database. I decided to modify that document to refer to note 756671.1 and remove references to the RPA. To my surprise my coworker had already included note 756671.1 months ago in our instructions and not the RPA. I think that Oracle support must have recommended the note some time ago, but I was still using the advisor. I guess you learn something new every day.

Anyway, note to self: don’t use the Recommended Patch Advisor to find the quarterly patch updates. Use note 756671.1.

Bobby

Posted in Uncategorized | Leave a comment

Python Script to Scan Listener Log For Incoming Connections

Wrote a quick and dirty script to scan the listener log for incoming connections and produce a count of how many connections came from which host. Here is the URL:

https://github.com/bobbydurrett/miscpython/blob/master/gethostcounts.py

All it does is hack out the ip address from the second HOST= entry of the lines in the listener log that have HOST= in them. Then it uses the Python function socket.gethostbyaddr to get the hostname for those that can be looked up this way.

Counts the number of lines for a given ip address and sorts them from most to least connections. Might be useful to someone.

Bobby

Posted in Uncategorized | 4 Comments

Bounce Listener and Database After Adding oracle User to Group

I keep forgetting that I need to bounce both the listener and the database after changing the oracle user’s group membership, so I am writing a quick note to myself here. This is the second time I have been tripped up by something like this and I spent way too much time figuring it out again.

We have some PL/SQL code that writes to files on a database server. The most recent issue was using DBMS_XSLPROCESSOR.CLOB2FILE to write a clob to a file in a certain directory. This got the following error:

ORA-22288: file or LOB operation FILEOPEN failed
Permission denied

I tried writing to the same directory using UTL_FILE.FOPEN with write permission to create a new file. I got this error:

ORA-29283: invalid file operation: cannot open file [29435]

I got a Linux admin to add the oracle Linux user to the correct group and I bounced the database and the error did not go away. Then I bounced the listener and the database, and the errors were resolved.

NOTE TO SELF: if you add the oracle user to a new group to allow PL/SQL to write to a directory using group permissions bounce the database and the listener!

Bobby

P.S. This is on Oracle 19c on Redhat Linux 7 but should apply to a wide range of Oracle versions and Unix like platforms.

P.P.S. Bounce the listener first.

Posted in Uncategorized | Leave a comment

Plan Change Monitor goes off on SQL Analyze query

Last night, the plan change monitor script we have setup on one of our production databases alerted us about query with a SQL Analyze hint. It looked weird.

The query has sql_id 48tfz5d37xk28 and starts like this:

/* SQL Analyze(3141,1) */ select ...

The weird thing is the output of my sqlstat.sql script for this sql_id value:

It looks like one of the database’s built in processes was running this query with various plans. Plan_hash_value 3992115155 tripped the alert. I guess it was 10 times as long as plan 3597180003. The plan change monitor script could be improved in many ways to eliminate false positives like this, but I just added a condition to exclude all queries run by user SYS:

258c258,259
< m.sql_id not in ('dkqs29nsj23jq','58r2wgh5xx1bg','3m2su959r6tp8','3z60x9386n1v8','bx81s3m9zxutb','czwbdwxx5aqsk','001wbydnwaa7n','6quugay8yt73u')
---
> m.sql_id not in 
('dkqs29nsj23jq','58r2wgh5xx1bg','3m2su959r6tp8','3z60x9386n1v8','bx81s3m9zxutb','czwbdwxx5aqsk','001wbydnwaa7n','6quugay8yt73u') 
and s.username <> 'SYS'

This is the first time I remember seeing something like this cause an alert, but the workaround was easy.

Bobby

Posted in Uncategorized | Leave a comment

AWR records top 30 SQLs by default

I forget that Oracle’s AWR only records the top 30 SQL statements in each snapshot by default. I am not sure how long this link will last but here is a 19c manual page describing the default: 19c manual – see the topnsql setting. A lot of my query tuning assumes that the problem query is in the AWR but for very efficient queries on active systems they may mysteriously disappear or be absent from the AWR. It sometimes takes me a while to remember that the snapshots only include a fixed number of SQLs.

I use my sqlstat.sql query to look at a history of a particular sql_id’s executions. Often it shows the query running faster on one plan_hash_value than another. Then I look at why the sql_id changed plans. But what about when the good plan does not show up at all? Several times I have looked at sqlstat.sql output and thought that a query had not run in the past with an efficient query even though it had. It had run so efficiently that it was not on the report, so it looked like the query was a new, slow, SQL statement.

Often I will fix a query’s plan with a SQL Profile and rerun sqlstat.sql on a busy system after manually running dbms_workload_repository.create_snapshot to capture the most recent activity and the problem query with the new plan will not show up. Usually I remember that it is not in the top 30 queries and that is why it is missing but sometimes I forget. Here is a partial sqlstat.sql output showing a long running SQL disappearing after I fixed its plan on Wednesday:

SQL_ID        PLAN_HASH_VALUE END_INTERVAL_TIME         EXECUTIONS_DELTA Elapsed Ave ms
------------- --------------- --------------------- ---------------- ------------------
acn0557p77na2      3049654342 12-FEB-20 05.00.01 AM                1          16733.256
acn0557p77na2      3049654342 12-FEB-20 06.00.03 AM                2           49694.32
acn0557p77na2      3049654342 12-FEB-20 07.00.53 AM                6          47694.527
acn0557p77na2      3049654342 12-FEB-20 08.00.54 AM               11         50732.0651
acn0557p77na2      3049654342 12-FEB-20 09.00.33 AM               15         53416.5183
acn0557p77na2      3049654342 12-FEB-20 10.00.43 AM               21         86904.4385
acn0557p77na2      3049654342 12-FEB-20 11.00.02 AM               27          84249.859
acn0557p77na2      3049654342 12-FEB-20 12.00.20 PM               27         125287.757
acn0557p77na2      3049654342 12-FEB-20 01.00.36 PM               69         156138.176

Sometimes I query the V$ tables to verify it is currently running a good plan. Here is example output from vsqlarea.sql showing the good plan running today.

LAST_ACTIVE         SQL_ID        PLAN_HASH_VALUE Avg Elapsed ms
------------------- ------------- --------------- --------------
2020-02-14 16:11:40 acn0557p77na2       867392646             14

This is just a quick note to me as much as anyone else. A query that is missing from an AWR report or my sqlstat.sql report may not have run at all, or it may have run so well that it is not a top 30 query.

Bobby

Posted in Uncategorized | 2 Comments

DBMS_UTILITY.FORMAT_CALL_STACK Change in 12.2 and later

Quick note. During my 11.2.0.4 to 19c upgrade that I have been writing about we found a difference in behavior of DBMS_UTILITY.FORMAT_CALL_STACK. I tested it on several versions, and it switched in 12.2. Now it puts the procedure name within the package in the stack.

Old output:

----- PL/SQL CALL STACK -----
  OBJECT      LINE  OBJECT
  HANDLE    NUMBER  NAME
0X15BFA6930         9  PACKAGE BODY MYUSER.MYPKG
0X10C988058         1  ANONYMOUS BLOCK

New output:

----- PL/SQL CALL STACK -----
  OBJECT      LINE  OBJECT
  HANDLE    NUMBER  NAME
0XA796DF28         9  PACKAGE BODY MYUSER.MYPKG.MYPROC
0X7ADFEEB8         1  ANONYMOUS BLOCK

Test code:

select * from v$version;

CREATE OR REPLACE PACKAGE MYPKG
AS

PROCEDURE MYPROC;

END MYPKG;
/
SHOW ERRORS

CREATE OR REPLACE PACKAGE BODY MYPKG
AS

PROCEDURE MYPROC
IS

BEGIN

DBMS_OUTPUT.PUT_LINE(UPPER(dbms_utility.format_call_stack));

END MYPROC;

END MYPKG;
/
SHOW ERRORS;

execute mypkg.myproc;
show errors;

Might be useful to someone else. We had some code that depended on the package name being the last thing on its line, but the new version includes the name of the procedure after the package name.

Bobby

Posted in Uncategorized | 2 Comments