AZORA with Viscosity lunch and learn Thursday 2/13/20

Quick note. The next AZORA meetup will be next Thursday, 2/13/2020.

https://www.meetup.com/Arizona-Oracle-User-Group/events/268337993/

Be sure to RSVP on Viscosity’s web site: https://tips.viscosityna.com/modernize2020-phoenix/ (DOES NOT EXIST)

I don’t know about you but we are busy moving to or implementing on 19c so these sessions related to 19c and other topics should be valuable.

Bobby

Posted in Uncategorized | Leave a comment

Query Plan Change Diagnosis Example

This week I investigated an issue with a query that was suddenly a lot slower in one test environment than another. It runs about 2 seconds in the good case as well as in production. But it was now running more than 10 seconds which is painfully slow through the GUI and impacting the team’s ability to finish a software release. I was able to figure out that the query was running a worse plan and get it to run the good plan and helped the team finish their testing.

I thought I would write this blog post documenting the tools that I used and my reasoning while trying to resolve the issue. This post may be redundant with some of my other posts because they document similar tools and approaches, but I thought it could not hurt to put together a complete example. I considered just writing about one part of the process to focus in on that, but I thought I would write a longer post instead that shows the overall process with details along the way. This blog is my experience as an Oracle Database Administrator/Architect, and I work for an individual company so my experiences may not apply to everyone else’s situation. You may work somewhere that has licensed different versions or features of the Oracle database so your choices would be different. But I still can put out there what I am doing and let others apply it to their situation as they can. Also, there are lots of very sharp and hardworking Oracle people out there that write about these same topics and have contributed their own tools to the Oracle community so I wonder about the value of putting my tools on GitHub and writing about how I use them. Why not use their tools and their approaches? Maybe they will work better? Sure, but I know that my approach and tools work well for me. Also most of the ideas I use came from books, talks, and posts by the same sharp people that I have in mind so my approach really reflects my translation of ideas that are out there in the Oracle community with additions of my own so it is not a choice of me versus them. Anyway, I thought I would work through this example on this post so others can review it and find value where they can. Inevitably I will get a lot out of writing it down myself and it will help me remember what I was thinking at this time when I read this post in the future. Hopefully others will benefit as well.

Here is what I knew when I first looked at the problem. A coworker emailed me the SQL for the problem query and verified that the tables in both test environments were about the same size. Sometimes our test environments can be much smaller than production and different from each other in terms of data volume. Usually, when I get a performance question, I ask for two things: the name of the database and the date and time that the issue stopped and started, including time zone. In this case I knew the database name but not the time frame. But I figured that they had been working on it during the current day, so I just got an AWR report for that day from 8 am to present assuming they worked roughly in working hours.

Fortunately, the AWR report had the problem query as the top query so I did not have trouble finding it. It helped that I already knew the text of the SQL statement. The main thing I got from the AWR report at first is the SQL_ID value. I already had the SQL text, but it helps to get the SQL_ID to plug into other scripts I used.

Trimmed down version of AWR report showing problem SQL_ID 4n5ssud3fgsyq

My next step was to run my sqlstat.sql script to see if the problem query, 4n5ssud3fgsyq, has changed plans. Not sure why I did not use the updated version, sqlstat3.sql. I manually updated the query with the SQL_ID value:

where ss.sql_id = '4n5ssud3fgsyq'

and then I looked for plan changes in the output:

I use the PLAN_HASH_VALUE as a name for the plan. I verified that the good plan, 65249283, was run on the other test server that was fine and that it was used in production. So, I had to figure out why the plan changed to 2414495514 on this one test server.

At some point in this process I paused and considered using a SQL Profile to just force the plan back to the good plan. I have written a number of posts about SQL Profiles but this did not seem like a situation where one made sense. Why not spend a couple hours trying to figure out why the plan changed? If I use a SQL Profile, then I will have to migrate the profile to production. And if the query changes the tiniest bit in a future release the SQL Profile will no longer be valid. I have one situation where we have a partitioned table with some almost empty partitions and queries that access the table with the partitioning column passed into the where clause as bind variables. I have been using SQL Profiles to handle that situation on this particular set of production and test databases, but it is a big pain. The last release I had to redo 7-9 SQL Profiles. I’m probably going to have to bite the bullet and find another way around it rather than updating profiles when a release modifies the affected queries. Probably will have to fake the stats on the empty partitions or see if they can be removed. I doubt I will be able to get development to take the partitioning columns out of bind variables. I guess some set of hints might work as well. But anyway, the point is that in this situation that this particular post is about I decided to dig into it and try to find the underlying problem instead of using a SQL Profile band aid.

Having decided to move forward with figuring out why the plan changed I next looked at the two plans. I used my getplans.sql script to extract the plans from the AWR based on SQL_ID and PLAN_HASH_VALUE. I just ran this when logged into the problem database using sqlplus:

@getplans 4n5ssud3fgsyq

I noticed that the slow plan did full table scans on two tables that the fast plan did index lookups on.

SQL_ID 4n5ssud3fgsyq
--------------------
...
Plan hash value: 65249283 - GOOD PLAN

----------------------------------------------------------
| Id  | Operation                             | Name     |
----------------------------------------------------------
...
|  16 |     TABLE ACCESS BY INDEX ROWID       | TABLE1   |
|  17 |      INDEX RANGE SCAN                 | TABLE1_2 |
...
|  47 |           TABLE ACCESS BY INDEX ROWID | TABLE2   |
|  48 |            INDEX UNIQUE SCAN          | TABLE2_0 |


Plan hash value: 2414495514 - BAD PLAN

----------------------------------------------------------
| Id  | Operation                             | Name     |
----------------------------------------------------------
...
|  16 |     TABLE ACCESS FULL                 | TABLE1   |
...						       
|  48 |           TABLE ACCESS FULL           | TABLE2 	 |
----------------------------------------------------------  
						       

I also looked at the Segments part of the AWR report and TABLE2 was at the top of the logical and physical reads reports.

Edited picture of AWR report. Top Table is TABLE2 in execution plans.

So, at this point in my attempt to understand why the plan changed I have the clue that the bad plan has full table scans, and that one of the two tables with the full scans has a substantial percentage of the overall I/O on the database. So, I had two ideas in mind. First, I wanted to check that the two tables had the same indexes as on the working system. Maybe the indexes got dropped during some development work. If the indexes were there, I could just look at the optimizer statistics for the two tables and compare them to what they have on the working test system and production.

I use my optimizer statistics scripts to look at indexes and information that the optimizer uses to choose the best plan. I know that there are other ways to get the same information, such as in a graphical program like Toad or OEM, but I like having scripts that give me exactly the information that I know is important. It is a little clunky to use but not difficult. The top level script is all.sql and I usually comment out the calls to scripts that produce an excessive amount of output. I probably should put all.sql out like this in GitHub:

@tablelist.sql
--@colpartstats.sql
@columnstats.sql
--@histograms.sql
@indexcolumns.sql
@indexstats.sql
--@indpartstats.sql
--@parthists.sql
@tablestats.sql
@tabpartstats.sql
@tabsubpartstats.sql

Seeing all the histogram information for every partition on a table with thousands of partitions is a lot of output and not really needed. If you have many subpartitions you might comment out tabsubpartstats.sql also. To use these scripts you just put your table owners and names in tablelist.sql and run all.sql. The script indexstats.sql listed out the indexes and statistics on them and indexcolumns.sql showed what columns the indexes were on. Unfortunately, in this case the indexes and their columns were the same for the two full scan tables on both test environments and on production. So the switch of the plan to full scans was not due to the lack of indexes. These tables were not partitioned so the tablestats.sql and columnstats.sql reports were the main ones I looked at to compare statistics. Maybe one of them had very out of date or even missing statistics. Unfortunately, the statistics on the two tables looked pretty similar on all three databases. So, my look at the indexes and statistics of the two tables with full scans on the bad plan did not explain the change in plan.

At some point in this process I decided to run the problem query to see if it gave me clues about the plan change. I am putting this post together remembering the steps I went through, but I may not have the exact chronological order. But I should be able to capture the main things I thought about in an order that makes sense. In order to test the query, I needed to come up with values for the bind variables. I used my bind2.sql script to extract some of the bind variable values on the problem test database. I edited bind2.sql to have the problem SQL_ID, 4n5ssud3fgsyq. I chose the most recent two sets of bind variable values to plug into the problem query. I extracted the query text using dumpsql.sql and used Toad to format it nicely. Then I plugged the formatted query into my test2.sql script. To use test2.sql you need to replace the select statement that is there on GitHub:

SELECT /*+gather_plan_statistics*/ * from dual where dummy <> ' ';

with your own query. But you have to make sure that the gather_plan_statistics hint stays in the select statement. There are different ways to handle bind variables and their values in a test script like this but in this case, I manually edited the test script replacing the bind variables with literals that I got from bind2.sql. I chose two sets because one set had nulls for a couple of values so I thought the nulls might cause the bad plan. There might be some odd choice of bind variable values on this test system that was causing the bad plan, or so I thought. But I ran the test2.sql script as described and it ran slow on the one test system for both sets of bind variable values and fast on the other.

My tests with the bind variable values showed that the problem was not related to those sets of bind variable values and that even with literals in place the problem query runs slow on the problem test database. The plan was not the same as the bad plan I had extracted for the original query, but it still had the two full scans. In the same way, my tests had index lookups on the two tables on the other test environment even with the literals replacing the bind variables. So, all this really proved so far was that the two sets of bind variable values that I chose did not affect the problem.

To go further I started looking at the output of test2.sql. The way I get the plan here shows estimated and actual rows. If I find discrepancies in estimated and actual row counts that could point me to where the problem lies. This query in test2.sql dumps out the plan in a format that shows estimated and actual rows:

select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS'));

Here is the edited output for the bad plan for the two full scan tables:

----------------------------------------------------------------------------------------
| Id  | Operation                    | Name    | Starts | E-Rows | A-Rows |   A-Time   |
----------------------------------------------------------------------------------------
...
|  16 |     TABLE ACCESS FULL        | TABLE1  |      2 |   1593K|   3186K|00:00:01.36 |
...
|  48 |           TABLE ACCESS FULL  | TABLE2  |      2 |   1628K|   3344K|00:00:15.13 |
----------------------------------------------------------------------------------------

But I am not sure what this is telling me. For TABLE1 the optimizer expected 1593K rows and got 3186K. It expected 1.5 million rows but got 3 million. Similarly, about 1.6 million and 3.3. But how does this help me figure out what the problem is? I looked at the good plan in the same way but was equally puzzled:

-------------------------------------------------------------------------------------
| Id  | Operation                             | Name     | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------------------------
...
|  16 |     TABLE ACCESS BY INDEX ROWID       | TABLE1   |      0 |      1 |      0 |
|* 17 |      INDEX RANGE SCAN                 | TABLE1_2 |      0 |      1 |      0 |
...
|  47 |           TABLE ACCESS BY INDEX ROWID | TABLE2   |      0 |      1 |      0 |
|* 48 |            INDEX UNIQUE SCAN          | TABLE2_0 |      0 |      1 |      0 |
-------------------------------------------------------------------------------------

Here the estimated rows is 1 but actual is 0. But how do I use that to figure out what changed the plan on the one test environment? Then I got the key insight to solve this whole thing. It was not new to me, but I don’t do this every day and it was a good reminder and I wanted to blog about it. I considered blogging about just this one point and maybe I should edit this post down to just this idea. But then I wouldn’t put out the overall steps and tools I used to get to this point. The key insight was to use a full set of outline hints to force the bad plan and good plan to run in both environments and to compare the estimated and actual rows with each plan. I am not sure how this post will end but that is the most important point. Presumably the optimizer considered both plans and gave each of them a cost and chose the lowest cost plan to run. On our problem system the much slower plan has a lower cost than the faster one. The opposite is true on the working systems. What we want is to force the query to run the bad plan both places and see where the estimated rows differ. Then do the same for the good plan. A key idea that I learned about query tuning from people like Tom Kyte, Jonathan Lewis, and others is that the optimizer chooses a bad plan because of bad row estimates or bad estimates of the cost per row of certain operations. Most of the time it comes down to bad row estimates. So, which table is getting the bad row estimates and why? My way of finding it out in this case was to use outline hints to force the same plan to run on both databases and then use the test2.sql output to look at estimated and actual rows.

I was not sure how to extract the plans from the AWR and get the outline hints. I’m sure there is a way but in my hurry I ended up just using my plan.sql script to use EXPLAIN PLAN to get the plan and pull the outline hint from there. I checked the plans to make sure they had the same index versus full scan issues as I have been seeing. At the end of the output for plan.sql is an outline hint section that starts like this:

Outline Data
-------------

  /*+
      BEGIN_OUTLINE_DATA

You just copy and paste everything from /*+ through */ as your hint. I put this in my test2.sql script that I had used before with one set of the bind variable values as constants and I merged the gather_plan_statistics hint with the outline hint so the query started like this:

 SELECT /*+gather_plan_statistics 
      BEGIN_OUTLINE_DATA

This way the gather_plan_statistics hint collects the actual and estimated rows and the outline hint forces the plan to be used.

At first, I looked at the estimated and actual rows on the good and bad plans on both systems for the two tables that switched to full scans on the problem system.

Good DB Bad Plan Full Scan Tables

------------------------------------------------------------------------------------
| Id  | Operation                             | Name    | Starts | E-Rows | A-Rows |
------------------------------------------------------------------------------------
...
|  16 |     TABLE ACCESS FULL                 | TABLE1  |      1 |   1590K|   1593K|
...
|  48 |           TABLE ACCESS FULL           | TABLE2  |      1 |   1957K|   1957K|
------------------------------------------------------------------------------------

Bad DB Bad Plan Full Scan Tables

Plan hash value: 1397976319

------------------------------------------------------------------------------------
| Id  | Operation                             | Name    | Starts | E-Rows | A-Rows |
------------------------------------------------------------------------------------
...
|  16 |     TABLE ACCESS FULL                 | TABLE1  |      1 |   1593K|   1593K|
...
|  48 |           TABLE ACCESS FULL           | TABLE2  |      1 |   1628K|   1672K|
------------------------------------------------------------------------------------

Good DB Good Plan Full Scan Tables

-------------------------------------------------------------------------------------
| Id  | Operation                             | Name     | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------------------------
...
|  16 |     TABLE ACCESS BY INDEX ROWID       | TABLE1   |     16 |      1 |     18 |
|* 17 |      INDEX RANGE SCAN                 | TABLE1_2 |     16 |      1 |     18 |
...
|  47 |           TABLE ACCESS BY INDEX ROWID | TABLE2   |     12 |      1 |     12 |
|* 48 |            INDEX UNIQUE SCAN          | TABLE2_0 |     12 |      1 |     12 |
-------------------------------------------------------------------------------------

Bad DB Good Plan Full Scan Tables

-------------------------------------------------------------------------------------
| Id  | Operation                             | Name     | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------------------------
...
|  16 |     TABLE ACCESS BY INDEX ROWID       | TABLE1   |     16 |      1 |     18 |
|* 17 |      INDEX RANGE SCAN                 | TABLE1_2 |     16 |      2 |     18 |
...
|  47 |           TABLE ACCESS BY INDEX ROWID | TABLE2   |     12 |      1 |     11 |
|* 48 |            INDEX UNIQUE SCAN          | TABLE2_0 |     12 |      1 |     11 |
-------------------------------------------------------------------------------------

No obvious pattern shows up on these to me. It seems like the optimizer has the same estimates for these two tables on both databases for both plans. But then I got the second key idea: look at the other tables. It is typical when a plan switches to a full scan for there to be something wrong with that table’s stats. But nothing pointed to that in this situation. So, I got the idea of stepping back and looking at the other tables. That led to me finding the actual issue. Two tables that were not the ones with the full scans had different degrees of discrepancies between actual and estimated rows on the bad and good databases.

Good DB Bad Plan Other Tables

-------------------------------------------------------------------------------------
| Id  | Operation                             | Name     | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------------------------
...
|  12 |        TABLE ACCESS BY INDEX ROWID    | TABLE3   |      1 |      3 |      6 |
|* 13 |         INDEX RANGE SCAN              | TABLE3_1 |      1 |      3 |      6 |
|  14 |      TABLE ACCESS BY INDEX ROWID      | TABLE4   |      6 |     55 |     11 |
|* 15 |       INDEX RANGE SCAN                | TABLE4_1 |      6 |     55 |     11 |
...
-------------------------------------------------------------------------------------

Bad DB Bad Plan Other Tables

-------------------------------------------------------------------------------------
| Id  | Operation                             | Name     | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------------------------
...
|  12 |        TABLE ACCESS BY INDEX ROWID    | TABLE3   |      1 |     22 |      6 |
|* 13 |         INDEX RANGE SCAN              | TABLE3_1 |      1 |     22 |      6 |
|  14 |      TABLE ACCESS BY INDEX ROWID      | TABLE4   |      6 |    153 |     11 |
|* 15 |       INDEX RANGE SCAN                | TABLE4_1 |      6 |    153 |     11 |
...
-------------------------------------------------------------------------------------

Good DB Good Plan Other Tables

----------------------------------------------------------------------------------
| Id  | Operation                          | Name     | Starts | E-Rows | A-Rows |
----------------------------------------------------------------------------------
...
|  12 |        TABLE ACCESS BY INDEX ROWID | TABLE3   |      1 |      3 |      6 |
|* 13 |         INDEX RANGE SCAN           | TABLE3_1 |      1 |      3 |      6 |
|  14 |      TABLE ACCESS BY INDEX ROWID   | TABLE4   |      6 |     55 |     11 |
|* 15 |       INDEX RANGE SCAN             | TABLE4_1 |      6 |     55 |     11 |

----------------------------------------------------------------------------------

Bad DB Good Plan Other Tables

----------------------------------------------------------------------------------
| Id  | Operation                          | Name     | Starts | E-Rows | A-Rows |
----------------------------------------------------------------------------------
...
|  12 |        TABLE ACCESS BY INDEX ROWID | TABLE3   |      1 |     22 |      6 |
|* 13 |         INDEX RANGE SCAN           | TABLE3_1 |      1 |     22 |      6 |
|  14 |      TABLE ACCESS BY INDEX ROWID   | TABLE4   |      6 |    153 |     11 |
|* 15 |       INDEX RANGE SCAN             | TABLE4_1 |      6 |    153 |     11 |
...
----------------------------------------------------------------------------------

Looking at these two tables the bad database had significantly higher row estimates than the good database and the actual rows was the same on both. Errors in estimated rows multiply through the plan as joins are considered. So, at this point I thought it could be that the estimates on these tables (TABLE3 and TABLE4) led to the switch to full scans on TABLE1 and TABLE2. So, I went back to my optimizer statistics scripts and took TABLE1 and 2 out of tablelist.sql and put TABLE3 and 4 in and looked for differences. I found the difference in the columnstats.sql output. The NUM_BUCKETS value for the first column of TABLE3 and TABLE4 was 1 on the bad database and 254 or 255 on the good. So, column statistics had been gathered differently on these two tables on the bad database at some point. Maybe in the process of testing or developing on this system statistics were gathered in a non-standard way. The database in question uses Oracle’s default statistics job. (This is 11.2.0.3, HP Unix Itanium by the way). So, I thought I would try gathering statistics on the two tables with dbms_stats.gather_table_stats with default options because that would be similar to what the default stats job would do. I did something like this:

execute dbms_stats.gather_table_stats('MYSCHEMA','TABLE3');

execute dbms_stats.gather_table_stats('MYSCHEMA','TABLE4');

After gathering stats on the two tables I verified that the first column of each had a 254/5 bucket histogram. Then I just tested the problem query on both, and they ran the same fast plan. Then the development team tested through the application and verified that it was back to normal. I flushed the shared pool before turning it over to them in case the bad plan was cached.

This problem was a little tricky because the stats were wrong on two tables that were not the ones that switched to full scans. A lot of times the situation is simpler such as a table is missing statistics altogether. But I thought it would be helpful to work through a description of the process that I took to get to that point and the tools that I used. The key point is the one I put in bold above which is to look at the estimated and actual rows for the same plan on both databases and look for differences. Once I did that it was just a matter of finding the tables with the different estimates on the two dbs. That led me to look at their statistics and to regather them in a better way. It took a little time to figure this out, but it was nice to find the problem and do the least amount of change necessary to resolve the issue instead of just randomly trying things. Sometimes people will just gather stats on all the tables in the query. That probably would have worked here if they gathered them in the right way. Or you could just use a SQL Profile and forget trying to figure out why the new plan popped up. But it was worthwhile to understand why the plan changed in this case. For one thing I was concerned whether the same plan change would happen in production, so I wanted to know why it changed on the test environment. Knowing that the two tables had statistics gathered on them in a way that we would not use in production I have more confidence that prod is fine. There really is value in digging into why a query’s plan has changed but it does take some work. Often the bad plan is a result of a bad row estimate in some part of the plan and it makes sense to track down which table has the bad row estimate and dig into why the estimate was off. In many cases bad row estimates relate to how and when statistics were gathered on the problem table as was the case with our two tables in this situation.

P.S. sqlstat3.sql no longer exists. Here is my post about how I merged my sqlstat scripts:

https://www.bobbydurrettdba.com/2020/04/14/fixed-issue-with-dba_hist_sqlstat-script/

Posted in Uncategorized | 2 Comments

Datapump Import Partitioned Tables ORA-00600 qesmaGetPamR-NullCtx

I have not yet had time to build a test case and prove this out, but I wanted to document one last bug that we found so far in our 11.2.0.4 to 19c upgrade. We tried copying a bunch of partitioned tables on our source database to the new one using Datapump Import (impdp) over a database link. We got a boatload of errors like this:

ORA-00600: internal error code, arguments: [qesmaGetPamR-NullCtx], 

There are many Oracle bugs like this, but they seem to have been fixed in 11.2.0.4. For example:

Bug 12591399 – ORA-600[qesmagetpamr-nullctx] / ORA-14091 with distributed query with local partition table (Doc ID 12591399.8)

Puzzling. We ended up just exporting to disk and that has worked well so no big deal, but I wonder if this is some sort of recession of a fixed bug.

Anyway, I am off for the rest of the year. This should be my last post unless I mess with Nethack over vacation and post something about that. I hope everyone out there has a good new year.

Bobby

P.S. Created a simple partitioned table with 2 partitions and 100 rows in each one. I got the error importing over a link from 11.2.0.4 to 19c. It worked perfectly going from 11.2.0.4 to 11.2.0.4. Same source table. Parfile:

$ cat bobby_link_test.par
userid=MYUSER/MYPASSWORD
JOB_NAME=BOBBY_TEST
DIRECTORY=BOBBY_DIR
NETWORK_LINK=MYLINK
LOGFILE=bobby_link_test.log
tables=TEST

Table:

CREATE TABLE test
(
  PART_COL              NUMBER,
  data                  NUMBER
)
PARTITION BY RANGE (PART_COL)
(  
  PARTITION PART_COL_1 VALUES LESS THAN (100),  
  PARTITION PART_COL_2 VALUES LESS THAN (200)
)
;

PPS. Works fine going from 11.2.0.4 to 18c. Going to try a different 19c database just to be sure it isn’t the one that has the problem.

PPPS. Definitely a 19c bug. It fails on two different 19c databases but not on 18c. In every case source is same 11.2.0.4 database and same small partitioned table.

1/6/2020 update – created SR. Trying to be a good Oracle community citizen.

Bug is already fixed. Patch 30321076. Oracle is way ahead of me!

Posted in Uncategorized | 2 Comments

db_securefile PREFERRED results in ORA-60019 with small uniform extents

Last 19c upgrade issue. Working on our new 19c database, several things died off with errors like this:

SQL> execute DBMS_STATS.CREATE_STAT_TABLE ('MYSCHEMA','MYSTATTAB','MYTS');
BEGIN DBMS_STATS.CREATE_STAT_TABLE ('MYSCHEMA','MYSTATTAB','MYTS'); END;

*
ERROR at line 1:
ORA-60019: Creating initial extent of size 14 in tablespace of extent size 8
ORA-06512: at "SYS.DBMS_STATS", line 20827
ORA-06512: at "SYS.DBMS_STATS", line 20770
ORA-06512: at "SYS.DBMS_STATS", line 20765
ORA-06512: at line 1

Our tablespaces had small uniform extents and our 19c database had defaulted the parameter db_securefile to PREFERRED. We bumped our uniform extent sizes up to 1 megabyte and the problem went away. Setting db_securefile to PERMITTED also resolved the issue.

Oracle’s support site has a bunch of good information about this. This might be a relevant bug:

Bug 9477178 : ORA-60019: CREATING INITIAL EXTENT OF SIZE X IN TABLESPACE FOR SECUREFILES

Bobby

Posted in Uncategorized | 4 Comments

Datapump Import Fails on Tables With Extended Statistics

Quick post before I leave on vacation. We used Datapump to import a schema from an 11.2 HP-UX database to a 19c Linux database and got errors on a few tables like these:

ORA-39083: Object type TABLE:"MYSCHEMA"."TEST" failed to create with error:
ORA-00904: "SYS_STU0S46GP2UUQY#45F$7UBFFCM": invalid identifier

Failing sql is:
ALTER TABLE "MYSCHEMA"."TEST"  MODIFY ("SYS_STU0S46GP2UUQY#45F$7UBFFCM" NUMBER GENERATED
ALWAYS AS (SYS_OP_COMBINED_HASH("COL1","COL2","COL3")) VIRTUAL )

Workaround was to create the table first empty with no indexes, constraints, etc. and import. Today I was trying to figure out why this happened. Apparently, the table has extended statistics on the three primary key columns. I found a post by Jonathan Lewis that shows a virtual column like the one this table has with extended statistics. The error is on the datapump import, impdp, of the table that has extended statistics. This error is similar to some Oracle documented issues such as:

DataPump Import (IMPDP) Raises The Errors ORA-39083 ORA-904 Due To Virtual Columns Dependent On A Function (Doc ID 1271176.1)

But I could not immediately find something that says that extended statistics cause a table to not be importable using Datapump impdp.

If you want to recreate the problem, try added extended stats like this (which I derived from Jonathan Lewis’s post):

select dbms_stats.create_extended_stats(NULL,'TEST','(COL1, COL2, COL3)') name from dual;

select * from user_tab_cols where table_name='TEST';

Then export table from 11.2 and import to 19c database using datapump. Anyway, posting here for my own memory and in case others find it useful. Maybe this is a bug?

Bobby

1/8/2020 update

Simplified my test case table to this:

CREATE TABLE test
(
  COL1 NUMBER,
  COL2 NUMBER
)
;

select dbms_stats.create_extended_stats(NULL,'TEST','(COL1, COL2)') name from dual;

Datapump export of this table from 11.2.0.4 fails on import with ORA-00904 on 18c and 19c but not 12.2.

1/9/20

Oracle support verified that there are a couple of bugs already out there for this.

1/13/20

You only get this error if you use the SCHEMAS parameter in Datapump. It works if you use the TABLES parameter.

This in the parfiles fails:

SCHEMAS=MYSCHEMA
INCLUDE=TABLE:"IN ('TEST')"

This works:

TABLES=TEST

1/15/20

My bug number for what it’s worth: 30763851. Probably will be a duplicate since there are several already out there. I don’t think there is a patch yet.

6/3/20

I applied the supplied patch today and it works great. Boom!

Posted in Uncategorized | 2 Comments

Merge Always Updates Sequence Number

This is nothing new, but I wanted to throw out a quick post to document it. If you have a sequence.nextval in the insert part of a merge statement the merge calls nextval for all the updated rows as well.

Oracle has a bug report about this from a 9.2 issue, so this is nothing new:

Bug 6827003 : SEQUENCE # IN MERGE BEING UPDATED FOR BOTH INSERT AND UPDATE

I created a couple of testcases if you want to try them: sequencewithmerge.zip

Oracle’s bug report says you can work around the issue by encasing the sequence.nextval call in a function so I tried it and it works.

Anyway, you can’t count on the sequence only being advanced on inserted rows with merge statements if you include sequence.nextval in the insert part of the merge statement.

Bobby

Posted in Uncategorized | Leave a comment

Database Link to 9.2 Database from 19c

I have mentioned in previous posts that I am working on migrating a large 11.2 database on HP Unix to 19c on Linux. I ran across a database link to an older 9.2 database in the current 11.2 database. That link does not work in 19c so I thought I would blog about my attempts to get it to run in 19c. It may not be that useful to other people because it is a special case, but I want to remember it for myself if nothing else.

First, I’ll just create test table in my own schema on a 9.2 development database:

SQL> create table test as select * from v$version;

Table created.

SQL> 
SQL> select * from test;

BANNER
----------------------------------------------------------------
Oracle9i Enterprise Edition Release 9.2.0.5.0 - 64bit Production
PL/SQL Release 9.2.0.5.0 - Production
CORE	9.2.0.6.0	Production
TNS for HPUX: Version 9.2.0.5.0 - Production
NLSRTL Version 9.2.0.5.0 - Production

Next, I will create a link to this 9.2 database from a 19c database. I will hide the part of the link creation that has my password and the database details, but they are not needed.

SQL> create database link link_to_92
... removed for security reasons ...

Database link created.

SQL> 
SQL> select * from test@link_to_92;
select * from test@link_to_92
                   *
ERROR at line 1:
ORA-03134: Connections to this server version are no longer supported.

So I looked up ways to get around the ORA-03134 error. I can’t remember all the things I checked but I have a note that I looked at this one link: Resolving 3134 errors. The idea was to create a new database link from an 11.2 database to a 9.2 database. Then create a synonym on the 11.2 database for the table I want on the 9.2 system. Here is what that looks like on my test databases:

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
... removed for brevity ...

SQL> create database link link_from_112
... removed for security ...

Database link created.

SQL> create synonym test for test@link_from_112;

Synonym created.

SQL> 
SQL> select * from test;

BANNER
----------------------------------------------------------------
Oracle9i Enterprise Edition Release 9.2.0.5.0 - 64bit Production

Now that I have the link and synonym on the 11.2 middleman database, I go back to the 19c database and create a link to the 11.2 database and query the synonym to see the original table:

SQL> select * from v$version;

BANNER                                                                           ...
-------------------------------------------------------------------------------- ...
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production           ...
...										    

SQL> create database link link_to_112
...

Database link created.
...
SQL> select * from v$version@link_to_112;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
...

SQL> select * from test@link_to_112;

BANNER
----------------------------------------------------------------
Oracle9i Enterprise Edition Release 9.2.0.5.0 - 64bit Production

So far so good. I am not sure how clear I have been, but the point is that I could not query the table test on the 9.2 database from a 19c database without getting an error. By jumping through an 11.2 database I can now query from it. But, alas, that is not all my problems with this remote 9.2 database table.

When I first started looking at these remote 9.2 tables in my real system, I wanted to get an execution plan of a query that used them. The link through an 11.2 database trick let me query the tables but not get a plan of the query.

SQL> truncate table plan_table;

Table truncated.

SQL> 
SQL> explain plan into plan_table for
  2  select * from test@link_to_112
  3  /

Explained.

SQL> 
SQL> set markup html preformat on
SQL> 
SQL> select * from table(dbms_xplan.display('PLAN_TABLE',NULL,'ADVANCED'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------
Error: cannot fetch last explain plan from PLAN_TABLE

SQL> 
SQL> select object_name from plan_table;

OBJECT_NAME
------------------------------------------------------------------------------

TEST

Kind of funky but not the end of the world. Only a small number of queries use these remote 9.2 tables so I should be able to live without explain plan. Next, I needed to use the remote table in a PL/SQL package. For simplicity I will show using it in a proc:

SQL> CREATE OR REPLACE PROCEDURE BOBBYTEST
  2  AS
  3  ver_count number;
  4  
  5  BEGIN
  6    SELECT
  7    count(*) into ver_count
  8    FROM test@link_to_112;
  9  
 10  END BOBBYTEST ;
 11  /

Warning: Procedure created with compilation errors.

SQL> SHOW ERRORS;
Errors for PROCEDURE BOBBYTEST:

LINE/COL ERROR
-------- -----------------------------------------------------------------
6/3      PL/SQL: SQL Statement ignored
6/3      PL/SQL: ORA-00980: synonym translation is no longer valid

I tried creating a synonym for the remote table but got the same error:

SQL> create synonym test92 for test@link_to_112;

...

SQL> CREATE OR REPLACE PROCEDURE BOBBYTEST
  2  AS
  3  ver_count number;
  4  
  5  BEGIN
  6    SELECT
  7    count(*) into ver_count
  8    FROM test92;
  9  
 10  END BOBBYTEST ;
 11  /

Warning: Procedure created with compilation errors.

SQL> SHOW ERRORS;
Errors for PROCEDURE BOBBYTEST:

LINE/COL ERROR
-------- -----------------------------------------------------------------
6/3      PL/SQL: SQL Statement ignored
6/3      PL/SQL: ORA-00980: synonym translation is no longer valid

Finally, by chance I found that I could use a view for the remote synonym and the proc would compile:

SQL> create view test92 as select * from test@link_to_112;

View created.

...

SQL> CREATE OR REPLACE PROCEDURE BOBBYTEST
  2  AS
  3  ver_count number;
  4  
  5  BEGIN
  6    SELECT
  7    count(*) into ver_count
  8    FROM test92;
  9  
 10  END BOBBYTEST ;
 11  /

Procedure created.

SQL> SHOW ERRORS;
No errors.
SQL> 
SQL> execute bobbytest;

PL/SQL procedure successfully completed.

SQL> show errors
No errors.

Now one last thing to check. Will the plan work with the view?

SQL> explain plan into plan_table for
  2  select * from test92
  3  /

Explained.

SQL> select * from table(dbms_xplan.display('PLAN_TABLE',NULL,'ADVANCED'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------
Error: cannot fetch last explain plan from PLAN_TABLE

Sadly, the view was not the cure all. So, here is a summary of what to do if you have a procedure on a 19c database that needs to access a table on a 9.2 database:

  • Create a link on a 11.2 database to the 9.2 database
  • Create a synonym on the 11.2 database pointing to the table on the 9.2 database
  • Create a link on the 19c database to the 11.2 database
  • Create a view on the 19c database that queries the synonym on the 11.2 database
  • Use the view in your procedure on your 19c database
  • Explain plans may not work with SQL that use the view

Bobby

Posted in Uncategorized | 12 Comments

So Far So Good with Force Logging

I mentioned in my previous two posts that I had tried to figure out if it would be safe to turn on force logging on a production database that does a bunch of batch processing on the weekend: post1, post2. We know that many of the tables are set to NOLOGGING and some of the inserts have the append hint. We put in force logging on Friday and the heavy weekend processing ran fine last weekend.

I used an AWR report to check the top INSERT statements from the weekend and I only found one that was significantly slower. But the table it inserts into is set for LOGGING, it does not have an append hint, and the parallel degree is set to 1. So, it is a normal insert that was slower last weekend for some other reason. Here is the output of my sqlstatsumday.sql script for the slower insert:

It averaged 197062 milliseconds last weekend but 89684 the previous one. The target table has always been set to LOGGING so FORCE LOGGING would not change anything with it.

One of the three INSERT statements that I expected to be slowed by FORCE LOGGING was faster this weekend than without FORCE LOGGING last weekend:

It ran 2421533 milliseconds this weekend and 5672230 the prior one. So clearly FORCE LOGGING did not have much effect on its overall run time.

It went so well this weekend that we decided to leave FORCE LOGGING in for now to see if it slows down the mid-week jobs and the web-based front end. I was confident on Friday, but I am even more confident now that NOLOGGING writes have minimal performance benefits on this system. But we will let it bake in for a while. Really, we might as well leave it in for good if only for the recovery benefits. Then when we configure GGS for the zero downtime upgrade it will already have been there for some time.

The lesson for me from this experience and the message of my last three posts is that NOLOGGING writes may have less benefits than you think, or your system may be doing less NOLOGGING writes than you think. That was true for me for this one database. It may be true for other systems that I expect to have a lot of NOLOGGING writes. Maybe someone reading this will find that they can safely use FORCE LOGGING on a database that they think does a lot of NOLOGGING writes, but which really does not need NOLOGGING for good performance.

Bobby

Posted in Uncategorized | 1 Comment

Impact of Force Logging

I am working on upgrading an Oracle database from 11.2.0.4 to 19c and migrating it from HP Unix to Linux. This 15-terabyte database is too large to copy from the old to the new system during our normal weekend downtime window. It also has a ton of weekend batch updates that overlap the normal weekend change window so it would be best for our business processing if the cut over from the old to the new system was as fast as possible.

I want to use GoldenGate to minimize the downtime for the cutover using an approach similar to what is described in this Oracle document:

Zero Downtime Database Upgrade Using Oracle GoldenGate

You start GoldenGate collecting changes on the current production system and then take your time copying the 15 TB of data from the old to new system. Once you are done with the initial load you apply the changes that happened in the meanwhile. Finally, you cut over to the new system. You could even switch the direction of the replication to push changes on the new production system back to the old system to allow for a mid-week back out several days after your upgrade. Pretty cool. A teammate of mine successfully used this approach on an important database some years back.

But the database that I am working on now, unlike the one that my colleague worked on, has a lot of tables set to nologging. Under the right conditions inserts into tables set to nologging are not written to the redo logs and will be missed by GoldenGate. This Oracle article recommends setting your database to FORCE LOGGING so GoldenGate will not miss any updates:

In order to ensure that the required redo information is contained in the Oracle redo logs for segments being replicated, it is important to override any NOLOGGING operations which would prevent the required redo information from being generated. If you are replicating the entire database, enable database force logging mode.

Oracle GoldenGate Performance Best Practices

We could also switch all our application tables and partitions in the source system to logging but we have so many I think we would set the whole database to force logging.

But the big question which I touched on in my previous post is whether force logging will slow down our weekend batch processing so much that we miss our deadlines for weekend processing to complete and affect our business in a negative way. The more I investigate it the more convinced I am that force logging will have minimal impact on our weekend jobs. This is an unexpected and surprising result. I really thought that our batch processing relied heavily on nologging writes to get the performance they need. It makes me wonder why we are using nologging in the first place. It would be a lot better for backup and recovery to have all our inserts logged to the redo logs. Here is a nice Oracle Support document that lays out the pros and cons of using nologging:

The Gains and Pains of Nologging Operations (Doc ID 290161.1)

I have an entry in my notes for this upgrade project dated 8/26/19 in which I wrote “Surely force logging will bog the … DB down”. Now I think the opposite. So, what changed my mind? It started with the graph from the previous post:

Graph From Previous Post with Little Direct Writes I/O

I was really surprised that the purple line was so low compared to the other two. But I felt like I needed to dig deeper to make sure that I was not just misunderstanding these metrics. The last thing I want to do is make some production change that slows down our weekend processes that already struggle to meet their deadlines. I was not sure what other metrics to look at since I could not find something that directly measures non-logged writes. But then I got the idea of using ASH data.

In my “Fast way to copy data into a table” post I said that to copy data quickly between two Oracle tables “you want everything done nologging, in parallel, and using direct path”. I may have known then and forgotten but working on this now has me thinking about the relationship between these three ways of speeding up inserts into tables. I think there are the following two dependencies:

  • Nologging requires direct path
  • Parallel requires direct path

Oracle document “Check For Logging / Nologging On DB Object(s) (Doc ID 269274.1)” says the first one. In the second case if you have a target table set to parallel degree > 1 and you enable parallel DML you get direct path writes when you insert into the target table.

From all this I got the idea to look for direct path write waits in the ASH views. I could use ASH to identify insert statements that are using direct path. Then I could check that the target tables or partitions are set to nologging. Then I would know they are doing non-logged writes even if I did not have a metric that said so directly.

directwritesql.sql looked at all the SQL statements that had direct write waits over the entire 6 weeks of our AWR history. The output looks like this:

     select
  2  sql_id,count(*) active
  3  from DBA_HIST_ACTIVE_SESS_HISTORY a
  4  where
  5  event = 'direct path write'
  6  group by sql_id
  7  order by active desc;

SQL_ID            ACTIVE
------------- ----------
2pfzwmtj41guu         99
g11qm73a4w37k         88
6q4kuj30agxak         58
fjxzfp4yagm0w         53
bvtzn333rp97k         39
6as226jb93ggd         38
0nx4fsb5gcyzb         36
6gtnb9t0dfj4w         31
3gatgc878pqxh         31
cq433j04qgb18         25

These numbers startled me because they were so low. Each entry in DBA_HIST_ACTIVE_SESS_HISTORY represents 10 seconds of activity. So over 6 weeks our top direct path write waiter waited 990 seconds. Given that we have batch processes running full out for a couple of days every weekend 990 seconds over 6 weekends is nothing.

I took the top SQL ids and dumped out the SQL text to see what tables they were inserting into. Then I queried the LOGGING column of dba_tables and dba_tab_partitions to see which insert was going into a table or partition set to nologging.

select logging,table_name
from dba_tables
where owner='MYOWNER' and
table_name in
(
... tables inserted into ...
)
order by table_name;


select logging,table_name,count(*) cnt
from dba_tab_partitions
where table_owner='MYOWNER' and
table_name in
(
... tables inserted into ...
)
group by logging,table_name
order by table_name,cnt desc;

This simple check for LOGGING or NOLOGGING status eliminated several of the top direct path write waiters. This process reduced the list of SQL ids down to three top suspects:

SQL_ID            ACTIVE
------------- ----------
cq433j04qgb18         25
71sr61v1rmmqc         17
0u0drxbt5qtqk         11

These are all inserts that are not logged. Notice that the most active one has 250 seconds of direct path write waits over the past 6 weeks. Surely enabling force logging could not cause more than about that much additional run time over the same length of time.

I got the idea of seeing what percentage of the total ASH time was direct path write waits for each of these SQL statements. In every case it was small:

cq433j04qgb18

TOTAL_SAMPLE_COUNT DW_SAMPLE_COUNT DW_SAMPLE_PCT
------------------ --------------- -------------
              2508              25    .996810207
 
71sr61v1rmmqc

TOTAL_SAMPLE_COUNT DW_SAMPLE_COUNT DW_SAMPLE_PCT
------------------ --------------- -------------
              1817              17    .935608145

0u0drxbt5qtqk

TOTAL_SAMPLE_COUNT DW_SAMPLE_COUNT DW_SAMPLE_PCT
------------------ --------------- -------------
              8691              11    .126567714

TOTAL_SAMPLE_COUNT was all the samples for that SQL_ID value for the past 6 weeks. DW_SAMPLE_COUNT is the same count of samples that are direct write waits that we already talked about. DW_SAMPLE_PCT is the percentage of the total samples that were direct write wait events. They were all around 1% or lower which means that write I/O time was only about 1% of the entire run time of these inserts. The rest was query processing best I can tell.

Also I used my sqlstat script to look at the average run time for these inserts:

These queries run at most a couple of hours. If direct path writes are 1% of their total run time, I estimated that force logging would add about 1% to the elapsed time or about 2 minutes per execution.

The final step was to try to run one of these top nologging I/O inserts in a test environment with and without force logging to see if the test matches the expected performance slowdown. I was not able to run 0u0drxbt5qtqk without setting up a more elaborate test with the development team. My test of cq433j04qgb18 ran considerably faster with force logging than without it so I think other factors were hiding whatever effect force logging had. But 71sr61v1rmmqc had some nice results that matched my estimates well. This is on a Delphix clone of production so the data was up to date with prod but the underlying I/O was slower.

71sr61v1rmmqc results running 5 times normal 5 times force logging

The individual run times are in seconds and the averages are listed in seconds and in minutes. I ran the insert 5 times with no force logging and 5 times with it alternating. I dropped the primary key and unique index of the target table to keep from getting constraint errors. I rolled back the insert each time. It averaged about 1.2 minutes more out of 40 minutes of run time which is about a 3% increase. My estimate from ASH was about 1% so this test matches that well.

The final test remains. In some upcoming production weekend, I will put in a change to flip the database to force logging and see how it goes. My tests were run on a test system with a different storage system and with no other activity. We might see different results on a heavily loaded system with a queue for the CPU. But, after all this analysis and testing I feel confident that we won’t be able to tell that force logging is enabled. Unfortunately, we sometimes have performance issues anyway due to plan changes or data volume so the force logging switch might get blamed. But I feel confident enough to push for the final test and I think we ultimately will pass that test and be able to use force logging to enable GoldenGate to support a short cut over time for our migration and upgrade project.

Bobby

P.S. A good question came in as a comment about direct path write waits and asynchronous I/O. The system I am testing on does not support async I/O because it is HP Unix and a filesystem. This older blog post talks a bit about async and direct I/O on HP-UX:

https://www.bobbydurrettdba.com/2013/04/26/db_writer_processes-dbwr_io_slaves-with-no-asynch-io-on-hp-ux/

So, your mileage may vary (YMMV) if you do these same queries on a system with asynchronous writes. Linux filesystems support async writes and on HP-UX our RAC system on ASM supports it. It is one of the challenges of writing blog posts. Other people may be in different situations than I am.

Posted in Uncategorized | 12 Comments

Estimating how much write I/O is not logged

I am trying to figure out how much non-logged write I/O an Oracle database is doing. I want to run an ALTER DATABASE FORCE LOGGING command on the database so that I can use Oracle GoldenGate(GGS) which reads updates from Oracle’s logs. GGS will miss writes that are not logged. But if I turn on force logging it may slow down applications that depend on non-logged writes for good performance. So, I want to find some Oracle performance metrics that give me an idea about how much non-logged write I/O we have so I have an estimate of how much force logging will degrade performance.

I created SQL*Plus and PythonDBAGraphs reports based on DBA_HIST_IOSTAT_FUNCTION that gives some insight into the write I/O that is not logged. Here is the Python based graphical version of the report for one recent weekend:

Possible NOLOGGING Write I/O

The purple-blue line represents Direct Writes. These may or may not be logged. The red-orange line represents writes through the DBWR process. These are always logged. The light green line represents log I/O through the LGWR process. My theory is that if the purple line is above the green line the difference must be write I/O that is not logged. But if the green line is equal to or greater than the purple line you really do not know if there was any write I/O that was not logged. But if there is non-logged write I/O it cannot be more than the amount indicated by the purple line. So, this graph does not directly answer my question about how much write I/O was not logged but it does show some numbers that relate to the question.

I did some experiments with the V$IOSTAT_FUNCTION view that populates DBA_HIST_IOSTAT_FUNCTION to see what values it gives for Direct Writes, DBWR, and LGWR using different scenarios. Here is the zip of these scripts and their output: nologgingscriptsandlogs09122018.zip. I tested four scenarios:

  1. Insert append nologging
  2. Insert append logging
  3. Insert noappend logging
  4. Insert noappend nologging

1 and 2 did Direct Writes. 3 and 4 did DBWR writes. 2, 3, and 4 all did LGWR writes.

Here are the relevant sections of the output that correspond to these statements.

Insert append nologging:

FUNCTION_NAME      WRITE_DIFF_MEGABYTES
------------------ --------------------
Direct Writes                      4660
LGWR                                 46
DBWR                                 27

Insert append logging:

FUNCTION_NAME      WRITE_DIFF_MEGABYTES
------------------ --------------------
LGWR                               4789
Direct Writes                      4661
DBWR                                 37

Insert noappend logging:

FUNCTION_NAME      WRITE_DIFF_MEGABYTES
------------------ --------------------
DBWR                               6192
LGWR                               4528
Direct Writes                         2

Insert noappend nologging:

FUNCTION_NAME      WRITE_DIFF_MEGABYTES
------------------ --------------------
DBWR                               6213
LGWR                               4524
Direct Writes                         2

This pattern is similar to that in a Ask Tom post that I wrote about a while back. That post showed the different situations in which writes were logged or not. I also got some ideas about direct writes and logging from this Oracle support document:

Check For Logging / Nologging On DB Object(s) (Doc ID 269274.1)

It sounds like inserts into tables that go through the normal processing eventually get written to disk by DBWR but inserts with the append hint write directly to the datafiles and may or may not be logged and written out by LGWR.

These tests and documents gave me the idea of building a report and graph based on DBA_HIST_IOSTAT_FUNCTION showing the values for the Direct Writes, DBWR, and LGWR FUNCTION_NAME values. The graph above shows an example of a real system. I was surprised to see how high the DBWR and LGWR values were and how low the Direct Writes were. That made me think that it would be safe to try turning on FORCE LOGGING because it likely will have minimal impact on the overall weekend processing. It gave me enough evidence to push for approval to do a controlled test of enabling FORCE LOGGING in production over an upcoming weekend. I will update this post with the results if we move forward with the test.

Bobby

Posted in Uncategorized | 3 Comments