Background:
Customer upgraded 10204 CRS and ASM to 11202 Grid Infrastructure.
Customer upgraded his 10204 Databases to 11202 Database.
Customer verified all the jobs and application to be working fine.
Customer uninstalled OLD 10204 CRS, ASM & Database home.
Symptoms:
Their nightly export job failed with following error.
UDE-27546: operation generated ORACLE error 27546
ORA-27546: Oracle compiled against IPC interface version 3.2 found version 3.3
Further investigation revealed that even normal connections to 11202 database was failing with following error.
ORA-27546: Oracle compiled against IPC interface version 3.2 found version 3.3
Cause:
Further research revealed that the error was caused due to library mismatch their RAC Nodes for libskgxp10.so in $ORACLE_HOME/lib.
Apparently, this library was found to have different timestamp and size between node 1 and rest of the nodes in cluster which is caused by Bug: 5474623 - DEINSTALL OF ONE ORACLE HOME AFFECTS ANOTHER ORACLE HOME.
As per the bug, OUI picks up the environment ORACLE_HOME variable and cleans up the library file libskgxp10.so from that particular ORACLE_HOME irrespective of which ORACLE_HOME you are uninstalling.
Customer uninstalled 10204 CRS, ASM and DB home while ORACLE_HOME environment variable was set to 11202 Database home.
Solution:
Copy the good copy of libskgxp10.so from any other node (11202 ORACLE_HOME/lib) in the cluster to the problematic node (11202 ORACLE_HOME/lib) from where uninstall was started.
EXPLORING ORACLE
Monday, May 9, 2011
Wednesday, February 9, 2011
/etc/oratab & # line added by Agent
We upgraded our Oracle technology stack from 10204 to 11202. We are running Oracle Grid infrastructure 11202 on 4 node RAC, which has many 11202 database running it.
We use /etc/oratab extensively on all our scripts. For example: backup scripts, application scripts which needs to connect to database etc..
In a RAC environment /etc/oratab only contains the DB_UNIQUE_NAME for cluster database instead of ORACLE_SID And most of our scripts use ORACLE_SID instead of DB_UNIQUE_NAME. As result of this we modifed /etc/oratab to point to ORACLE_SID rather than DB_UNIQUE_NAME. Thiw worked well in 10g.
Since we upgraded to Oracle 11202, we started having new entry in our /etc/oratab with DB_UNIQUE_NAME like below.
TEST:/app/oracle/db/11.2/db_1:N # line added by Agent
PROD:/app/oracle/db/11.2/db_1:N # line added by Agent
DR:/app/oracle/db/11.2/db_1:N # line added by Agent
This cause most of our scripts to fail and as result I modified the /etc/oratab manually and removed those entries and surprisingly in couple of days those entries will come back and our script will fail all over again.
I looked up in google and metalink and found couple of bugs which said that its an expected behaviour but no one explained why this is happening or who updates the /etc/oratab at regular intervals.
Further research revealed that ORACLE agent updates the /etc/oratab everytime you restart the database.
Logfile /app/oracle/grid/11.2/grid_1/log/node01/agent/crsd/oraagent_oracle has the following entries to confirm that.
oraagent_oracle.l01:2011-02-10 08:35:52.745: [ora.test.db][1346505024] {1:35347:18516} [start] sUpdateOratab oratab is /etc/oratab
oraagent_oracle.l01:2011-02-10 08:35:52.760: [ora.test.db][1346505024] {1:35347:18516} [start] ConfigFile::updateInPlace file /etc/oratab is updated
oraagent_oracle.l01:2011-02-10 08:37:06.316: [ora.test.db][3187673408] {1:35347:18560} [start] sUpdateOratab oratab is /etc/oratab
oraagent_oracle.l01:2011-02-10 08:37:06.329: [ora.test.db][3187673408] {1:35347:18560} [start] ConfigFile::updateInPlace file /etc/oratab is updated
Based on that, we can say that,
The oratab file entry is also updated automatically by the oraagent - New process introduced in 11g (not to be confused with Oracle EM agent) when a database started or shutdown.
Oracle creates an entry for each Oracle RAC database in the oratab configuration file Database name or DB_UNIQUE_NAME and not the ORACLE_SID. The oratab file is created by the root.sh script during installation, and it is updated by the Database Configuration Assistant when creating or deleting a database.
We use /etc/oratab extensively on all our scripts. For example: backup scripts, application scripts which needs to connect to database etc..
In a RAC environment /etc/oratab only contains the DB_UNIQUE_NAME for cluster database instead of ORACLE_SID And most of our scripts use ORACLE_SID instead of DB_UNIQUE_NAME. As result of this we modifed /etc/oratab to point to ORACLE_SID rather than DB_UNIQUE_NAME. Thiw worked well in 10g.
Since we upgraded to Oracle 11202, we started having new entry in our /etc/oratab with DB_UNIQUE_NAME like below.
TEST:/app/oracle/db/11.2/db_1:N # line added by Agent
PROD:/app/oracle/db/11.2/db_1:N # line added by Agent
DR:/app/oracle/db/11.2/db_1:N # line added by Agent
This cause most of our scripts to fail and as result I modified the /etc/oratab manually and removed those entries and surprisingly in couple of days those entries will come back and our script will fail all over again.
I looked up in google and metalink and found couple of bugs which said that its an expected behaviour but no one explained why this is happening or who updates the /etc/oratab at regular intervals.
Further research revealed that ORACLE agent updates the /etc/oratab everytime you restart the database.
Logfile /app/oracle/grid/11.2/grid_1/log/node01/agent/crsd/oraagent_oracle has the following entries to confirm that.
oraagent_oracle.l01:2011-02-10 08:35:52.745: [ora.test.db][1346505024] {1:35347:18516} [start] sUpdateOratab oratab is /etc/oratab
oraagent_oracle.l01:2011-02-10 08:35:52.760: [ora.test.db][1346505024] {1:35347:18516} [start] ConfigFile::updateInPlace file /etc/oratab is updated
oraagent_oracle.l01:2011-02-10 08:37:06.316: [ora.test.db][3187673408] {1:35347:18560} [start] sUpdateOratab oratab is /etc/oratab
oraagent_oracle.l01:2011-02-10 08:37:06.329: [ora.test.db][3187673408] {1:35347:18560} [start] ConfigFile::updateInPlace file /etc/oratab is updated
Based on that, we can say that,
The oratab file entry is also updated automatically by the oraagent - New process introduced in 11g (not to be confused with Oracle EM agent) when a database started or shutdown.
Oracle creates an entry for each Oracle RAC database in the oratab configuration file Database name or DB_UNIQUE_NAME and not the ORACLE_SID. The oratab file is created by the root.sh script during installation, and it is updated by the Database Configuration Assistant when creating or deleting a database.
Monday, February 7, 2011
ORA-15186: ASMLIB error function = [asm_open], error = [1], mesg = [Operation not permitted]
Recently, I was involved in OS rolling upgrade for 3-node RAC running on 10204 with ASMLIB. I had performed quite a few OS rolling upgrades in past so I had my checklist or documentation handy on steps to re-configure ASMLIB.
The upgrade worked fine on node 1 and ASM re-configuration worked fine. However on node 2, after patching OS and re-configuring ASMLIB, asm disk group failed mount with folowing error.
ORA-15186: ASMLIB error function = [asm_open], error = [1], mesg = [Operation not permitted]
Based on my previous experience, I was positive that this error is related multipathing. Most probably the root cause is ORACLE_SCANORDER parameter in /etc/sysconfig/oracleasm file.
I verified the value and it was set correctly as below.
# ORACLEASM_SCANORDER: Matching patterns to order disk scanning
ORACLEASM_SCANORDER="dm"
My asm commands, querydisk, listdisk, scandisk were working fine.
However, ASM diskgroup failed to mount with above errors.
To further investigate, I checked /dev/oracleasm/disks and /proc/partitions
ls -l /dev/oracleasm/disks
total 0
brw-rw---- 1 oracle dba 65, 176 Feb 6 17:04 TESTORA2_ASM1
brw-rw---- 1 oracle dba 65, 192 Feb 6 17:04 TESTORA2_ASM2
brw-rw---- 1 oracle dba 65, 208 Feb 6 17:04 TESTORA2_ASM3
brw-rw---- 1 oracle dba 65, 224 Feb 6 17:04 TESTORA2_ASM4
cat /proc/partitions
....
....
....
65 176 209715200 sdab
65 192 209715200 sdac
65 208 209715200 sdad
65 224 209715200 sdae
....
....
....
253 9 209715200 dm-9
253 10 209715200 dm-10
253 11 209715200 dm-11
253 12 209715200 dm-12
....
....
....
If you compare above to output you can see that my ls -l /dev/oracleasm/disks output (major, minor) matches to cat /proc/paritions outpu (major, minor) sd* disks and not dm-* disks. Even though my ORACLEASM_SCANORDER="dm" set correctly, ASM is not looking at the multipath disks but a single disks.
I tried restarting ASM service couple of times and it didn't help. I tried changing the ORACLEASM_SCANORDER="sd" as well and still no luck. Based on this, I thought for some reason ASMLIB was not reading the /etc/sysconfig/oracleasm file at all.
I decided to raise the SR with Oracle support and luckily I got a call from very knowlegeble senior support analyst and he confirmed that using the strace on /etc/init.d/oracleasm start command.
In strace it was visible that instead of reading the /etc/sysconfig/oracleasm file it was reading the file called oracleasm-_dev_oracleasm.
We ran ls command on other sever to compare with this server.
On working server
------------------
ls -l /etc/sysconfig/oracleasm*
lrwxrwxrwx 1 root root 24 Feb 6 16:57 oracleasm -> oracleasm-_dev_oracleasm
-rw-r--r-- 1 root root 776 Feb 6 17:00 oracleasm-_dev_oracleasm
-rw-r--r-- 1 root root 863 May 12 2010 oracleasm.rpmsave
On a Problematic Server
------------------------
ls -l oracleasm*
-rw-r--r-- 1 root root 24 Feb 6 16:57 oracleasm
-rw-r--r-- 1 root root 776 Feb 6 17:00 oracleasm-_dev_oracleasm
-rw-r--r-- 1 root root 863 May 12 2010 oracleasm.rpmsave
You can see that a on a problematic server, oracleasm is a text file instead of symbolic link. Further investigation revealed that while cleanup of OLD ASMLIB, file oracleasm-_dev_oracleasm was left out and due to this new installtion failed to create the symbolic link.
I recreated the symbolic link manually and changed the ORACLEASM_SCANORDER="dm" and restarted the ASM service and all started working properly.
Hope this helps...
The upgrade worked fine on node 1 and ASM re-configuration worked fine. However on node 2, after patching OS and re-configuring ASMLIB, asm disk group failed mount with folowing error.
ORA-15186: ASMLIB error function = [asm_open], error = [1], mesg = [Operation not permitted]
Based on my previous experience, I was positive that this error is related multipathing. Most probably the root cause is ORACLE_SCANORDER parameter in /etc/sysconfig/oracleasm file.
I verified the value and it was set correctly as below.
# ORACLEASM_SCANORDER: Matching patterns to order disk scanning
ORACLEASM_SCANORDER="dm"
My asm commands, querydisk, listdisk, scandisk were working fine.
However, ASM diskgroup failed to mount with above errors.
To further investigate, I checked /dev/oracleasm/disks and /proc/partitions
ls -l /dev/oracleasm/disks
total 0
brw-rw---- 1 oracle dba 65, 176 Feb 6 17:04 TESTORA2_ASM1
brw-rw---- 1 oracle dba 65, 192 Feb 6 17:04 TESTORA2_ASM2
brw-rw---- 1 oracle dba 65, 208 Feb 6 17:04 TESTORA2_ASM3
brw-rw---- 1 oracle dba 65, 224 Feb 6 17:04 TESTORA2_ASM4
cat /proc/partitions
....
....
....
65 176 209715200 sdab
65 192 209715200 sdac
65 208 209715200 sdad
65 224 209715200 sdae
....
....
....
253 9 209715200 dm-9
253 10 209715200 dm-10
253 11 209715200 dm-11
253 12 209715200 dm-12
....
....
....
If you compare above to output you can see that my ls -l /dev/oracleasm/disks output (major, minor) matches to cat /proc/paritions outpu (major, minor) sd* disks and not dm-* disks. Even though my ORACLEASM_SCANORDER="dm" set correctly, ASM is not looking at the multipath disks but a single disks.
I tried restarting ASM service couple of times and it didn't help. I tried changing the ORACLEASM_SCANORDER="sd" as well and still no luck. Based on this, I thought for some reason ASMLIB was not reading the /etc/sysconfig/oracleasm file at all.
I decided to raise the SR with Oracle support and luckily I got a call from very knowlegeble senior support analyst and he confirmed that using the strace on /etc/init.d/oracleasm start command.
In strace it was visible that instead of reading the /etc/sysconfig/oracleasm file it was reading the file called oracleasm-_dev_oracleasm.
We ran ls command on other sever to compare with this server.
On working server
------------------
ls -l /etc/sysconfig/oracleasm*
lrwxrwxrwx 1 root root 24 Feb 6 16:57 oracleasm -> oracleasm-_dev_oracleasm
-rw-r--r-- 1 root root 776 Feb 6 17:00 oracleasm-_dev_oracleasm
-rw-r--r-- 1 root root 863 May 12 2010 oracleasm.rpmsave
On a Problematic Server
------------------------
ls -l oracleasm*
-rw-r--r-- 1 root root 24 Feb 6 16:57 oracleasm
-rw-r--r-- 1 root root 776 Feb 6 17:00 oracleasm-_dev_oracleasm
-rw-r--r-- 1 root root 863 May 12 2010 oracleasm.rpmsave
You can see that a on a problematic server, oracleasm is a text file instead of symbolic link. Further investigation revealed that while cleanup of OLD ASMLIB, file oracleasm-_dev_oracleasm was left out and due to this new installtion failed to create the symbolic link.
I recreated the symbolic link manually and changed the ORACLEASM_SCANORDER="dm" and restarted the ASM service and all started working properly.
Hope this helps...
Monday, January 17, 2011
DBMS_REDEFINITION.COPY_TABLE_DEPENDENTS Failed with ORA-01408: such column list already indexed
Recently I was engaged in exercise to partition a table in a database. I was using a DBMS_REDEFINITION package to partition the online.
As per documentation,
There may be times when you want to override the actions of DBMS_REDEFINITION.COPY_TABLE_DEPENDENTS( ). For example, an index must be moved to a different tablespace. This can be achieved by making the relevant changes on the interim table, and then using DBMS_REDEFINITION.REGISTER_DEPENDENT_OBJECT() to register the new version of the object. Then when DBMS_REDEFINITION.COPY_TABLE_DEPENDENTS( ) is executed, it will ignore that object, in this case the index, which has already been defined.
However, during my test, I realized that this is not entirely true. It complained about my index which was supporting my primary key constraint.
Let's see that in following example:
Creating Source Table
---------------------
SQL> create table t1 as select * from all_users;
Table created.
SQL> alter table t1 add constraint t1_pk primary key(user_id);
Table altered.
SQL> select index_name from user_indexes;
INDEX_NAME
-----------------------------
T1_PK
Creating Interim Partitioned Table
----------------------------------
SQL> create table t2 partition by hash(user_id) partitions 8 as select * from t1 where 1=2;
Table created.
Moving the tablespace of INDEX on interim Table
-----------------------------------------------
SQL> create index t2_Pk on t2 (user_id) tablespace system;
Index created.
Start online re-definition
---------------------------
SQL> exec dbms_redefinition.can_redef_table( user, 'T1' );
PL/SQL procedure successfully completed.
SQL> exec dbms_redefinition.start_redef_table( user, 'T1', 'T2' );
PL/SQL procedure successfully completed.
SQL> EXEC DBMS_REDEFINITION.REGISTER_DEPENDENT_OBJECT('TEST', 'T1', 'T2', DBMS_REDEFINITION.CONS_INDEX, 'TEST', 'T1_PK', 'T2_PK');
PL/SQL procedure successfully completed.
SQL> SET SERVEROUTPUT ON;
VARIABLE NERRORS NUMBER;
BEGIN
DBMS_REDEFINITION.COPY_TABLE_DEPENDENTS ( 'TEST', 'T1', 'T2', 1, TRUE, TRUE, TRUE , FALSE, NUM_ERRORS => :NERRORS );
END;
/
BEGIN
*
ERROR at line 1:
ORA-01408: such column list already indexed
ORA-06512: at "SYS.DBMS_REDEFINITION", line 984
ORA-06512: at "SYS.DBMS_REDEFINITION", line 1727
ORA-06512: at line 2
Following query confirms that
------------------------------
SQL> select * From dba_redefinition_errors;
OBJECT_TYP OBJECT_OWNER OBJECT_NAME
---------- ------------------------------ ------------------------------
BASE_TABLE_OWNER BASE_TABLE_NAME
------------------------------ ------------------------------
DDL_TXT
--------------------------------------------------------------------------------
CONSTRAINT TEST T1_PK
TEST T1
ALTER TABLE "TEST"."T2" ADD CONSTRAINT "TMP$$_T1_PK0" PRIMARY KEY ("USER_ID")
Removing the Index
------------------
SQL> BEGIN
DBMS_REDEFINITION.unregister_dependent_object ( 'TEST', 'T1', 'T2', DBMS_REDEFINITION.CONS_INDEX, 'TEST', 'T1_PK', 'T2_PK');
END;
/
PL/SQL procedure successfully completed.
SQL> DROP INDEX T2_PK;
Index dropped.
No re-try the copy constraints
------------------------------
SQL> SET SERVEROUTPUT ON;
VARIABLE NERRORS NUMBER;
BEGIN
DBMS_REDEFINITION.COPY_TABLE_DEPENDENTS ( 'TEST', 'T1', 'T2', 1, TRUE, TRUE, TRUE , FALSE, NUM_ERRORS => :NERRORS );
END;
/
PL/SQL procedure successfully completed.
SQL> exec dbms_redefinition.finish_redef_table( user, 'T1', 'T2' );
PL/SQL procedure successfully completed.
Hope this helps...
As per documentation,
There may be times when you want to override the actions of DBMS_REDEFINITION.COPY_TABLE_DEPENDENTS( ). For example, an index must be moved to a different tablespace. This can be achieved by making the relevant changes on the interim table, and then using DBMS_REDEFINITION.REGISTER_DEPENDENT_OBJECT() to register the new version of the object. Then when DBMS_REDEFINITION.COPY_TABLE_DEPENDENTS( ) is executed, it will ignore that object, in this case the index, which has already been defined.
However, during my test, I realized that this is not entirely true. It complained about my index which was supporting my primary key constraint.
Let's see that in following example:
Creating Source Table
---------------------
SQL> create table t1 as select * from all_users;
Table created.
SQL> alter table t1 add constraint t1_pk primary key(user_id);
Table altered.
SQL> select index_name from user_indexes;
INDEX_NAME
-----------------------------
T1_PK
Creating Interim Partitioned Table
----------------------------------
SQL> create table t2 partition by hash(user_id) partitions 8 as select * from t1 where 1=2;
Table created.
Moving the tablespace of INDEX on interim Table
-----------------------------------------------
SQL> create index t2_Pk on t2 (user_id) tablespace system;
Index created.
Start online re-definition
---------------------------
SQL> exec dbms_redefinition.can_redef_table( user, 'T1' );
PL/SQL procedure successfully completed.
SQL> exec dbms_redefinition.start_redef_table( user, 'T1', 'T2' );
PL/SQL procedure successfully completed.
SQL> EXEC DBMS_REDEFINITION.REGISTER_DEPENDENT_OBJECT('TEST', 'T1', 'T2', DBMS_REDEFINITION.CONS_INDEX, 'TEST', 'T1_PK', 'T2_PK');
PL/SQL procedure successfully completed.
SQL> SET SERVEROUTPUT ON;
VARIABLE NERRORS NUMBER;
BEGIN
DBMS_REDEFINITION.COPY_TABLE_DEPENDENTS ( 'TEST', 'T1', 'T2', 1, TRUE, TRUE, TRUE , FALSE, NUM_ERRORS => :NERRORS );
END;
/
BEGIN
*
ERROR at line 1:
ORA-01408: such column list already indexed
ORA-06512: at "SYS.DBMS_REDEFINITION", line 984
ORA-06512: at "SYS.DBMS_REDEFINITION", line 1727
ORA-06512: at line 2
Following query confirms that
------------------------------
SQL> select * From dba_redefinition_errors;
OBJECT_TYP OBJECT_OWNER OBJECT_NAME
---------- ------------------------------ ------------------------------
BASE_TABLE_OWNER BASE_TABLE_NAME
------------------------------ ------------------------------
DDL_TXT
--------------------------------------------------------------------------------
CONSTRAINT TEST T1_PK
TEST T1
ALTER TABLE "TEST"."T2" ADD CONSTRAINT "TMP$$_T1_PK0" PRIMARY KEY ("USER_ID")
Removing the Index
------------------
SQL> BEGIN
DBMS_REDEFINITION.unregister_dependent_object ( 'TEST', 'T1', 'T2', DBMS_REDEFINITION.CONS_INDEX, 'TEST', 'T1_PK', 'T2_PK');
END;
/
PL/SQL procedure successfully completed.
SQL> DROP INDEX T2_PK;
Index dropped.
No re-try the copy constraints
------------------------------
SQL> SET SERVEROUTPUT ON;
VARIABLE NERRORS NUMBER;
BEGIN
DBMS_REDEFINITION.COPY_TABLE_DEPENDENTS ( 'TEST', 'T1', 'T2', 1, TRUE, TRUE, TRUE , FALSE, NUM_ERRORS => :NERRORS );
END;
/
PL/SQL procedure successfully completed.
SQL> exec dbms_redefinition.finish_redef_table( user, 'T1', 'T2' );
PL/SQL procedure successfully completed.
Hope this helps...
Thursday, January 13, 2011
Drop tablespace waiting on 'reliable message'
Lats week, I droped a tablespace from production database.
SQL> select count(*) from dba_segments where tablespace_name='TEST';
COUNT(*)
----------
0
SQL> drop tablespace TEST including contents and datafiles;
Its just hung there for more than 2 hours.
I looked at the alert log...
Wed Jan 12 11:10:39 2011
drop tablespace test including contents and datafiles
Wed Jan 12 11:11:52 2011
Deleted Oracle managed file +DGROUP1/PROD/datafile/test.339.666449433
Wed Jan 12 11:12:12 2011
Deleted Oracle managed file +DGROUP1/PROD/datafile/test.1104.672507107
Wed Jan 12 11:12:29 2011
Deleted Oracle managed file +DGROUP1/PROD/datafile/test.1838.684692483
Wed Jan 12 11:12:45 2011
Deleted Oracle managed file +DGROUP1/PROD/datafile/test.1955.691083671
Wed Jan 12 11:39:48 2011
Thread 1 advanced to log sequence 94797 (LGWR switch)
Current log# 7 seq# 94797 mem# 0: +DGROUP2/PROD/redo07a.dbf
Wed Jan 12 11:39:50 2011
LNS: Standby redo logfile selected for thread 1 sequence 94797 for destination LOG_ARCHIVE_DEST_2
Wed Jan 12 11:42:35 2011
Looks like all the files have been dropped physically from ASM.
Checked the database...
SQL> select * from v$tablespace where name='TEST';
no rows selected
SQL> select file_name from dba_data_files where tablespace_name='TEST';
no rows selected
In spite of all this, my initial sqlplus session with drop tablespace was still hung...
Looking V$session_wait it was waiting for 'reliable message' for last 2 hours.
SQL> select event, p1, p2, state from v$session_wait where sid=836;
EVENT P1 P2 STATE
---------------------------------------------------------------- ----------
reliable message 6852580008 6968627176 WAITING
SQL> /
reliable message 6852580008 6968627176 WAITING
SQL> /
reliable message 6852580008 6968627176 WAITING
SQL> /
reliable message 6852580008 6968627176 WAITING
I gogled a bit, couldn't find anything. Metalink search got me few hits though...
Bug 6148054 RAC hang waiting for "reliable message"
Bug 4635062 drop tablespace session waiting for reliable message
Though, this confirmed that I was hitting either of the oracle bug, none of them had a solution on what to do for exisitng hung session. One off the bug did had a suggestion to kill the session from other session using alter system command. However, there wasn't enough information on if that resolved the issue or not.
AS I had no other option, I killed my drop tablespace session from other session using alter system command and drop tablespace worked.
SQL> select count(*) from dba_segments where tablespace_name='TEST';
COUNT(*)
----------
0
SQL> drop tablespace TEST including contents and datafiles;
Its just hung there for more than 2 hours.
I looked at the alert log...
Wed Jan 12 11:10:39 2011
drop tablespace test including contents and datafiles
Wed Jan 12 11:11:52 2011
Deleted Oracle managed file +DGROUP1/PROD/datafile/test.339.666449433
Wed Jan 12 11:12:12 2011
Deleted Oracle managed file +DGROUP1/PROD/datafile/test.1104.672507107
Wed Jan 12 11:12:29 2011
Deleted Oracle managed file +DGROUP1/PROD/datafile/test.1838.684692483
Wed Jan 12 11:12:45 2011
Deleted Oracle managed file +DGROUP1/PROD/datafile/test.1955.691083671
Wed Jan 12 11:39:48 2011
Thread 1 advanced to log sequence 94797 (LGWR switch)
Current log# 7 seq# 94797 mem# 0: +DGROUP2/PROD/redo07a.dbf
Wed Jan 12 11:39:50 2011
LNS: Standby redo logfile selected for thread 1 sequence 94797 for destination LOG_ARCHIVE_DEST_2
Wed Jan 12 11:42:35 2011
Looks like all the files have been dropped physically from ASM.
Checked the database...
SQL> select * from v$tablespace where name='TEST';
no rows selected
SQL> select file_name from dba_data_files where tablespace_name='TEST';
no rows selected
In spite of all this, my initial sqlplus session with drop tablespace was still hung...
Looking V$session_wait it was waiting for 'reliable message' for last 2 hours.
SQL> select event, p1, p2, state from v$session_wait where sid=836;
EVENT P1 P2 STATE
---------------------------------------------------------------- ----------
reliable message 6852580008 6968627176 WAITING
SQL> /
reliable message 6852580008 6968627176 WAITING
SQL> /
reliable message 6852580008 6968627176 WAITING
SQL> /
reliable message 6852580008 6968627176 WAITING
I gogled a bit, couldn't find anything. Metalink search got me few hits though...
Bug 6148054 RAC hang waiting for "reliable message"
Bug 4635062 drop tablespace session waiting for reliable message
Though, this confirmed that I was hitting either of the oracle bug, none of them had a solution on what to do for exisitng hung session. One off the bug did had a suggestion to kill the session from other session using alter system command. However, there wasn't enough information on if that resolved the issue or not.
AS I had no other option, I killed my drop tablespace session from other session using alter system command and drop tablespace worked.
Watch out when using RMAN custom backup piece naming convention via FORMAT
Yesterday, I was asked to investigate the backup restore issue of production database for one off my client. Customer was taking incremental level 0 backups on Sunday and cumulative incremental level 1 backups on Monday-Saturday. Customer had successful backup for all one month, no errors in the backup log files, RMAN Catalog list valid backups available for last one month.
Customer was trying to restore and recover a backup as of Monday morning. In spite successful incremental level 0 backup on Sunday night, restore as of Monday morning was failing. It was complaining about backup being not found for certain files in backup set of Sunday and it was trying to read the backup piece of previous Sunday and for some files 4 to 6 weeks old backup piece.
Customer had no clue on why RMAN was able to restore few files from last night (Sunday) incremental level 0 backup however, trying to read a backup piece 3 to 4 weeks old for other files.
Customer was using the following script to perform the backup.
run
{
allocate channel tdp1 type 'sbt_tape' parms 'ENV=(TDPO_OPTFILE=/opt/tivoli/tsm/client/oracle/bin64/tdpo.opt)';
allocate channel tdp2 type 'sbt_tape' parms 'ENV=(TDPO_OPTFILE=/opt/tivoli/tsm/client/oracle/bin64/tdpo.opt)';
BACKUP INCREMENTAL LEVEL 0 DATABASE FORMAT='DB_LVL0_%d_%T%t_%p_%c' PLUS ARCHIVELOG NOT BACKED UP 2 TIMES FORMAT='LG_%d_%T%t_%p_%c';
}
Then I looked at the backup log files…
Starting backup at 10-JAN-2011 05:18:50
channel tdp1: starting incremental level 0 datafile backupset
channel tdp1: specifying datafile(s) in backupset
input datafile fno=00006 name=+DGROUP1/prod/datafile/PROD_data.1138.704640745
input datafile fno=00007 name=+DGROUP1/prod/datafile/PROD_index.1147.704640829
input datafile fno=00014 name=+DGROUP1/prod/datafile/PROD_data.2100.725724753
input datafile fno=00002 name=+DGROUP1/prod/datafile/undotbs1.1067.704635111
input datafile fno=00010 name=+DGROUP1/prod/datafile/PROD_data.1111.719594721
input datafile fno=00013 name=+DGROUP1/prod/datafile/PROD_data.1278.721149489
input datafile fno=00005 name=+DGROUP1/prod/datafile/users.1210.704635119
input datafile fno=00016 name=+DGROUP1/prod/datafile/PROD_index.1435.726672149
channel tdp1: starting piece 1 at 10-JAN-2011 05:18:51
channel tdp2: starting incremental level 0 datafile backupset
channel tdp2: specifying datafile(s) in backupset
input datafile fno=00004 name=+DGROUP1/prod/datafile/undotbs2.2251.704635119
input datafile fno=00015 name=+DGROUP1/prod/datafile/PROD_data.1080.725725677
input datafile fno=00017 name=+DGROUP1/prod/datafile/PROD_index.1090.734796557
input datafile fno=00012 name=+DGROUP1/prod/datafile/PROD_data.1951.721149479
input datafile fno=00009 name=+DGROUP1/prod/datafile/PROD_data.395.717544159
input datafile fno=00011 name=+DGROUP1/prod/datafile/PROD_data.1162.719594733
input datafile fno=00003 name=+DGROUP1/prod/datafile/sysaux.1922.704635113
input datafile fno=00001 name=+DGROUP1/prod/datafile/system.1752.704635103
input datafile fno=00008 name=+DGROUP1/prod/datafile/drsys.770.712926249
channel tdp2: starting piece 1 at 10-JAN-2011 05:18:51
channel tdp2: finished piece 1 at 10-JAN-2011 05:49:39
piece handle=DB_LVL0_prod_20110110740035131_1_1 tag=TAG20110110T051851 comment=API Version 2.0,MMS Version 5.5.1.0
channel tdp2: backup set complete, elapsed time: 00:30:48
channel tdp1: finished piece 1 at 10-JAN-2011 05:56:45
piece handle=DB_LVL0_prod_20110110740035131_1_1 tag=TAG20110110T051851 comment=API Version 2.0,MMS Version 5.5.1.0
channel tdp1: backup set complete, elapsed time: 00:37:54
Finished backup at 10-JAN-2011 05:56:45
If you look at the log files very carefully, it allocated 2 channels and distributes the all the datafiles in 2 sets and starts the backups. However, if you look at the backup piece name for both channels, it’s exactly same. As a result, when which ever channel finished lasts it overwrites the backup taken by earlier channel. So if you look at the above output, channel 1 td1 and channel 2 tdp2 started the backup but the channel 2 finished backup first with the name DB_LVL0_PROD_20110110740035131_1_1. Later on when channel 1 finished the backup it reused the same backup piece name DB_LVL0_PROD_20110110740035131_1_1. Due to this all the files backed by channel 2 tdp2 are completely lost. And the surprisingly, RMAN doesn’t complain at all about this. RMAN Backup log file looks clear with return code 0.
However, when it comes to restore, RMAN can only restore the set of files backup by channel 1 tdp1 which finished last and its has no backup available for files backed by channel 2 tdp2.
Let’s confirm this by looking at RMAN catalog output. We already saw RMAN backup log it says it backed up all datafiles without any error.
Let’s run Report Schema to list the datafiles in database.
RMAN> report schema;
using target database control file instead of recovery catalog
Report of database schema
List of Permanent Datafiles
===========================
File Size(MB) Tablespace RB segs Datafile Name
---- -------- -------------------- ------- ------------------------
1 530 SYSTEM *** +DGROUP1/prod/datafile/system.1752.704635103
2 13720 UNDOTBS1 *** +DGROUP1/prod/datafile/undotbs1.1067.704635111
3 970 SYSAUX *** +DGROUP1/prod/datafile/sysaux.1922.704635113
4 23395 UNDOTBS2 *** +DGROUP1/prod/datafile/undotbs2.2251.704635119
5 5 USERS *** +DGROUP1/prod/datafile/users.1210.704635119
6 20480 PROD_DATA *** +DGROUP1/prod/datafile/PROD_data.1138.704640745
7 18960 PROD_INDEX *** +DGROUP1/prod/datafile/PROD_index.1147.704640829
8 20 DRSYS *** +DGROUP1/prod/datafile/drsys.770.712926249
9 8192 PROD_DATA *** +DGROUP1/prod/datafile/PROD_data.395.717544159
10 8192 PROD_DATA *** +DGROUP1/prod/datafile/PROD_data.1111.719594721
11 8192 PROD_DATA *** +DGROUP1/prod/datafile/PROD_data.1162.719594733
12 8196 PROD_DATA *** +DGROUP1/prod/datafile/PROD_data.1951.721149479
13 8192 PROD_DATA *** +DGROUP1/prod/datafile/PROD_data.1278.721149489
14 18692 PROD_DATA *** +DGROUP1/prod/datafile/PROD_data.2100.725724753
15 18992 PROD_DATA *** +DGROUP1/prod/datafile/PROD_data.1080.725725677
16 8192 PROD_INDEX *** +DGROUP1/prod/datafile/PROD_index.1435.726672149
17 16384 PROD_INDEX *** +DGROUP1/prod/datafile/PROD_index.1090.734796557
List of Temporary Files
=======================
File Size(MB) Tablespace Maxsize(MB) Tempfile Name
---- -------- -------------------- ----------- --------------------
1 15360 TEMP 15360 +DGROUP1/prod/tempfile/temp.309.704635115
2 8192 TEMP 8192 +DGROUP1/prod/tempfile/temp.1141.735345399
Now let’s grab the backup tag from backup log file and list the backup piece and datafiles in that backup tag.
RMAN> list backup tag TAG20110110T051851;
List of Backup Sets
===================
BS Key Type LV Size Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ --------------------
5169802 Incr 0 93.70G SBT_TAPE 00:37:50 10-JAN-2011 05:56:41
BP Key: 5169806 Status: AVAILABLE Compressed: NO Tag: TAG20110110T051851
Handle: DB_LVL0_prod_20110110740035131_1_1 Media:
List of Datafiles in backup set 5169802
File LV Type Ckp SCN Ckp Time Name
---- -- ---- ---------- -------------------- ----
2 0 Incr 7310305942 10-JAN-2011 05:18:51 +DGROUP1/prod/datafile/undotbs1.1067.704635111
5 0 Incr 7310305942 10-JAN-2011 05:18:51 +DGROUP1/prod/datafile/users.1210.704635119
6 0 Incr 7310305942 10-JAN-2011 05:18:51 +DGROUP1/prod/datafile/norkom_data.1138.704640745
7 0 Incr 7310305942 10-JAN-2011 05:18:51 +DGROUP1/prod/datafile/norkom_index.1147.704640829
10 0 Incr 7310305942 10-JAN-2011 05:18:51 +DGROUP1/prod/datafile/norkom_data.1111.719594721
13 0 Incr 7310305942 10-JAN-2011 05:18:51 +DGROUP1/prod/datafile/norkom_data.1278.721149489
14 0 Incr 7310305942 10-JAN-2011 05:18:51 +DGROUP1/prod/datafile/norkom_data.2100.725724753
16 0 Incr 7310305942 10-JAN-2011 05:18:51 +DGROUP1/prod/datafile/norkom_index.1435.726672149
RMAN>
From the above output you can clearly see that there set of datafiles missing in the backup piece. That’s why customer was able to restore set of files from recent backup and for rest of the files RMAN was looking for backup piece which 3 to 4 weeks old.
So be cautious when you are using customized RMAN backup piece name using format keyword. Make sure you use %u which guarantees unique name for your backup piece.
Customer was trying to restore and recover a backup as of Monday morning. In spite successful incremental level 0 backup on Sunday night, restore as of Monday morning was failing. It was complaining about backup being not found for certain files in backup set of Sunday and it was trying to read the backup piece of previous Sunday and for some files 4 to 6 weeks old backup piece.
Customer had no clue on why RMAN was able to restore few files from last night (Sunday) incremental level 0 backup however, trying to read a backup piece 3 to 4 weeks old for other files.
Customer was using the following script to perform the backup.
run
{
allocate channel tdp1 type 'sbt_tape' parms 'ENV=(TDPO_OPTFILE=/opt/tivoli/tsm/client/oracle/bin64/tdpo.opt)';
allocate channel tdp2 type 'sbt_tape' parms 'ENV=(TDPO_OPTFILE=/opt/tivoli/tsm/client/oracle/bin64/tdpo.opt)';
BACKUP INCREMENTAL LEVEL 0 DATABASE FORMAT='DB_LVL0_%d_%T%t_%p_%c' PLUS ARCHIVELOG NOT BACKED UP 2 TIMES FORMAT='LG_%d_%T%t_%p_%c';
}
Then I looked at the backup log files…
Starting backup at 10-JAN-2011 05:18:50
channel tdp1: starting incremental level 0 datafile backupset
channel tdp1: specifying datafile(s) in backupset
input datafile fno=00006 name=+DGROUP1/prod/datafile/PROD_data.1138.704640745
input datafile fno=00007 name=+DGROUP1/prod/datafile/PROD_index.1147.704640829
input datafile fno=00014 name=+DGROUP1/prod/datafile/PROD_data.2100.725724753
input datafile fno=00002 name=+DGROUP1/prod/datafile/undotbs1.1067.704635111
input datafile fno=00010 name=+DGROUP1/prod/datafile/PROD_data.1111.719594721
input datafile fno=00013 name=+DGROUP1/prod/datafile/PROD_data.1278.721149489
input datafile fno=00005 name=+DGROUP1/prod/datafile/users.1210.704635119
input datafile fno=00016 name=+DGROUP1/prod/datafile/PROD_index.1435.726672149
channel tdp1: starting piece 1 at 10-JAN-2011 05:18:51
channel tdp2: starting incremental level 0 datafile backupset
channel tdp2: specifying datafile(s) in backupset
input datafile fno=00004 name=+DGROUP1/prod/datafile/undotbs2.2251.704635119
input datafile fno=00015 name=+DGROUP1/prod/datafile/PROD_data.1080.725725677
input datafile fno=00017 name=+DGROUP1/prod/datafile/PROD_index.1090.734796557
input datafile fno=00012 name=+DGROUP1/prod/datafile/PROD_data.1951.721149479
input datafile fno=00009 name=+DGROUP1/prod/datafile/PROD_data.395.717544159
input datafile fno=00011 name=+DGROUP1/prod/datafile/PROD_data.1162.719594733
input datafile fno=00003 name=+DGROUP1/prod/datafile/sysaux.1922.704635113
input datafile fno=00001 name=+DGROUP1/prod/datafile/system.1752.704635103
input datafile fno=00008 name=+DGROUP1/prod/datafile/drsys.770.712926249
channel tdp2: starting piece 1 at 10-JAN-2011 05:18:51
channel tdp2: finished piece 1 at 10-JAN-2011 05:49:39
piece handle=DB_LVL0_prod_20110110740035131_1_1 tag=TAG20110110T051851 comment=API Version 2.0,MMS Version 5.5.1.0
channel tdp2: backup set complete, elapsed time: 00:30:48
channel tdp1: finished piece 1 at 10-JAN-2011 05:56:45
piece handle=DB_LVL0_prod_20110110740035131_1_1 tag=TAG20110110T051851 comment=API Version 2.0,MMS Version 5.5.1.0
channel tdp1: backup set complete, elapsed time: 00:37:54
Finished backup at 10-JAN-2011 05:56:45
If you look at the log files very carefully, it allocated 2 channels and distributes the all the datafiles in 2 sets and starts the backups. However, if you look at the backup piece name for both channels, it’s exactly same. As a result, when which ever channel finished lasts it overwrites the backup taken by earlier channel. So if you look at the above output, channel 1 td1 and channel 2 tdp2 started the backup but the channel 2 finished backup first with the name DB_LVL0_PROD_20110110740035131_1_1. Later on when channel 1 finished the backup it reused the same backup piece name DB_LVL0_PROD_20110110740035131_1_1. Due to this all the files backed by channel 2 tdp2 are completely lost. And the surprisingly, RMAN doesn’t complain at all about this. RMAN Backup log file looks clear with return code 0.
However, when it comes to restore, RMAN can only restore the set of files backup by channel 1 tdp1 which finished last and its has no backup available for files backed by channel 2 tdp2.
Let’s confirm this by looking at RMAN catalog output. We already saw RMAN backup log it says it backed up all datafiles without any error.
Let’s run Report Schema to list the datafiles in database.
RMAN> report schema;
using target database control file instead of recovery catalog
Report of database schema
List of Permanent Datafiles
===========================
File Size(MB) Tablespace RB segs Datafile Name
---- -------- -------------------- ------- ------------------------
1 530 SYSTEM *** +DGROUP1/prod/datafile/system.1752.704635103
2 13720 UNDOTBS1 *** +DGROUP1/prod/datafile/undotbs1.1067.704635111
3 970 SYSAUX *** +DGROUP1/prod/datafile/sysaux.1922.704635113
4 23395 UNDOTBS2 *** +DGROUP1/prod/datafile/undotbs2.2251.704635119
5 5 USERS *** +DGROUP1/prod/datafile/users.1210.704635119
6 20480 PROD_DATA *** +DGROUP1/prod/datafile/PROD_data.1138.704640745
7 18960 PROD_INDEX *** +DGROUP1/prod/datafile/PROD_index.1147.704640829
8 20 DRSYS *** +DGROUP1/prod/datafile/drsys.770.712926249
9 8192 PROD_DATA *** +DGROUP1/prod/datafile/PROD_data.395.717544159
10 8192 PROD_DATA *** +DGROUP1/prod/datafile/PROD_data.1111.719594721
11 8192 PROD_DATA *** +DGROUP1/prod/datafile/PROD_data.1162.719594733
12 8196 PROD_DATA *** +DGROUP1/prod/datafile/PROD_data.1951.721149479
13 8192 PROD_DATA *** +DGROUP1/prod/datafile/PROD_data.1278.721149489
14 18692 PROD_DATA *** +DGROUP1/prod/datafile/PROD_data.2100.725724753
15 18992 PROD_DATA *** +DGROUP1/prod/datafile/PROD_data.1080.725725677
16 8192 PROD_INDEX *** +DGROUP1/prod/datafile/PROD_index.1435.726672149
17 16384 PROD_INDEX *** +DGROUP1/prod/datafile/PROD_index.1090.734796557
List of Temporary Files
=======================
File Size(MB) Tablespace Maxsize(MB) Tempfile Name
---- -------- -------------------- ----------- --------------------
1 15360 TEMP 15360 +DGROUP1/prod/tempfile/temp.309.704635115
2 8192 TEMP 8192 +DGROUP1/prod/tempfile/temp.1141.735345399
Now let’s grab the backup tag from backup log file and list the backup piece and datafiles in that backup tag.
RMAN> list backup tag TAG20110110T051851;
List of Backup Sets
===================
BS Key Type LV Size Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ --------------------
5169802 Incr 0 93.70G SBT_TAPE 00:37:50 10-JAN-2011 05:56:41
BP Key: 5169806 Status: AVAILABLE Compressed: NO Tag: TAG20110110T051851
Handle: DB_LVL0_prod_20110110740035131_1_1 Media:
List of Datafiles in backup set 5169802
File LV Type Ckp SCN Ckp Time Name
---- -- ---- ---------- -------------------- ----
2 0 Incr 7310305942 10-JAN-2011 05:18:51 +DGROUP1/prod/datafile/undotbs1.1067.704635111
5 0 Incr 7310305942 10-JAN-2011 05:18:51 +DGROUP1/prod/datafile/users.1210.704635119
6 0 Incr 7310305942 10-JAN-2011 05:18:51 +DGROUP1/prod/datafile/norkom_data.1138.704640745
7 0 Incr 7310305942 10-JAN-2011 05:18:51 +DGROUP1/prod/datafile/norkom_index.1147.704640829
10 0 Incr 7310305942 10-JAN-2011 05:18:51 +DGROUP1/prod/datafile/norkom_data.1111.719594721
13 0 Incr 7310305942 10-JAN-2011 05:18:51 +DGROUP1/prod/datafile/norkom_data.1278.721149489
14 0 Incr 7310305942 10-JAN-2011 05:18:51 +DGROUP1/prod/datafile/norkom_data.2100.725724753
16 0 Incr 7310305942 10-JAN-2011 05:18:51 +DGROUP1/prod/datafile/norkom_index.1435.726672149
RMAN>
From the above output you can clearly see that there set of datafiles missing in the backup piece. That’s why customer was able to restore set of files from recent backup and for rest of the files RMAN was looking for backup piece which 3 to 4 weeks old.
So be cautious when you are using customized RMAN backup piece name using format keyword. Make sure you use %u which guarantees unique name for your backup piece.
Tuesday, August 17, 2010
ORA-31619: invalid dump file for IMPDP
We had a EXPDP script to export the application schema on daily basis.
We were writing data pump files to WINDOWS files system mounted to LINUX system with CIFS option. We used parallel=2 and generated 2 files in parallel.
EXPDP always worked fine without any issues.
Last night I had to restore the EXPORT in DEV invironment and the IMPDP failed with following error.
ORA-39001: invalid argument value
ORA-39000: bad dump file specification
ORA-31619: invalid dump file "/app/edws/fileload/expdp/SYSMAN_201.dmp"
After breaking my head for a while, I found that if EXPDP has parallel set > 1 and if you write to 2 multiple files on CIFS mounted file system, it corrupts the file.
Your IMPDP throws above error. I got mixed resulst from Metalink.
Bug # 8313127 says CIFS file system is not supported for data pump.
Note: 444809.1 says its supported but not certified.
Based on my test, you write to one file, you are still safe. But if you use parallel option, you are files are corrupted and can't be used for IMPORT.
So for those who are using CIFS mount please be aware of that.
We were writing data pump files to WINDOWS files system mounted to LINUX system with CIFS option. We used parallel=2 and generated 2 files in parallel.
EXPDP always worked fine without any issues.
Last night I had to restore the EXPORT in DEV invironment and the IMPDP failed with following error.
ORA-39001: invalid argument value
ORA-39000: bad dump file specification
ORA-31619: invalid dump file "/app/edws/fileload/expdp/SYSMAN_201.dmp"
After breaking my head for a while, I found that if EXPDP has parallel set > 1 and if you write to 2 multiple files on CIFS mounted file system, it corrupts the file.
Your IMPDP throws above error. I got mixed resulst from Metalink.
Bug # 8313127 says CIFS file system is not supported for data pump.
Note: 444809.1 says its supported but not certified.
Based on my test, you write to one file, you are still safe. But if you use parallel option, you are files are corrupted and can't be used for IMPORT.
So for those who are using CIFS mount please be aware of that.
Subscribe to:
Posts (Atom)