I’ve been pulling my hair out over this one, so hopefully this post will prove useful to someone else experiencing similar problems with Data Guard traffic.
One of our Cloud hosted environments (IaaS) has an Oracle 11.2.0.4 Data Guard (physical standby) setup on Windows. Recently, the standby database started logging the following errors in it’s alert log:
Fri June 06 08:51:16 2016 RFS[1085]: Assigned to RFS process 8996 RFS[1085]: Opened log for thread 1 sequence 72899 dbid -2002036753 branch 876434118 CORRUPTION DETECTED: In redo blocks starting at block 135169count 2048 for thread 1 sequence 72899 Deleted Oracle managed file H:\FAST_RECOVERY_AREA\SNAPF\ARCHIVELOG\2016_06_03\O1_MF_1_72899_CMC1VNVP_.ARC RFS[1085]: Possible network disconnect with primary database
The logs were being transported across from the primary site, but the media recovery process was reporting corrupt blocks when trying to apply the archive redo log files, and so recovery stalled.
Validating the archive logs at the primary site showed us that the files were indeed valid at the source (primary):
rman target / validate archivelog sequence 72899; ... List of Archived Logs ===================== Thrd Seq Status Blocks Failing Blocks Examined Name ---- ------- ------ -------------- --------------- --------------- 1 72899 OK 0 350165 H:\FAST_RECOVERY_AREA\SNAPF\ARCHIVELOG\2016_06_03\O1_MF_1_72899_CM3533SG_.ARC Finished validate at 03-JUN-16
Attempting a dump of the log file contents would also demonstrate whether or not the log file was valid:
ALTER SYSTEM DUMP LOGFILE 'H:\FAST_RECOVERY_AREA\SNAPF\ARCHIVELOG\2016_06_03\O1_MF_1_72899_CM3533SG_.ARC';
So we know the logs are clean and intact at the primary site, which would suggest that something in the log transport process was corrupting the logs. Further, manually copying the files across, and re-registering would resolve the problem, until the next error occurred (not a sustainable work around):
ALTER DATABASE REGISTER LOGFILE 'H:\FAST_RECOVERY_AREA\SNAPF\ARCHIVELOG\2016_06_03\O1_MF_1_72899_CM3533SG_.ARC';
Oracle were quite helpful in suggesting we check the firewall(s) to ensure the follow features were disabled:
- SQLNet fixup protocol
- Deep Packet Inspection (DPI)
- SQLNet packet inspection
- SQL Fixup
- SQL ALG (Juniper firewall)
- Oracle DB-control component DOS
After further investigation, it would seem that the Cisco switches being used between our primary and standby sites had “SQL*Net inspection enabled” by default (deep packet inspection). As a result, because we were using the default 1521 listener port, packets were being scanned and reaching the standby site in a malformed/corrupted state.
Disable this feature wasn’t so straight forward unfortunately, so as a work around (and to avoid other 1521 port scanning protocols interfering), I opted to change the Data Guard listener port instead from 1521 to 1528 by adding another listener service:
SID_LIST_LISTENER = (SID_LIST = (SID_DESC = (SID_NAME = CLRExtProc) (ORACLE_HOME = E:\app\oracle\product\11.2.0.4) (PROGRAM = extproc) (ENVS = "EXTPROC_DLLS=ONLY:E:\app\oracle\product\11.2.0.4\bin\oraclr11.dll") ) ) LISTENER = (DESCRIPTION_LIST = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = win02-stby.vbox)(PORT = 1521)) ) ) ADR_BASE_LISTENER = E:\app\oracle # DG listener created to use port 1528, following SQL*Net packet inspection issues SID_LIST_LISTENER_DG = (SID_LIST = (SID_DESC = (GLOBAL_DBNAME = DATAMARTF_DGMGRL) # Data Guard Manager (ORACLE_HOME = E:\app\oracle\product\11.2.0.4) (SID_NAME = SNAPF) ) (SID_DESC = (GLOBAL_DBNAME = SNAPF) # Data Guard Broker Process (ORACLE_HOME = E:\app\oracle\product\11.2.0.4) (SID_NAME = SNAPF) ) ) LISTENER_DG = (DESCRIPTION_LIST = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = win02-stby.vbox)(PORT = 1528)) ) ) ADR_BASE_LISTENER_DG = E:\app\oracle
Sure enough, after starting up the new LISTENER_DG service, the corruption issues disappeared!
NOTE: Don’t forget to also change the port number at your primary site for your Data Guard TNS entries.
References:
MAA Best Practices – Oracle Database
Data Guard Redo Transport & Network Best Practices Oracle Database 10g Release 2
SQL*Net (A.K.A Oracle TNS) And Firewalls
Cisco ASA – Configuring Inspection of Database and Directory Protocols
Great Work. It saved my time 🙂