Wednesday, July 7, 2021

AQ Queue Monitor Does Not Change Message MSG_STATE from WAIT to READY on RAC

In RAC database, the Advanced Queue (AQ) message MSG_STATE (column of view AQ$<QUEUE_TABLE>) stays with value 'WAIT' and will not be changed to 'READY' any more. It usually happens after database was shut down with immediate/abort option followed by database patching/upgrading. 

The symptom looks as following
select MSG_ID,MSG_STATE,DELAY,ENQ_TIME
  from USER01.AQ$MYAPP_QUEUE
 where MSG_STATE='WAIT'
order by ENQ_TIME;

MSG_ID                           MSG_STATE        DELAY               ENQ_TIME
-------------------------------- ---------- ------------------- -------------------
C5F90797E0818689E0540010E0BB9924 WAIT       2021-06-30 05:06:23 2021-06-30 05:05:53
C5FB6A538C42CA36E0540010E0BB9924 WAIT       2021-06-30 08:12:34 2021-06-30 08:12:04
C60217EE4A871694E0540010E0BB9924 WAIT       2021-06-30 15:55:12 2021-06-30 15:54:42
C66133CE2AEFFB54E0540010E0BB94ED WAIT       2021-07-05 09:23:21 2021-07-05 09:22:51
C66D26A0AB5F8801E0540010E0BB9924 WAIT       2021-07-05 23:38:40 2021-07-05 23:38:10
C673D842FA79BB8EE0540010E0BB94ED WAIT       2021-07-06 07:37:50 2021-07-06 07:37:20
C6812C907420EEA0E0540010E0BB94ED WAIT       2021-07-06 23:33:59 2021-07-06 23:33:29
C6812C90741FEEA0E0540010E0BB94ED WAIT       2021-07-06 23:33:59 2021-07-06 23:33:29
C681902A5305101EE0540010E0BB94ED WAIT       2021-07-06 23:59:50 2021-07-06 23:59:20
C681A3D7FCE14DB5E0540010E0BB9924 WAIT       2021-07-07 00:05:20 2021-07-07 00:04:50
<

Queue table "MYAPP_QUEUE" is AQ queue table owned by user "USER01" and view "AQ$MYAPP_QUEUE" is created when creating queue table "MYAPP_QUEUE" and returns read only data from "MYAPP_QUEUE" with more meaningful/readable values for columns (i.e. MSG_STATE value 'WAIT' & 'READY' in view "AQ$MYAPP_QUEUE" instead of number 1 & 0 in column STATE of table "MYAPP_QUEUE").

From previous query, we can find that the first "WAIT" message was enqueued at 5:05am June 30, 2021 (column ENQ_TIME). Check alert log and trace file generated/created around that time.

It is found in alert log on first node,
2021-06-29 15:19:08.274000 -04:00
Shutting down ORACLE instance (immediate) (OS id: 33517)
Shutdown is initiated by oraagent.bin@rac01.dbaplus.ca (TNS V1-V.
Stopping background process SMCO
2021-06-29 15:19:09.384000 -04:00
Shutting down instance: further logons disabled
Stopping background process CJQ0
Process termination requested for pid 27892 [source = unknown], [info = 0] [request issued by pid: 33378, uid: 60004]
Stopping background process MMNL
2021-06-29 15:19:10.495000 -04:00
Stopping background process MMON
2021-06-29 15:19:13.709000 -04:00
alter pluggable database all close immediate
<

Process 33378 requested to terminate process 27892, here, process 33378 is AQ Master Class Process QM05, its trace file was created with name "orcl_1_qm05_33378.trc", and process 27892 is AQ Server Class Process Q005, it is found out from QM05 trace file,
Trace file /u01/app/oracle/diag/rdbms/orcl/orcl_1/trace/orcl_1_qm05_33378.trc
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.9.0.0.0
Build label:    RDBMS_19.9.0.0.0DBRU_SOLARIS.SPARC64_200930
ORACLE_HOME:    /u01/app/oracle/product/19.0.0/dbhome_1
System name:    SunOS
Node name:      rac01.dbaplus.ca
Release:        5.11
Version:        11.3
Machine:        sun4v
Instance name: orcl_1
Redo thread mounted by this instance: 1
Oracle process number: 246
Unix process pid: 33378, image: oracle@rac01.dbaplus.ca (QM05)


*** 2021-06-29T15:19:09.456361-04:00 (CDB$ROOT(1))
*** SESSION ID:(1298.50814) 2021-06-29T15:19:09.456423-04:00
*** CLIENT ID:() 2021-06-29T15:19:09.456442-04:00
*** SERVICE NAME:(SYS$BACKGROUND) 2021-06-29T15:19:09.456460-04:00
*** MODULE NAME:() 2021-06-29T15:19:09.456479-04:00
*** ACTION NAME:() 2021-06-29T15:19:09.456498-04:00
*** CLIENT DRIVER:() 2021-06-29T15:19:09.456516-04:00
*** CONTAINER ID:(1) 2021-06-29T15:19:09.456535-04:00
 
ksvkillslaves: detected hung proc 48b4d2b50 [Unix process pid: 27892, image: oracle@rac01.dbaplus.ca (Q005)]
<

From QM05 trace file, we can see that background process Q005 was hung during the database shutdown. Therefore, Q005 could be killed abnormally by Oracle shutdown process, and leave AQ message process corrupted. It can be automatically fixed/recovered after database is restarted at most scenarios. Unfortunately, the scenarios do not include database patching/upgrading immediately following the database shutdown.

In RAC environment, the issue can be fixed by start AQ acrivity again on a different instance. Each Queue Table has owner instance, and can be assigned with primary instance and secondary instance. Running following sql to determine owner instance, primary instance and secondary instance
select owner, queue_table, object_type, owner_instance, primary_instance, secondary_instance
  from dba_queue_tables where queue_table='MYAPP_QUEUE';

OWNER           QUEUE_TABLE       OWNER_INSTANCE PRIMARY_INSTANCE SECONDARY_INSTANCE
--------------- ----------------- -------------- ---------------- ------------------
USER01          MYAPP_QUEUE                   1                0                  0
<

The owner instance is instance 1, and primary and secondary are not assigned yet. They can be changed by running
execute dbms_aqadm.alter_queue_table (queue_table => 'MYAPP_QUEUE',primary_instance => 2,secondary_instance => 1);
<

New setting is changed to following
select owner, queue_table, object_type, owner_instance, primary_instance, secondary_instance
  from dba_queue_tables where queue_table='MYAPP_QUEUE';

OWNER           QUEUE_TABLE       OWNER_INSTANCE PRIMARY_INSTANCE SECONDARY_INSTANCE
--------------- ----------------- -------------- ---------------- ------------------
USER01          MYAPP_QUEUE                   2                2                  1
<

Wait for a few minutes, and check the status of messages in the queue,
select MSG_ID,MSG_STATE,DELAY,ENQ_TIME
  from USER01.AQ$MYAPP_QUEUE
 where MSG_STATE='WAIT'
order by ENQ_TIME;

no rows selected
<

The 'WAIT' messages are gone (MSG_STATE is changed to 'READY').

2 comments:

Abdul hafeez kalsekar -- OCS/OCE/OCP said...

Thanks for sharing

Toon Koppelaars said...

Was this issue ever reported back to Oracle, and/or fixed in the mean time?