DBA Plus Workshop: ASMB process crashed the instance after 'PMON failed to acquire latch'

Environment:
Oracle Database: Enterprise Edition 11.2.0.2.12
Operating System: AIX 6.1

Symptom:

Database alert log:

Fri Dec 26 07:23:04 2014
PMON failed to acquire latch, see PMON dump
Fri Dec 26 07:24:04 2014
PMON failed to acquire latch, see PMON dump
Fri Dec 26 07:25:36 2014
PMON failed to acquire latch, see PMON dump
Fri Dec 26 07:26:36 2014
PMON failed to acquire latch, see PMON dump
Fri Dec 26 07:28:07 2014
PMON failed to acquire latch, see PMON dump
Fri Dec 26 07:28:52 2014
NOTE: ASMB terminating
Errors in file /u01/app/oracle/diag/rdbms/oradb/ORADB/trace/ORADB_asmb_5439902.trc:
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID:
Session ID: 152 Serial number: 9
Errors in file /U01/app/oracle/diag/rdbms/oradb/ORADB/trace/ORADB_asmb_5439902.trc:
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID:
Session ID: 152 Serial number: 9
ASMB (ospid: 5439902): terminating the instance due to error 15064
Fri Dec 26 07:28:52 2014

PMON trace:

*** 2014-12-26 07:23:04.963
        Location from where latch is held: kgh.h LINE:6387 ID:kghalo:
        Context saved from call: 0
        state=busy [holder orapid=16] wlstate=free [value=0]
    waiters [orapid (seconds since: put on list, posted, alive check)]:
     64 (103, 1419596584, 0)
        ... ...
     61 (4, 1419596584, 0)
     waiter count=17
    gotten 38993536 times wait, failed first 285766 sleeps 71131
    gotten 0 times nowait, failed: 0
Short stack dump:
    ... ... 
possible holder pid = 16 ospid=5439902
----------------------------------------
SO: 0x7000002fec6df78, type: 2, owner: 0x0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
 proc=0x7000002fec6df78, name=process, file=ksu.h LINE:12451 ID:, pg=0
(process) Oracle pid:16, ser:1, calls cur/top: 0x7000002ff83fd00/0x7000002ff83fd00
          flags : (0x6) SYSTEM
          flags2: (0x0),  flags3: (0x0)
          intr error: 0, call error: 0, sess error: 0, txn error 0
          intr queue: empty
  ksudlp FALSE at location: 0
  (post info) last post received: 0 0 35
              last post received-location: ksr2.h LINE:603 ID:ksrpublish
              last process to post me: 7000002fec731a8 2 0
              last post sent: 1051 0 2
              last post sent-location: ksl2.h LINE:2293 ID:kslpsr
              last process posted by me: 7000002fec69db8 1 6
  (latch info) wait_event=0 bits=84
    holding    (efd=4) 700000000107940 Child shared pool level=7 child#=1
        Location from where latch is held: kgh.h LINE:6387 ID:kghalo:
        Context saved from call: 0
        state=busy [holder orapid=16] wlstate=free [value=0]
        waiters [orapid (seconds since: put on list, posted, alive check)]:
         64 (103, 1419596584, 0)
             ... ...
         61 (4, 1419596584, 0)
         waiter count=17
    holding    (efd=4) 700000000047900  ASM map headers level=2
        Location from where latch is held: kffm2.h LINE:306 ID:kffmAllocate:
        Context saved from call: 0
        state=busy [holder orapid=16] wlstate=free [value=0]
  Process Group: DEFAULT, pseudo proc: 0x7000002fee6c858
  O/S info: user: oragrid, term: UNKNOWN, ospid: 5439902
  OSD pid info: Unix process pid: 5439902, image: oracle@host01.dbaplus.ca (ASMB)

ASMB trace:

*** 2014-12-26 07:28:52.409
NOTE: ASMB terminating
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID:
Session ID: 152 Serial number: 9
error 15064 detected in background process
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID:
Session ID: 152 Serial number: 9
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+644<-kjzdssdmp()+444<-kjzduptcctx()+272<-kjzdicrshnfy()+96<-ksuitm()+1284<-ksbrdp()+4508<-opirip()+1216<-opidrv()+1156<-sou2o()+192<-opimai_real()+308<-ssthrdmain()+340<-main()+216<-__start()+112
----- End of Abridged Call Stack Trace -----

*** 2014-12-26 07:28:52.412
ASMB (ospid: 5439902): terminating the instance due to error 15064
ksuitm: waiting up to [5] seconds before killing DIAG(8519832)

AWR report:

Cache Sizes                       Begin        End
~~~~~~~~~~~                  ---------- ----------
               Buffer Cache:     6,464M     6,464M  Std Block Size:        16K
           Shared Pool Size:     5,152M     5,152M      Log Buffer:     7,172K

Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            Buffer Nowait %:  100.00       Redo NoWait %:  100.00
            Buffer  Hit   %:   97.27    In-memory Sort %:  100.00
            Library Hit   %:   73.34        Soft Parse %:   63.01
         Execute to Parse %:   45.26         Latch Hit %:   99.74
Parse CPU to Parse Elapsd %:    4.75     % Non-Parse CPU:   92.21

  
Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
latch: shared pool                      836         866   1035   38.4 Concurrenc
db file sequential read             227,015         281      1   12.5 User I/O
DB CPU                                              280          12.4
latch free                                6         161  26794    7.1 Other
library cache lock                        3          89  29759    4.0 Concurrenc

Check duplicate SQL:

select PLAN_HASH_VALUE,count(*) from v$sql
 where EXECUTIONS=1 group by PLAN_HASH_VALUE order by 2 desc;

     251958376      34281
    3946703927      33694
    1997326701       2603
    3410233426       1221
    3460305098        304
     ... ...
    3150591124          1	 

select substr(sql_text,1,50),count(1) from v$sql
 where PLAN_HASH_VALUE=251958376 group by substr(sql_text,1,50);

Obviously, the application is not designed with consideration of bind variables, the duplicate SQLs (using literals instead of bind variables) caused massive shared pool fragmentation.

Therefore, ASMB process can not find a suitable sized chunk of memory before it was considered unresponsive and thereafter killed.

Workaround:

Since application can not be changed (system too old to find developer), tried CURSOR_SHARING=FORCE and no crash happened after that.

Pages

Friday, December 26, 2014

ASMB process crashed the instance after 'PMON failed to acquire latch'

No comments: