Friday, December 26, 2014

ASMB process crashed the instance after 'PMON failed to acquire latch'

Environment:
Oracle Database: Enterprise Edition 11.2.0.2.12
Operating System:  AIX 6.1

Symptom:

Database alert log:
Fri Dec 26 07:23:04 2014
PMON failed to acquire latch, see PMON dump
Fri Dec 26 07:24:04 2014
PMON failed to acquire latch, see PMON dump
Fri Dec 26 07:25:36 2014
PMON failed to acquire latch, see PMON dump
Fri Dec 26 07:26:36 2014
PMON failed to acquire latch, see PMON dump
Fri Dec 26 07:28:07 2014
PMON failed to acquire latch, see PMON dump
Fri Dec 26 07:28:52 2014
NOTE: ASMB terminating
Errors in file /u01/app/oracle/diag/rdbms/oradb/ORADB/trace/ORADB_asmb_5439902.trc:
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID:
Session ID: 152 Serial number: 9
Errors in file /U01/app/oracle/diag/rdbms/oradb/ORADB/trace/ORADB_asmb_5439902.trc:
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID:
Session ID: 152 Serial number: 9
ASMB (ospid: 5439902): terminating the instance due to error 15064
Fri Dec 26 07:28:52 2014

PMON trace:
*** 2014-12-26 07:23:04.963
Location from where latch is held: kgh.h LINE:6387 ID:kghalo:
Context saved from call: 0
state=busy [holder orapid=16] wlstate=free [value=0]
waiters [orapid (seconds since: put on list, posted, alive check)]:
64 (103, 1419596584, 0)
... ...
61 (4, 1419596584, 0)
waiter count=17
gotten 38993536 times wait, failed first 285766 sleeps 71131
gotten 0 times nowait, failed: 0
Short stack dump:
... ...
possible holder pid = 16 ospid=5439902
----------------------------------------
SO: 0x7000002fec6df78, type: 2, owner: 0x0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x7000002fec6df78, name=process, file=ksu.h LINE:12451 ID:, pg=0
(process) Oracle pid:16, ser:1, calls cur/top: 0x7000002ff83fd00/0x7000002ff83fd00
flags : (0x6) SYSTEM
flags2: (0x0), flags3: (0x0)
intr error: 0, call error: 0, sess error: 0, txn error 0
intr queue: empty
ksudlp FALSE at location: 0
(post info) last post received: 0 0 35
last post received-location: ksr2.h LINE:603 ID:ksrpublish
last process to post me: 7000002fec731a8 2 0
last post sent: 1051 0 2
last post sent-location: ksl2.h LINE:2293 ID:kslpsr
last process posted by me: 7000002fec69db8 1 6
(latch info) wait_event=0 bits=84
holding (efd=4) 700000000107940 Child shared pool level=7 child#=1
Location from where latch is held: kgh.h LINE:6387 ID:kghalo:
Context saved from call: 0
state=busy [holder orapid=16] wlstate=free [value=0]
waiters [orapid (seconds since: put on list, posted, alive check)]:
64 (103, 1419596584, 0)
... ...
61 (4, 1419596584, 0)
waiter count=17
holding (efd=4) 700000000047900 ASM map headers level=2
Location from where latch is held: kffm2.h LINE:306 ID:kffmAllocate:
Context saved from call: 0
state=busy [holder orapid=16] wlstate=free [value=0]
Process Group: DEFAULT, pseudo proc: 0x7000002fee6c858
O/S info: user: oragrid, term: UNKNOWN, ospid: 5439902
OSD pid info: Unix process pid: 5439902, image: oracle@host01.dbaplus.ca (ASMB)

ASMB trace:
*** 2014-12-26 07:28:52.409
NOTE: ASMB terminating
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID:
Session ID: 152 Serial number: 9
error 15064 detected in background process
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID:
Session ID: 152 Serial number: 9
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+644<-kjzdssdmp()+444<-kjzduptcctx()+272<-kjzdicrshnfy()+96<-ksuitm()+1284<-ksbrdp()+4508<-opirip()+1216<-opidrv()+1156<-sou2o()+192<-opimai_real()+308<-ssthrdmain()+340<-main()+216<-__start()+112
----- End of Abridged Call Stack Trace -----

*** 2014-12-26 07:28:52.412
ASMB (ospid: 5439902): terminating the instance due to error 15064
ksuitm: waiting up to [5] seconds before killing DIAG(8519832)

AWR report:
Cache Sizes                       Begin        End
~~~~~~~~~~~ ---------- ----------
Buffer Cache: 6,464M 6,464M Std Block Size: 16K
Shared Pool Size: 5,152M 5,152M Log Buffer: 7,172K

Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Buffer Nowait %: 100.00 Redo NoWait %: 100.00
Buffer Hit %: 97.27 In-memory Sort %: 100.00
Library Hit %: 73.34 Soft Parse %: 63.01
Execute to Parse %: 45.26 Latch Hit %: 99.74
Parse CPU to Parse Elapsd %: 4.75 % Non-Parse CPU: 92.21


Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Avg
wait % DB
Event Waits Time(s) (ms) time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
latch: shared pool 836 866 1035 38.4 Concurrenc
db file sequential read 227,015 281 1 12.5 User I/O
DB CPU 280 12.4
latch free 6 161 26794 7.1 Other
library cache lock 3 89 29759 4.0 Concurrenc

Check duplicate SQL:
select PLAN_HASH_VALUE,count(*) from v$sql
where EXECUTIONS=1 group by PLAN_HASH_VALUE order by 2 desc;

251958376 34281
3946703927 33694
1997326701 2603
3410233426 1221
3460305098 304
... ...
3150591124 1

select substr(sql_text,1,50),count(1) from v$sql
where PLAN_HASH_VALUE=251958376 group by substr(sql_text,1,50);

Obviously,  the application is not designed with consideration of bind variables, the duplicate SQLs (using literals instead of bind variables) caused massive shared pool fragmentation.

Therefore, ASMB process can not find a suitable sized chunk of memory before it was considered unresponsive and thereafter killed.

Workaround:

Since application can not be changed (system too old to find developer), tried CURSOR_SHARING=FORCE and no crash happened after that.

 

No comments: