Sunday, May 17, 2020

12.2.0.1 init.ohasd does not start clusterware stack after April 2020 RU applied

The init.ohasd process does not start clusterware stack after the 12.2.0.1 Grid Infrastructure home is applied April 2020 Release Update. The scenario is not a typical Grid Infrastructure startup issue. It is a human error and happened when DBA, by accident, ran roothas.sh instead of rootcrs.sh while patching GI cluster home. Before knowing it is human error, the diagnose process is done as following.

The processes init.ohasd and ohasd.bin status,
[root@host02]# ps -ef | grep has
root      3313     1  0 11:00 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run
grid     30923 17981  0 12:01 pts/9    00:00:00 grep has

Process ohasd.bin is not started automatically.

Make sure the autostart of crs is not disabled,
[root@host02]# /u01/app/12.2.0/grid/bin/crsctl config crs
CRS-4622: Oracle High Availability Services autostart is enabled.

The command shows that the autostart is enabled. Alternative method to check autostart status is to display the content of file ohasdstr,
[root@host02]# cat /etc/oracle/scls_scr/host02/root/ohasdstr
enable

Try to start crs manually,
[root@host02]# /u01/app/12.2.0/grid/bin/crsctl start crs 
CRS-4123: Oracle High Availability Services has been started. 
[root@host02]# ps -ef | grep has
root      3313     1  0 11:00 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run
root     23329     1  0 12:07 ?        00:00:04 /u01/app/12.2.0/grid/bin/ohasd.bin reboot
grid     30923 17981  0 12:27 pts/9    00:00:00 grep has
[root@host02]# /u01/app/12.2.0/grid/bin/crsctl stat res -t 
CRS-4535: Cannot communicate with Cluster Ready Services 
CRS-4000: Command Status failed, or completed with errors. 
[root@host02#  ps -ef | grep 'd\.bin'

The ohasd.bin is started, but the clusterware stack is still not started.

Check trace file $ORACLE_BASE/crsdata/<node-name>/output/ohasdOUT.trc
[root@host02# tail /u01/app/grid/crsdata/host02/output/ohasdOUT.trc
Redirected Oracle Clusterware daemon standard output file /u01/app/grid/crsdata/host02/output/ohasdOUT.trc
Initializing Oracle Clusterware OHASD daemon with OS process ID 27459 on host host02 at local time 2020/04/21-22:02:29.433
   Oracle Home is: /u01/app/12.2.0/grid
   Oracle Base is: /u01/app/grid
   NLS_LANG is:    AMERICAN_AMERICA.AL32UTF8
Redirected Oracle Clusterware daemon standard output file /u01/app/grid/crsdata/host02/output/ohasdOUT.trc
Initializing Oracle Clusterware OHASD daemon with OS process ID 23329 on host host02 at local time 2020/05/13-12:17:31.952
   Oracle Home is: /u01/app/12.2.0/grid
   Oracle Base is: /u01/app/grid
   NLS_LANG is:    AMERICAN_AMERICA.AL32UTF8
OSD_main waitforWatchdogMonitor() : Blocking on FIFO

Check trace file  $ORACLE_BASE/diag/crs/<node-name>/crs/trace/ohasd.trc
[root@host02# tail /u01/app/grid/diag/crs/host02/crs/trace/ohasd.trc
Trace file /u01/app/grid/diag/crs/host02/crs/trace/ohasd.trc
Oracle Database 12c Clusterware Release 12.2.0.1.0 - Production Copyright 1996, 2016 Oracle. All rights reserved.
    CLSB:3669426272: Argument count (argc) for this daemon is 2
    CLSB:3669426272: Argument 0 is: /u01/app/12.2.0/grid/bin/ohasd.bin
    CLSB:3669426272: Argument 1 is: reboot
2020-05-13 12:17:32.060 : default:3669426272:  Created alert : (:OHAS00117:) :  TIMED OUT WAITING FOR OHASD MONITOR
2020-05-13 12:17:32.060 :OHASDMAIN:3669426272:  OHASD Daemon waiting for the watchdog monitor script to start

Check alert log file  $ORACLE_BASE/diag/crs/<node-name>/crs/trace/alert.log
[root@host02# tail /u01/app/grid/diag/crs/r7-toyo/crs/trace/alert.log
2020-05-12 17:10:56.891 [CLSECHO(29775)]ACFS-9156: Detecting control device '/dev/asm/.asm_ctl_spec'.
2020-05-12 17:10:56.975 [CLSECHO(29794)]ACFS-9156: Detecting control device '/dev/ofsctl'.
2020-05-12 17:10:59.878 [CLSECHO(30496)]ACFS-9309: ADVM/ACFS installation correctness verified.
2020-05-12 17:11:00.587 [CLSCFG(30626)]CRS-1810: Node-specific configuration for node r7-toyo in Oracle Local Registry was patched to patch level 2554537810.
2020-05-13 12:17:31.979 [OHASD(23329)]CRS-8500: Oracle Clusterware OHASD process is starting with operating system process ID 23329
2020-05-13 12:17:32.053 [OHASD(23329)]CRS-0715: Oracle High Availability Service has timed out waiting for init.ohasd to be started.

It seems that something is wrong with script init.ohasd. Review the script and find following lines,
######### Instantiated Variables #########
ORA_CRS_HOME=/u01/app/12.2.0/grid
export ORA_CRS_HOME

HAS_USER=grid
SCRBASE=/etc/oracle/scls_scr

Here, HAS_USER is supposed to be root for Grid Infrastructure instead of grid which is for standalone Grid Infrastructure (Oracle Restart). Looks like some Oracle Restart configuration script changed the script init.ohasd. By checking the patch applying log, it is found that DBA ran roothas.sh instead of rootcrs.sh during patching. Luckily, it is a two-node cluster and the other node was patched properly.

Copy init.ohasd script from healthy node to replace current script, and reboot the node. Everything comes back,
[root@host02]# ps -ef |grep has
root     23526     1  0 May13 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run
root     33529     1  0 May13 ?        01:04:09 /u01/app/12.2.0/grid/bin/ohasd.bin reboot _ORA_BLOCKING_STACK_LOCALE=AMERICAN_AMERICA.AL32UTF8
oracle   29897 35245  0 22:04 pts/0    00:00:00 grep has
[root@host02# ps -ef | grep 'd\.bin'
root     10858     1  0 May13 ?        01:05:09 /u01/app/12.2.0/grid/bin/ohasd.bin reboot
grid     11145     1  0 May13 ?        00:17:16 /u01/app/12.2.0/grid/bin/mdnsd.bin
grid     11147     1  0 May13 ?        01:07:38 /u01/app/12.2.0/grid/bin/evmd.bin
grid     11182     1  0 May13 ?        00:21:10 /u01/app/12.2.0/grid/bin/gpnpd.bin
grid     11240     1  0 May13 ?        01:19:11 /u01/app/12.2.0/grid/bin/gipcd.bin
grid     11311     1  0 May13 ?        01:20:05 /u01/app/12.2.0/grid/bin/ocssd.bin
root     11508     1  0 May13 ?        01:10:20 /u01/app/12.2.0/grid/bin/octssd.bin reboot
root     11752     1  1 May13 ?        04:27:06 /u01/app/12.2.0/grid/bin/osysmond.bin
root     11773     1  0 May13 ?        01:36:37 /u01/app/12.2.0/grid/bin/crsd.bin reboot
[root@host02]# /u01/app/12.2.0/grid/bin/crsctl stat res -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.ASMNET1LSNR_ASM.lsnr
               ONLINE  ONLINE       host01                  STABLE
               ONLINE  ONLINE       host02                  STABLE
ora.ASMNET2LSNR_ASM.lsnr
               ONLINE  ONLINE       host01                  STABLE
               ONLINE  ONLINE       host02                  STABLE
ora.DATA.dg
               ONLINE  ONLINE       host01                  STABLE
               ONLINE  ONLINE       host02                  STABLE
ora.LISTENER.lsnr
               ONLINE  ONLINE       host01                  STABLE
               ONLINE  ONLINE       host02                  STABLE
ora.chad
               ONLINE  ONLINE       host01                  STABLE
               ONLINE  ONLINE       host02                  STABLE
ora.net1.network
               ONLINE  ONLINE       host01                  STABLE
               ONLINE  ONLINE       host02                  STABLE
ora.ons
               ONLINE  ONLINE       host01                  STABLE
               ONLINE  ONLINE       host02                  STABLE
ora.proxy_advm
               ONLINE  ONLINE       host01                  STABLE
               ONLINE  ONLINE       host02                  STABLE
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       host01                  STABLE
ora.LISTENER_SCAN2.lsnr
      1        ONLINE  ONLINE       host02                  STABLE
ora.LISTENER_SCAN3.lsnr
      1        ONLINE  ONLINE       host02                  STABLE
ora.MGMTLSNR
      1        ONLINE  ONLINE       host02                  169.254.76.22 192.16
                                                             8.7.2 192.168.7.10,S
                                                             TABLE
ora.asm
      1        ONLINE  ONLINE       host01                  Started,STABLE
      2        ONLINE  ONLINE       host02                  Started,STABLE
ora.db01.dbsvc01.svc
      1        ONLINE  ONLINE       host02                  STABLE
ora.db01.db
      1        ONLINE  ONLINE       host02                  Open,HOME=/u01/app/ora
                                                             cle/product/12.2.0/dbh
                                                             ome_1,STABLE
ora.cvu
      1        ONLINE  ONLINE       host02                  STABLE
ora.mgmtdb
      1        ONLINE  ONLINE       host02                  Open,STABLE
ora.qosmserver
      1        ONLINE  ONLINE       host02                  STABLE
ora.host01.vip
      1        ONLINE  ONLINE       host01                  STABLE
ora.host02.vip
      1        ONLINE  ONLINE       host02                  STABLE
ora.scan1.vip
      1        ONLINE  ONLINE       host01                  STABLE
ora.scan2.vip
      1        ONLINE  ONLINE       host02                  STABLE
ora.scan3.vip
      1        ONLINE  ONLINE       host02                  STABLE
--------------------------------------------------------------------------------

No comments: