Sunday, May 16, 2021

Grid Infrastructure 12.2 restore OCR, Voting File and mgmtdb when disk group corrupted

For some reason, Oracle desupported the placement of OCR and voting files directly on a shared file system from Grid Infrastructure 12.2 until rescinding the desupport for Standalone Clusters from 19.3. Therefore, when GI 12.2 is installed, OCR, Voting file, and OCR backup location are, by default, configured in the ASM disk group, and mostly likely in the same disk group. When the disk group cannot be mounted because of any issues, the cluster will not be brought up anymore.

Technically, GI has to be re-configured as like a new installation, that could be a big job. However, it may not be that bad. Although 12.2 does not allow you have OCR backup location out of ASM disk group, DBA still can copy the OCR backup file from ASM disk group to file system with command 'asmcmd cp'. I am going to demonstrate how we can maximize the opportunity of bringing the cluster backup without rebuild/reinstall it.

1. Retrieve OCR backup from corrupted disk group


If you do not have OCR backup out of the corrupted disk group, or the backup is too old, try to find the latest OCR backup from the corrupted disk group as following,

* Find out the disk devices which consist the disk group from ASM alert log file 
Log file name "<DIAGNOSTIC_DEST>/diag/asm/+asm/<ASM_INSTANCE_NAME>/trace/alert_<ASM_INSTANCE_NAME>.log"

Sample contents in the sample alert file "/u01/app/grid/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log"
2021-05-15T20:14:16.982466-04:00
NOTE: Diskgroup used for Voting files is:
  DATA
NOTE: Diskgroup used for OCR is:DATA
NOTE: cache registered group DATA 1/0x7536B4AB
NOTE: cache began mount (first) of group DATA 1/0x7536B4AB
NOTE: Assigning number (1,1) to disk (/dev/mapper/ASMDATA2)
NOTE: Assigning number (1,0) to disk (/dev/mapper/ASMDATA1)

It shows that disk group DATA is used for OCR and Voting files, and DATA includes following two disks

   /dev/mapper/ASMDATA2
   /dev/mapper/ASMDATA1

* Find out OCR backup up files by running following command as root no matter clusterware is running or not

  <GI_HOME>/bin/ocrconfig -showbackup

Sample output
# /u01/app/12.2.0/grid/bin/ocrconfig -showbackup
PROT-26: Oracle Cluster Registry backup locations were retrieved from a local copy

rac01    2021/05/15 00:23:38     +DATA:/cluster01/OCRBACKUP/backup00.ocr.272.1072571013     1211615871
rac01
rac01    2021/05/14 20:23:30     +DATA:/cluster01/OCRBACKUP/backup01.ocr.269.1072556603     1211615871
rac01
rac01    2021/05/14 20:23:30     +DATA:/cluster01/OCRBACKUP/day.ocr.270.1072556611     1211615871
rac01
rac01    2021/05/14 20:23:30     +DATA:/cluster01/OCRBACKUP/week.ocr.271.1072556611     1211615871

The latest was created at 2021/05/15 00:23:38 in disk group "DATA" with file number "272".

* Extract OCR backup from corrupted disk group by running command as root

  <GI_HOME>/bin/amdu -diskstring '<disk-device-name>' -extract <disk-group-name>.<file-number>

Here, <disk-device-name>, <disk-group-name> and <file-number> are identified in previous steps.

Sample output
# /u01/app/12.2.0/grid/bin/amdu -diskstring '/dev/mapper/ASMDATA2' -extract DATA.272
amdu_2021_05_16_11_02_47/
AMDU-00208: File directory block not found. Cannot extract file DATA.272.
ERROR: no read quorum in group: required 1, found 0 disks

# /u01/app/12.2.0/grid/bin/amdu -diskstring '/dev/mapper/ASMDATA1' -extract DATA.272
amdu_2021_05_16_11_04_01/
AMDU-00216: 0xbadfda7a written to 1560 blocks of file DATA_272.f

# ls -l amdu_2021_05_16_11_04_01
total 12392
-rw-r--r-- 1 root root 12681216 May 16 11:04 DATA_272.f
-rw-r--r-- 1 root root     4365 May 16 11:04 report.txt

The "amdu" found the OCR backup from disk "/dev/mapper/ASMDATA1" (first disk of disk group DATA) and created two files,

  DATA_272.f     copy of OCR backup
  report.txt     Summary of the extract operation

File 'DATA_272.f' will be used for OCR restore.

2. Create new disk group


Before creating new disk group, you have to follow Oracle installation document to get new disks ready. Then re-create disk group as following,

* Stop Oracle Clusterware by running the following command as root on all nodes of the cluster

  <GI_HOME>/bin/crsctl stop crs

If the command returns any error due to OCR corruption, running following following command as root on all nodes instead,

  <GI_HOME>/bin/crsctl stop crs -f

* Start the Oracle Clusterware stack on one node in exclusive mode by running the following command as root

  <GI_HOME>/bin/crsctl start crs -excl -nocrs

The -nocrs option ensures that the CRSD process and OCR do not start with the rest of the Oracle Clusterware stack. Ignore any errors that display.

Check whether CRSD is running by running the following command

  <GI_HOME>/bin/crsctl status resource ora.crsd -init


If CRSD is running, then stop it by running the following command as root,

  <GI_HOME>/bin/crsctl stop resource ora.crsd -init

Sample output
[root@rac01]# crsctl start crs -excl -nocrs
CRS-4123: Oracle High Availability Services has been started.
CRS-2672: Attempting to start 'ora.evmd' on 'rac01'
CRS-2672: Attempting to start 'ora.mdnsd' on 'rac01'
CRS-2676: Start of 'ora.mdnsd' on 'rac01' succeeded
CRS-2676: Start of 'ora.evmd' on 'rac01' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'rac01'
CRS-2676: Start of 'ora.gpnpd' on 'rac01' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'rac01'
CRS-2672: Attempting to start 'ora.gipcd' on 'rac01'
CRS-2676: Start of 'ora.cssdmonitor' on 'rac01' succeeded
CRS-2676: Start of 'ora.gipcd' on 'rac01' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'rac01'
CRS-2672: Attempting to start 'ora.diskmon' on 'rac01'
CRS-2676: Start of 'ora.diskmon' on 'rac01' succeeded
CRS-2676: Start of 'ora.cssd' on 'rac01' succeeded
CRS-2672: Attempting to start 'ora.drivers.acfs' on 'rac01'
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'rac01'
CRS-2672: Attempting to start 'ora.ctssd' on 'rac01'
CRS-2676: Start of 'ora.ctssd' on 'rac01' succeeded
CRS-2676: Start of 'ora.drivers.acfs' on 'rac01' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'rac01' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'rac01'
CRS-2676: Start of 'ora.asm' on 'rac01' succeeded

[root@rac01]# crsctl status resource ora.crsd
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
[root@rac01]#
[root@rac01]# crsctl status resource ora.crsd -init
NAME=ora.crsd
TYPE=ora.crs.type
TARGET=OFFLINE
STATE=OFFLINE

* Check if ASM instance is running with command "ps -ef | grep asm_pmon"
[root@rac01]# ps -ef | grep asm_pmon | grep -v grep
grid  17207     1  0 12:36 ?        00:00:00 asm_pmon_+ASM1

ASM instance "+ASM1" is running as user "grid", the OS process id is 17207.

Log into system as "grid" and check the GI home from which ASM instance is running,

   pwdx <pid>

If command pwdx does not exist on your system, use following instead,

  ls -l /proc/<pid>/cwd

Sample output
[grid@rac01]$ pwdx 17207
17207: /u01/app/12.2.0/grid/dbs

[grid@rac01]$ ls -l /proc/17207/cwd
lrwxrwxrwx 1 grid oracle 0 May 16 12:45 /proc/17207/cwd -> /u01/app/12.2.0/grid/dbs

The GI home is "/u01/app/12.2.0/grid".

* Connect to instance "+ASM1" with sqlplus as sysasm
[grid@rac01]$ export ORACLE_SID='+ASM1'
[grid@rac01]$ export ORACLE_HOME=/u01/app/12.2.0/grid
[grid@rac01]$ export PATH=$ORACLE_HOME/bin:$PATH

[grid@rac01]$ sqlplus / as sysasm

SQL*Plus: Release 12.2.0.1.0 Production on Sun May 16 13:03:53 2021

Copyright (c) 1982, 2016, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production

SQL>

* List candidate disks which can be used to create disk group in sqlplus
SQL> col path for a30
SQL> select group_number,path,state,header_status,mount_status from v$asm_disk;

GROUP_NUMBER PATH                           STATE    HEADER_STATU MOUNT_S
------------ ------------------------------ -------- ------------ -------
           0 /dev/mapper/NEWDATA1           NORMAL   CANDIDATE    CLOSED
           0 /dev/mapper/NEWDATA2           NORMAL   CANDIDATE    CLOSED

If your new disks are not listed, please check and make sure the value of parameter "asm_diskstring" include the path of your new disk.

Example value of "asm_diskstring"
SQL> show parameter asm_diskstring

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
asm_diskstring                       string      AFD:*, /dev/mapper/*

Oracle will search path /dev/mapper and AFD disks for candidate disks. Therefore, the following new disks can be listed in view v$asm_disk, 

  /dev/mapper/NEWDATA1
  /dev/mapper/NEWDATA2

Note, if the new disk path is not configured with "asm_diskstring" and you decide to configure it manually with "alter system set asm_diskstring", make sure the value will be kept across the ASM instance restart. If original ASM instance saved the spfile in same disk group as OCR, it cannot be accessible either. Most likely, the current running ASM instances is started with default value without spfile/pfile. If this is true, it would be better for you to create a temporary pfile 'init+ASM.ora' under directory <GI_HOME>/dbs.

* Create disk group DATA
SQL> create diskgroup DATA external redundancy
  2   disk '/dev/mapper/NEWDATA1' size 200g
  3   disk '/dev/mapper/NEWDATA2' size 200g;

Diskgroup created.

SQL> select g.name,d.path,d. state,d.header_status,d.mount_status
  2   from v$asm_diskgroup g, v$asm_disk d
  3  where g.group_number(+)=d.group_number;

NAME       PATH                           STATE    HEADER_STATU MOUNT_S
---------- ------------------------------ -------- ------------ -------
DATA       /dev/mapper/NEWDATA1           NORMAL   MEMBER       CACHED
DATA       /dev/mapper/NEWDATA2           NORMAL   MEMBER       CACHED

I am using external redundancy because the disks are multipath SAN disks, ASM mirror is not necessary. If your disks are not RAID disks, ASM mirror,at least NORMAL REDUNDANCY, is strongly recommended.

Make sure the new created disk group is mounted,
SQL> select name,state from v$asm_diskgroup;

NAME                           STATE
------------------------------ -----------
DATA                           MOUNTED

3. Restore OCR to disk group DATA


* Restore OCR by running following command as root

  <GI_HOME>/bin/ocrconfig -restore <Retrieved-OCR-backup>

Here, <Retrieved-OCR-backup> is the OCR backup file (/u01/amdu_2021_05_16_11_04_01/DATA_272.f) extracted from corrupted disk group in first step.

Example output
[root@rac01]# /u01/app/12.2.0/grid/bin/ocrconfig -restore /u01/amdu_2021_05_16_11_04_01/DATA_272.f
[root@rac01]# 

You may have to repeat step 1 to extract other OCR backup file from corrupted disk group if the command returns errors like following
[root@rac01]# /u01/app/12.2.0/grid/bin/ocrconfig -restore /u01/amdu_2021_05_16_11_04_01/DATA_272.f
PROT-40: The specified file '/u01/amdu_2021_05_16_11_04_01/DATA_272.f' has an invalid format
PROC-26: Error while accessing the physical storage Storage layer error [Insufficient quorum to open OCR devices] [0]

Note: If the new disk group is named different from the corrupted one, "ocrconfig -restroe" will fail with errors like following
[root@rac01]# /u01/app/12.2.0/grid/bin/ocrconfig -restore /u01/amdu_2021_05_16_11_04_01/DATA_272.f
PROT-35: The configured OCR locations are not accessible.
PROC-26: Error while accessing the physical storage Storage layer error [Insufficient quorum to open OCR devices] [0]

As a workaround, you can manually update the OCR location information by editing file "/etc/oracle/ocr.loc" as following

Backup the file with command

  cp -p /etc/oracle/ocr.loc /etc/oracle/ocr.loc.bak

Edit the file /etc/oracle/ocr.loc, and replace "ocrconfig_loc" line with following line

  ocrconfig_loc=<new_disk_group>

If new disk group is named "DATANEW", Sample output
[root@rac01]# cat /etc/oracle/ocr.loc
ocrconfig_loc=+DATA/cluster01/OCRFILE/registry.255.1072473645
local_only=false

[root@rac01]# cp -p /etc/oracle/ocr.loc /etc/oracle/ocr.loc.bak

[root@rac01]# sed -i '/ocrconfig_loc=/c\ocrconfig_loc=+DATANEW' /etc/oracle/ocr.loc

[root@rac01]# cat /etc/oracle/ocr.loc
ocrconfig_loc=+DATANEW
local_only=false

After the file is changed with new disk group name, re-run "ocrconfig -restore" command.

Note: It is also needed to edit the same file on all other cluster nodes if the new disk group name is different from corrupted disk group.

* Verify the integrity of new restored OCR by running command as root

  <GI_HOME>/bin/ocrcheck

Example output
[root@rac01]# /u01/app/12.2.0/grid/bin/ocrcheck
Status of Oracle Cluster Registry is as follows :
         Version                  :          4
         Total space (kbytes)     :     409568
         Used space (kbytes)      :      11948
         Available space (kbytes) :     397620
         ID                       : 1063869578
         Device/File Name         :      +DATA
                                    Device/File integrity check succeeded

                                    Device/File not configured

                                    Device/File not configured

                                    Device/File not configured

                                    Device/File not configured

         Cluster registry integrity check succeeded

         Logical corruption check succeeded

It shows that neither integrity issue nor logical corruption is found. The OCR restore succeeded.

* Stop Oracle Clusterware

  <GI_HOME>/bin/crsctl stop crs -f

Example output
[root@rac01]# /u01/app/12.2.0/grid/bin/crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'rac01'
CRS-2673: Attempting to stop 'ora.evmd' on 'rac01'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'rac01'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'rac01'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'rac01'
CRS-2677: Stop of 'ora.evmd' on 'rac01' succeeded
CRS-2673: Attempting to stop 'ora.ctssd' on 'rac01'
CRS-2673: Attempting to stop 'ora.asm' on 'rac01'
CRS-2677: Stop of 'ora.drivers.acfs' on 'rac01' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'rac01' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'rac01' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'rac01' succeeded
CRS-2677: Stop of 'ora.asm' on 'rac01' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'rac01'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'rac01' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'rac01'
CRS-2677: Stop of 'ora.cssd' on 'rac01' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'rac01'
CRS-2677: Stop of 'ora.gipcd' on 'rac01' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'rac01' has completed
CRS-4133: Oracle High Availability Services has been stopped.

* If the OCR is restored to new disk group which name is different from corrupted disk group, edit file "/etc/oracle/ocr.loc" on all other nodes like what you did on first node.

Note: Do not start Clusterware on any nodes.

4. Recreate Voting files


* Run the following command as root from only one node to start the Oracle Clusterware stack in exclusive mode, which does not require voting files to be present or usable

  <GI_HOME>/bin/crsctl start crs -excl

Example output
[root@rac01]# u01/app/12.2.0/grid/bin/crsctl start crs -excl
CRS-4123: Oracle High Availability Services has been started.
CRS-2672: Attempting to start 'ora.evmd' on 'rac01'
CRS-2672: Attempting to start 'ora.mdnsd' on 'rac01'
CRS-2676: Start of 'ora.mdnsd' on 'rac01' succeeded
CRS-2676: Start of 'ora.evmd' on 'rac01' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'rac01'
CRS-2676: Start of 'ora.gpnpd' on 'rac01' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'rac01'
CRS-2672: Attempting to start 'ora.gipcd' on 'rac01'
CRS-2676: Start of 'ora.cssdmonitor' on 'rac01' succeeded
CRS-2676: Start of 'ora.gipcd' on 'rac01' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'rac01'
CRS-2672: Attempting to start 'ora.diskmon' on 'rac01'
CRS-2676: Start of 'ora.diskmon' on 'rac01' succeeded
CRS-2676: Start of 'ora.cssd' on 'rac01' succeeded
CRS-2672: Attempting to start 'ora.ctssd' on 'rac01'
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'rac01'
CRS-2676: Start of 'ora.ctssd' on 'rac01' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'rac01' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'rac01'
CRS-2676: Start of 'ora.asm' on 'rac01' succeeded
CRS-2672: Attempting to start 'ora.storage' on 'rac01'
CRS-2676: Start of 'ora.storage' on 'rac01' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'rac01'
CRS-2676: Start of 'ora.crsd' on 'rac01' succeeded

* Verify currently defined voting files

  <GI_HOME>/bin/crsctl query css votedisk

Example output
[root@rac01]# /u01/app/12.2.0/grid/bin/crsctl query css votedisk
Located 0 voting disk(s).

No voting disk (file) is found because the original voting file was located in the corrupted disk group. Therefore, the Clusterware cannot be started normally until voting file is re-created.

* Recreate Voting files by running following command as root

  <GI_HOME>/bin/crsctl replace votedisk <disk_group>

Example output
[root@rac01]# /u01/app/12.2.0/grid/bin/crsctl replace votedisk +DATA
Successful addition of voting disk 0a66a3368a0e4fb2bfef7de9764d3d8b.
Successfully replaced voting disk group with +DATA.
CRS-4266: Voting file(s) successfully replaced

* Re-verify currently defined voting files
[root@rac01]# /u01/app/12.2.0/grid/bin/crsctl query css votedisk
##  STATE    File Universal Id                File Name              Disk group
--  -----    -----------------                ---------------------- ---------
 1. ONLINE   0a66a3368a0e4fb2bfef7de9764d3d8b (/dev/mapper/newDATA1) [DATA]
Located 1 voting disk(s).

* Stop the Clusterware stack as root

  <GI_HOME>/bin/crsctl stop crs -f

Example output
[root@rac01]# /u01/app/12.2.0/grid/bin/crsctl stop has
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'rac01'
CRS-2673: Attempting to stop 'ora.crsd' on 'rac01'
CRS-2677: Stop of 'ora.crsd' on 'rac01' succeeded
CRS-2673: Attempting to stop 'ora.storage' on 'rac01'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'rac01'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'rac01'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'rac01'
CRS-2677: Stop of 'ora.drivers.acfs' on 'rac01' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'rac01' succeeded
CRS-2677: Stop of 'ora.storage' on 'rac01' succeeded
CRS-2673: Attempting to stop 'ora.asm' on 'rac01'
CRS-2677: Stop of 'ora.mdnsd' on 'rac01' succeeded
CRS-2677: Stop of 'ora.asm' on 'rac01' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'rac01'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'rac01' succeeded
CRS-2673: Attempting to stop 'ora.ctssd' on 'rac01'
CRS-2673: Attempting to stop 'ora.evmd' on 'rac01'
CRS-2677: Stop of 'ora.ctssd' on 'rac01' succeeded
CRS-2677: Stop of 'ora.evmd' on 'rac01' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'rac01'
CRS-2677: Stop of 'ora.cssd' on 'rac01' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'rac01'
CRS-2677: Stop of 'ora.gipcd' on 'rac01' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'rac01' has completed
CRS-4133: Oracle High Availability Services has been stopped.

* Restart the Oracle Clusterware stack in normal mode as root on all nodes

  <GI_HOME>/bin/crsctl start crs

5. Create ASM spfile in new ASM disk group


If the original ASM spfile was saved in corrupted disk group, a new spfile has to be created in new disk group as following,

* Start sqlplus as GI_HOME owner (usually "grid") and connect to ASM instance as sysasm

Example output
[grid@rac01]$ ps -ef | grep asm_pmon | grep -v grep
grid  24084     1  0 15:53 ?        00:00:00 asm_pmon_+ASM1
[grid@rac01]$
[grid@rac01]$ . oraenv
ORACLE_SID = [grid] ? +ASM1
The Oracle base remains unchanged with value /u01/app/grid
[grid@rac01]$
[grid@rac01]$ sqlplus / as sysasm

SQL*Plus: Release 12.2.0.1.0 Production on Sun May 16 15:59:59 2021

Copyright (c) 1982, 2016, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production

SQL>

* Confirm that the instance is not using spfile
SQL> show parameter spfile

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
spfile                               string

* Create spfile in new disk group
SQL> create spfile='+DATA' from pfile='/u01/app/12.2.0/grid/dbs/init+ASM.ora';

File created.

Note, make sure the pfile init+ASM.ora has been created with proper value for parameter "asm_diskstring".

* Restart ASM on each node of the cluster

  <GI_HOME>/bin/srvctl stop asm -force
  <GI_HOME>/bin/srvctl start asm

* Confirm that the instance is using spfile
SQL> show parameter spfile

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
spfile                               string      +DATA/cluster01/ASMPARAMETERFILE/regi
                                                 stry.253.1072713895

The ASM instance is using spfile '+DATA/cluster01/ASMPARAMETERFILE/registry.253.1072713895';

6. Re-create mgmtdb as GI_HOME owner (grid)


This step is only needed when mgmtdb was configured before disk group corruption and cannot be started after the Clusterware is restored.

Note: if you have access to My Oracle Support, download mdbutil.pl from Doc ID 2065175.1 and use it to re-create mgmtdb, it is much easier than manual creation. If you cannot download the script, continue with this article to re-create mgmtdb manually.

Confirm whether mgmtdb was configured before with command,

  <GI_HOME>/bin/srvctl status mgmtdb

If the command returns as following,
[grid@rac01]$ srvctl status mgmtdb
Database is enabled
Database is not running.

"Databse is enalbed" - means original Clusterware was configured with mgmtdb

"Database is not running" - for some reason, mgmtdb is not running, try to start it with command following command

  <GI_HOME>/bin/srvctl start mgmtdb

If the database was not created in the corrupted disk group, the database should be able to be started. If it cannot be started, follow the instruction in this step to re-create.

* Make sure listener mgmtlsnr is not running

  <GI_HOME>/bin/srvctl status mgmtlsnr

If the listener is running, stop it

  <GI_HOME>/bin/srvctl stop mgmtlsnr

Example output
[grid@rac01]$ srvctl status mgmtlsnr
Listener MGMTLSNR is enabled
Listener MGMTLSNR is running on node(s): rac01

[grid@rac01]$ srvctl stop mgmtlsnr

[grid@rac01]$ srvctl status mgmtlsnr
Listener MGMTLSNR is enabled
Listener MGMTLSNR is not running

* Start Oracle Database QoS Management server resource (qosmserver) if it is enabled and not running

Check the status of qosmserver,

  <GI_HOME>/bin/srvctl status qosmserver

If the service is enabled but not running, start it with command

  <GI_HOME>/bin/srvctl start qosmserver

Example output
[grid@rac01]$ srvctl status qosmserver
QoS Management Server is enabled.
QoS Management Server is not running.

[grid@rac01]$ srvctl start qosmserver

[grid@rac01]$ srvctl status qosmserver
QoS Management Server is enabled.
QoS Management Server is running on node rac01.

* Remove mgmtdb resource from OCR

  <GI_HOME>/bin/srvctl remove mgmtdb -force

Note: If resrouce is not removed, dbca will fail.

* Create mgmtdb with dbca as following

  <GI_HOME>/bin/dbca -silent -createDatabase -createAsContainerDatabase true \
        -templateName MGMTSeed_Database.dbc -sid -MGMTDB -gdbName _mgmtdb \
        -storageType ASM -diskGroupName <disk_group> \
        -datafileJarLocation <GI_HOME>/assistants/dbca/templates \
        -characterset AL32UTF8 -autoGeneratePasswords -skipUserTemplateCheck

Example command output
[grid@rac01]$ dbca -silent -createDatabase -createAsContainerDatabase true \
> -templateName MGMTSeed_Database.dbc -sid -MGMTDB -gdbName _mgmtdb \
> -storageType ASM -diskGroupName +DATA \
> -datafileJarLocation $ORACLE_HOME/assistants/dbca/templates \
> -characterset AL32UTF8 -autoGeneratePasswords -skipUserTemplateCheck
Registering database with Oracle Grid Infrastructure
5% complete
Copying database files
6% complete
7% complete
22% complete
36% complete
Creating and starting Oracle instance
37% complete
39% complete
43% complete
44% complete
45% complete
49% complete
53% complete
54% complete
55% complete
58% complete
Completing Database Creation
59% complete
62% complete
65% complete
68% complete
Executing Post Configuration Actions
100% complete
Look at the log file "/u01/app/grid/cfgtoollogs/dbca/_mgmtdb/_mgmtdb4.log" for further details.

* Configure mgmtdb with command

  <GI_HOME>/bin/mgmtca -local

As of now, the Clusterware is restored successfully. Next step is restore/recover databases, every DBA knows how to do that.

No comments: