Friday, December 31, 2021

EM 12c/13c Send Email Notification When OMS or Repository Database Down

In Oracle Enterprise Manager 12c/13c deployment, normally the OMS is responsible for sending all metric alert notifications, but in cases where the OMS or repository database is down, it is not able to do this. Therefore,  the Out of Band email notification is introduced as a backup mechanism for sending email notification when the OMS or repository database itself is down.

When it happens, the email notification will be generated by the agent which monitors the OMS and Repository target (oracle_emrep) target. Therefore this method is only applicable to the agent which is installed on the same machine as the OMS (also known as the chained agent). In the case of a multi-OMS setup, the Agent on the primary OMS node is responsible for monitoring the oracle_emrep target.

The Out of Band email notification is implemented by configuring Agent property on OMS host or primary OMS node of multi-OMS setup. Configuring steps as following,

1. Stop agent on the OMS host

  <AGENT_INST_HOME>/bin/emctl stop agent

2. Set agent properties

Add following lines to agent property files <AGENT_INST_HOME>/sysman/config/emd.properties
emd_email_gateway=<email_SMTP_server>
emd_email_address=<email_address_of_notification_receivers>
emd_from_email_address=<email_address_of_notification_sender>
Line 2 set email address(es) to which the email notification will be sent. You can specify more than one email address separated by commas. For example, in my lab system, it is configures as following
emd_email_gateway=mail.lab.dbaplus.ca
emd_email_address=dba@lab.dbaplus.ca,sa@lab.dbaplus.ca
emd_from_email_address=oemadmin@lab.dbaplus.ca
3. Start agent

  <AGENT_INST_HOME>/bin/emctl start agent

4. Validate the properties with following commands

  <AGENT_INST_HOME>/bin/emctl getproperty agent -name emd_email_gateway
  <AGENT_INST_HOME>/bin/emctl getproperty agent -name emd_email_address
  <AGENT_INST_HOME>/bin/emctl getproperty agent -name emd_from_email_address

The commands should return the values configured in step 2.

EM 13c Agent is unreachable as its first severity has not yet come after blackout end

The agent status shows "Agent Unreachable (Post Blackout)" in Oracle Enterprise Manager 13c console. On the agent home page, it shows following error message,

  Agent is unreachable as its first severity has not yet come after blackout end.

It is seen in EM 13c, can can be fixed by running following commands on agent host,

  ${AGENT_INST_HOME}/bin/emctl stop agent
  ${AGENT_INST_HOME}/bin/emctl  clearstate agent
  ${AGENT_INST_HOME}/bin/emctl start agent

Thursday, December 30, 2021

RMAN left defunct server processes and NetBackup nborautil processes

RMAN jobs can successfully backup database to tape with NetBackup, but leave server processes (RMAN channels) and NetBackup nborautil processes running. The processes look like defunct processes.

The issue is observed in Oracle Database 19c RMAN backup job with NetBackup 8.1.2 on AIX 7.2, it may also happen on other platform with NetBackup 8.1.2 or 8.2. When it happens, nborautil processes can be seen running like following
$ ps -ef | grep nborautil
  oracle 17236328  6489154  31 12:14:13      - 129:18 -bprdtype 2 -use_stdin -client host01 -bprd -noxmloutput -ignorenamespace -jsonoutput 26 -eoichar /usr/openv/netbackup/bin/nborautil
  oracle 32506230  5440300  26 12:47:03      - 113:20 -bprdtype 2 -use_stdin -client host01 -bprd -noxmloutput -ignorenamespace -jsonoutput 26 -eoichar /usr/openv/netbackup/bin/nborautil

$ ps -ef | grep 6489154
  oracle 17236328  6489154  30 12:14:13      - 129:44 -bprdtype 2 -use_stdin -client host01 -bprd -noxmloutput -ignorenamespace -jsonoutput 26 -eoichar /usr/openv/netbackup/bin/nborautil
  oracle  6489154        1   0   Dec 27      - 11:47 oracleORCL (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
$
$ ps -ef | grep 5440300
  oracle 32506230  5440300  24 12:47:03      - 113:44 -bprdtype 2 -use_stdin -client host01 -bprd -noxmloutput -ignorenamespace -jsonoutput 26 -eoichar /usr/openv/netbackup/bin/nborautil
  oracle  5440300        1   0   Dec 28      - 10:19 oracleORCL (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
Two NetBackup nborautil processes seem defunct, their parent process id are 6489154 and 5440300 respectively. Both parent processes are server processes of Oracle Database Instance ORCL. Check the session status of the server processes,
sys@ORCL> select s.sid,s.serial#,s.username,s.machine,s.program,s.event
  2  from v$session s, v$process p
  3  where s.paddr=p.addr and p.spid in (5440300,6489154);

  SID    SERIAL# USERNAME  MACHINE  PROGRAM                   EVENT
----- ---------- --------- -------- ------------------------- ------------------
  300      44503 SYS       host01   rman@host01 (TNS V1-V3)   Backup: MML shutdown
  178      17334 SYS       host01   rman@host01 (TNS V1-V3)   Backup: MML shutdown
The server processes are waiting for event "Backup: MML shutdown", it means waiting for NetBackup nborautil process to complete, and nborautil never exits, though Veritas Support claims it only takes longer time than expected not hang. As I saw, every RMAN job will leave new nborautil processes running, and eventually the defunct processes will use up all CPU resource and hang the system.

Veritas Support claims that this is bug of NetBackup on version 8.1.2/8.2. This issue will occur on all Oracle backups using RMAN script if the database is comprised of many datafiles.

The Oracle backups will  have an unusually long delay between when the last child job completes and when the parent job completes.  They delay may extend to several hours, even days.   

There are no performance or slow behaviour observed prior to the completion of the data transfer.  The delay is only during the meta data cataloging after the data transfer jobs have completed.

Smaller databases and those with fewer datafiles may not experience this delay.

So far, Veritas does not release any fix for it. If possible, upgrade NetBackup to higher version (e.x. 9). As a workaround, reducing backup datafiles in single job may help.

Friday, December 10, 2021

Mount DVD/CDROM with command on Linux CentOS / Red Hat / Oracle Linux / Ubuntu

 Minimal install is an important security for deploying Linux. It means only essential packages are installed. Therefore, GUI is usually not available. It becomes helpful to know how to mount CDROM/DVD when you need to install packages/software from DVD.

Basically, different Linux distributions and different versions may have different commands to do that. Here, I am trying to demonstrate a way which can be used on most of the Linux distributions/versions. It includes three steps,

* Identify DVD/CDROM device name (drive name)
* Create mount point (directory)
* Mount the DVD/CDROM on created mount point

Step 1. Identify DVD/CDROM device name (drive name)

This is most important step, and the only chanllege to mount the drive. Many commands/utilities can display DVD/CDROM device name, such as blkid, lsscsi, dmesg. But they are not always available. Most time, I use following command

  cat /proc/sys/dev/cdrom/info

It lists the information of the DVD/CDROM drive, and includes drive name. It proves to be working, at least, on CentOS, Red Hat Enterprise Linux, Oracle Linux, Ubuntu.

Sample output,
[root@RHEL4]# cat /proc/sys/dev/cdrom/info
CD-ROM information, Id: cdrom.c 3.20 2003/12/17

drive name:             sr1     sr0
drive speed:            1       32
drive # of slots:       1       1
Can close tray:         1       1
Can open tray:          1       1
Can lock tray:          1       1
Can change speed:       0       1
Can select disk:        0       0
Can read multisession:  1       1
Can read MCN:           1       1
Reports media changed:  1       1
Can play audio:         1       1
Can write CD-R:         0       0
Can write CD-RW:        0       0
Can read DVD:           0       1
Can write DVD-R:        0       0
Can write DVD-RAM:      0       0
Can read MRW:           0       1
Can write MRW:          0       1
Can write RAM:          0       1
The the example, the server has two DVD/CDROM drives sr1 and sr0. The full path of their device names are respectively /dev/sr1 and /dev/sr0.

Step 2. Create mount point

Mount point is a directory where you are going to mount DVD/CDROM. You can create your preferred directory with command

mkdir -p <directory>

For example
[root@RHEL4]# mkdir -p /media/cdrom
Step 3. Mount DVD/CDROM on the mount point with command,

  mount <device name> <mount point>

For example,
[root@RHEL4]# mount /dev/sr0 /media/cdrom
mount: block device /dev/sr0 is write-protected, mounting read-only

Sunday, December 5, 2021

Oracle 19c Installation on Red Had Enterprise Linux 8 / Oracle Linux 8 – [INS-08101] Unexpected error while executing the action at state: ‘supportedOSCheck’

When installing Oracle 19c Grid Infrastructure (GI) or 19c Database (runInstaller), got following error,
[INS-08101] Unexpected error while executing the action at state: 'supportedOSCheck'
Error details,
Cause - No additional information available.

Action - Cantact Oracle Spport Services or refer to the software manual.

Summary
 - java.lang.NullPointerException
The reason is that Oracle 19c is certified with Red Hat Enterprise Linux 8 / Oracle Linux 8 from 19.7, the base installation 19.3 is not certified.

The error can be fixed by downloading Release Update (RU) 19.7 or higher and starting installation with applyRU option as following,
./gridSetup.sh -applyRU <GI_RU_UNZIP_DIR>    - Install Grid Infrastructure
./runInstaller -applyRU <GI_RU_UNZIP_DIR>    - Insall Database
Note: Use Grid Infrastructure RU insted of Database RU to install database (runInstaller), it will apply both Database Release Update and OCW Release Update, and OCW RU includes the fix of the insallation errors.

If you cannot download required RU, as a workaround, you can export the variable
export CV_ASSUME_DISTID=OEL8
Then restart installation gridSetup.sh (for GI) or runInstaller (for Database).

You may ask why I am using OEL8 as the value of the variable CV_ASSUME_DISTID. No reason, I just randomly pick it. No matter what the value is, even the value is null (empty), the installation will be fixed.

The root cause of the installation error is that the installation utility (gridSetup.sh, runInstaller) is java program. When the OS is not supported by the installation, it will try to read value of variable CV_ASSUME_DISTID. The variable can be OS environment variable, just as I did before with export command, or variable and value pair in file $ORACLE_HOME/cv/admin/cvu_config. By default, the variable line is commented out in the file and OS variable is not set, and java program cannot find the variable and raise exeption java.lang.NullPointerException. It should be considered a bug.

Wednesday, December 1, 2021

Use oradebug To List All Internal Events Set In Oracle Database

Oracle internal events can be set at both sesson level and system level as following,

A. Session level
Session level events only affect current session. Users, who has ALTER SESSION privilege granted, can set event at session level with SQL command,

    alter session set events '<event context>' ;

For example
# -- Set sql_trace (event 10046) to level 12
user01@orcl> alter session set events '10046 trace name context forever, level 12';

# -- Unset/disable sql_trace (event 10046)
user01@orcl> alter session set events '10046 trace name context off';

# -- Set event for ORA-1652: unable to extend temp segment by %s in tablespace %s
user01@orcl> alter session set events '1652 trace name errorstack';

# -- Unset event for ORA-1652
user01@orcl> alter session set events '1652 trace name context off';
B. System level
Events set at systm level affect all database sessions. Users, who has both ALTER SYSTEM and ALTER SESSION privileges granted, can set events at system level with SQL command,

    alter system set events '<event context>';

For example
# -- Set sql_trace (event 10046) to level 12
user01@orcl> alter system set events '10046 trace name context forever, level 12';

# -- Unset/disable sql_trace (event 10046)
user01@orcl> alter system set events '10046 trace name context off';

# -- Set event for ORA-1652: unable to extend temp segment by %s in tablespace %s
user01@orcl> alter system set events '1652 trace name errorstack';

# -- Unset event for ORA-1652
user01@orcl> alter system set events '1652 trace name context off';
Note: ALTER SYSTEM SET EVENTS will set the events at system level for all sessions, as well as session level for current session. Therefore, the user has to be granted both ALTER SYSTEM and ALTER SESSION privileges to run this command.

In order to find out / list all events set at system level or session level, following steps is needed

1. Run SQL*Plus and connect to database as sysdba

The oradebug command will be run in SQL*Plus.

2. Identify the session, this step is only needed if you want to list session level events set in the sessions other than current session

Find out the session's Oracle process identifier (PID), Operating system process identifier (SPID) and Oracle process name (PNAME) with following SQL,
select p.pid, p.spid, p.pname, s.username, s.machine, s.program
  from v$process p, v$session s 
 where p.addr = s.paddr;
Sample output
PID SPID   PNAME  USERNAME   MACHINE         PROGRAM
--- ------ ------ ---------- --------------- ----------------------
 45 24432         USER01     workstation01   sqlplus.exe
The the sample output shows that USER01 is running sqlplus.exe on machine workstation01, which session PID is 45 and SPID is 24432.

3. Attached to the identified session

A. If you want to list system level events or session level events in current session, run oradebug command ,

   oradebug setmypid

For example
sys@orcl> oradebug setmypid
Statement processed.
B. If you want to list events set in the session other than current session, run one of following oradebug commands,

   oradebug setorapid      <PID>
   oradebug setospid       <SPID>
   oradebug setorapname    <PNAME>

For example, attaching the session found in step 2
sys@sys@orcl> oradebug setorapid 45
Oracle pid: 45, Unix process pid: 24432, image: oracle@dbhost01
Or
sys@sys@orcl> oradebug setospid 24432
Oracle pid: 45, Unix process pid: 24432, image: oracle@dbhost01
Here, oradebug setorapname cannot be used to attach the session because the process of the session does not have PNAME (PNAME is null). 

4. List the events

A. List events set at system level, run command

   oradebug eventdump system   -- 10g or newer
   oradebug dump events 4      -- 8i/9i

Sample output
sys@orcl> oradebug eventdump system
1652 trace name errorstack
The trace is enabled for ORA-1652 at system level.

B. List events set at session/process level, run command

   oradebug eventdump session   -- 10g or newer
   oradebug dump events 1          -- 8i/9i for session
   oradebug eventdump process  -- 10g or newer
   oradebug dump events 2          -- 8i/9i for process

Sample output
sys@orcl> oradebug eventdump process
1652 trace name errorstack
sys@orcl> oradebug eventdump session
sql_trace level=12
1652 trace name errorstack
The trace is set for ORA-1652 at both process and session level. And sql_trace is set at session level, the trace level is 12.

Tuesday, November 30, 2021

Identify SQL statements consuming temporary tablespace

You may have to find out the SQL statements which are using temporary tablespace when you get storage pressure for temporary tablespace.

Monday, November 15, 2021

Oracle 19c root.sh not working while gridSetup.sh switchGridHome to software-only installation

When the command "griddSetup.sh -switchGridHome" is used to perform Oracle database 19c out of place (OOP) patching, it will prompt to run root.sh on each cluster node. And root.sh will switch Grid Home from old one (without new patches) to new one (applied with new patches) on each node and also patch Grid Infrastructure Management Repository (GIMR) database on last node of the cluster.

Unfortunately, roo.sh could only work on first node, and not on others if you deploy the software/patches in following steps,

Step 1. Install new GI home with software only option with command,

   ./gridSetup.sh -applyRU <RU_Patch_Location> -applyOneOffs <comma_separated_one-off_patches_location>

At the end of the installation, you will be asked to run root.sh on each node. This software-only version root.sh does not touch old GI home and any running cluster processes (listener, asm, etc.). The new home is applied with new patctes (specified by command option applyRU and/or applyOneOffs).

Step 2. Switch GI from old home to new patched home with command,

   ./gridSetup.sh -switchGridHome

This installation/configuration utility will create a new version of rootconfig.sh under directory "$ORACLE_HOME/crs/config" with home-switching and patching options enabled on local node (known as fist node), the affected options as following,
   Software-only version            New version
   ---------------------------      --------------------------
   PATCH_HOME=false                 PATCH_HOME=true
   MOVE_HOME=false                  MOVE_HOME=true
   TRANSPARENT_MOVE_HOME=false      TRANSPARENT_MOVE_HOME=true
   SWONLY_MULTINODE=true            SWONLY_MULTINODE=false
The rootconfig.sh is called by root.sh. Technically, the utility gridSetup.sh should copy this new version of rootconfig.sh to all other nodes. However, it does not always happen and you will see nothing being changed by root.sh on nodes other than first node where the utility is running. This issue is experienced when new home is patches with Oracle Release Update 19.13.0.0.211019.

If root.sh does not work on other nodes, you can fix it by manually copying new version of rootconfig.sh from first node to other nodes.

Note: Always make a backup of existing rootconfig.sh before overriding it with new rootconfig.sh copied from first node and it is also a good practice to compare two versions of rootconfig.sh and verify what the differences are. 

Saturday, November 6, 2021

EM 12c/13c Agent Unreachable (Agent is running but is currently not read to accept client requests)

EM 12c/13c create Agent Unreachable alerts as follows: 
Host=host01.dbaplus.ca 
Target type=Agent 
Target name=host01.dbaplus.ca:3872 
Categories=Availability 
Message=Agent Unreachable (REASON = The agent is running but is currently not ready to accept client requests). Host is reachable. 
Severity=Critical 
Event reported time=Nov 6, 2021 12:36:48 PM EDT 
Operating System=SunOS
Platform=sparc
Event Type=Target Availability 
Event name=Status 
Availability status=Agent Unreachable
Update Details:
Agent Unreachable (REASON = The agent is running but is currently not ready to accept client requests). Host is reachable.
It usually happens when the agent is monitoring many targets or heavily loaded.  Agent cannot respond to OMS heartbeat, it cause oms to generate Agent Unreachable alerts.

Check the value of agent property MaxInComingConnections with following command
${AGNET_INSTANCE_HOME}/bin/emctl getproperty agent -name MaxInComingConnections
Sample output if the property is configured,
$ emctl getproperty agent -name MaxInComingConnections
Oracle Enterprise Manager Cloud Control 13c Release 5  
Copyright (c) 1996, 2021 Oracle Corporation.  All rights reserved.
MaxInComingConnections=50
If the property is not configured, the output looks like following,
$ emctl getproperty agent -name MaxInComingConnections
Oracle Enterprise Manager Cloud Control 13c Release 5  
Copyright (c) 1996, 2021 Oracle Corporation.  All rights reserved.
MaxInComingConnections is not a valid configuration property
The property MaxInComingConnections can be configured by adding following line in agent configuration file "${AGENT_INSTANCE_HOME}/sysman/config/emd.properties",
MaxInComingConnections=200
If the property already exists, you can find this line (the value may be different depending on your system) in the file, and there is no need to add new line,  just update the value to a bigger number (e.g. 200).

Note, it is a best practice to do a backup before editing the configuration file.

After saving the change, restart the agent.

Friday, November 5, 2021

Javascript/CSS: Add Copy-to-Clipboard Button in HTML Code Example

When feeling tired to select the text before copying them, I decide to add a COPY button to every code example block in my blog. The button will copy the whole example block to clipboard. I reached it through 4 steps.

Wednesday, November 3, 2021

Javascript/CSS: Add Line Number to HTML Source Code Block

 When I shared the source code in my blog before, I used to reference source code by highlighting the code lines with HTML tag. It was not an issue until I had many references to different lines of a long clode block. It could take too much of my time to highlight or mark the reference when I post an artical. Therefore, I added the line number to the source code in my posts. The line-numbered code block looks like,

Sunday, October 24, 2021

EM 12c/13c The listener is down: Failed to run lsnrctl

The listener is up and running. However, the lisener target shows as Down in the Oracle Enterprise Manager Cloud Control Console and the incident message as following,
The listener is down: Failed to run lsnrctl.

Friday, October 1, 2021

Change SYS password in Oracle Data Guard

In Oracle Data Guard configuration, redo transport uses Oracle Net sessions to transport redo data. These redo transport sessions are authenticated using either the Secure Socket Layer (SSL) protocol or a remote login password file. Most time, remote login password file is adopted because of the difficulty of meeting SSL authentication requirements.

Monday, August 30, 2021

Database startup failed with ORA-600 [dbgripmg_2: infinite init action] [ADR_CONTROL_AUX]

Oracle database startup failed with ORA-00600 as following
SQL> startup
ORACLE instance started.

Total System Global Area 8589878792 bytes
Fixed Size                 12854792 bytes
Variable Size            5402263552 bytes
Database Buffers         3154116608 bytes
Redo Buffers               20643840 bytes
ORA-00600: internal error code, arguments: [dbgripmg_2: infinite init action],
[11], [ADR_CONTROL_AUX], [], [], [], [], [], [], [], [], []

Errors ORA-00700 [dbgrmblcp_corrupt_page]/ORA-00600 [dbgrmblgp_get_page_1] can be found from alert log as following,
Dumping diagnostic data in directory=[cdmp_20210830103319], requested by (instance=1, osid=85656173 (M000)), summary=[incident=186167].
2021-08-30 10:33:19.773000 -04:00
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/oracle/diag/rdbms/db01/DB01/trace/DB01_m000_11797188.trc  (incident=186169) (PDBNAME=CDB$ROOT):
ORA-00700: soft internal error, arguments: [dbgrmblcp_corrupt_page], [/u01/app/oracle/diag/rdbms/db01/DB01/metadata/INCIDENT.ams], [11], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/db01/DB01/incident/incdir_186169/DB01_m000_11797188_i186169.trc
Errors in file /u01/app/oracle/diag/rdbms/db01/DB01/trace/DB01_ora_59376292.trc  (incident=186150) (PDBNAME=CDB$ROOT):
ORA-00700: soft internal error, arguments: [dbgrmblcp_corrupt_page], [/u01/app/oracle/diag/rdbms/db01/DB01/metadata/ADR_CONTROL.ams], [11], [], [], [], [], [], [], [], [], []
ORA-00700: soft internal error, arguments: [dbgrmblcp_corrupt_page], [/u01/app/oracle/diag/rdbms/db01/DB01/metadata/ADR_CONTROL.ams], [11], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/db01/DB01/incident/incdir_186150/DB01_ora_59376292_i186150.trc
Dumping diagnostic data in directory=[cdmp_20210830103320], requested by (instance=1, osid=85656173 (M000)), summary=[incident=186168].
2021-08-30 10:33:21.211000 -04:00
Errors in file /u01/app/oracle/diag/rdbms/db01/DB01/trace/DB01_m000_11797188.trc  (incident=186170) (PDBNAME=CDB$ROOT):
ORA-00700: soft internal error, arguments: [dbgrmblcp_corrupt_page], [/u01/app/oracle/diag/rdbms/db01/DB01/metadata/INCIDENT.ams], [11], [], [], [], [], [], [], [], [], []
ORA-00700: soft internal error, arguments: [dbgrmblcp_corrupt_page], [/u01/app/oracle/diag/rdbms/db01/DB01/metadata/INCIDENT.ams], [11], [], [], [], [], [], [], [], [], []
Errors in file /u01/app/oracle/diag/rdbms/db01/DB01/trace/DB01_m000_11797188.trc  (incident=186171) (PDBNAME=CDB$ROOT):
ORA-00600: internal error code, arguments: [dbgrmblgp_get_page_1], [11], [0], [0], [], [], [], [], [], [], [], []
ORA-00700: soft internal error, arguments: [dbgrmblcp_corrupt_page], [/u01/app/oracle/diag/rdbms/db01/DB01/metadata/INCIDENT.ams], [11], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/db01/DB01/incident/incdir_186171/DB01_m000_11797188_i186171.trc
Errors in file /u01/app/oracle/diag/rdbms/db01/DB01/trace/DB01_ora_59376292.trc  (incident=186151) (PDBNAME=CDB$ROOT):
ORA-00600: internal error code, arguments: [dbgrmblgp_get_page_1], [11], [0], [0], [], [], [], [], [], [], [], []
ORA-00700: soft internal error, arguments: [dbgrmblcp_corrupt_page], [/u01/app/oracle/diag/rdbms/db01/DB01/metadata/ADR_CONTROL.ams], [11], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/db01/DB01/incident/incdir_186151/DB01_ora_59376292_i186151.trc

ADRCI is accessing ams files which are corrupted. In the example alert, corrupted files are INCIDENT.ams and ADR_CONTROL.ams under $ADR_HOME. The corrupted ams file may be different in different scenarios.

Solution is that deleting the corrupted files and restart database.

Wednesday, July 7, 2021

AQ Queue Monitor Does Not Change Message MSG_STATE from WAIT to READY on RAC

In RAC database, the Advanced Queue (AQ) message MSG_STATE (column of view AQ$<QUEUE_TABLE>) stays with value 'WAIT' and will not be changed to 'READY' any more. It usually happens after database was shut down with immediate/abort option followed by database patching/upgrading. 

Friday, June 25, 2021

EM 12c/13c Update Oracle Home Path of Targets with SQL

 As out-of-place patching can dramatically help minimize database downtime, especially when multiple databases run out of same home, it does not require all databases shut down at same time, more and more DBA adopt this method. However, if you are patching a home which host multiple databases, you may feel frustrated to update the databases' Oracle Home property in Oracle Enterprise Manage by clicking through different windows and changing it one by one. Now, the first question you will ask is how I can update all databases at one time. The answer is to run SQL (PL/SQL) in EM repository database.

Wednesday, June 23, 2021

EM 12c/13c Change Lifecycle Status with SQL

The Lifecycle Status property of EM targets is often used to prioritize the notification of incident rule. For example, DBA will be paged while database which Lifecycle Status is 'Production' crashed, and only emailed for 'Test' database. This post is going to show how to use sql scripts to find out targets with different Lifecycle Status and how to change the Lifecycle Status.

Tuesday, June 22, 2021

EM 12c/13c How to find out new discovered targets with SQL

 When agent is deployed to new host, or new targets is installed or created on existing hosts, Oracle Enterprise Manger (EM) Cloud Control can discover the targets automatically. However, EM does not promote the new discovered targets automatically.

Sunday, June 20, 2021

Script Run datapatch against all running instance in parallel

This script can be used to run datapatch against all currently running instances in parallel to apply sql patches after patches are applied on Oracle homes. It is helpful for post patching operations. It works for different instances running out of different Oracle homes and different home with different owners.

The script accepts one optional parameter as log file path, it will save log files under /tmp if no parameter is presented while the script is started.

In order to have the script succeed, please be advised about following limitations,

1. The script is only tested on Linux and AIX, it does not work on Solaris.

2. Only root or Oracle database home owner is supposed to run this script. If the instances are running out of different Oracle homes which are owned by different OS users, root user is recommended. Otherwise, the instances running out of Oracle home which owner is different from current user will be excluded.

3. Oracle database home is 12c or higher which supports datapatch

4. The script runs following command as root to retrieve Oracle home path
   /bin/ls
   Therefore, if Oracle home owner (normally oracle) runs this script, sudo has to be configure to grant the user privilege to run '/bin/ls' as root without being asked for password.

Friday, June 18, 2021

Install PostgreSQL on RHEL/OL/CentOS 6/7/8

There are many ways to install PostgreSQL on Linux box. Here, I am going to use yum to install different version of PostgreSQL on Red Hat Enterprise Linux (RHEL) 7/8, it also works for Oracle Linux (OL) / CentOS 7/8.

Thursday, June 17, 2021

Java Stored Procedure failed with java.lang.OutOfMemoryError

Oracle database Jave stored procedure failed with "java.lang.OutOfMemoryError", the error stack looks like
ORA-29532: Java call terminated by uncaught Java exception: java.lang.OutOfMemoryError
ORA-06512: at "USER01.MYJAVAPROC", line 28
ORA-06512: at line 7

This error is thrown when there is insufficient space to allocate an object in the Java heap. In this case, The Java garbage collector cannot make space available to accommodate a new object, and the heap cannot be expanded further.  

Wednesday, June 16, 2021

Oracle 19.11 roothas.sh failed with "Out of memory" on AIX

When applying Oracle GI Release Update 19.11.0.0.210420 on AIX, command "roothas.sh -postpatch" failed with "Out of memory" as following
[root@host01]# /u01/app/oracle/product/19.0.0/grid_1/crs/install/roothas.sh -postpatch
Using configuration parameter file: /u01/app/oracle/product/19.0.0/grid_1/crs/install/crsconfig_params
The log of current session can be found at:
  /u01/app/oracle/crsdata/host01/crsconfig/hapatch_2021-06-15_01-53-27PM.log
Out of memory!
Out of memory!
Out of memory!
/u01/app/oracle/product/19.0.0/grid_1/crs/install/roothas.sh[137]: 7930494 Segmentation fault(coredump)
The command '/u01/app/oracle/product/19.0.0/grid_1/perl/bin/perl -I/u01/app/oracle/product/19.0.0/grid_1/perl/lib -I/u01/app/oracle/product/19.0.0/grid_1/crs/install -I/u01/app/oracle/product/19.0.0/grid_1/xag /u01/app/oracle/product/19.0.0/grid_1/crs/install/roothas.pl -postpatch' execution failed

Tuesday, June 15, 2021

Configure yum with proxy server on RHEL/OL/Fedora

When using yum to maintain packages on hosts running Red Hat Enterprise Linux (RHEL), Oracle Linux (OL) or Fedora, you may have to configure proxy option if the host is running behind firewall.

In order to do that, add following lines to yum configuration file /etc/yum.conf

proxy=http://<proxy server>:<port>
proxy_username=<user name>
proxy_password=<password>

Here, 

  <proxy server> is the host name or IP address of proxy server
  <user name> user name if proxy server requires
  <password>  password if proxy server requires

File /etc/yum.conf  example
gpgcheck=1
installonly_limit=3
clean_requirements_on_remove=True
best=True
skip_if_unavailable=False
proxy=http://proxy.dbaplus.ca:8080
proxy_username=user01
proxy_password=user01pwd

Thursday, June 3, 2021

How to Permanently Change PowerShell Prompt

 The PowerShell command prompt indicates that PowerShell is ready to run a command,
PS C:\>

It is determined by the built-in Prompt function and can be customized by running following command to re-define Prompt function,

  function prompt {"<Script block>"}

Here, <Script block> is script block which will determine/generate prompt value (character string). For example, following have prompt include current date and time
PS C:\>
PS C:\> function Prompt {"$(Get-Date)> "}
06/03/2021 11:39:23>
06/03/2021 11:39:25>

The change is only valid for current session, the prompt of new session is still default value. In order to keep new prompt for all sessions, you have to create your own Prompt function and saving it in your PowerShell profile as following,

1. Find out your PowerShell profile file with "$profile"
PS C:\> $profile
C:\Users\admin\Documents\WindowsPowerShell\Microsoft.PowerShell_profile.ps1
PS C:\>

2. Add customized Prompt function to profile

If the file or directory does not exist, you can manually create it, then add customized Prompt function to the file. For example, add following lines
function Prompt { "PS [" + ${ENV:USERNAME} + "@" + ${ENV:COMPUTERNAME} + "]> "}

The prompt will be in the format "PS [user-name@computer-name]> " in all new PowerShell sessions.

Sunday, May 16, 2021

Grid Infrastructure 12.2 restore OCR, Voting File and mgmtdb when disk group corrupted

For some reason, Oracle desupported the placement of OCR and voting files directly on a shared file system from Grid Infrastructure 12.2 until rescinding the desupport for Standalone Clusters from 19.3. Therefore, when GI 12.2 is installed, OCR, Voting file, and OCR backup location are, by default, configured in the ASM disk group, and mostly likely in the same disk group. When the disk group cannot be mounted because of any issues, the cluster will not be brought up anymore.

Technically, GI has to be re-configured as like a new installation, that could be a big job. However, it may not be that bad. Although 12.2 does not allow you have OCR backup location out of ASM disk group, DBA still can copy the OCR backup file from ASM disk group to file system with command 'asmcmd cp'. I am going to demonstrate how we can maximize the opportunity of bringing the cluster backup without rebuild/reinstall it.

Sunday, May 9, 2021

Windows How to log all output on the console to text file

Windows Command Prompt console does not have built-in facility to log the console output to file. In order to implement logging function, PowerShell has to be used instead of normal command console.

To enable logging, running following command under PowerShell prompt,

  Start-Transcript [-Path] "<file-name>" [-Append]

Here, 

  -Path parameter is used to specify the file "<file-name>" and "<file-name>" is full path of log file in which output messages will be saved

  -Append Indicates that the logging text will be added to the end of an existing file instead of overriding.

To stop logging, running command

  Stop-Trasnscript

Note: PowerShell can be started on Windows by clicking "Windows PowerShell" applicaiton or running command "PowerShell" in normal "Command Prompt" console, but DO NOT start PowerShell in "Windows Terminal". If the PowerShell is started from "Window Terminal", Start-Transcript may not be able to log everything. For example, non-Windows built-in command (e.g. sqlplus) output cannot be completely logged.
  
Example

PS> Start-Transcript -Path "C:\temp\test.log"
Transcript started, output file is C:\temp\test.log
PS>
PS> sqlplus system/oracle@orcl

SQL*Plus: Release 12.2.0.1.0 Production on Sun May 9 20:30:15 2021

Copyright (c) 1982, 2016, Oracle.  All rights reserved.

Last Successful login time: Tue Apr 13 2021 16:46:29 -04:00

Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

system@orcl> select host_name from v$instance;

HOST_NAME
----------------------------------------------------------------
host01

system@orcl> exit
Disconnected from Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
PS>
PS> Stop-Transcript
Transcript stopped, output file is C:\temp\test.log

The content of file C:\temp\test.log

**********************
Windows PowerShell transcript start
Start time: 20210509202857
Username: dbaplus
RunAs User: dbaplus
Configuration Name: 
Machine: wkstn01 (Microsoft Windows NT 10.0.19042.0)
Host Application: C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe
Process ID: 15732
PSVersion: 5.1.19041.906
PSEdition: Desktop
PSCompatibleVersions: 1.0, 2.0, 3.0, 4.0, 5.0, 5.1.19041.906
BuildVersion: 10.0.19041.906
CLRVersion: 4.0.30319.42000
WSManStackVersion: 3.0
PSRemotingProtocolVersion: 2.3
SerializationVersion: 1.1.0.1
**********************
Transcript started, output file is C:\temp\test.log
PS> sqlplus system/oracle@orcl

SQL*Plus: Release 12.2.0.1.0 Production on Sun May 9 20:30:15 2021

Copyright (c) 1982, 2016, Oracle.  All rights reserved.

Last Successful login time: Tue Apr 13 2021 16:46:29 -04:00

Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

system@orcl> select host_name from v$instance;

HOST_NAME
----------------------------------------------------------------
host01

system@orcl> exit
Disconnected from Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
PS> Stop-Transcript
**********************
Windows PowerShell transcript end
End time: 20210509203043
**********************

Wednesday, May 5, 2021

OEM 12c/13c Database Target Discovered and Named with host name as suffix

 When EM discovers database targets, it generates default target name for the database in one of following formats,

  db_unique_name.db_domain   if both parameters db_unique_name and db_domain are set
  db_unique_name             if parameter db_unique_name is set, but db_domain is not set
  db_name.db_domain          if parameter db_unique_name is not set, but db_domain is set
  db_name                    if neither db_unique_name nor db_domain is set

EM discovery script retrieves these parameters' values from parameter file (pfile or spfile) of the database, not from running instance. If the discovery process has issue to locate or process parameter file, EM will name database target in format,

  <sid>_<hostname>
  
Therefore, when you find the new discovered database target is named in this format, it means EM agent discovery script has issue with parameter file. The details can be found from agent trace file "<AGENT_INST_HOME>/sysman/log/emagent_perl.trc". Most popular error messages look like

  ERROR:  initParameterFileUtl::convertSPFileToPFile: Failed to convert spfile
  
or

  ERROR:  initParameterFileUtl::getParameterFile: Cannot find any init parameter file for instance <instancename> in oracle home  <oracle_home>

For example, when Oracle Restart (standalone Grid Infrastructure) is installed and ASM storage is configured on server host1.dbaplus.ca, DBCA creates a database orcl (sid & db_name) in ASM diskgroup, the spfile will also be saved in ASM diskgroup and no parameter file (pfile/spfile) will be created under directory <ORACLE_HOME>/dbs. The database will be discovered by EM with default name orcl_host1.dbaplus.ca and you will see errors in agent trace file "emagent_perl.trc",

oracledb.pl: 2021-05-04 04:03:08,923: INFO:  DB_LISTENER_DISCOVERY:  processing sid="orcl"
oracledb.pl: 2021-05-04 04:03:08,926: ERROR:  initParameterFileUtl::getParameterFile: Cannot find any init parameter file for instance orcl in oracle home /u01/app/oracle/product/19.9.0/dbhome_1/dbs
oracledb.pl: 2021-05-04 04:03:08,930: ERROR:  initParameterFileUtl::getParameterFile: Cannot find any init parameter file for instance orcl in oracle home /u01/app/oracle/product/19.9.0/dbhome_1/dbs
oracledb.pl: 2021-05-04 04:03:09,065: ERROR:  initParameterFileUtl::getParameterFile: Cannot find any init parameter file for instance orcl in oracle home /u01/app/oracle/product/19.9.0/dbhome_1/dbs
oracledb.pl: 2021-05-04 04:03:09,069: ERROR:  initParameterFileUtl::getParameterFile: Cannot find any init parameter file for instance orcl in oracle home /u01/app/oracle/product/19.9.0/dbhome_1/dbs

The reason is that discovery script cannot find parameter file from <ORACLE_HOME>/dbs. To fix this problem, create a pfile 'initorcl.ora' under <ORACLE_HOME>/dbs with following content,

   spfile='<full path of spfile saved in diskgroup>'

Note: DO NOT leave any SPACE character at the beginning of the line (before word "spfile"). If you did, you will not see any errors in the trace file, but the target name will be <sid>_<hostname>.

Sample init file

$ srvctl config database -db orcl | grep spfile
Spfile: +DATA/orcl/PARAMETERFILE/spfile.919.1071658047
$
$ echo "spfile='+DATA/orcl/PARAMETERFILE/spfile.919.1071658047'" > $ORACLE_HOME/dbs/initorcl.ora
$
$ cat $ORACLE_HOME/dbs/initorcl.ora
spfile='+DATA/orcl/PARAMETERFILE/spfile.919.1071658047'

Sunday, May 2, 2021

Oracle 19.11 deinstall failed with "ERROR: oracle/rat/tfa/util/ManageTfa"

After Oracle database 19c home is applied Release Update 19.11.0.0.210420, the home cannot be de-installed anymore.

The deinstall utility will fail with errors,
######################## DEINSTALL CLEAN OPERATION START ########################
## [START] Preparing for Deinstall ##
Setting LOCAL_NODE to thanos
Setting CRS_HOME to false
Setting oracle.installer.invPtrLoc to /tmp/deinstall2021-05-02_07-34-15PM/oraInst.loc
Setting oracle.installer.local to false

ERROR: oracle/rat/tfa/util/ManageTfa
Exited from program.


############# ORACLE DEINSTALL TOOL END #############

In deinstall error log,
java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at oracle.install.db.deinstall.wrapper.Deinstall.callCleanImpl(Deinstall.java:1876)
        at oracle.install.db.deinstall.wrapper.Deinstall.main(Deinstall.java:907)
Caused by: java.lang.NoClassDefFoundError: oracle/rat/tfa/util/ManageTfa
        at oracle.install.db.deinstall.core.PrepForOUIDeinstall.cleanConfig(PrepForOUIDeinstall.java:187)
        ... 6 more
Caused by: java.lang.ClassNotFoundException: oracle.rat.tfa.util.ManageTfa
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
        ... 7 more

Looks like something related to TFA is missing after RU 19.11 is applied.

Workaround,

Roll back RU 19.11, then re-run deinstall utility.

Oracle 19c runInstaller failed with "undefined reference to 'jox_eujs_nowait_'"

Updated on Thursday, Oct 21, 2021 with following,

The error is also found with RU 19.13 (19.13.0.0.211019).
I believe it is same thing with 19.12, though I did not test it.

When installing Oracle 19c with patch apply of Database RU 19.11 and OJVM RU 19.11, runInstaller failed with message

   Error in invoking target 'irman ioracle idrdactl idrdalsnr idrdaproc' of makefile '/u01/app/oracle/product/19.11.0/dbhome_1/rdbms/lib/ins_rdbms.mk'. See /u01/app/oraInventory/logs/InstallActions2021-05-01_08-40-24PM/installActions2021-05-01_08-40-24PM.log for details.

Saturday, May 1, 2021

OEM 13c Software Standardization Advisor Returns Collection Errors

In EM 13c console, access "Software Standardization Advisor" as following,

 Targets -> Databases -> Administration -> Software Standardization Advisor
 
You could see a number beside "Collection Errors" for database and/or Grid infrastructure home. The issue can be found in EM 13.3, 13.4 and 13.5.

Click the number beside "Collection Errors", all Oracle home targets which have "Metric Collection Errors" are listed with Path, Host and Owner information. 

Click the name of Oracle Home Target to show home page of the target, at the bottom of "Summary" section, click link "Found 1 Metric Collection Error(s)" beside "Reason".

Metric "Files affected by a Patch" has following "Message"

     java.lang.UnsupportedOperationException: Collection Result Maximum Flood Control Level Exceeded

The issue can be reproduced / verified by run following command on the host where the Oracle home exists,

  <AGENT_HOME>/bin/emctl control agent runCollection <Target name of the Oracle home>:oracle_home oracle_home_config

$ emctl control agent runCollection OraHome12_host01:oracle_home oracle_home_config
Oracle Enterprise Manager Cloud Control 13c Release 5  
Copyright (c) 1996, 2021 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
EMD runCollection completed successfully

Following lines are found from the log file "<AGENT_INST_HOME>/sysman/log/gcagent.log"
2021-04-29 20:25:18,725 [668:D9EFD971:HTTPListener--668 (DispatchRequests emdctl@28474@amber.corp.toronto.ca=>[161974231840001])] INFO - >>> Dispatching request: RunCollectionRequest <<<
2021-04-29 20:25:22,986 [1129:A9983BB8:GC.OnDemand.8 (oracle_home:OraHome12_host01:Real-time collection oracle_home_config) (oracle_home:OraHome12_host01:oracle_home_config:PatchFixedBug)] WARN - Result set exceeding min flood control level
2021-04-29 20:25:22,992 [1129:A9983BB8:GC.OnDemand.8 (oracle_home:OraHome12_host01:Real-time collection oracle_home_config) (oracle_home:OraHome12_host01:oracle_home_config:PatchFixedBug)] WARN - Result set exceeding min flood control level
2021-04-29 20:25:24,398 [1129:A9983BB8:GC.OnDemand.8 (oracle_home:OraHome12_host01:Real-time collection oracle_home_config) (oracle_home:OraHome12_host01:oracle_home_config:PatchFixedBug)] WARN - Result set exceeding min flood control level
2021-04-29 20:25:24,406 [1129:GC.OnDemand.8 (oracle_home:OraHome12_host01:Real-time collection oracle_home_config) (oracle_home:OraHome12_host01:oracle_home_config:PatchedFile)] ERROR - Result set exceeded max flood control level
2021-04-29 20:25:24,408 [1129:GC.OnDemand.8 (oracle_home:OraHome12_host01:Real-time collection oracle_home_config) (oracle_home:OraHome12_host01:oracle_home_config:PatchedFile)] ERROR - oracle_home:OraHome12_host01:oracle_home_config:PatchedFile
java.lang.UnsupportedOperationException: Collection Result Maximum Flood Control Level Exceeded
at oracle.sysman.emSDK.agent.datacollection.CollectionResult.performFloodControl(CollectionResult.java:459)
at oracle.sysman.emSDK.agent.datacollection.CollectionResult.addCollectionRow(CollectionResult.java:662)
at oracle.sysman.gcagent.addon.fetchlet.osfetchlet.BaseOSFetchlet.getOSMetric(BaseOSFetchlet.java:1157)
at oracle.sysman.gcagent.addon.fetchlet.osfetchlet.BaseOSFetchlet.getMetric(BaseOSFetchlet.java:476)
at oracle.sysman.gcagent.target.interaction.execution.FetchletFactory.getMetric(FetchletFactory.java:437)
at oracle.sysman.gcagent.target.interaction.execution.ExecuteTask.executeQueryDescriptor(ExecuteTask.java:1284)
at oracle.sysman.gcagent.target.interaction.execution.ExecuteTask.runTask(ExecuteTask.java:3167)

From the log, we can find two exceptions:

  1. When collectiong metric PatchFixedBug (Bugs fixed by Patch), got WARN message "Result set exceeding min flood control level"
  2. When collectiong metric PatchedFile (Files affected by Patch), got ERROR message "Result set exceeded max flood control level"

The ERROR can be also confirmed by collecting specific metric "Files affected by Patch" with command

  <AGENT_HOME>/bin/emctl getmetric agent <Oracle Home target name>,oracle_home,PatchedFile

$ emctl getmetric agent OraHome12_host01,oracle_home,PatchedFile
Oracle Enterprise Manager Cloud Control 13c Release 5
Copyright (c) 1996, 2021 Oracle Corporation.  All rights reserved.
EMD getmetric error: Collection Result Maximum Flood Control Level Exceeded


Solution

1.  Agent side settings:

  * Set Min flood control value "CollectionResults.MaximumRowsFloodControlMin" to remove WARN message

    <AGENT_HOME>/bin/emctl setproperty agent -allow_new -name CollectionResults.MaximumRowsFloodControlMin -value 5000

  * Set Max flood control value "CollectionResults.MaximumRowsFloodControlMax" to fix ERROR issue
  
    <AGENT_HOME>/bin/emctl setproperty agent -allow_new -name CollectionResults.MaximumRowsFloodControlMax -value 50000

  * Verify the new value of properties with commands
  
    <AGENT_HOME>/bin/emctl getproperty agent -name CollectionResults.MaximumRowsFloodControlMin
    <AGENT_HOME>/bin/emctl getproperty agent -name CollectionResults.MaximumRowsFloodControlMax

    Or reviewing agent property file "<AGENT_INST_HOME>/sysman/config/emd.properties"

    grep 'CollectionResults' <AGENT_INST_HOME>/sysman/config/emd.properties

  * Manually start metric collection to reset the error status or you can wait until next scheduled metric collection job running (by default, every 24 hours)

    <AGENT_HOME>/bin/emctl control agent runCollection <Oracle home target name>:oracle_home oracle_home_config

  * In case you change your mind (or want to see the errors) the new value of properties can be cleared/removed with commands

    <AGENT_HOME>/bin/emctl clear_property agent -name CollectionResults.MaximumRowsFloodControlMin
    <AGENT_HOME>/bin/emctl clear_property agent -name CollectionResults.MaximumRowsFloodControlMax

Sample output
$ emctl getproperty agent -name CollectionResults.MaximumRowsFloodControlMin
Oracle Enterprise Manager Cloud Control 13c Release 5
Copyright (c) 1996, 2021 Oracle Corporation.  All rights reserved.
CollectionResults.MaximumRowsFloodControlMin is not a valid configuration property

$ emctl getproperty agent -name CollectionResults.MaximumRowsFloodControlMax
Oracle Enterprise Manager Cloud Control 13c Release 5
Copyright (c) 1996, 2021 Oracle Corporation.  All rights reserved.
CollectionResults.MaximumRowsFloodControlMax is not a valid configuration property

$ emctl setproperty agent -allow_new -name CollectionResults.MaximumRowsFloodControlMin -value 5000
Oracle Enterprise Manager Cloud Control 13c Release 5
Copyright (c) 1996, 2021 Oracle Corporation.  All rights reserved.
EMD setproperty succeeded

$ emctl setproperty agent -allow_new -name CollectionResults.MaximumRowsFloodControlMax -value 50000
Oracle Enterprise Manager Cloud Control 13c Release 5
Copyright (c) 1996, 2021 Oracle Corporation.  All rights reserved.
EMD setproperty succeeded

$ emctl getproperty agent -name CollectionResults.MaximumRowsFloodControlMin
Oracle Enterprise Manager Cloud Control 13c Release 5
Copyright (c) 1996, 2021 Oracle Corporation.  All rights reserved.
CollectionResults.MaximumRowsFloodControlMin=5000

$ emctl getproperty agent -name CollectionResults.MaximumRowsFloodControlMax
Oracle Enterprise Manager Cloud Control 13c Release 5
Copyright (c) 1996, 2021 Oracle Corporation.  All rights reserved.
CollectionResults.MaximumRowsFloodControlMax=50000

$ grep 'CollectionResults' /u01/app/oracle/em13.5/agent/agent_inst/sysman/config/emd.properties
CollectionResults.MaximumRowsFloodControlMin=5000
CollectionResults.MaximumRowsFloodControlMax=50000

2. OMS side settings:

  * Setting the min value:
    
    <OMS_HOME>/bin/emctl set property -name CollectionResults.MaximumRowsFloodControlMin -value 5000

  * Setting the max value:
    
    <OMS_HOME>/bin/emctl set property -name CollectionResults.MaximumRowsFloodControlMax -value 50000

  * Verify the new value of properties
  
    <OMS_HOME>/bin/emctl get property -name CollectionResults.MaximumRowsFloodControlMin
    <OMS_HOME>/bin/emctl get property -name CollectionResults.MaximumRowsFloodControlMax
    
  * Delete new properties if they are not needed

    <OMS_HOME>/bin/emctl delete property -name CollectionResults.MaximumRowsFloodControlMin
    <OMS_HOME>/bin/emctl delete property -name CollectionResults.MaximumRowsFloodControlMax

Sample output
[oracle@oms]$ emctl get property -name CollectionResults.MaximumRowsFloodControlMin
Oracle Enterprise Manager Cloud Control 13c Release 5
Copyright (c) 1996, 2021 Oracle Corporation.  All rights reserved.
SYSMAN password:
Value for property CollectionResults.MaximumRowsFloodControlMin for oms All Management Servers is null

[oracle@oms]$ emctl get property -name CollectionResults.MaximumRowsFloodControlMax
Oracle Enterprise Manager Cloud Control 13c Release 5
Copyright (c) 1996, 2021 Oracle Corporation.  All rights reserved.
SYSMAN password:
Value for property CollectionResults.MaximumRowsFloodControlMax for oms All Management Servers is null

[oracle@oms]$ emctl set property -name CollectionResults.MaximumRowsFloodControlMin -value 5000
Oracle Enterprise Manager Cloud Control 13c Release 5
Copyright (c) 1996, 2021 Oracle Corporation.  All rights reserved.
SYSMAN password:
Property CollectionResults.MaximumRowsFloodControlMin has been set to value 5000 for all Management Servers
OMS restart is not required to reflect the new property value

[oracle@oms]$ emctl set property -name CollectionResults.MaximumRowsFloodControlMax -value 50000
Oracle Enterprise Manager Cloud Control 13c Release 5
Copyright (c) 1996, 2021 Oracle Corporation.  All rights reserved.
SYSMAN password:
Property CollectionResults.MaximumRowsFloodControlMax has been set to value 50000 for all Management Servers
OMS restart is not required to reflect the new property value

[oracle@oms]$ emctl get property -name CollectionResults.MaximumRowsFloodControlMin
Oracle Enterprise Manager Cloud Control 13c Release 5
Copyright (c) 1996, 2021 Oracle Corporation.  All rights reserved.
SYSMAN password:
Value for property CollectionResults.MaximumRowsFloodControlMin at Global level is 5000

[oracle@oms]$ emctl get property -name CollectionResults.MaximumRowsFloodControlMax
Oracle Enterprise Manager Cloud Control 13c Release 5
Copyright (c) 1996, 2021 Oracle Corporation.  All rights reserved.
SYSMAN password:
Value for property CollectionResults.MaximumRowsFloodControlMax at Global level is 50000

Sunday, April 25, 2021

OEM 13c EMGC_ADMINSERVER & EMGC_OMS1 target status show DOWN when emctl secure wls with cut certificate

 When custom certificate is configured for OEM 13c, WebLogic Servers installed as part of Enterprise Manager Cloud control (Administration Server and Managed Servers) can be secured with custom certificate using following command,

   $OMS_HOME/bin/emclt secure wls

However, the WebLogic Servers and their deployments could show down in OEM console after secured with custom certificate, though they are still running well.

The reason is that the CA involved in issuing the custom certificate for OMS is not "well known", at least it is not accepted by Oracle as default trusted CA. When agent running on OMS server communicates with WebLogic Servers (WLS), WLS is using the custom certificate as his own identification, but agent cannot find trusted certificates of CA involved in issuing the custom certificate from agent's local keystore. Therefore, agent cannot verify the validation of WLS's certificate, and stops the communication with WLS.

The quick fix is to import the certificate of each CA involved in issuing the ticket into agent local keystore with following command,

    $AGENT_HOME/bin/emctl secure add_trust_cert_to_jks -trust_certs_loc <ca_certificate_file> -alias <certificate_alias> [-password <keystore_pwd>]

Here, <certificate_alias> is used to identify the certificates saved in the keystore, they must be unique for each certificate, <keystore_pwd> is the password of the keystore, the default value is welcome.

For example, I have installed a CA in my lab network, and the CA issued certificate to my OMS server. The two of my CA server certificates (root certificate & intermediate certificate) has to be imported into agent keystore as following,

 $AGENT_HOME/bin/emctl stop agent

 $AGENT_HOME/bin/emctl secure add_trust_cert_to_jks -password welcome -alias dbaplus-root -trust_certs_loc /home/oracle/Root_CA_Certificate.txt

 $AGENT_HOME/bin/emctl secure add_trust_cert_to_jks -password welcome -alias dbaplus-intermediate -trust_certs_loc /home/oracle/Intermediate_CA_Certificate.txt

 $AGENT_HOME/bin/emctl start agent

List the certificates imported into agent monitor keystore,

 $AGENT_HOME/jdk/bin/keytool -list -alias <certificate_alias> -keystore   $AGENT_INSTANCE_HOME/sysman/config/montrust/AgentTrust.jks -storepass welcome -v

If needed, the certificates can be removed from keystore as following

 $AGENT_HOME/jdk/bin/keytool -delete -alias <certificate_alias> -keystore   $AGENT_INSTANCE_HOME/sysman/config/montrust/AgentTrust.jks -storepass welcome -v

Saturday, April 24, 2021

OEM 13c Target "EM Jobs Service" shown as down in EM Console while all associated targets are up

"EM Jobs Service" target status is showing down in Enterprise Manager Cloud Control(EM) console even though all associated targets are up and running. It could be an issue with the metric collection definition. Usually it is seen at post blackout of associated targets.

The status of EM Jobs Service is aggregated target status, it is calculated based on the status of the associated targets. The associated targets and calculation logic are defined by default when the system is installed, and you can change it later.

The issue can be fixed by changing/restoring Availability Definition of the service as following,

1. In EM Console navigate to the following menu

     Targets > Services > Click on "EM Jobs Service" target

2. In "EM Jobs Service" home page, click on the tab "Monitoring Configuration" and then click on the link "Availability Definition"

3. Take a screen shot of the "Availability Definition" configurations, change the definition to a different option and click OK to save it.

    For Instance, If Availability Definition is to consider "All key components are up" (default definition), change it to "At least one key component is up" and save change.

4. Now revert "Availability Definition" of the service back to original configration by following the same procedure.

    For instance, change and save "Availability Definition" to "All key components are up"

The target status shows up as all components are up.

Wednesday, April 14, 2021

OEM 12c/13c Agent Deployment fails with "Remote Validatons: Shell Path Validation Failed"

When deploying agent on OEM 12c/13c using 'Add Host Targets' wizard, the deployment fails with

Remote Validations:  Shell Path validation failed

Cause:  Shell path is incorrect or not defined.:/bin/bash(SH_PATH),-c(SH_ARGS) on host <host name> 

Recommendation:  Check the property values in the following files in this order, ssPaths_<plat>.properties or sPaths.properties or Paths.properties, in "/u01/app/oracle/em13.4/middleware/oui/prov/resources" directory. If the property values are correct, then ensure the login user account is enabled for remote logins.For more details, refer to the Oracle Enterprise Manager Basic Installation Guide.

Most common reason why it happened could be one of following

1. Shell (sh, bash & ksh) location is different from OEM defined location
OEM defined shell location can be found from file 'ssPaths_<platform>.properties' under directory '$OMS_HOME/oui/prov/resources'. For example, if the errors happens on deploying agent to AIX host, type the content of file 'ssPaths_aix.properties' which looks like following
SH_PATH=/bin/bash
SH_ARGS=-c
SHELL_PATH=/bin/bash
SHELL_ARGS=-c
KSH_PATH=/usr/bin/ksh
RMDIR_ARGS=
#the date should be in the format of year:month:date:hour:minute:second
DATE_ARGS=-u +%y:%m:%d:%H:%M:%S
PING_PATH=/usr/sbin/ping
SSH_KEYGEN_PATH=/usr/bin/ssh-keygen
TAR_EXCLUDE_ARGS=X
TAR_INCLUDE_ARGS=-I
DF_COL_NAME=avail
SSH_HOST_KEY_LOC=/etc/ssh

On the host where agent is going to be installed, check if the executables/shell exist and are located at same place as in the OEM file 'ssPaths_<platform>.properties'. In previous example files, the executables/shell are

/bin/bash
/usr/bin/ksh
/usr/bin/ssh-keygen

If it does not exist, you have to install it. If it exists but is located at different directory, edit the OEM file and replace the shell/executable path with the directory where the shell/executable is.

2. Incorrect user name or password configured in Named Credential which is used to deploy the agent

If incorrect user name or password is used, the error could also happen. If you do not have the password of the user defined in Named Credential, the issue can be confirmed by checking following log file on oms server,
  
$OMS_INSTANCE_BASE/em/EMGC_OMS1/sysman/agentpush/<timestamp>/applogs/<host_name>_deploy.log

For example, the failed deployment log is

  /u01/app/oracle/em13.4/gc_inst/em/EMGC_OMS1/sysman/agentpush/2021-04-13_12-58-49-PM/applogs/host01.lab.dbaplus.ca_deploy.log

And following message is found in the log
2021-04-13_12-59-55:INFO:===VALIDATION===:Checking SH_PATH on target nodes
2021-04-13_12-59-55:INFO:isWrongShPath:remotePathPropertiesLoc:/u01/app/oracle/em13.4/middleware/oui/prov/resources Platform id:212
2021-04-13_12-59-55:INFO:NODES=host01.lab.dbaplus.ca
2021-04-13_12-59-55:INFO:Running cmd /bin/bash -c /bin/true on node host01.lab.dbaplus.ca
2021-04-13_12-59-55:INFO:Action description Execution of command /bin/bash -c /bin/true  on host host01.lab.dbaplus.ca
2021-04-13_12-59-55:INFO:Attempt :1 pty required false  with no inputs
2021-04-13_12-59-56:INFO:/bin/bash -c /bin/true execution failed on host host01.lab.dbaplus.ca
2021-04-13_12-59-56:INFO: OUT null
2021-04-13_12-59-56:INFO: ERR WARNING: Your password has expired.
Password change required but no TTY available.

We can see that the password has expired, ask system administrator to reset the password and also update the password for Named Credential.

The easist way to eliminate this error because of user name or password issue is to ask system administrator to test the login manually out of OEM.

Sunday, April 11, 2021

OEM 13c Discovering WebLogic Domain failed to save Node Manger target with error 'This target requires a local Management Agent'

When discovering or refreshing a WebLogic Domain or Fusion Middleware Farm in Enterprise Manager (EM) 13.4 Cloud Control, the Node Manager target is not saved. The error is shown in EM:

Failed to save NM_xxx_x(Oracle WebLogic Node Manager) on host <IP/host name>. This target requires a local Management Agent, but a local Management Agent was not found.  In order to add this target, you need to install a Management Agent on the same host as the target and then perform a "Refresh WebLogic Domain" operation.

The agent has been installed on the host. The errors happened because of difference between Listen Address of Node Manager configuration and host name of EM Agent URL. As a solution, the Listen Address Node Manager should be changed to host name of EM Agent URL.

Oracle explains it as incorrect configuration of Oracle WebLogic Node Manager. Therefore, it could happen on all release of EM 13c. However, I can only reproduce the problem in EM 13.1 and 13.4 when Listen Address of WebLogic Node Manager is configured with IP address instead of host name which is used by EM Agent URL and there is no problem with EM 13.2. Anyway, having both configuration use same host name is not bad idea.

Find out host name of EM agent URL with command <AGENT_HOME>/bin/emctl status agent
$ /u01/app/oracle/em13.4/agent/agent_13.4.0.0.0/bin/emctl status agent
Oracle Enterprise Manager Cloud Control 13c Release 4
Copyright (c) 1996, 2020 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent Version          : 13.4.0.0.0
OMS Version            : 13.4.0.0.0
Protocol Version       : 12.1.0.1.0
Agent Home             : /u01/app/oracle/em13.4/agent/agent_inst
Agent Log Directory    : /u01/app/oracle/em13.4/agent/agent_inst/sysman/log
Agent Binaries         : /u01/app/oracle/em13.4/agent/agent_13.4.0.0.0
Core JAR Location      : /u01/app/oracle/em13.4/agent/agent_13.4.0.0.0/jlib
Agent Process ID       : 76282
Parent Process ID      : 76240
Agent URL              : https://host01.lab.dbaplus.ca:3872/emd/main/
Local Agent URL in NAT : https://host01.lab.dbaplus.ca:3872/emd/main/
Repository URL         : https://oms.lab.dbaplus.ca:4903/empbs/upload
Started at             : 2021-04-07 17:53:56
Started by user        : oracle
Operating System       : Linux version 4.1.12-124.46.4.1.el7uek.x86_64 (amd64)
...
---------------------------------------------------------------
Agent is Running and Ready

Change Listen Address of Node Manager to the host name of EM Agent URL in the WebLogic Admin Console as following,
1. Go to Node Manger configuraiton page

   Environment > Machines > [Machine Name] > Configuration > Node Manager

2. Click 'Lock & Edit' to enable edit mode

3. Set the value of "Listen Address" property to the host name given by previous command 'emctl status agent'

4. Click 'Save', then click 'Activate Changes'

Refresh or rediscover the domain, the Node Manger will be discovered successfully.

Wednesday, March 31, 2021

OEM 13.4 AgentPull.sh failed with '0403-011 The specified substitution is not valid for this command'

When using Oracle Enterprise Manager 13.4 AgentPull.sh to deploy Agent 13.4 on AIX server, got following error,
./AgentPull.sh[270]: platform=${platform/\)/%29}: 0403-011 The specified substitution is not valid for this command.

Line 270 in AgentPull.sh is
platform=${platform/\)/%29}

It is a bash statement which tries to replace right parenthesis with code '%29', it works for bash but not Bourne shell (sh). The problem is that the script has directive to have Unix/Linux to run the script under bourne shell
$ head AgentPull.sh
#!/bin/sh
#
# $Header: emcore/install/unix/AgentPull.sh.template /main/30 2019/12/19 21:29:10 vbhaagav Exp $
#
# AgentPull.sh
#
# Copyright (c) 2011, 2019, Oracle and/or its affiliates. All rights reserved.
#
#    NAME
#      AgentPull.sh - <one-line expansion of the name>

The script works on most Linux servers because the /bin/sh is a symbolic link file linked to /bin/bash on these servers. Unfortunately, AIX is honest :), sh is sh and bash is bash, they are different.

As workaround, replace first line of AgentPull.sh '#!/bin/sh' with '#!/bin/bash'.

Tuesday, March 30, 2021

OEM 13.4 New Imported RuleSet False Evaluation of Lifecycle Status

If Rule Set is defined for targets based on Lifecycle Status, and the Rule Set is exported and imported into Enterprise Manager 13.4, the Rule Set will be always activated regardless of Lifecycle Status.

For example, two Rule Sets are created in source Enterprise Manager,

   Rule Set A for targets which Lifecycle Status is Production
   Rule Set B for targets which Lifecycle Status is Development

Both A and B are exported from source EM and imported into EM 13.4, and one target (no matter what value of Lifecycle Status) is brought down, both A and B will send out notification.

It can be fixed by manually editing the Rule Set. You do not need really to change anything of the Rule Set, just select the Rule Set and click Edit, then click Save.

OEM 13.2 Exported Ruleset Failed with 'Error in parsing properties for RuleSet from the XML' when Importing into EM 13.x

When importing ruleset in Enterprise Manager 13.x (confirmed in 13.2 and 13.4) from xml file exported from Enterprise Manager 13.2, got following error,

  Error in parsing properties for RuleSet from the XML

If the destination EM version is 13.2, following message will be found from trace file emoms.trc under directory <OMS_INSTANCE_BASE>/em/EMGC_OMS1/sysman/log
2021-03-29 22:49:12,394 [[ACTIVE] ExecuteThread: '7' for queue: 'weblogic.kernel.Default (self-tuning)'] ERROR exportimport.ImportHelper logp.251 - Error in parsing properties for RuleSet from the XML
oracle.sysman.emSDK.app.exception.EMSystemException
     at oracle.sysman.core.event.rules.uimodel.rulesfwk.RulesModelUtil.throwEMSystemException(RulesModelUtil.java:1486)
     at oracle.sysman.core.event.rules.uimodel.rulesfwk.RulesFwkEntityImpl.postChanges(RulesFwkEntityImpl.java:116)
     ...
    ... 100 more
Caused by: oracle.jbo.JboException: JBO-29000: Unexpected exception caught: oracle.jbo.DMLConstraintException, msg=JBO-26048: Constraint "EM_RULE_EXPRESSIONS_UK1" is violated during post operation "Insert" using SQL statement "INSERT INTO EM_RULE_EXPRESSIONS(SELECTION_OBJECT_ID,EXPRESSION_ID,EXPRESSION_GROUP_ID,ATTR_ID,OPERATOR_CODE,ATTR_VALUE,SEL_OBJ_TYPE,EXPRESSION_TYPE,UI_GROUP_ID,EXCLUSION_GROUP_ID) VALUES (:1,:2,:3,:4,:5,:6,:7,:8,:9,:10)".
    at oracle.sysman.core.event.rules.uimodel.rulesfwk.RulesFwkEntityImpl.doDML(RulesFwkEntityImpl.java:201)
    at oracle.jbo.server.EntityImpl.postChanges(EntityImpl.java:7271)
    at oracle.sysman.core.event.rules.uimodel.rulesfwk.RulesFwkEntityImpl.postChanges(RulesFwkEntityImpl.java:97)
    ... 104 more
Caused by: oracle.jbo.DMLConstraintException: JBO-26048: Constraint "EM_RULE_EXPRESSIONS_UK1" is violated during post operation "Insert" using SQL statement "INSERT INTO EM_RULE_EXPRESSIONS(SELECTION_OBJECT_ID,EXPRESSION_ID,EXPRESSION_GROUP_ID,ATTR_ID,OPERATOR_CODE,ATTR_VALUE,SEL_OBJ_TYPE,EXPRESSION_TYPE,UI_GROUP_ID,EXCLUSION_GROUP_ID) VALUES (:1,:2,:3,:4,:5,:6,:7,:8,:9,:10)".
    at oracle.jbo.server.OracleSQLBuilderImpl.doEntityDML(OracleSQLBuilderImpl.java:565)
    at oracle.jbo.server.EntityImpl.doDML(EntityImpl.java:9098)
    at oracle.sysman.core.event.rules.uimodel.rulesfwk.RulesFwkEntityImpl.doDML(RulesFwkEntityImpl.java:196)
    ... 106 more
Caused by: java.sql.SQLIntegrityConstraintViolationException: ORA-00001: unique constraint (SYSMAN.EM_RULE_EXPRESSIONS_UK1) violated

If the destination EM version is 13.4 (or 13.3), following message will be found from trace file emoms.trc
2021-03-29 22:04:29,284 [[ACTIVE] ExecuteThread: '70' for queue: 'weblogic.kernel.Default (self-tuning)'] ERROR exportimport.ImportHelper logp.251 - Error in parsing properties for RuleSet from the XML
oracle.sysman.emSDK.app.exception.EMSystemException
    at oracle.sysman.core.event.rules.uimodel.rulesfwk.RulesModelUtil.throwEMSystemException(RulesModelUtil.java:1509)
    at oracle.sysman.core.event.rules.uimodel.rulesfwk.RulesFwkEntityImpl.postChanges(RulesFwkEntityImpl.java:106)
    ...
    at weblogic.work.ExecuteThread.execute(ExecuteThread.java:420)
    at weblogic.work.ExecuteThread.run(ExecuteThread.java:360)
Caused by: oracle.jbo.RowAlreadyDeletedException: JBO-25019: Entity row with key oracle.jbo.Key[BEB7FD94E2112377E053DA570F89CC8F ] is not found in EmRuleSetsEO.
    at oracle.jbo.server.OracleSQLBuilderImpl.doEntitySelectForAltKey(OracleSQLBuilderImpl.java:811)
    at oracle.jbo.server.BaseSQLBuilderImpl.doEntitySelect(BaseSQLBuilderImpl.java:554)
    at oracle.jbo.server.EntityImpl.doSelect(EntityImpl.java:9133)
    at oracle.jbo.server.EntityImpl.lock(EntityImpl.java:6612)
    at oracle.jbo.server.EntityImpl.beforePost(EntityImpl.java:7150)
    at oracle.jbo.server.EntityImpl.postChanges(EntityImpl.java:7384)
    at oracle.sysman.core.event.rules.uimodel.rulesfwk.RulesFwkEntityImpl.postChanges(RulesFwkEntityImpl.java:97)
    ... 95 more

It is a bug of EM 13.2, it can be fixed by apply OMS one-off patch 25986453 on source EM 13.2. The patch is also needed for destination EM if the version is 13.2.

Monday, March 29, 2021

OEM 13c: Oracle Coherence Cache Targets Showing Down after WebLogic Server PSU applied

 After WebLogic Server (WLS) Patch Set Update (PSU) is applied on WLS home installed as part of Oracle Enterprise Manager (EM) 13c, two of Oracle Coherence Cache objects in WebLogic Domain '/EMGC_GCDomain/GCDomain' are shown Down in EM console.

The issues, at least, was seen after applying WLS PSU 12.2.1.3.201217 on EM 13.4.0.9 home, and WLS PSU 12.2.1.4.210330 on EM 13.5.0.0.0 home.

It can be fixed by removing and re-adding the targets as following,

1.  Remove the Oracle Coherence Cache Targets showing Down from OEM

    * In the OEM Console,  navigate to 'Targets' > 'Middleware'
    * On left panel expand 'Target Type', then expand 'Coherence', Select Oracle Coherence Cache. All Oracle Coherence Cache targets will be listed on the right window.
    * Right click on the name of the targets showing Down, click pop-up menu 'Target Setup' > 'Remove Target'
   
    The Down targets are deleted from OEM console.
   
2.  Refresh the domain '/EMGC_GCDomain/GCDomain' to add removed/missing Oracle Coherence Cache objects

    * In the OEM Console,  navigate to 'Targets' > 'Middleware'
    * On left panel, clear the selection you did while removing the targets. The Oracle WebLogic Domain '/EMGC_GCDomain/GCDomain' will be listed in right window
    * Click the domain name '/EMGC_GCDomain/GCDomain'
    * Navigate to the  'Refresh Weblogic Domain' in the Drop-Down menu on the home page of target '/EMGC_GCDomain/GCDomain'
    * New Screen will appear, click on "Add and Update Targets..."
    * Following the prompt to add new found targets
 
The removed targets will be added with new target names.

Wednesday, March 10, 2021

DBCA does not list ASM diskgroup for storage option

Oracle 19c DBCA does not list ASM disk groups in "Select Database Storage Option" window.
Messages in dbca trace file $ORACLE_BASE/cfgtoollogs/dbca/trace.log_<TIMESTAMP>
[DBStorageOption.flowWorker] [ 2021-03-08 13:24:45.969 EST ] [KfodUtil.kfodOutput:375]  /u01/app/19.0.0/grid/bin/kfod
[DBStorageOption.flowWorker] [ 2021-03-08 13:24:45.969 EST ] [KfodUtil.kfodOutput:375]  nohdr=TRUE
[DBStorageOption.flowWorker] [ 2021-03-08 13:24:45.969 EST ] [KfodUtil.kfodOutput:375]  OP=GROUPS
[DBStorageOption.flowWorker] [ 2021-03-08 13:24:45.969 EST ] [KfodUtil.kfodOutput:375]  status=true
[DBStorageOption.flowWorker] [ 2021-03-08 13:24:45.969 EST ] [KfodUtil.kfodOutput:375]  asmcompatibility=true dbcompatibility=true
[DBStorageOption.flowWorker] [ 2021-03-08 13:24:46.045 EST ] [KfodUtil.kfodOutput:386]  Kfod result
Error 49802 initializing ADR
ERROR!!! could not initialize the diag context

[DBStorageOption.flowWorker] [ 2021-03-08 13:24:46.046 EST ] [ASMUtils.loadDiskGroups:1221]  Loading  the diskgroups. exception using kfodError retrieving diskgroup using kfod utility, null
INFO: Mar 08, 2021 1:24:46 PM oracle.assistants.common.lib.FileSystemInfo getSharedStoragePaths
INFO: Getting default shared storage path.

DBCA failed with command "/u01/app/19.0.0/grid/bin/kfod", manually run the command
[oracle@host01]$ /u01/app/19.0.0/grid/bin/kfod nohdr=TRUE OP=GROUPS status=true asmcompatibility=true dbcompatibility=true
Error 49802 initializing ADR
ERROR!!! could not initialize the diag context

Debug the command with truss or strace as following
  On Linux:  strace -o /tmp/kfod.out /u01/app/19.9.0/grid/bin/kfod nohdr=TRUE OP=GROUPS status=true asmcompatibility=true dbcompatibility=true
  On Solaris/AIX: truss -o /tmp/kfod.out /u01/app/19.0.0/grid/bin/kfod nohdr=TRUE OP=GROUPS status=true asmcompatibility=true dbcompatibility=true
Example on Solaris
[oracle@host01]$ truss -o /tmp/kfod.out /u01/app/19.0.0/grid/bin/kfod nohdr=TRUE OP=GROUPS status=true asmcompatibility=true dbcompatibility=true
Error 49802 initializing ADR
ERROR!!! could not initialize the diag context
[oracle@host01]$
[oracle@host01]$ tail /tmp/kfod.out
stat("/u01/app/19.0.0/grid/log/diag/kfod/host01/kfod/log", 0xFFFFFFFF7FFFBF80) Err#2 ENOENT
stat("/u01/app/19.0.0/grid/log/diag/kfod/host01/kfod", 0xFFFFFFFF7FFFBF80) Err#2 ENOENT
stat("/u01/app/19.0.0/grid/log/diag/kfod/host01", 0xFFFFFFFF7FFFBF80) Err#2 ENOENT
stat("/u01/app/19.0.0/grid/log/diag/kfod", 0xFFFFFFFF7FFFBF80) Err#2 ENOENT
stat("/u01/app/19.0.0/grid/log/diag", 0xFFFFFFFF7FFFBF80) = 0
getuid()     = 104 [104]
getgid()     = 112 [112]
getuid()     = 104 [104]
getuid()     = 104 [104]
mmap(0x00010000, 65536, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFFFFFFFF73900000
getuid()     = 104 [104]
getuid()     = 104 [104]
open("/system/volatile/name_service_door", O_RDONLY) = 6
fcntl(6, F_SETFD, 0x00000001)   = 0
door_info(6, 0xFFFFFFFF75D8B5B0)  = 0
door_call(6, 0xFFFFFFFF7FFFBBB8)  = 0
ioctl(1, TCGETA, 0xFFFFFFFF7FFF8F7C)  = 0
fstat(1, 0xFFFFFFFF7FFF8F10)   = 0
write(1, " E r r o r   4 9 8 0 2  ".., 29) = 29
write(1, " E R R O R ! ! !   c o u".., 47) = 47
close(5)     = 0
_exit(1)

From the trace file /tmp/kfod.out, kfod failed with accessing following directories
    /u01/app/19.0.0/grid/log/diag/kfod/host01/kfod/log
    /u01/app/19.0.0/grid/log/diag/kfod/host01/kfod
    /u01/app/19.0.0/grid/log/diag/kfod/host01
    /u01/app/19.0.0/grid/log/diag/kfod
    /u01/app/19.0.0/grid/log/diag

Check the permission of the directory
[oracle@host01]$ ls -ld /u01/app/19.0.0/grid/log/diag
drwxr-x---   4 grid     oinstall       5 Dec 12 23:17 /u01/app/19.0.0/grid/log/diag
[oracle@host01]$ id
uid=104(oracle) gid=112(oinstall)

oracle, who is running kfod, is member of group oinstall, but group oinstall does not have write permission on directory. Trying to grant write permission as root or grid (owner of GI home)
[grid@host01]$ chmod g+w /u01/app/19.0.0/grid/log/diag
[grid@host01]$ ls -ld /u01/app/19.0.0/grid/log/diag
drwxrwx---   4 grid     oinstall       5 Dec 13 02:45 /u01/app/19.0.0/grid/log/diag

Re-try kfod as oracle
[oracle@host01]$ /u01/app/19.0.0/grid/bin/kfod nohdr=TRUE OP=GROUPS status=true asmcompatibility=true dbcompatibility=true
    204528     188110 EXTERN REDO 10.1.0.0.0 12.1.0.0.0
    613584     388124 EXTERN FRA  10.1.0.0.0 12.1.0.0.0
   4091624     470712 EXTERN DATA 10.1.0.0.0 12.1.0.0.0

Now, dbca can list ASM disk groups.

Monday, March 1, 2021

19c runcluvfy.sh faile with PRVF-7596 PRVG-2002

When upgrading Oracle Infrastructure 12.2 to 19c, runcluvfy.sh failed with following messages
Verifying OCR Integrity ...FAILED
host02: PRVF-7596 : CSS is probably working with a non-clustered, local-only
          configuration on node "host02"
Verifying resolv.conf Integrity ...FAILED
host02: PRVG-2002 : Encountered error in copying file "/etc/resolv.conf" from
          node "host02" to node "host01"
          protocol error: filename does not match request
Verifying DNS/NIS name service ...FAILED
host02: PRVG-2002 : Encountered error in copying file "/etc/netsvc.conf" from
          node "host02" to node "host01"
          protocol error: filename does not match request

Set log location and re-run runcluvfy.sh
# runcluvfy.sh will write log files to directory /u01/tmp
export CV_TRACELOC=/u01/tmp
# re-run runcluvfy.sh
./runcluvfy.sh stage -pre crsinst -upgrade -rolling -src_crshome /u01/app/12.2.0/grid_1 -dest_crshome /u01/app/19.0.0/grid_1 -dest_version 19.0.0.0.0 -fixup -verbose
# Log files created
cd /u01/tmp
ls -l
total 15200
-rw-r--r--    1 grid     oinstall     952467 Feb 24 16:52 cvuhelper.log.0
-rw-r--r--    1 grid     oinstall          0 Feb 24 16:48 cvuhelper.log.0.lck
-rw-r--r--    1 grid     oinstall    6824223 Feb 24 16:52 cvutrace.log.0

The log file cvutrace.log.0 shows messsage
[Worker 0] [ 2021-02-24 16:47:49.784 EST ] [UnixSystem.remoteCopyFile:848]  UnixSystem: /usr/bin/scp -p host02:'/tmp/CVU_19.0.0.0.0_grid/scratch/getFileInfo3605304.out' /tmp/host02.getFileInfo3605304.out
[Worker 0] [ 2021-02-24 16:47:49.798 EST ] [RuntimeExec.runCommand:294]  runCommand: Waiting for the process
[Thread-83] [ 2021-02-24 16:47:49.798 EST ] [StreamReader.run:62]  In StreamReader.run
[Thread-82] [ 2021-02-24 16:47:49.798 EST ] [StreamReader.run:62]  In StreamReader.run
[Thread-83] [ 2021-02-24 16:47:50.201 EST ] [StreamReader.run:66]  ERROR>protocol error: filename does not match request
...
[main] [ 2021-02-24 16:51:46.747 EST ] [UnixSystem.remoteCopyFile:848]  UnixSystem: /usr/bin/scp -p host02:'/etc/resolv.conf' /tmp/CVU_19.0.0.0.0_grid/scratch/resolv.conf_host02
[main] [ 2021-02-24 16:51:46.760 EST ] [RuntimeExec.runCommand:294]  runCommand: Waiting for the process
[Thread-1060] [ 2021-02-24 16:51:46.760 EST ] [StreamReader.run:62]  In StreamReader.run
[Thread-1059] [ 2021-02-24 16:51:46.760 EST ] [StreamReader.run:62]  In StreamReader.run
[main] [ 2021-02-24 16:51:47.148 EST ] [RuntimeExec.runCommand:296]  runCommand: process returns 1
[Thread-1060] [ 2021-02-24 16:51:47.149 EST ] [StreamReader.run:66]  ERROR>protocol error: filename does not match request
...
[main] [ 2021-02-24 16:51:47.548 EST ] [UnixSystem.remoteCopyFile:848]  UnixSystem: /usr/bin/scp -p host02:'/etc/netsvc.conf' /tmp/CVU_19.0.0.0.0_grid/scratch/nssw_conf_host02
[Thread-1066] [ 2021-02-24 16:51:47.559 EST ] [StreamReader.run:62]  In StreamReader.run
[main] [ 2021-02-24 16:51:47.559 EST ] [RuntimeExec.runCommand:294]  runCommand: Waiting for the process
[Thread-1065] [ 2021-02-24 16:51:47.560 EST ] [StreamReader.run:62]  In StreamReader.run
[main] [ 2021-02-24 16:51:47.965 EST ] [RuntimeExec.runCommand:296]  runCommand: process returns 1
[Thread-1066] [ 2021-02-24 16:51:47.973 EST ] [StreamReader.run:66]  ERROR>protocol error: filename does not match request

All three errors occurred when scp is executed to copy file from remote node host02 to local node.
Interesting thing is that all three commands are using single quotation marks to enclose remote file name, I guess Oracle programmer left them there by accident because there is no reason to use them. Let's verify if it is the culprit by manually running scp
[grid@host01]$ scp -p host02:"'/etc/resolv.conf'" /tmp/CVU_19.0.0.0.0_grid/scratch/resolv.conf_host02
protocol error: filename does not match request
[grid@host01]$
[grid@host01]$ scp -T -p host02:"'/etc/resolv.conf'" /tmp/CVU_19.0.0.0.0_grid/scratch/resolv.conf_host02
resolv.conf                                             96    51.3KB/s   00:00
[grid@host01]$

The scp fails with same error "protocol error: filename does not match request", but succeeds if extra option -T is used.

The -T option was introduced by OpenSSH 8.0 released in April 2019. In earlier version of OpenSSH, when copying files from a remote system to a local directory, scp did not verify that the file names that the server sent matched those requested by the client. This could allow a hostile server to create or clobber unexpected local files with attacker-controlled content. OpenSSH 8.0 fixed this security issue and scp, by default, verifies the file name on client side, and also introduced -T option to provide capacity to disable the verification.

Although OpenSSH officially claims that the fix is introduced in 8.0, runcluvfy.sh shipped with Grid Infrastructure 19.3 also fails with same reason on AIX with OpenSSH 7.5p1 and it is where the errors used in this article happened.

This version of runcluvfy.sh sends remote file name with single quotation marks, but remote server returns file name without quotation. Technically, they are same thing, but they are visually different. Therefore, old version scp worked because it did not verify them, but current scp fails it with "filename does not match".

It can be resolved by downloading new standalone version Cluster Verification Utility from My Oracle Support(Patch 30839369). Note: the executable is cluvfy instead of runcluvfy.sh. If download is not the option. As a temporary workaround, we can rename scp and create a new scp
# Rename the original scp
mv /usr/bin/scp /usr/bin/scp.bak
# Create a new file scp
echo "/usr/bin/scp.orig -T $*" > /usr/bin/scp
# Make the file executable
chmod a+rx /usr/bin/scp

Now, the errors are gone. After successfully installing GI, remember to restore original scp
# Delete interim scp
rm /usr/bin/scp
# Restore the original scp.
mv /usr/bin/scp.bak /usr/bin/scp