Thursday, December 30, 2021

RMAN left defunct server processes and NetBackup nborautil processes

RMAN jobs can successfully backup database to tape with NetBackup, but leave server processes (RMAN channels) and NetBackup nborautil processes running. The processes look like defunct processes.

The issue is observed in Oracle Database 19c RMAN backup job with NetBackup 8.1.2 on AIX 7.2, it may also happen on other platform with NetBackup 8.1.2 or 8.2. When it happens, nborautil processes can be seen running like following
$ ps -ef | grep nborautil
  oracle 17236328  6489154  31 12:14:13      - 129:18 -bprdtype 2 -use_stdin -client host01 -bprd -noxmloutput -ignorenamespace -jsonoutput 26 -eoichar /usr/openv/netbackup/bin/nborautil
  oracle 32506230  5440300  26 12:47:03      - 113:20 -bprdtype 2 -use_stdin -client host01 -bprd -noxmloutput -ignorenamespace -jsonoutput 26 -eoichar /usr/openv/netbackup/bin/nborautil

$ ps -ef | grep 6489154
  oracle 17236328  6489154  30 12:14:13      - 129:44 -bprdtype 2 -use_stdin -client host01 -bprd -noxmloutput -ignorenamespace -jsonoutput 26 -eoichar /usr/openv/netbackup/bin/nborautil
  oracle  6489154        1   0   Dec 27      - 11:47 oracleORCL (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
$
$ ps -ef | grep 5440300
  oracle 32506230  5440300  24 12:47:03      - 113:44 -bprdtype 2 -use_stdin -client host01 -bprd -noxmloutput -ignorenamespace -jsonoutput 26 -eoichar /usr/openv/netbackup/bin/nborautil
  oracle  5440300        1   0   Dec 28      - 10:19 oracleORCL (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
Two NetBackup nborautil processes seem defunct, their parent process id are 6489154 and 5440300 respectively. Both parent processes are server processes of Oracle Database Instance ORCL. Check the session status of the server processes,
sys@ORCL> select s.sid,s.serial#,s.username,s.machine,s.program,s.event
  2  from v$session s, v$process p
  3  where s.paddr=p.addr and p.spid in (5440300,6489154);

  SID    SERIAL# USERNAME  MACHINE  PROGRAM                   EVENT
----- ---------- --------- -------- ------------------------- ------------------
  300      44503 SYS       host01   rman@host01 (TNS V1-V3)   Backup: MML shutdown
  178      17334 SYS       host01   rman@host01 (TNS V1-V3)   Backup: MML shutdown
The server processes are waiting for event "Backup: MML shutdown", it means waiting for NetBackup nborautil process to complete, and nborautil never exits, though Veritas Support claims it only takes longer time than expected not hang. As I saw, every RMAN job will leave new nborautil processes running, and eventually the defunct processes will use up all CPU resource and hang the system.

Veritas Support claims that this is bug of NetBackup on version 8.1.2/8.2. This issue will occur on all Oracle backups using RMAN script if the database is comprised of many datafiles.

The Oracle backups will  have an unusually long delay between when the last child job completes and when the parent job completes.  They delay may extend to several hours, even days.   

There are no performance or slow behaviour observed prior to the completion of the data transfer.  The delay is only during the meta data cataloging after the data transfer jobs have completed.

Smaller databases and those with fewer datafiles may not experience this delay.

So far, Veritas does not release any fix for it. If possible, upgrade NetBackup to higher version (e.x. 9). As a workaround, reducing backup datafiles in single job may help.

No comments: