Twitter a better Oracle patch reporting tool

Tired of struggling with Oracle patches, I developped three years ago and I based it on a new feature at that time which was implemented after the opatch -all_nodes option was deprecated; this new feature was a new report -remote option to opatchauto. I must confess that I like new things and this is why I decided to go this new feature instead of basing on that good old buddy opatch lspatches.

Three years later, I can clearly say that it was a bad decision; indeed many people (and I) had issues mainly because:
  • Oracle clearly dropped the 11g support with this -- but people still have tons of 11g (which may not be a bad idea :))
  • This opatchauto -remote option had many issues just not working properly
  • This option has even been removed and replaced by a -host
Even if this is kind of usual with oracle and the new features, at the end of the day, was most likely not to work due to the points discussed above.

So I restarted from scratch and developped a new script based on opatch lspatches which is working forever and also has the advantage of showing the description of the patches which is always good to have. The script is now more reliable, works perfectly with 11g, less code, more simple... in short: better.

How that works ? checks the homes from /etc/oratab, runs opatch lspatches on all the nodes of a cluster and makes a nice table, one column by node, showing a green x if the patch is installed on that specific node and a red Missing if a patch is missing as you can see below:
You can then very easily find a missing or extra patch installed on one or many nodes. You can also note that the return code of the script is the number of missing patches so if used by an automatic script, you can easily know if all is good (return code 0) or not OK (return code > 0) and even the number of missing patches.

Below a successful output:

Is it Exadata specific ?

No, it works on any RAC, even on standalone systems.

Is it working for GI and databases Homes ?

Yes, it does work for GI and databases Homes.

Which user to execute the script with ?

opatch lspatches must be run by oracle (not root) so if you run it as root, the script will execute the opatch lspatches commands as oracle (or anothger user if the --oracleuser option is used to specify another "oracle" user) so you can use root or oracle, the script will manage automatically.

Some options

Feel free to check the help of the script ( -h or --help) for the possible options; it is pretty simple and self-exaplanatory:
# ./ --help

NAME - Show nice tables of the installed and/or missing patches for some GI/DB Oracle Homes

        ./ [-g] [-c] [-G] [-v] [-s] [-u] [-h]

DESCRIPTION Based on oratab, show nice tables of the installed and/or missing patches for some GI/DB Oracle Homes
                       You can then quickly find a missing patch across a RAC system will by default check all the nodes of a cluster (based on olsnodes) which requires ASM to be running
                       and oraenv to be working with the ASM aslias defined in oratab; If you have no ASM alias in oratab,
                       you may suffer from
                       You can specify a comma separated list of host or a file containing one host per line by default checks all the homes defined in oratab, you can use --grep/--home and --ungrep/--ignore to limit your home selection (see examples below) relies on opatch lspatches which must run as oracle user (and not root); if the script is started as root,
                       the opatch lspatches commands will be run after su - oracle (see -u | --oracleuser for more on this)

        -g | --groupfile            ) A group file containing a list of hosts
        -c | --commalist  | --hosts ) A comma separated list of hosts
        -G | --grep | --oh | --home ) Pattern to grep from /etc/oratab
        -v | --ungrep | --ignore    ) Pattern to grep -v (ignore) from /etc/oratab
        -s | --showhomes | --show   ) Just show the homes from oratab resolving the grep/ungrep combinations
        -u | --oracleuser           ) User to use to run opatch lspatches if the script is started as root, default is oracle
        -h | --help                 ) Shows this help

       ./                                                       # Analyze and show all the homes of nodes of a cluster
       ./ --show                                                # Show the homes from oratab (only show, dont do anything else)
       ./ --grep grid                                           # Analyze the grid home
       ./ --grep db --ungrep 12                                 # Only the DB homes but not the 12 ones
       ./ --grep db --ungrep 12 --groupfile ~/dbs_group         # Same as above on the hosts contained in the ~/dbs_group file
       ./ --home db --ignore 12 --hosts exa01,exa06             # Same as above but only on hosts exa02 and exa06
       ./ --home db --ignore 12 --hosts exa01,exa06 -u oracle2  # Same as above but started as root; will then su - oracle2 automatically

We now finally have a robust script to show the installed patches across many nodes, enjoy ! You can freely download the script here.

So you have removed a trace file and your FS is still filling (very fast)

So you have a FS filling very fast on a server, someone connects, finds some big trace files growing very fast and delete them which frees some space on the FS and few minutes later, the FS is back to a dangerous threshold but you are unable to see the trace files any more (hey no, they have been deleted !) and a few moment later, your FS is full which is not what we want. Oftently, people panic and finish by rebooting the application(s) and/or server(s) supposed to fill these trace files -- which is also not what anyone wants but this is often seen as the best (less worse ?) way of "saving the system".
cat /dev/null > tace_file would obviously have been better here but what is done is done and you have to deal with the situation, you cannot unrm the file. Having said that, in a specific case I faced, it was due to a 19c oracle bug (what ? an Oracle bug ? yeah it happens from time to time . . . :D) where the oracle processes are tracing too much (you bet !) and only a very aggressive (every minute) cronjob to empty these trace files may fix the issue (if this can be considered as a fix) so the below solution still applies the time you can organize applying a patch (which would obviously more or less need a downtime . . .).

Back to the situations df can show that the FS is full:
[root@linux01]# df -h /u01
Filesystem                    Size  Used Avail Use% Mounted on
/dev/mapper/VGExaDb-LVDbOra1  237G  227G     0 100% /u01
but not du (you can do the maths, according to du, the FS is far from being full):
[root@linux01]# du -sh *
144G    app
4.0K    crashfiles
16K     lost+found
612M    oracle.ExaWatcher
547M    tmp
Before jumping into the solution, we need to understand how writing into a file works on Un*x systems and what file descriptors are.

Let's start with this simple ls example which shows a file and when redirected to /dev/null shows nothing -- obviously, the output is redirected to /dev/null.
$ ls
$ ls > /dev/null
Now let's do the same as above but with a syntax error:
$ ls -z > /dev/null
ls: invalid option -- 'z'
Try 'ls --help' for more information.
$ ls -z > /dev/null 2>&1
Here, we can see that the error is not redirected to /dev/null when doing > /dev/null but only when doing > /dev/null 2>&1 !? This is because > /dev/null redirects the "1" file descriptor (stdout -- your screen) to /dev/null but not the "2" file descriptor which is stderr and also your screen by default.

If everyone knows more or less these special 1 and 2 file descriptors, you can create and use any other file descriptors and read/write from it; lets create a "3" file descriptor and write in it:
$ exec 3>fd3.log
$ echo "to fd3" >&3
$ cat fd3.log
to fd3
To kind of simulate a process (an Oracle process or any other) writing into a file through its file descriptor, let's run this while true command inserting the current date every 5 seconds:
$ while true; do echo $(date)" to fd 3 again" >&3; sleep 5; done
The trace file is well updated every 5 seconds:
$ tail -f fd3.log
to fd3
Thu Mar 18 14:28:03 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:08 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:13 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:18 AEDT 2021 to fd 3 again
Now let's remove that file:
$ ls -ltr | grep fd
-rw-r--r-- 1 fred fred 535 Mar 18 14:28 fd3.log
$ date; rm fd3.log
Thu Mar 18 14:28:28 AEDT 2021   <== the date I have removed the file
$ ls -ltr fd3.log
ls: cannot access 'fd3.log': No such file or directory    <=== the file does not exist anymore
And if I go back to my session tailing the logfile:
$ tail -f fd3.log
Thu Mar 18 14:28:03 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:08 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:13 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:18 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:23 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:18 AEDT 2021 to fd 3 again  
Thu Mar 18 14:28:23 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:28 AEDT 2021 to fd 3 again  <=== file has been deleted on Thu Mar 18 14:28:28 AEDT 2021
Thu Mar 18 14:28:33 AEDT 2021 to fd 3 again  <=== still things are inserted in it !
Thu Mar 18 14:28:38 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:43 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:48 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:53 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:58 AEDT 2021 to fd 3 again
Thu Mar 18 14:29:03 AEDT 2021 to fd 3 again
Thu Mar 18 14:29:08 AEDT 2021 to fd 3 again
I can see that my file is still being filled by the while true I started !? this is because the file descriptor is still opened and then the process is still writing in it -- but you cannot see the file as his physical enveloppe (wow this is becoming mystic ! :D) does not exist any more -- this is the situation I was describing earlier.

Indeed, the full process of writing into a file is:
1/ open the file (open a file descriptor)
2/ write in it (write into a file descriptor)
3/ close the file (close the file descriptor)
Now that we know this and as my example as shown, it looks obvious that we have a (far) better (and easier and safer and better . . . name it . . .) solution than restarting the application or rebooting the database or server, right ?

Let's move further into the investigation (as you also most likely don't know which trace files have been removed) and see the correct solution later on.

First of all, we need to find the biggest deleted files thanks to lsof +aL1 (see man lsof):
[root@linux01 u01]# lsof +aL1 /u01 sort -nk 7 | tail -10
oracle    263797 oracle  271w   REG  252,2    4915555     0 1059311 /u01/app/oracle/diag/rdbms/db01/SID01/trace/SID01_ora_263797.trm (deleted)
diskmon    25690 oracle  363w   REG  252,2    5736792     0  942591 /u01/app/oracle/diag/crs/linux01/crs/trace/diskmon_68.trm (deleted)
oracle    316046 oracle   68w   REG  252,2    8536422     0 1488777 /u01/app/oracle/diag/rdbms/db01/SID01/trace/SID01_ora_316046.trc (deleted)
oracle    116391 oracle   90w   REG  252,2    9324380     0 4873097 /u01/app/oracle/diag/rdbms/db01/SID01/trace/SID01_ora_116391.trc (deleted)
oracle    116379 oracle   89w   REG  252,2    9621504     0 4876403 /u01/app/oracle/diag/rdbms/db01/SID01/trace/SID01_ora_116379.trc (deleted)
oracle    116340 oracle   86w   REG  252,2    9664646     0 4873104 /u01/app/oracle/diag/rdbms/db01/SID01/trace/SID01_ora_116340.trc (deleted)
diskmon    25690 oracle  362w   REG  252,2   27203987     0  942489 /u01/app/oracle/diag/crs/linux01/crs/trace/diskmon_68.trc (deleted)
oracle    263797 oracle  270w   REG  252,2   36228291     0 1059304 /u01/app/oracle/diag/rdbms/db01/SID01/trace/SID01_ora_263797.trc (deleted)
ora_p02l_ 371508 oracle   49w   REG  252,2  209219823     0 5767946 /u01/app/oracle/diag/rdbms/db02/SID02/trace/SID02_p02l_371508.trm (deleted)
ora_p02l_ 371508 oracle   48w   REG  252,2 3959405542     0 5767945 /u01/app/oracle/diag/rdbms/db02/SID02/trace/SID02_p02l_371508.trc (deleted)
[root@linux01 u01]#
We can find from /proc the process ID and the file descriptors of these files:
[root@linux01 u01]# ls -ltr /proc/371508/fd | grep SID02
l-wx------ 1 oracle dba 64 Nov  4 21:32 49 -> /u01/app/oracle/diag/rdbms/db02/SID02/trace/SID02_p02l_371508.trm (deleted)
l-wx------ 1 oracle dba 64 Nov  4 21:32 48 -> /u01/app/oracle/diag/rdbms/db02/SID02/trace/SID02_p02l_371508.trc (deleted)
[root@linux01 u01]#
We then just have to apply the 3rd step on how to write into a file: close these file descriptors ! then the process would no mode write into it -- easy !

If it is easy to do in a program you write, we have here to close a file descriptor into another running PID (which is still doing tons of stuff we do not want to stop). For this purpoise, we will be using gdb which is the GNU debugger (default on any Linux):
[root@linux01 u01]# gdb -p 371508
. . .
(gdb) call close(48)
$1 = 0
(gdb) call close(49)
$2 = 0
(gdb) quit
A debugging session is active.

        Inferior 1 [process 371508] will be detached.

Quit anyway? (y or n) y
Detaching from program: /u01/app/oracle/product/, process 371508
[Inferior 1 (process 371508) detached]
[root@linux01 u01]#
We can verify that these file descriptors have been well closed
[root@linux01 u01]# ls -ltr /proc/371508/fd | grep SID02
[root@linux01 u01]#

You have then properly closed these file descriptors, no more space issue on your system, problem fixed !

IB Switches: May the --force be with you !

Patching Infiniband Switches is usually really hassle free but you may one day face a (very) reluctant to be patched IB Switch. Note that this blog is part of a more general Exadata patching troubleshooting blog.
This journey started with some failed IB Switches pre-requisites:
FAILED : DONE: Initiate pre-upgrade validation check on InfiniBand switch(es).
ERROR : FAILED run of command:/patches/ -ibswitches /root/ib_group -upgrade -ibswitch_precheck
INFO : upgrade attempted on nodes in file /root/ib_group: [exa-ib1 exa-ib2 exa-ib3]
Looking at patchmgr.trc, I could find:
[patchmgr_send_notification_to_all_nodes][702]  Arguments: Failed 808 ibswitch
And in upgradeIBSwitch.trc:
[TRACE][/patches/ - 1740][copyToIBSwitch][1740]   Arguments: exa-ib1 xcp /usr/local/bin/xcp
[WARNING][/patches/ - 1749][copyToIBSwitch][]  [CMD: scp xcp root@\[exa-ib1\]:/usr/local/bin/xcp] [CMD_STATUS: 1]
    ----- START STDERR -----
    xcp: No such file or directory
    ----- END STDERR -----
[TRACE][/patches/ - 1740][copyToIBSwitch][1740]   Arguments: exa-ib1 /usr/local/lib/
[WARNING][/patches/ - 1749][copyToIBSwitch][]  [CMD: scp root@\[exa-ib1\]:/usr/local/lib/] [CMD_STATUS: 1]
    ----- START STDERR ----- No such file or directory
This was looking like if patchmgr was unable to copy xcp and to the switches so it could be a SSH passwordless connectivity issue (patchmgr tries to connect back to the database node used to patch it which may be impossible depending on what is in the switch /etc/hosts or your SSH security config defined in /etc/ssh/sshd_config -- you can find notes like this one Exadata: Patchmgr fails during the InfiniBand patching precheck. (Doc ID 2356026.1) on MOS about this. But well I know this and I was able SSH to my Switch which also could SSH back properly on any network interfaces to the DB node I was using to patch; I could also manually scp the famous xcp and All was supposed to be OK.

Checking further, I found that this reluctant switch was with a very old version:
[root@exadb01 ~]# ./ -I ~/ib_group
       Cluster is a X5-2 Quarter Rack HC 8TB
         -- Infiniband Switches
       exa-ib1        exa-ib2        exa-ib3
       2.1.8-1       2.2.15-1        2.2.15-1
[root@exadb01 ~]#
Indeed, if you look into Note 888828.1, you will find that Switch firmware 2.1.8-1 - Supplied with Exadata; being released in April 2016 (keep in mind that patchmgr has been released in version which was shipping IB Switch version 2.2.4-3 then ** after ** this 2.1.8-1 -- more on that later); It was then indeed an old version and it also meant that I was not really the first one facing this issue which then had not been resolved before :)

After investigating all of this with Oracle support which basically wanted to be sure that the SSH config was working, I have been pointed to the manual way of patching an IB Switch (which was the way of patching a Switch before patchmgr) described here : It is pretty straightforward: you load the package, the switch installs and reboot and that's it; so I gave it a go (you cannot directly upgrade to 2.2.16-1 which was my target version but you first have to upgrade to 2.2.7-2):
-> load -source fhttp://
Downloading firmware image. This will take a few minutes.
Error: Couldn't connect to server
-> load -source
Downloading firmware image. This will take a few minutes.
Error: Couldn't connect to server
which failed miserably; I then realized that this could not work, there is no FTP nor HTTP running on my database server where I want to load that package to the switch. The (very good) MOS engineer told me there was no other way: FTP or HTTP -- wow, FTP ? really ? you mean that old buddy running on port 21 ? no way I can have this to run on my DB node; FTP is a bit like Nokia 3210 -- you remember it was great but no way you can use it nowadays :D

As I could obviously not install a FTP or a HTTP server anywhere close to that switch, I tried to scp the package to the switch itself and load it from there -- locally:
-> load -source
Error: Insufficient disk space/memory. Firmware update requires minimum of
120 MB space in /tmp directory
80 MB space in / filesystem
120 MB of free memory 
It also failed miserably as obviously there is not enough space on the Switch to save a 180M package -- this started to be tough:
  • patchmgr fails at pre-requisites
  • No way to have a FTP server to load the package to the switch
  • No way to have a HTTP server to load the package to the switch
  • Not enough space to copy the package to the switch to load it locally
  • Not sure how patchmgr manages it but he can somehow load the package to the switch as this is what he usually does (I have to check patchmgr code to see what is that magic trick)

But still well, I had to patch this switch even if it was actually looking a bit like that to me:

So we (MOS engineer and I) thought that as patchmgr was released after this switch version, it may just not be aware of this switch version and then the pre-requisites could fail just because he didnt't know that version -- so I tried the upgrade (of that switch only) ignoring the pre-requisites with the -- force option !:
[root@exadb01 patch_switch_20.]# ./patchmgr -ibswitches ~/ib1 -upgrade --force yes
. . .
[INFO     ] Package will be downloaded at firmware update time via scp  <== a clue about how patchmgr does it -- but where does it find the disk space ? this is another story :)
[SUCCESS  ] Execute plugin check for Patching on exa-ib1
[INFO     ] Starting upgrade on exa-ib1 to 2.2.7_2. Please give upto 15 mins for the process to complete. DO NOT INTERRUPT or HIT CTRL+C during the upgrade
[INFO     ] Additional firmware load required. Starting secondary firmware load. DO NOT INTERRUPT or HIT CTRL+C
[INFO     ] Rebooting exa-ib1 to complete the firmware update. Wait for 15 minutes before continuing. DO NOT MANUALLY REBOOT THE INFINIBAND SWITCH
. . . looking good so far . . . 
[INFO     ] Validating the current firmware on the InfiniBand Switch
[SUCCESS  ] Firmware verification on InfiniBand switch exa-ib1
[INFO     ] Finished post-update validation on exa-ib1
[FAIL     ] Post-update validation on exa-ib1
[ERROR    ] Failed to upgrade exa-ib1 to 2.2.7-2. Cannot proceed with upgrading switch to 2.2.16_1
[FAIL     ] Update switch exa-ib1 to 2.2.16_1
[INFO     ] Aborting the process. Not going to try anymore switches. Retry after resolving the problems.
[FAIL     ] Overall status
OK so here it seems that the upgrade to 2.2.7-2 was OK but the post steps were KO -- may be also because of the fact that the original version of the Switch was too old; I could verify the version which was now good:
[root@exa-ib1 ~]# version
SUN DCS 36p version: 2.2.7-2 <================ looks good
Build time: Nov 2 2017 09:21:37
. . .
[root@exa-ib1 ~]#
OKay, now I could run the upgrade to 2.2.16-1 pre-requisites on Switch -- which were OK:
----- InfiniBand switch update process ended 2021-02-12 12:19:33 +1100 -----
2021-02-12 12:19:33 +1100 1 of 1 :SUCCESS: Initiate pre-upgrade validation check on InfiniBand switch(es).
2021-02-12 12:19:33 +1100 :SUCCESS: Completed run of command: /patches/ -ibswitches /root/ib1 -upgrade -ibswitch_precheck
2021-02-12 12:19:33 +1100 :INFO : upgrade attempted on nodes in file /root/ib1: [exa-ib1] 
And then I could upgrade all the 3 switches in a row, patchmgr now taking care of the version difference to upgrade all these switches to the target version: 2.2.16-1 which worked like a charm !

I could then finally patch this reluctant to be patched Infiniband Switch thanks to the -- force option because he was older than patchmgr -- May the -- force be with you !

I finally found my top Orace 12c-19c database feature !

Every Oracle database version comes with tons of new features (or renamed features to kind of re release a feature which was not really well implemented in a previous version :)) which are advertised a lot but when you think back about these features, which ones were really the top new feature of each version ?

I can easily name what I consider (this is indeed very subjective) to be the best feature of each version up to 11g but honnestly, for the 12c family (12, 18 and 19), I had nothing in mind (also may be because I have been doing less database work for years now). Let's start by my top feature for each version:
  • Oracle 7: CBO -- I haven't worked that much with Oracle 7 though but well, CBO is a huge feature

  • Oracle 8: RMAN -- We finally had something more integrated and more efficient than BEGIN/END BACKUP which was kind of a hassle. I also remind funny stories when interviewing to recruit people: "Me: do you know RMAN ?", "Candidate: Hermann Maier ? yes he is very good at ski racing", "Me: OK, we'll call you" :D

  • Oracle 8i: if 8i is a different version than 8 (some were saying that at that time), I would say Java in the database ! haha joking obviously; not sure this was the best idea ever as we still cannot really patch it online 20 years later and it is still full of vulnerabilities we have to fix every single month :) I would then say Partitionning -- 8 had partitionning but it was very basic, it started to be really usable in 8i; if 8i is not a different version than 8 then I go with RMAN for 8/8i

  • Oracle 9i: dbms_metadata.get_ddl -- how awesome it was when get_ddl was released ! what a pain it was to generate a DDL in the previous versions ! query tons of different data dict views, doing export and clean the DDL from the export file, using some graphical tools, ... there was no easy and 100% reliable way, get_ddl was clearly a relief, long live get_ddl ! :)

  • Oracle 10g: AWR -- no need to argue here, there was clearly a before and an after in term of performance investigations and tuning thanks to AWR

  • Oracle 11g: Snapshot Standby -- we could already do this in 10g with restore points, I remember I had scripted this in the past but Snapshot Standbys made it easy (and official !) to open read write a copy of a production to developper and resync it at night, excellent feature; I could also have said Exadata ! but at the time Exadata was released, it was still kind of confidential, it took time for people to use / trust Exadata so for me, Snapshot Standby stays the top 11g feature

Then 12c came and I couldn't find anything really appealling to me. I was thinking about the online datafiles move which is indeed a very cool feature but Oracle has mainly addressed these kind of datafiles placement issues in previous versions with OMF, db_file_create_dest, etc ... so even if this is a very cool feature, it would have had a bigger impact if implemeted in 8 than in 12. One would say CDB/PDB; it is indeed the major architecture change but I personnaly don't really like it and making it default makes everything more complicated for 99% for may be 1% who would really need this feature so for me, it is not really the best 12c family feature. Also, it is not totally integrated with CRS which is why I cannot show the PDBs in It should be coming with GI 21c (as GI 20c has not been released) -- we'll see.

And suddenly, with 19c, came my top 12c family feature:
  • Services can now automatically failback to the preferred node using -failback yes !!
Oh boy, I have always been waiting for this feature ! For everyone who have done patching / upgrades (everyone ?), you know how messy it is with the services: they restart on the available nodes when you patch the preferred node, application teams want to run on a specific node because they application is still not RAC-friendly or whatever reason, etc... You can easily know what services have moved during a maintenance using rac-status but when it comes to rebalance everything to the preferred node(s), this is another story. Indeed, CRS does not show the preferred / available nodes information with crsctl and you then have to svrctl config service to get this information knowing that you need to use the srvctl command from the database home version (and not CRS !) for this as a wrong srvctl version used for this would fail, .. in short, it is a big mess for something which should be easy in my opinion.

But this was before -failback yes feature ! you just have to set it up on a service like this:
srvctl config service -d DB_NAME -s SERVICE_NAME -failback yes
Then at next restart of the server, GI or instance, the service will automatically been rebalanced to the preferred node -- gracefully, without disconnecting anyone -- very cool, right ?!

To help setting that up, I have then wrote 2 scripts to be able to easily take advantage of this feature:
Enjoy ! a better Oracle patch reporting tool

Tired of struggling with Oracle patches, I developped three years ago and I based it on a new feature at that time which was ...