Twitter

Exadata: make a cell or a DB node blink !

Hardware issues happen and when it happens on Exadata systems, you need to have an Oracle Field Engineer to go into your Datacenter and replace the faulty part. What also happens is that you have many Exadatas in your datacenter, your CMDB is not really up to date and as a result, the Field Engineer is in trouble locating the Exadata component he needs to replace something faulty in.
Hopefully, the ILOMs have a very cool feature and you can make a cell or a DB node blink ! (then easy to locate)

First of all, connect to the ILOM and check the /SP/LOCATE property:
-> show /SYS/LOCATE
/SYS/LOCATE
    Targets:
    Properties:
        type = Indicator
        ipmi_name = LOCATE
        value = Off            <=== OFF
    Commands:
        cd
        set
        show
->
Let's make it blink:
-> set /SYS/LOCATE value=fast_blink
Set 'value' to 'fast_blink' [Fast Blink]
-> show /SYS/LOCATE
/SYS/LOCATE
    Targets:
    Properties:
        type = Indicator
        ipmi_name = LOCATE
        value = Fast Blink     <=== It blinks !
    Commands:
        cd
        set
        show
->
Once the Field Engineer has located your blinking component, it is time to stop the blinking -- I know it is very fun so let it a couple more minutes and then stop it:
-> set /SYS/LOCATE value=off
Set 'value' to 'off' [Off]
-> show /SYS/LOCATE
/SYS/LOCATE
    Targets:
    Properties:
        type = Indicator
        ipmi_name = LOCATE
        value = Off            <=== OFF
    Commands:
        cd
        set
        show
->

This is be very useful, I use it a lot !

Exadata: Hack patchmgr

Do no try this at home, this blog is for educational (and fun) purpose only.

Let's take as an example that weird patchmgr behavior when patching Exadata:
  • 1/ patchmgr uses the /etc/hosts IP of a host to start the patching
  • 2/ patchmgr uses the DNS ip of a host when waiting for a host to reboot
So if it happens that the DNS IP is blocked by a firewall for example (this would be most likely due to a wrong configuration but let's assume this is the situation we are in) then patchmgr will wait for the host te be back after reboot forever -- and you will be in trouble.

To work this around, you can still comment the DNS server(s) out of the /etc/resolv.conf file from the host you start patchmgr from (no DNS server then patchmgr would not be able to use the DNS IP to ping the host waiting to come back after reboot then would use the /etc/hosts IP instead) but, by doing that, the whole host would be unable to use DNS during this patch session and this is not what you want; indeed, if you use an external server to start patchmgr, you will most likely create an incident on this system.

Another way is to ... "adapt" patchmgr to make it not to use the DNS IP when pinging the patched host to wait for it to come back online after reboot. And this kind of easy as patchmgr is made out of Shell script -- great news, right ?
So looking into patchmgr code, you will find that patchmgr uses the host command to resolve the target server IP:
host_name_or_ip=$(host -t A $target | awk '/has address/ {print $NF; exit}')
And if you look at man host, you'll see that host resolves using DNS:
host - DNS lookup utility
So this is where the described issue comes from; we would not have this issue if patchmgr would resolve the host from the /etc/hosts file using for example a simple ping:
host_name_or_ip=$(ping -qc1 $target | head -1 | awk -F "[()]" '{print $2}')
Note that the resolution from /etc/hosts before DNS is default as you can see in /etc/nsswitch.conf:
# grep hosts /etc/nsswitch.conf
hosts:          files dns    <=== we resolve using files before DNS
All that said, we can then update patchmgr to resolve using ping and no more host:
#host_name_or_ip=$(host -t A $target | awk '/has address/ {print $NF; exit}')
host_name_or_ip=$(ping -qc1 $target | head -1 | awk -F "[()]" '{print $2}')
And let's run a test (just a precheck) to see how it goes:
# ./patchmgr -dbnodes ~/dbs_group -precheck -nomodify_at_prereq -target_version 19.3.12.0.0.200905 -iso_repo ../p31720221_193000_Linux-x86-64.zip -allow_active_network_mounts
2020-11-05 15:22:45 +1100        :ERROR  : Incorrect md5sum of /patches/dbserver_patch_20.200911/patchmgr
#
Hey, patchmgr has detected that we have modified it, clever ! Indeed, patchmgr checks the md5 of all the files before starting a patching sesssion; these md5 values are saved in the md5sum_files.lst file, so we just have to get the md5 of our "patched" patchmgr and update md5sum_files.lst with it:
# md5sum patchmgr
8e75bf3c1cae3e2d75229c85e84f8e0e  patchmgr
# cp md5sum_files.lst md5sum_files.lst.orig
# vi md5sum_files.lst
# grep patchmgr md5sum_files.lst
8e75bf3c1cae3e2d75229c85e84f8e0e  patchmgr
40acc292fec697492dd40a6938fb60c4  patchmgr_functions
#

And you are now good to go, you can now run your patched patchmgr !

Again, do no try this at home, this blog is for educational (and fun) purpose only. Nothing here would be supported by Oracle -- but it is still good to know ! :)

Exadata: Repair a corrupted / broken RPM database

Storage servers and DB nodes are Linux machines which use RPM as package manager. This RPM has a database to store the installed packages, etc ... It happened a few times that this database became corrupted after an Exadata patching. And you usually discover this during your next patching session with this kind of complaint from the pre-requisites:
2020-11-12 13:12:15 +1100        :FAILED : Check space and state of cell services.
You can note that the error message is Check space and state of cell services so the first thing is to check whether you have a space issue or not then once you have verified it is not that (dcli -g ~/cell_group -l root "df -h /" -- / needs 3 GB for patching), you know that the issue is with the state of your component (cell or DB node).

You then have to check the hostname.log of the culprit in the patchmgr directory (usually all the "hostname.log" logfiles have a very similar size so if one is bigger, you most likely have the culprit with a simple ls) and you may find:
cel02: [ERROR] Can not continue. Runtime configuration is not consistent with values configured in /opt/oracle.cellos/cell.conf.
cel02: [ERROR] Run ipconf to correct the inconsistencies. Failed check: /root/_cellupd_dpullec_/_p_/ipconf -check-consistency -at-runtime -semantic -verbose
This is not excessively verbose but it gives you a good hint and the command which has failed, you can then re execute this command and see how it goes:
[root@cel02 ~]# /root/_cellupd_dpullec_/_p_/ipconf -check-consistency -at-runtime -semantic -verbose
error: rpmdb: BDB0113 Thread/process 23845/140082556999744 failed: BDB1507 Thread died in Berkeley DB library
error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
error: cannot open Packages index using db5 -  (-30973)
error: cannot open Packages database in /var/lib/rpm
error: rpmdb: BDB0113 Thread/process 23845/140082556999744 failed: BDB1507 Thread died in Berkeley DB library
error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
error: cannot open Packages database in /var/lib/rpm
[Info]: ipconf command line: /root/_cellupd_dpullec_/_p_/ipconf.pl -check-consistency -at-runtime -semantic -verbose -nocodes
Logging started to /var/log/cellos/ipconf.log
[Warning]: File not found /etc/ntp.conf
. . . many more info not useful in this scenario . . .

We can clearly see 2 issues here:
  • Missing /etc/ntp.conf: this can be ignored, it is documented in note 2689297.1
  • A RPM issue

We can confirm the RPM issue just by queying the RPM database:
[root@cel02 ~]# rpm -qa
error: rpmdb: BDB0113 Thread/process 23845/140082556999744 failed: BDB1507 Thread died in Berkeley DB library
error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
error: cannot open Packages index using db5 -  (-30973)
error: cannot open Packages database in /var/lib/rpm
error: rpmdb: BDB0113 Thread/process 23845/140082556999744 failed: BDB1507 Thread died in Berkeley DB library
error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
error: cannot open Packages database in /var/lib/rpm
[root@cel02 ~]#
So here we have to rebuild this corrupted RPM database and the good is that it ca be done 100% onlin with no disruption:
[root@cel02 ~]# mkdir /var/lib/rpm/backup
[root@cel02 ~]# cp -a /var/lib/rpm/__db* /var/lib/rpm/backup/
[root@cel02 ~]# rm -f /var/lib/rpm/__db*
[root@cel02 ~]# rpm --rebuilddb
[root@cel02 ~]#
Easy, right ? you can now verify the good health of your RPM database:
[root@cel02 ~]# rpm -qa | wc -l
455
[root@cel02 ~]#
You can revalidate the whole configuration:
[root@cel02 ~]# cellcli -e alter cell validate configuration
Cell cel02 successfully altered
[root@cel02 ~]#

And you are good to go, everything is now fixed and clean, your pre-requisites and future patch will now be working !
Note that this example is with a cell but it works the same way with a database node (and any Linux server).

Exadata: make a cell or a DB node blink !

Hardware issues happen and when it happens on Exadata systems, you need to have an Oracle Field Engineer to go into your Datacenter and rep...