Twitter

IB Switches: May the --force be with you !

Patching Infiniband Switches is usually really hassle free but you may one day face a (very) reluctant to be patched IB Switch. Note that this blog is part of a more general Exadata patching troubleshooting blog.
This journey started with some failed IB Switches pre-requisites:
FAILED : DONE: Initiate pre-upgrade validation check on InfiniBand switch(es).
ERROR : FAILED run of command:/patches/20.1.6.0.0/patch_switch_20.1.6.0.0.210113/patchmgr -ibswitches /root/ib_group -upgrade -ibswitch_precheck
INFO : upgrade attempted on nodes in file /root/ib_group: [exa-ib1 exa-ib2 exa-ib3]
Looking at patchmgr.trc, I could find:
[patchmgr_send_notification_to_all_nodes][702]  Arguments: Failed 808 ibswitch
And in upgradeIBSwitch.trc:
[TRACE][/patches/20.1.6.0.0/patch_switch_20.1.6.0.0.210113/upgradeIBSwitch.sh - 1740][copyToIBSwitch][1740]   Arguments: exa-ib1 xcp /usr/local/bin/xcp
[WARNING][/patches/20.1.6.0.0/patch_switch_20.1.6.0.0.210113/upgradeIBSwitch.sh - 1749][copyToIBSwitch][]  [CMD: scp xcp root@\[exa-ib1\]:/usr/local/bin/xcp] [CMD_STATUS: 1]
    ----- START STDERR -----
    xcp: No such file or directory
    ----- END STDERR -----
[TRACE][/patches/20.1.6.0.0/patch_switch_20.1.6.0.0.210113/upgradeIBSwitch.sh - 1740][copyToIBSwitch][1740]   Arguments: exa-ib1 libxcp.so.1 /usr/local/lib/libxcp.so.1
[WARNING][/patches/20.1.6.0.0/patch_switch_20.1.6.0.0.210113/upgradeIBSwitch.sh - 1749][copyToIBSwitch][]  [CMD: scp libxcp.so.1 root@\[exa-ib1\]:/usr/local/lib/libxcp.so.1] [CMD_STATUS: 1]
    ----- START STDERR -----
    libxcp.so.1: No such file or directory
This was looking like if patchmgr was unable to copy xcp and libxcp.so.1 to the switches so it could be a SSH passwordless connectivity issue (patchmgr tries to connect back to the database node used to patch it which may be impossible depending on what is in the switch /etc/hosts or your SSH security config defined in /etc/ssh/sshd_config -- you can find notes like this one Exadata: Patchmgr fails during the InfiniBand patching precheck. (Doc ID 2356026.1) on MOS about this. But well I know this and I was able SSH to my Switch which also could SSH back properly on any network interfaces to the DB node I was using to patch; I could also manually scp the famous xcp and libxcp.so.1. All was supposed to be OK.

Checking further, I found that this reluctant switch was with a very old version:
[root@exadb01 ~]# ./exa-versions.sh -I ~/ib_group
       Cluster is a X5-2 Quarter Rack HC 8TB
         -- Infiniband Switches
       exa-ib1        exa-ib2        exa-ib3
----------------------------------------------------
       2.1.8-1       2.2.15-1        2.2.15-1
----------------------------------------------------
[root@exadb01 ~]#
Indeed, if you look into Note 888828.1, you will find that Switch firmware 2.1.8-1 - Supplied with Exadata 12.1.2.3.x; 12.1.2.3 being released in April 2016 (keep in mind that patchmgr has been released in version 12.2.1.1.0 which was shipping IB Switch version 2.2.4-3 then ** after ** this 2.1.8-1 -- more on that later); It was then indeed an old version and it also meant that I was not really the first one facing this issue which then had not been resolved before :)

After investigating all of this with Oracle support which basically wanted to be sure that the SSH config was working, I have been pointed to the manual way of patching an IB Switch (which was the way of patching a Switch before patchmgr) described here : https://docs.oracle.com/cd/E76424_01/html/E76431/z400029a1775330.html#scrolltoc. It is pretty straightforward: you load the package, the switch installs and reboot and that's it; so I gave it a go (you cannot directly upgrade to 2.2.16-1 which was my target version but you first have to upgrade to 2.2.7-2):
-> load -source fhttp://10.11.12.13/patches/20.1.6.0.0/patch_switch_20.1.6.0.0.210113/sundcs_36p_repository_upgrade_2.1_to_2.2.7_2.pkg
Downloading firmware image. This will take a few minutes.
Error: Couldn't connect to server
-> load -source http://10.11.12.13/patches/20.1.6.0.0/patch_switch_20.1.6.0.0.210113/sundcs_36p_repository_upgrade_2.1_to_2.2.7_2.pkg
Downloading firmware image. This will take a few minutes.
Error: Couldn't connect to server
which failed miserably; I then realized that this could not work, there is no FTP nor HTTP running on my database server where I want to load that package to the switch. The (very good) MOS engineer told me there was no other way: FTP or HTTP -- wow, FTP ? really ? you mean that old buddy running on port 21 ? no way I can have this to run on my DB node; FTP is a bit like Nokia 3210 -- you remember it was great but no way you can use it nowadays :D

As I could obviously not install a FTP or a HTTP server anywhere close to that switch, I tried to scp the package to the switch itself and load it from there -- locally:
-> load -source ftp://10.20.21.22/tmp/sundcs_36p_repository_2.2.7_2.pkg
Error: Insufficient disk space/memory. Firmware update requires minimum of
120 MB space in /tmp directory
80 MB space in / filesystem
120 MB of free memory 
It also failed miserably as obviously there is not enough space on the Switch to save a 180M package -- this started to be tough:
  • patchmgr fails at pre-requisites
  • No way to have a FTP server to load the package to the switch
  • No way to have a HTTP server to load the package to the switch
  • Not enough space to copy the package to the switch to load it locally
  • Not sure how patchmgr manages it but he can somehow load the package to the switch as this is what he usually does (I have to check patchmgr code to see what is that magic trick)

But still well, I had to patch this switch even if it was actually looking a bit like that to me:

So we (MOS engineer and I) thought that as patchmgr was released after this switch version, it may just not be aware of this switch version and then the pre-requisites could fail just because he didnt't know that version -- so I tried the upgrade (of that switch only) ignoring the pre-requisites with the -- force option !:
[root@exadb01 patch_switch_20.1.6.0.0.210113]# ./patchmgr -ibswitches ~/ib1 -upgrade --force yes
. . .
[INFO     ] Package will be downloaded at firmware update time via scp  <== a clue about how patchmgr does it -- but where does it find the disk space ? this is another story :)
[SUCCESS  ] Execute plugin check for Patching on exa-ib1
[INFO     ] Starting upgrade on exa-ib1 to 2.2.7_2. Please give upto 15 mins for the process to complete. DO NOT INTERRUPT or HIT CTRL+C during the upgrade
[INFO     ] Additional firmware load required. Starting secondary firmware load. DO NOT INTERRUPT or HIT CTRL+C
[INFO     ] Rebooting exa-ib1 to complete the firmware update. Wait for 15 minutes before continuing. DO NOT MANUALLY REBOOT THE INFINIBAND SWITCH
. . . looking good so far . . . 
[INFO     ] Validating the current firmware on the InfiniBand Switch
[SUCCESS  ] Firmware verification on InfiniBand switch exa-ib1
[INFO     ] Finished post-update validation on exa-ib1
[FAIL     ] Post-update validation on exa-ib1
[ERROR    ] Failed to upgrade exa-ib1 to 2.2.7-2. Cannot proceed with upgrading switch to 2.2.16_1
[FAIL     ] Update switch exa-ib1 to 2.2.16_1
[INFO     ] Aborting the process. Not going to try anymore switches. Retry after resolving the problems.
[FAIL     ] Overall status
OK so here it seems that the upgrade to 2.2.7-2 was OK but the post steps were KO -- may be also because of the fact that the original version of the Switch was too old; I could verify the version which was now good:
[root@exa-ib1 ~]# version
SUN DCS 36p version: 2.2.7-2 <================ looks good
Build time: Nov 2 2017 09:21:37
. . .
[root@exa-ib1 ~]#
OKay, now I could run the upgrade to 2.2.16-1 pre-requisites on Switch -- which were OK:
----- InfiniBand switch update process ended 2021-02-12 12:19:33 +1100 -----
2021-02-12 12:19:33 +1100 1 of 1 :SUCCESS: Initiate pre-upgrade validation check on InfiniBand switch(es).
2021-02-12 12:19:33 +1100 :SUCCESS: Completed run of command: /patches/20.1.6.0.0/patch_switch_20.1.6.0.0.210113/patchmgr -ibswitches /root/ib1 -upgrade -ibswitch_precheck
2021-02-12 12:19:33 +1100 :INFO : upgrade attempted on nodes in file /root/ib1: [exa-ib1] 
And then I could upgrade all the 3 switches in a row, patchmgr now taking care of the version difference to upgrade all these switches to the target version: 2.2.16-1 which worked like a charm !

I could then finally patch this reluctant to be patched Infiniband Switch thanks to the -- force option because he was older than patchmgr -- May the -- force be with you !

No comments:

Post a Comment

CUDA: getting started on WSL

I have always preferred command line and vi finding it more efficient so after the CUDA: getting started on Windows , let's have a loo...