It does not seem possible to me to write an exhaustive blog about troubleshooting an Exadata patching session which would go bad (or it would be incredibly pretentious). Indeed, an Exadata stack is a complete and complex mix of software and hardware which can, on top of that, be configured very differently depending in each company needs, norms, compliance rules, etc....
The best way, in my humble opinion, to be able to efficiently troubleshoot a failure during an Exadata patching session is to:
Troubleshooting a failed patch starts with checking the logfiles to get more informatio about the issue you just faced; below a list of the most commn logfiles:
Each procedure listed below has been executed on real life production Exadatas at least once (many have been used far more than once)
Hope it helps !
The best way, in my humble opinion, to be able to efficiently troubleshoot a failure during an Exadata patching session is to:
- Know vi and grep to check the logfiles :)
- Have the full picture of the Exadata patching procedure
- Keep in mind that, in case of rolling patches, even a crash or a non reponsive server does not impact the uptime of the applications as everything is (at least) redundant so take your time to troubleshoot and stay cool (as a cucumber)
- Have access to well documented procedures from the real life which help setting and/or manage Exadata components; indeed, the main reasons for problems during Exadata patching sessions are:
- Failed pre-requisites: - Due to hardware alerts; open a SR to have the failed hardware fixed
- A crash / timeout / non responsive server during the patching: - A space issue but they are usually detected by the pre-requisites
- Again, this is not exhaustive and this is the beauty of it ! and this is why this blog is a living blog and will be updated when new issues and solutions appear.
- Due to a misconfiguration of one of the component; you'll find the the procedure below
- Usually, rebooting the server will be needed; you'll find the the procedure below
- If a SR has to be opened, ILOM snapshot will be needed; you'll find the the procedure below
Troubleshooting a failed patch starts with checking the logfiles to get more informatio about the issue you just faced; below a list of the most commn logfiles:
- patchmgr.log -- the main patchmgr logfile
- patchmgr.trc -- a more detailed patchmgr output
- nodename.log --
- /var/log/cellos/dbnodeupdate.log -- located on the node (not in the patchmgr directory), detailed log of the patch application on this specific node
Each procedure listed below has been executed on real life production Exadatas at least once (many have been used far more than once)
- ILOMs
- Infiniband Switches ILOM -- stop, start, restart, status
- How to take an ILOM snapshot with the command line
- Set up / fix DNS configuration
- Set up / fix NTP configuration
- Change ILOM hostname
- Database nodes
- How to reboot a database server using its ILOM (same procedure applies for a storage server)
- How to re-image an Exadata database server
- Reinstall a broken system RPM
- Repair a corrupted/broken RPM database
- Make your DB node blink !
- dbnodeupdate.sh backup failed on one or more nodes
- Cells
- Shutdown or reboot a cell without impacting ASM
- How to reboot a database server using its ILOM (same procedure applies for a database node)
- Restart SSH on a storage cell with no SSH access
- How to re-image an Exadata cell storage server
- Reinstall a broken system RPM
- Repair a corrupted/broken RPM database
- Make your cell blink !
- Switches
- Patchmgr
Hope it helps !
Nice work.
ReplyDelete