Twitter

cell-status.sh: An overview of your Exadata cell and grid disks

You may already know rac-status.sh which gives you on overview of your RAC/GI resources in a glimpse; here is now cell-status.sh which gives you a status of the cell disks and the grid disks of an Exadata !
This appears to be very useful on a daily basis, let check how it looks srtaight away !

A sample example:



Let's describe this first screenshot:
  • On top of the output, and as usual on all my Exadata scripts, the Exadata model is shown
  • The the first table shows the cell disks with the first column listing the analyzed cells
  • Then 2 sets of 3 columns, one for the FlashDisks and one for the HardDisks; each one containing 3 columns:
    • Nb : number of disks
    • Normal : number of cell disks with the status "Normal"; it appears in red if the number of "Normal" disks is less than the number of disks -- you will then quickly see if some disks are not back online after a patch for example
    • Errors: Number of errors on the cell disks
  • A second table showing the status of the grid disks with the first column listing the analyzed cells
  • 1 set of column per diskgroup showing 3 columns per diskgroup:
    • Nb : number of disks
    • Online: number of "Online" grid cell disks; a red xx is shown if some disks are Offline -- you will then quickly see if you have any issue with your configuration
    • Errors: Number of errors on the grid disks
  • A legend under the tables which is self explanatory.

An Extreme Flash configuration

In an Extreme Flash configuration, you will indeed have 1 set of column in the cell disks table as it has only Flash Disks.

An asmDeactivationOutcome issue


As having an asmDeactivationOutcome parameter not to Yes is really something you don't want, the script will show this with a red background as you can see on the above screenshot. You can then quickly spot any problem related to this to investigate and fix this ASAP.

List of failed disks (-v option)

The above outputs are cool but you may also want to know which disks have issues; this is the purpose of the -v option which will add after the table a detail of the failed cell disks and grid disks like the below one:


Users

To fit the needs to every different configuration you can have, I made the way of executing this script and of connecting to the cells flexible knowing that the user who executes the script must have SSH passwordless connectivity to the cells; here is how it works:
  • If cell-status.sh is executed as root, then root is used to connect to the cells (it is defined on top of the script by USER="root")
  • If cell-status.sh is executed as a non root user, then cellmonitor is used to connect to the cells (it is defined on top of the script by NONROOTUSER="cellmonitor")
  • You can change this behavior by forcing the use of a specific user with the -u option

List of cells

The list of cells to report about can also be customized as below:
  • If cell-status.sh is executed as root, it uses ibhosts to build the list of cells to connect to
  • If cell-status.sh is executed as a non root user, it uses the databasemachine.xml file to build the list of cells to connect to
  • You can also specify a specific list of cells to analyze using the -c option and a cell_group file

Option -h for help

Feel free to check the help using the -h option:
[root@exadb01]# ./cell-status.sh -h

When patching

This script is very useful when patching Exadata and it is now fully integrated in my Exadtaa patching procedure.
I then check the status of the cells during the pre-requisites phase to be sure I'm gonna patch an healthy systen as well as before patching the cells:
[root@exadb01]# ./cell-status.sh -h | tee -a cell_status_before_patching
I then re check this status after having the cells patched:
[root@exadb01]# ./cell-status.sh -h | tee -a cell_status_after_patching
And a simple diff would give me any issue that could have happended during the cells patching like if some disks are not properly back online:
[root@exadb01]# diff cell_status_before_patching cell_status_after_patching

The code

You can download the code from my github repo.

Enjoy !

13 comments:

  1. As usual, you always rocks with new scripts for Exadata.
    Looks like the error showing on your script is false or something i am missing.
    When i checked the flashdisk status, could not find any error.


    Cell Disks | HardDisk | FlashDisk |
    | Nb | Normal | Errors | Nb | Normal | Errors |
    ---------------------------------------------------------------------------
    Exalabcel1 | 12 | 12 | 0 | 16 | 16 | 0 |
    Exalabcel2 | 12 | 12 | 0 | 16 | 16 | 0 |
    Exalabcel3 | 12 | 12 | 0 | 16 | 16 | 0 |
    Exalabcel4 | 12 | 12 | 0 | 4 | 4 | 0 |
    Exalabcel5 | 12 | 12 | 0 | 4 | 4 | 130 |
    Exalabcel6 | 12 | 12 | 0 | 4 | 4 | 0 |
    ---------------------------------------------------------------------------

    Failed Cell Disks details
    Cell | Name | Status | Size | Nb_Error | Disktype |
    ---------------------------------------------------------------------------------------------
    Exalabcel5 | FD_03_Exalabcel5 | normal | 1.455474853515625T | 130 | FlashDisk |
    ---------------------------------------------------------------------------------------------

    ReplyDelete
    Replies
    1. Thanks !

      What do you mean by you found no error ? 130 errors are reported on the FD_03_Exalabcel5 device; it is the number of errors reported by "cellcli -e list celldisk attributes name,status,size,errorcount,disktype". Can you check with this command on Exalabcel5 ?

      Delete
    2. You are right, there are some errors in that particular flash disk. Let me check how to get rid of those.
      Thank You.

      Delete
    3. You should do a sundiag of the cell and send to support for analysis. Some errors are ignorable, they will let you know.

      I am working on trying to find a way to add this in the cell-status.sh script -- far easier to say than to do -- may be one day :)

      Delete
    4. Yes, I have opened a SR yesterday with Support and they confirmed that error count >0 however the status is normal.
      "There is no impact on the functionality and from Hardware perspective there is no clear indication of a possible failure now or in the near future.
      However, we recommend to keep the error counter under monitoring for the next couple of days and if a rapid increase of errors is observed, please let us know."
      So, i have requested them to create a SR with EEST group to check from software point of view.

      Delete
    5. Good to hear, it would be good to find a way to reset this error counter to 0 then everything red would mean something newly wrong and not old errors.

      In the meantime, you can monitor this number easily with cell-status !

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Hi,

    I am getting this error while executing on X8M. Help to resolve.


    ibwarn: [358611] mad_rpc_open_port: client_register for mgmt 1 failed
    src/ibnetdisc.c:784; can't open MAD port ((null):0)
    Error: No cells specified.

    Cluster is a X8M-2 Eighth Rack HC 14TB

    Cell Disks |
    |
    ---------------------
    ---------------------


    ReplyDelete
    Replies
    1. Please use:

      ./cell_status.sh -c ~/cell_group

      You have to specify the cell_group file as there is no way to have this list dynamically with X8M and the new ROCE switches

      Delete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Hi, can be this script adapted to ExaC@C Gen2? Thanks and regards.

    ReplyDelete
    Replies
    1. I am afraid not as you cannot access the cells with ExaCC.

      Delete
    2. And if we could be able to access the cells with ExaCC Gen2? : )

      Delete

CUDA: getting started on WSL

I have always preferred command line and vi finding it more efficient so after the CUDA: getting started on Windows , let's have a loo...