An Unknown DBA blog: rac-mon.sh: a quick and efficient GI 11g,12c,18c monitoring tool based on rac-status.sh

When I recently published a rac-status.sh update, I have been asked if rac-status.sh could also monitor a resource being moved to another node which would mean that something wrong has happened.
As I like to keep things simple and as I really like the below principles from the Unix philosophy:

Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new "features".
Expect the output of every program to become the input to another

I then found a great opportunity to develop another tool (rac-mon.sh) based on rac-status.sh output to monitor and warn if something wrong happens on a cluster !

1/ Overview

rac-mon.sh compares the current status of a cluster with a previously saved status in a reference file. If any difference is found (meaning something has changed), the user is prompted with these differences and/or receive an email (you can then page someone).
The "reference file" is a file where is saved what is considered in being the "good" status of your cluster (when all the resources are up and running on the nodes you want and with the status you want). Do not worry, rac-mon.sh will be creating it by itself when it first executes.
rac-status.sh -a is used by rac-mon.sh behind the scene.

2/ Pre requisites

The only pre requisite to have is to have rac-status.sh downloaded. It is also good to ensure that rac-status.sh is working properly.
You can save rac-status.sh anywhere, it just has to be known by rac-mon.sh so please update the below variable if you save rac-status.sh elsewhere than in the home directory of the user who execute it.

RACSTATUS=~/rac-status.sh                        # The rac-status.sh script

As mentioned earlier, rac-mon.sh uses a reference file to save the "good" status of a cluster, the default is as below, feel free to modify this path to fit your needs:

REFERENCE=~/rac-status_reference                 # The reference file where is saved the good status of your cluster

3/ A first execution

When you first execute rac-mon.sh, it will be creating the reference file itself and then check if the current status of the cluster is different from the reference one:

[oracle@exadatadb01]$ ./rac-mon.sh
        No reference file found at /home/oracle/rac-status_reference, creating it . . . OK
        No change has been identified across the cluster, all good !
[oracle@exadatadb01]$ echo $?
0
[oracle@exadatadb01]$

Indeed, the first execution will always be a successful one.
Also worth mentioning that the script exits with a value of 0 when no issue is discovered. You can then take advantage of this if you want to schedule rac-mon.sh using any of your already deployed monitoring tool and then integrate rac-mon.sh in your monitoring armada fairly quickly.

4/ Errors detected

You may experience some issues during the life of your cluster and then if some differences are detected, you will be prompted with these difference:

[oracle@exadatadb01]$ ./rac-mon.sh
        The below changes have been identified across the cluster:
8c8
<  LISTENER      | TCP:1521          |          -         |       Online       |   Listener   |
---
>  LISTENER      | TCP:1521          |       Online       |       Online       |   Listener   |
31c31
<  proddb        | app               |         -          |       Online       |
---
>  proddb        | app               |       Online       |       Online       |
[oracle@exadatadb01]$ echo $?
1
[oracle@exadatadb01]$

In the above example, you can quickly see that the LISTENER listener was Online on the first node but is not any more. Also, the production APP service is no more Online on node 1, you can then quickly fix the issue as you know where to search which is always better than having to investigate every component of a cluster.
Also worth mentioning that the script exits with a value of 1 when an issue is discovered. You can then take advantage of this if you want to schedule rac-mon.sh using any of your already deployed monitoring tool and then integrate rac-mon.sh in your monitoring armada fairly quickly.

5/ Warn someone

Finding issues is nice but warning someone to investigate these issues is better. rac-mon.sh can then send emails (with the -e option) when issues are found:

[oracle@exadatadb01]$ ./rac-mon.sh -e
        The below changes have been identified across the cluster:
8c8
<  LISTENER      | TCP:1521          |          -         |       Online       |   Listener   |
---
>  LISTENER      | TCP:1521          |       Online       |       Online       |   Listener   |
31c31
<  proddb        | app               |         -          |       Online       |
---
>  proddb        | app               |       Online       |       Online       |
        Sending en email to     dbaoncall@company.com        . . .  OK
[oracle@exadatadb01]$

I have also implemented a -s option if you want to get an email even if no issue is found:

[oracle@exadatadb01]$ ./rac-mon.sh -s
        No change has been identified across the cluster, all good !
        Sending en email to     dbaoncall@company.com        . . .  OK
[oracle@exadatadb01]$

If you do not want to bother with these options to send emails, you can change the default to always send emails (om failure and/or success) but setting these variables to "Yes":

EMAIL_ON_FAILURE="No"                  # Default behavior to send an email if an error is detected (-e option) - put Yes to always send emails
EMAIL_ON_SUCCESS="No"                  # Default behavior to send an email even if no error is detected (-s option) - put Yes to always send emails

Also, you'd need to set the email(s) address(es) you want to send the alerts to:

EMAILTO="youremail@company.com"        # The email to send the alert to

You can also modify the subject of the emails:

FAILURE_SUBJECT="Error : Cluster status at "`date`      # Subject of the email sent
SUCCESS_SUBJECT="OK : Cluster status at "`date`         # Subject of the email sent

6/ Recreate the reference file

As rac-mon.sh is based on a reference status which contains the good status of your cluster, you may have to update this reference file in a few cases like a service that is now moved on purpose to another node or another database is added to the cluster, etc ... indeed, you do not want to page anyone for a newly added database. You then just have to manually recreate the reference file:

[oracle@exadatadb01]$ ./rac-status.sh -a > ./rac-status_reference
[oracle@exadatadb01]$

And then test rac-mon.sh quickly to double check that everything looks good:

[oracle@exadatadb01]$ ./rac-mon.sh
        No change has been identified across the cluster, all good !
[oracle@exadatadb01]$

7/ Cron rac-mon.sh

You may want to schedule rac-mon.sh on a regular basis to not miss any issue:

*/5 * * * * /home/oracle/rac-mon.sh -e >> /var/log/rac-mon.log 2>&1

Keep in mind to use logrotate to purge this log.
Note that you need to schedule rac-mon.sh on only one node of your cluster.

8/ Option -h for help

As usual, a little bit of documentation is always welcome:

[oracle@exadatadb01]$ ./rac-mon.sh -h
NAME
        rac-mon.sh - A quick and efficient RAC/GI 12c monitoring tool based on rac-status.sh (https://goo.gl/LwQC1N)

SYNOPSIS
        ./rac-mon.sh [-e] [-s] [-h]

DESCRIPTION
        rac-mon.sh needs the rac-status.sh script to be downloaded and working (https://goo.gl/LwQC1N)

        rac-mon.sh executes rac-status.sh and compares it with a previously taken good status of your cluster
        If no previous status exists, you will be prompted to create it with the command to do so.

        If rac-mon.sh finds differences betwen the current status of the cluster and the good status in the reference file,
        you will be told about and rac-mon.sh will exit 1. If no difference found, you will be told about and rac-mon.sh will exit 0.

        rac-mon.sh can also send emails about this depending on the -e and -s option as well as the EMAIL_ON_FAILURE and EMAIL_ON_SUCCESS variables.

OPTIONS
        -e      Sends an email to the email(s) defined in the EMAILTO parameter if an issue has been detected in the cluster
        -s      Sends an email to the email(s) defined in the EMAILTO parameter on success (even if no error has been detected)

                If you want to modify the script default to always send emails and not have to specify -e or -s,
                just change the values of these parameters on top of the script like this:
                        EMAIL_ON_FAILURE="Yes"
                        EMAIL_ON_SUCCESS="Yes"

        -h      Show this help
[oracle@exadatadb01]$

9/ Download rac-mon.sh

You can find rac-mon.sh code here.

This is a pretty easy and cool way to monitor all the resources of a RAC/GI 12c/18c cluster, let me know if you like it !

1 comment:

AnonymousMarch 5, 2019 at 2:05 AM
rac-mon.sh is a great tool.. We have just rolled it out to our production environment, and have it running in all our pre-production areas.
One of our pain points is when services move without our knowledge, and this will help quickly identify when that happens.

An Unknown DBA blog

Twitter

rac-mon.sh: a quick and efficient GI 11g,12c,18c monitoring tool based on rac-status.sh