In Part 3 of this series, we covered the use of separate Policies as a method of suppressing Alerts in a reasonably targeted method. But as was mentioned in that article, two potential use cases were not addressed with that method:
- We only used the method to suppress alerts, rather than tune Alerts.
- We conceded that the method described in Part 3 did not offer a practical solution to granular tuning of Alerts.
So, consider the previous article as a basis for the next level of Alert tuning: Editing Alert Symptoms.
This article will explore two use cases for tuning Alerts by editing Symptoms:
- Threshold Tuning
- Selective Alert Suppression using Property Symptoms
Before we begin, I recommend users who are not familiar with vROPs Alert structures review the critical relationship between Alerts and Symptoms (they are NOT the same thing). See Chapter 4 of the vRealize Operations Manager Customization and Administration Guide for details.
But first, let’s review some more real-world examples.
Real World Example of the Need for Symptom Tuning
In the previous post, we cited two common examples for Alerts that operators may wish to suppress under certain conditions:
- Stress Alerts
- Disk Space Alerts (for disks known to be at or near full capacity by design)
These can also be examples of alerts requiring symptom tuning, or – as we will see in this article – more granular suppression by the use of symptoms.
But let’s consider another example from the real world: Several weeks ago, I had been paying a series of scheduled visits to a Health Care customer in Illinois, discussing many of these topics. Between two of those visits – as happens too often – the customer had a serious outage in the overnight hours.
The root cause of this outage was an “oldie but goodie”: SCSI conflicts between LUNs in their SAN. This caused intermittent and sporadic availability issues for VMs being able to access storage.
This customer was only operating vROPs with the vSphere Management Pack installed, which would NOT detect the root cause of this alert. The “smoking gun” in this situation would have been in ESXi logs, (which is a good illustration of how vROPs and vRealize Log Insight could have worked together to quicken the resolution of a serious problem.)
The customer understood the explanation of the vROPs vSphere Management Pack, but asked a very valid question: Even if vROPs – as configured – would not have detected the root cause, shouldn’t the out-of-the-box symptoms (ie “effect”) have raised an alert? In this case shouldn’t vROPs have detected out of the ordinary I/O latency for the affected resources?
This was an excellent question, and it caused us to engage in a bit of sleuthing. We reviewed the data within vROPs during the relevant time frame, and sure enough, in both the Alerts View and the Timeline View, we saw several latency Alerts from vROPs for objects affected by this outage.
But this only illustrated a deeper issue: With the out-of-the box vROPs settings and most user environments, I/O latency Alerts (for Read and Write) are very common. This is because the out of the box Policies in the vSphere Management Pack and the pre-configured Alerts are based on vSphere Best Practices, which prescribe a tolerance for Latency up to 20-30 milliseconds. The fact is, this condition is not uncommon in customer environments. (See Figure 2 for typical example).
So, this alert criteria DID result in Alerts related to the SCSI outage, but the Alerts were lost in a sea of other similar Alerts. What was different about the latencies caused by the SCSI issue was the scale of the latency: in this case up to and beyond 1500 milliseconds.
Most virtualization and storage admins would agree there is a huge difference between latency issues of 20-30 ms (barely above tolerable) and latency up to 500, 1000, or 1500 ms.
NOTE: There is an argument in this example for the value of the Anomalies analytics in vROPs. Over time, even latencies of 20-30 ms that trigger hard threshold alerts would eventually be marked as “Normal” by vROPs Analytics. The Anomalies badge would then fire when truly abnormal latencies would be observed. This, too, was noted in this example, but the customer had not yet incorporated Anomaly behaviors into their troubleshooting efforts. See Figure 3 for a similar — albeit much less severe — example.
Anomalies aside, this experience illustrated a need for this customer to adjust vROPs for an entirely different kind of Alert notification than what was configured at installation. In my recommendation, I suggested a new variant of the Latency Alerts, called “Super Critical Disk Latency” that would only fire when latency values rose significantly above the 20-30ms layer that happens all too often. I arbitrarily recommended a threshold for this alert at about 500ms. Such occurrences should be very rare, but would be very critical in almost any instance they occurred.
So in addition to the suppression examples from our previous article, we will use the above example as our template for threshold tuning:
Use Case 1: Threshold Tuning in Symptoms
Before you begin, watch the following video which provides a good overview of the basics of vROPs Alerts and Symptoms.
vROPs Alert and Symptoms Overview Video (Duration: 5 Minutes)
We will use the Latency issue above as an example for Hard Threshold tuning in Alert Symptoms. To note, we will NOT create this alert from scratch. We will use the existing Latency Alert definition as the basis for the newly-tuned “Super Critical Latency” Alert definition.
- First Task: Copy and Edit the New Alert Definition in the Alerts Section
- Second Task: Add the New Alert Definition to the Appropriate Policies
First Task: Copy and Edit New Alert Definition
In this task we will use the Out of the Box Alert for Latency as the basis for an entirely new alert that will detect and notify for much more severe latency conditions:
1. Make sure you are logged into vROPs as an Admin user.
2. Click on the Content Quicklink at the top of the Navigation Pane:
b. If you reviewed the brief video in the above link, you will know that Alerts are made up of one or more symptoms. In our use case — dealing with Latency — we wish to use an existing symptom as the basis for a new — more critical — performance Symptom. So that is where we will start.
3. Click on the “Symptom Definitions” Section
a. This will take you to the Library of All Existing Symptoms in the vROPs Alert Library. Note the categories (Adapter Type and Object Type). These categorizations will help you understand the scope of the Symptoms you create or edit. This is important, since there are similar alerts for different resource types. In this case, we want to capture severe Latency for Virtual Machines.
b. There are several Symptom types. For this case, make sure you have selected “Metric/Supermetric Symptom Definitions.”
4. To find Latency Symptoms, type “latency” into the Symptoms filter.
a. You will likely see several Symptom definitions related to Latency. You may need to scroll through the examples, to find the two that are related to Virtual Machines. (One each for Read Latency and Write Latency — See Figure 5).
5. Click on the Symptom: “Virtual Machine has high read latency” to select it.
6. Click on the “Clone” Icon at the top of the screen to create a copy of the Symptom.
7. This will open the Symptom Editor.
a. For this example, we will change three items in the Symptom:
- Description (The new symptom should be self-explanatory)
- Severity (The level of Alert Severity is much higher than the source Alert)
- Threshold (The metric condition must be higher to match the level of Severity)
b. Configure the following new values for this new Symptom:
- Description: “Virtual Machine has HIGHLY CRITICAL read latency”
- Severity: Critical (Changed from Warning)
- Threshold: 500 (Changed from 15)
c. Note that the metric for the symptom (Virtual Machine: Aggregate of all instances|Read Latency (ms)) does not need to be specified, because it was already chosen as part of the cloned Symptom. When you create symptoms from scratch, the metric you select can be any metric collected by vROPs.
d. The value of the Threshold (500) is a bit arbitrary. The goal is to set it high enough that when the Symptom and Alert fire, it is high enough to warrant immediate action and not be dismissed as a “false positive” , and not too high as to miss actual conditions that would be deemed as critical. Keep in mind, in this exercise, we are not deleting the original Warning Alert that fires for Latency conditions of 15 ms. The ultimate value of the threshold is a matter of personal judgement.
e. Do not change the Wait Cycle or the Cancel Cycle values at this time.
8. Click the “Save” Button to save this symptom.
9. Repeat Steps 1-8, but in this instance, clone the Symptom: “Virtual Machine has high write latency”, and configure the cloned Symptom with the values below, so it appears in Figure 8:
- Description: “Virtual Machine has HIGHLY CRITICAL write latency”
- Severity: Critical (Changed from Warning)
- Threshold: 500 (Changed from 15)
Now that we have our Symptoms, its time to package them into new Alerts. Before we proceed, let’s first review the basic parts of Alerts:
We have just completed our new symptoms, and we will see in a moment, we will also make some changes in our new Alerts as well:
To create our first new Alert (for CRITICALLY HIGH Write Latency), complete the following steps:
1. Make sure you are logged in to vROPs as an Admin User.
2. Navigate to the “Content” Quicklink.
3. Click on “Alert Definitions”.
a. This will bring us to the Library of existing Alerts.
b Just as we already had latency Symptoms we could clone as the basis for our new Symptoms, we also have existing latency Alerts we can use for the same purpose.
4. In the Alert Definitions window, type “latency” into the filter to find existing Alert definitions for latency.
a. Note the two existing Disk I/O latency alerts for Virtual Machines. These are based on the original symptoms with the lower latency thresholds that we cloned earlier. We will clone these Alerts using a similar method.
5. Select the Alert “Virtual Machine has Disk I/O read latency problem”.
a. You should note that the Alert Definition Workspace has 5 steps. We will only need to change a few of them to create our new alert. See Figure 11:
7. Make sure you have selected “Step 1: Name and Description” of the Alert Definition Workspace. Edit the Name of the Alert to read “Virtual Machine has CRITICAL Disk I/O read latency problem”.
a. OPTIONAL: Edit the Description of the Alert to read something similar.
8. Click on “Step 4: Add Symptom Definitions” of the Alert Definition Workspace.
a. Note the existing Symptoms of this Alert. As you can see, this is an example of an Alert that requires more than one symptom to fire the Alert. There is an existing read latency symptom we will replace. But there are additional symptoms for Co-Stop and CPU swap wait. For the purposes of this exercise, it is up to you whether you would prefer to leave these as criteria for our new alert, or remove them, leaving only our new symptom.
9. Click on the “X” next to the Symptom called “Virtual Machine has high read latency” to remove it from the Alert Definition.
10. In the list of “Symptom Definitions” at the left of the Alert Definition Workspace, type the word “CRITICALLY” to help find the new symptoms we created in the previous task.
11. When you find the Symptom “Virtual Machine has CRITICALLY HIGH read latency”, drag it to the “Symptoms” area of the Workspace.
12. Click on “Step 5: Add Recommendations” of the Alert Definition Workspace.
a. You will note that the original Alert Definition contains several possible recommendations for remediation of this condition. Since our new, more critical Alert is more severe, we may wish to add an additional recommendation to this Alert Definition. This is where your own IT Operational Procedures can help. You and your staff may have procedures in place to address issues of this type. This is your opportunity to add your own best practices to an Alert.
13. Click the plus sign “+” to add a new Recommendation.
14. For my example, I simply added a Recommendation to “NOTIFY THE SAN RAPID RESPONSE TEAM IMMEDIATELY” to the Alert. You can add something similar or something else more in line with your own procedures.
a. When complete, your Alert Definition should look something similar to Figure 12 (changes noted in red):
15. Click the “Save” button to save the Alert.
So, now, we have our CRITICAL Read Latency Alert. You also can create your new Write latency alert, by repeating Steps 1 through 15, to clone the existing Alert “Virtual Machine has Disk I/O write latency problem”. Edit it to appear similar to Figure 13:
FINAL STEP: Putting the New Alerts into Action.
Alerts do not become active until they are added to a Monitoring Policy and that Policy is assigned to the Groups for which you want the Alert to apply. We covered the basics of enabling and disabling Alerts within Policies in Part 3 of this series. Review those steps and make sure these two new Alerts are enabled for the resources for which you want them to apply. See Figure 14 for an example of what this should look like:
Testing the New Alert
If you wish to verify the appropriate behavior of these new Alerts, you should be able to do so using a tool to generate I/O behavior such as IOMETER.
Selective Suppression of Alerts Using Property Symptoms
So, now we know how to tune Alerts using Symptoms and adjusting thresholds. The next technique is to build on the Alert suppression method we described in Part 3 of this series, but doing so in a much more granular fashion.
In the next series of steps, we will demonstrate how to suppress specific Alerts on a resource by resource basis. We will do this using two key pieces of data:
- A specific feature of vROPs Alerts known as Property Symptoms
- vSphere Tags in vCenter
As we just demonstrated, Alerts in vROPs are based upon one or more Symptoms. Most of those symptoms are based upon performance metrics. But those are not the only kinds of symptoms that vROPs Alerts can recognize. In fact there are several additional symptom categories. Here is the complete list:
- Metric/Supermetric Symptoms
- Property Symptoms
- Message Event Symptoms
- Fault Symptoms
- Metric Event Symptoms
For the purposes of this exercise, the key Symptom type is Property Symptoms.
Our use case here is to manage Alerts by exception. We want to have a simple, straightforward method to view common alerts as they occur, and based on our needs, exempt or allow specific resources (VMs, Hosts, Datastores, etc) to continue or stop receiving events of this type.
For this example. we will return to our Stress Alerts we covered in Part 3 of this series.
Instead of suppressing Stress alerts for entire Clusters or other Groups of systems as we demonstrated in that article, we may wish to make realtime decisions to suppress the Stress Alert for individual VMs.
This is where Property Symptoms can help us. We can add some simple logic to the Stress Alerts to suppress it by specific configurations found within the VM.
So, let’s take a look by making this simple change:
1. Make sure you are logged into vROPs as an Admin user.
2. In the Navigation Pane, click on the “Content” Quicklink.
3. Click on the “Alerts Definitions” link. You will be returned to the existing Alerts definitions.
4. We are going to make an edit to the Stress Alerts, so in the Symptoms Filter, type “stress”.
5. Select either of the Stress Alert definitions (CPU or Memory).
a. The Alert Definitions Workspace opens.
7. Click on “Step 4: Add Symptom Definitions”.
8. Note the Drop Down Menu “Symptom Definition Type”. From that menu, select “Property”.
a. This will display all of the Property Types that vROPs collects for Virtual Machines from vCenter.
b. You will note there are many choices here: OS, VM Name, CPU and Memory configurations, etc. It might be our first instinct to choose VM Name as a method of exempting specific VMs from Alerts, but this would be inefficient from an administrative perspective. It would require you to visit the Alert editor for every exception.
10. The Property we are looking for here is “vSphere Tag”. Select “vSphere Tag” and the Symptom Criteria will appear. Complete the Criteria so it appears as it does in Figure 17:
a. So it is clear what we are trying to accomplish, let’s explain the logic in this symptom: We want to add a condition to an Alert so that it allows the alert if a specific tag is not found in vCenter. So by default, if you change nothing in vCenter, this Alert behaves as configured. If you add the tag “Exempt from Stress” to a VM in vCenter, however, any Alert with this Symptom will not fire for that VM.
11. Click Save, to save the Property Symptom.
12. Now we need to add the Property Symptom to the Stress Alert(s). The new Symptom should appear in your list of Property Symptoms. Simply drag it to the appropriate Symptoms area of the Stress Alert as shown in Figure 18.
13. Now, Save the Stress Alert Definition.
14. If you wish, you can repeat this for the other Stress Alert, or any other Alerts you would like to suppress using this vSphere Tag.
So, that’s it for configuration in vROPs.
To tag a Virtual Machine for Alert Suppression for this alert, you simply open the VM in the vSphere Web Client. The good news is that you can do that from within vROPs itself!!
To suppress Stress Alert for a specific VM:
1. Find the target VM in the vROPs UI.
2. From the Actions Menu, select “Open Virtual Machine in vSphere Client”.
3. Click on the “Manage”, then click on “Tags”.
4. Click on the “New Tag” Icon.
5. Add the Tag “Exempt from Stress” and a Category.
6. That’s it. With this Tag applied to any VM, it will never receive Alerts with the Property Symptom assigned.
There are many different ways this can be applied. Of course, having a handful of tags already configured in vSphere can make this even easier. Then you can go into the vSphere Alerts, and add the appropriate symptoms to the Alerts you wish to suppress on an exception basis.
Advantages vs. Disadvantages of this Method
This method of Alert Suppression has several advantages:
- It requires NO CHANGES to Monitoring Policies or creation of Custom Groups. Although combing this method with those simply adds another level of administrative control.
- It is ideal for one-off exceptions.
- Once the preparation of Alerts (ie adding the appropriate Property Symptoms), this method is relatively easy and intuitive.
- If we are fortunate, we will soon have available tools so we can automate the application os of vSphere Tags to many resources at once.
There are only a few disadvantages:
- There is some administrative overhead at the beginning (adding the Property Symptoms to Alert you want to Suppress). However, you are not likely to need to apply this method to every Alert type; only those that appear commonly, but you don’t wish to suppress the alert type altogether.
- The method will not work for Datastores unless you are running vROPs 6.0.2 or later. Earlier versions of vROPs did not recognize vSphere tags for Datastores. Fortunately, this has been addressed in the 6.0.2 release.
Hopefully, by now, you are building a powerful set of tools and methods to tune and configure vROPs to work most effectively in your environment. For most customers, the methods described in this series thus far. But we can take this even farther. So, stay tuned!
COMING UP NEXT: “Zero Alert Baseline” Using vROPs Property Inheritance