10 visual signs your monitoring may not be effective
There are ten visual ways to instantly query if your IT monitoring needs be improved, so you can reduce performance and availability issues and control operational costs.
You have taken the – incredibly wise - decision to do something about monitoring efficacy in your business IT environment. But what if you don’t know where to start? You could use various assessments (outcome based, functionality based etc.) to measure monitoring and observability - and you should. But these can take time to perform.
In the meantime, there are some visual signs that there is almost certainly scope for improving the effectiveness of your monitoring.
Here are the 10 visual checks you can perform at a glance. These are not listed in any particular order.
1. Do you see screens that are packed with lots of data?
Too much data causes confusion. You need to see detailed low-level data that allows you to get to the cause of a specific issue and remediate. But you also need high-level data to understand the context of the cause. This way you'll know the impact of the issue, so you can mitigate the effect. Screens full of too much data do not provide this context.
Screens full of data can also be the result of a lack of decision-making on what data to collect. So, you can't see the important data because it’s mixed up in all the other data that was available.
It's hard to create effective simple monitoring, it's easy to create complicated confusing monitoring.
2. Do you see a screen with maps of your physical architecture?
If so, you may be missing out on seeing more valuable data.
Do you really need to see architectural mappings for a simple system? No. Use the space for more valuable data instead.
Do you need to see architectural mapping for complicated systems? Yes, classic examples being network topology. But check if this is on the primary monitoring screen. Often a physical depiction of components is not required on a primary screen. There is far more valuable information needed before you need to see a physical architecture. This is more likely to be required on the 2nd or 3rd drill down of investigation.
Another question to ask is whether the component location is only required as metadata. If so, then yes this data is needed - but is it needed on the primary screen.
Note: We are referring to physical architecture, not logical or business workflows. These can provide extremely valuable context on a primary screen.
3. Do you see a screen full of automatically discovered items?
If so, you may be missing out on seeing more valuable data.
Auto discovery is brilliant. It is brilliant if you don't know what you have. It is brilliant for providing time to value. But it's not the end goal of your monitoring.
Auto discovery typically provides basic monitoring based on device types, so acts as a great starter. But you'll need to add your high value monitoring requirements on top.
If you see auto discovered items, check they are delivering all your monitoring requirements.
4. Do you always have a lot of red or yellow alerts? Or none?
If you look at your screens on various days over one or several weeks, do you always see a lot of red or yellow? Or do you never have any? If so, you may have poor alert governance.
This check is not about whether you have a stable application or system. This check is about why you always have the same number alerts and what you are doing about them.
There are two extremes:
a) You always have too many alerts.
This could be an implication of poor alert governance. e.g., a lack of empowerment to get the right alerts. Or they are poorly configured - or you have no clear workflow for dealing with alerts.
b) You never have any alerts.
This could mean that you don't know what you should be alerted on. This is surprisingly common, and the answer is not always obvious.
5. Do you see a screen full of graphs, candlestick charts and so on?
If so, you may not even be seeing monitoring, but analysis instead.
These types of data widgets are brilliant for communicating the relationship between multiple data items. They can be highly effective to communicate monitored data. The classic example being to compare multiple values of the same data item together and depicting if the relationship is good or bad for the "live" value.
However, there are many, many, many more use cases for using such tools when performing post-event analysis. Post-event analysis is not monitoring. Post event analysis is after the incident has occurred, has been fixed and you want to learn from what happened.
So, if you see lots of these types of widgets on your monitoring screen, check the data you see is for monitoring. You would be amazed at how often you'll be looking at an analysis screen.
If it is a monitoring screen, see #1 – “lots of data.”
6. Do you see a screen with only one system being monitored on it?
If so, you probably have monitoring gaps.
There are very few, if any, IT systems, applications, services etc. that have no dependencies on other systems. So, unless you have a very specialized team with very specific monitoring requirements, you can assume you have gaps.
7. Do you see only screens full of tables of data?
If so, you may not have the full picture of what is happening.
Tables are not a great medium for providing context, which is important for understanding the impact, relationship and priorities of an issue.
Without context to go with the specific alert, you may be slower to remediate or at worst be unable to prevent a knock-on issue.
8. Do you see no comparison of what is happening now to what normally happens?
If so, you may be missing out on predicting an issue.
Most live data can be correlated with something else to make an operational judgement. For example, you know if a process goes down it's bad. You know that a business throughput metric is dependent on application runtime limits.
But for service-, business- or performance-related data, knowing what is happening now compared to something else does not always work. It's not always a clear decision if something is bad. A human or AI judgement on the behavior of a metric is needed. Highlighting if something is not normal allows that judgement to be made. Or at the very least, know there is a risk because things are not normal.
So, if you see no comparisons to the norm in on your screens, check if you need it.
9. Do your different user groups all look at the same views or data?
If so, you users may not be getting the data they need.
There are many teams using monitoring: Business support, client support, DevOps, application support, revenue ops. Security ops, and the list goes on.
Their requirements are different. What is important for one team is not important to another. This also applies when every team is monitoring the same system!
You should check if the users are getting what they need. There is a good chance they've been given a one-size-fits-all solution, so will not be getting what they need.
10. Do you see a monitoring solution built on nothing but log files?
If so, you may be missing out on valuable data and using the wrong tools.
Log file data, especially the whole file and not just messages, is brilliant for post-event analysis. But for real-time monitoring, log files should only be a part of the solution, not the whole solution.
Log files contain useful data for monitoring, such as keywords, messages, condition states, auditing etc. However, log files are only as good as the data in them. Do you trust this enough for your operational resilience?
Reasons why you should not only rely on log files:
a) Log files are typically designed for applications, not so much what is happening outside the application.
b) Log files can contain bad data - it's a message of an event, not the event itself.
c) Log files contain secondary data (unless it's directly related to the app). For example, if it's important for an application to have a connection to another component, then monitor the primary source, monitor the connection itself. Don't rely on an app to tell you.
d) There are often more fit-for-purpose methods to monitor the data.
Conclusion
We hope you find the checks helpful. The fact you can do these simple checks visually and with no questions make them super easy to try. Learn more
Webinar: What does a good monitoring strategy look like?