Monitoring the monitoring
What is monitoring?
IT Monitoring is about ensuring the availability of business services by checking, at regular intervals, the underlying IT estate which enables said service. The estate may include infrastructure (hardware, networks, storage), middleware, databases, applications, as well as end-user monitoring.
Most mission critical financial services companies deploy enterprise monitoring tools to comprehensively monitor their complete IT estate.
Why is monitoring the monitoring important?
In financial services, especially stock exchanges, banks, brokers and clearing houses, the availability & performance of business services plays a key role in the reputation of their platforms and, hence, retaining and winning clients.
Whilst everybody is aware of the growing reliance organizations are placing on IT services, with each incremental modernization, the potential for IT service disruption, and the associated costs, increases. Also, any IT service disruption or outage may have stark implications on the reputation of the concerned firm. Not only do organizations then face an uphill task regaining trust but also regulatory scrutiny and fines.
In view of the above, IT service monitoring plays a very important role in avoiding an outage. The reasons pointed to whenever an outage happens are typically:
- The Service was not being monitored (not configured/outdated monitoring).
- Monitoring was configured but not running (script error / config error).
- There were no alerts even though monitoring was in place.
- The alert didn’t catch the attention of the operator.
- The alert was lost in a sea of alerts that day.
Thus, it is critical that we monitor the health of the monitoring system itself to avoid monitoring being one of the root-causes for an outage.
How to monitor the monitoring
We recommend the following 5 basic checks to monitor the monitoring.
Examples of how to in Geneos are given in the screenshots below.
1. Is monitoring working?
Apply check on sampling status of all monitoring services to check that they are working at all times. This can be done by applying a simple severity rule on the sampling status of all services being monitored.
It can be then checked through the sampling status that it is being monitored.
You can monitor in one place all the gateway sampling statuses. This can be done using combination of Gateway Sharing and “Gateway-Sql” plugins.
2. Check if all applications and physical / virtual servers are covered?
Check if all the configured application services are covered. This can be done by using the “Gateway-managedEntitiesData” plugin as shown below. Note that there may be more than one application service (Managed Entity) on a single server.
A rule can be added on the “probeStatus” field to check if a probe’s status is “Up”.
Check if all the physical /virtual servers are covered. This can be done by using the “Gateway-probeData” plugin as shown below.
3. Check License validity.
This can be done by using the “Gateway-licenceUsage” plugin.
4. Visibility of monitoring health
Various troubleshooting decisions, like restarting a process, restarting a module or fail-over to backup, are taken during incident, it is important that the health status of the monitoring estate is available to all who take these decisions.
This can be done by having a placeholder for the underlying monitoring health “Sampling Status” on the mission critical dashboard. As a result, the decision maker knows if he is relying on the correct monitoring data or if there is a break in monitoring services which may be resulting in the alert.
Additionally, one second ticking date time also ensures that the dashboard state is up-to-date and not affected by a local workstation issue.
5. Monitoring reporting & audit
It is important that the monitoring team publish information to stakeholders regarding the following:
- What servers and applications are being monitored.
- Existing issues.
- Critical warning alerts per application and/or server.
- Alerts disabled or snoozed.
- Alert receipts configured (email & mobile) .
Stakeholders can then flag any gaps.