Using monitoring to strengthen operational resilience
With the right performance monitoring practices, you can boost risk management strategies to ensure regulatory compliance and create a better end-user experience.
As the saying goes, the only certainty in life is uncertainty. Recent unexpected events from the credit crisis to the Covid-19 pandemic have demonstrated this with aplomb. The financial services industry has taken a beating in these events (and more) prompting regulators to double down on making them prove operational resilience and risk management practices.
Operational resilience for the finance sector focuses heavily on user experience, considering factors such as downtime and business continuity - but also risk. But what exactly does that mean?
At its core, operational resilience is about keeping your business in business by understanding your critical services across the enterprise and adapting to any disruptions in real time. But it goes beyond recovery to sustainability of the business during disruptions.
Four stages of operational resilience
The goal in operational resilience is to identify potential problems before they happen and devise a plan to either mitigate the effects or to allow the organization to quickly recover. There are four stages to this:
1. Anticipating problems before they occur
Businesses that determine which events have the greatest potential of happening - and blocking the organization's ability to do business - are in a better position to recover quickly.
2. Producing preventive strategies
Once a potential risk is identified, a plan for handling that risk can be established. There can be various levels of resilience from a simple event, such as an IT systems failure, all the way to large-scale security risks involving more in-depth planning.
3. Responding and recovering
When an event happens, how long does it take your organization to identify and enact the proper strategy for mitigation and recovery?
4. Post-incident strategy
After an event has occurred and been successfully contained, it’s important to thoroughly examine what worked according to plan - and what might need to be tweaked in the future.
Bar set higher for financial services
While most companies require protection against things like reduced access to capital or equity, the financial sector is under even greater pressure. It has to protect against decreases in net interest income and credit loss just by the sheer number of daily transactions. In 2021, nearly 18 billion transactions were processed in the US alone.
The finance sector is also under greater scrutiny by regulators seeking compliance, as well as higher levels of application performance and customer protections. This was increased after the 2007-2009 credit crisis and recession, as well as several recent high-profile security breaches affecting tens of millions of customers. These factors call for the type of concurrent monitoring that offers a big-picture assessment of problems from multiple locations at the same time for faster problem resolution.
Another must-have is high reliability. Simultaneous checks from multiple locations give you a comprehensive view of the uptime, performance, and function for your websites, APIs, and servers. The result is you get more data and faster alerting, which translates to better user experiences.
For example, Neogrid, a SaaS provider that offers automated SCM solutions, found this out first-hand when an optical fibre link break cut off communications out of Brazil. The previous monitoring company that it used (before switching to Uptrends) didn’t have any checkpoints in Brazil and was not aware of the problem.
Tools for better monitoring
Simulating business-critical customer journeys: It’s a given that the user journey in business-critical transactions is comprised of many steps equally susceptible to occasional failures. Beginning with logins and balance checks to deposits and money transfers, there are numerous opportunities along the clickpath for black holes to appear and halt the journey toward customer satisfaction.
Enter transaction monitoring. In order to monitor these critical workflows, it’s important to put them in a script that can be run over and over again to check if everything still works as expected. This is where you can derive important data regarding service levels, system availability and more.
Real user monitoring: Operational resilience doesn’t just factor in performance data, anomalies, and planning on core processes — website uptime, API availability, alert, and notification routing, etc. Without data from users in real time, you cannot develop models of behaviour that will allow you to author procedures for incident mitigation and recovery.
Solutions like real user monitoring (RUM) harness your actual user’s experience and collect and quantify website performance and user data directly from your site’s visitors, in the actual location that they are accessing your services.
Important metrics can be gained from a local analysis perspective that can be vital in formulating mitigation and recovery processes:
• Know website speed per country.
• See exactly where local load times can be improved.
• You can cull rich data, including DOM duration, render duration, time to first byte and page ready time.
• Track browsers and operating systems used to access your website and how fast your website loads for each of them.
• Monitor actual mobile experience; inspect load times from visitors accessing your websites from mobile devices.
• Spot trends in your charts and quickly see your load times during peak business hours.
Top to bottom, full-stack monitoring
Active monitoring tools can take action to correct incidents before they become problems, but automated monitoring solutions can also offer financial firms a full picture of any given IT estate — from legacy to cloud-based systems. Only ITRS Opsview gives you access to information that allows you to mitigate operational, reputational, and financial risk, shorten issue detection and resolution time, and comply with operational resilience regulations.
Monitor operating systems, networks, cloud, VMs, containers, databases, applications, and more. With over 200 ITRS Opsview-supported Opspacks and 4,500+ plug-ins via Nagios Exchange, you have complete coverage where you need it most. Moreover, it’s flexible, scalable, and user-friendly.
You don’t have to be a financial powerhouse to take full advantage of a rock-solid monitoring heritage used by the world’s smartest operations and DevOps teams.
You can read the original article here.
Learn more about how Uptrends can help strengthen your operational resilience posture - start your free trial now by clicking below.