How to bring Capacity Planning practices into the Big Data age
The Facts:
Did you know the typical data centre server only operates at 12 to 18 percent of its capacity?[1] This is staggeringly low, but technical staff who are managing and maintaining large data centres are more concerned with keeping the lights on than worrying about efficiency. Throwing more servers or processors at a problem is, after all, easier than the more complex task of optimising workloads across an IT estate.
Ironically though, while most data centres run at woefully low efficiencies, outages related to capacity are still commonplace. For example, Black Friday and Cyber Monday are the perfect test for retailers. As shoppers flock online, retailers’ websites need to be armed and ready to cope with the deluge of web visitors and increased rate of purchases.
In 2016, major retailer sites, including QVC, Macy’s, and Walmart, all experienced outages on Black Friday due in part to the impact of increased traffic on the servers. It is also common for many e-commerce stores to see a significant increase in their page load time. While not fatal, it’s not ideal at a time when buyers are rushing to purchase items.
Managing capacity is therefore a fine balance. IT executives’ must ensure that data centre strategy keeps a close eye on hardware and plans for sudden peaks that can bring systems to a halt.
If it ain’t broke, anticipate it
Capacity planning is still a surprisingly arcane process for most enterprises. The process of modelling data centre capacity often involves throwing metrics into an Excel spreadsheet. This can be manually intensive, and subject to errors and inaccuracies.
Capacity planning in the modern age requires absorbing capacity-related metrics from monitoring tools and using complex modelling techniques. Only then can users understand current resource utilisation and plan any necessary improvements such as a migration to the cloud.
The old adage of ‘if it ain’t broke, don’t fix it’ doesn’t apply to the world of IT operations. Anticipating problems before they happen is a far more cost-effective option than fire fighting after a capacity incident.
Avoid Garbage In, Garbage Out
However, success of the modelling techniques will in part be down to the data inputs. A heavily simplified view of resource demand will result in an inaccurate forecast or recommendation. The key is to use a variety of metrics to predict demand combining resource metrics (CPU, Memory and Storage) with business level transaction metrics. The ultimate goal is to see how what’s going on in the business is driving resource consumption in the data centre and how shifts and spikes in business activity or strategy may affect it.
Using Sophisticated Tools in a post-Excel Age
Once the right data collection procedures are in place, it is time to select appropriate analytical techniques to make sense of the data. An accurate forecasting algorithm should account for numerous factors including cyclicality or seasonality; and hardware or software changes. The models should build a risk score for each virtual machine or host, predicting the probability of upcoming capacity incidents. There also needs to be a way to present the results of the predictive models. Reporting and visualisation functionality should allow technicians to easily slice and dice the capacity data across any dimension and publish reports for different audiences.
Making the Case for Capacity Planning
Although capacity outages are relatively common and have serious ramifications and longer-term reputation, IT teams may find it difficult to get senior executives on board.
The best solution is to transparently provide the numbers on how much money the enterprise could save by using a capacity planning tool. Using industry data or the company’s internal data, the IT team can find out the average monetary losses accruing to an outage and multiply it by the expected number of outages to occur annually.
After simple calculations, the case for capacity planning soon becomes clear: enterprises can save hundreds of thousands, even millions in some cases, by preventing outages and decommissioning underutilised hardware.
[1] According to a survey conducted by Anthesis Consulting Group
By Jay Patani, Tech evangelist, ITRS Group.