infoTECH Feature

March 31, 2016

The System That Cried Wolf

The story is all too real: a CIO receives a notification from the vice president of operations about a mission-critical application outage. The VP is already reeling under pressure to get the application back up and running, which translates into pressure on the operations team to reduce the mean time to repair (MTTR). The outage could have been prevented, in hindsight. It all started with alarms going off. But as one alarm went off, so did another, and another and another, until the actual cause of the incident was buried in the sea of red.  

The snowballing effect went quickly from application performance degradation to a major outage. By the time the incident is brought under control, the company may have lost on an average as much as $750,000 for a 90-minute outage, according to the Ponemon report, “Cost of Data Center Outages,” in addition to loss of face and damage to brand value.

The post-mortem activity is just as messy. The current methods of having subject matter experts get into a war room, going through multiple product consoles and logs to identify the cause of an incident with a lot of finger-pointing and passing the buck that is inherent in the process, is putting companies at a severe disadvantage. This article will highlight how deep machine learning-based, root-cause analytics and predictive analytics technologies are helping organizations dramatically prevent such incidents, reduce mean time to repair and save brand reputation.

Too Many Cooks in the Kitchen

In today’s digital world, organizations must consider the design, resource and deployment of IT with a digital-first mentality. This requires teams to manage unparalleled amounts of data while predicting and preventing outages, in real time, and maintaining and delivering agile, reliable applications. The problem is that most organizations must tap several different siloed vendor tools to assist in the monitoring, identifying, mitigation and remediation of incidents and hope that they speak to each other, which traditionally hasn’t happened.

As IT continues to transform from physical to hybrid and multi-cloud environments, and to new architectures, it is becoming impossible for IT administrators to keep up with the multitude of objects, with thousands of metrics generating data in near-real time.

           

In order to ensure availability, reliability, performance and security of applications in today’s digital, virtualized and hybrid-cloud environments, new approaches must be employed to provide intelligence. Automated, self-learning solutions that analyze and provide insight into ever-changing applications and infrastructure topologies are essential in this transformation.

What Did You Mean By That?

Terms like “big data” and “machine learning” are seeing a lot of play in marketing materials these days because organizations understand that these features can help them tackle the complex and dynamic needs of application performance. However, what vendors say they have and what they actually mean by those phrases don’t always jive. Let’s define our terms:

  • Big Data Architecture: The ability to handle masses of structured and unstructured data in an automated, highly scalable way using open source technologies.
  • Machine Learning: This technology means much more than visualizing the data. Vendors are throwing around the term “machine learning,” but it’s often a mischaracterization. Machine learning is self-learning, supervised or unsupervised algorithms that can be based on neural networks, statistics or Digital Signal Processing et al.
  • Domain Knowledge: What just happened? What caused it? How do we remediate it? How do we not have it happen again? The domain knowledge in TechOps and DevOps helps answer these questions.

Best Practices for Preventing and Managing Incidents

Application outage is far too important a business issue to leave to chance or slick marketing. Before a company moves forward with a solution, there are a few points they should keep in mind:

  • Big and Flexible: The solution needs to scale to handle millions of objects; legacy solutions are not adequate for today’s big data.
  • Automated: It’s easy to forget, but it isn’t the VPs of Operations or the CIOs of the world who are handling the day-to-day tasks – it’s admin and IT support. The solution needs to be automated in a way that quickly pinpoints the root cause of the problem and identifies how to fix it, rather than relying on expensive domain experts.
  • Working Smart vs. Working Hard: The idea of using siloed tools and a huge operations team is not always a silver bullet. IT workers become fatigued and apathetic – so when the system is actually disrupted, no one is paying attention. Instead of just alarms, identify solutions that provide answers and help determine exactly what needs immediate attention.
  • Preventing Outages: Traditional monitoring tools trigger alerts only after a problem has already occurred when the rules and set thresholds are violated, but the key to preventing outages is to predict issues before they become problems. Look for a solution that can alert you to anomalous trends or potentially dangerous issues before they impact your application.
  • Remediation: When incidents do happen, using tribal knowledge to remediate by pooling knowledge of in-house subject matter experts is difficult and time- consuming. Having access to vendor knowledge bases, discussion forums and the latest state-of-the-art technologies is important. Look for solutions that can curate tribal knowledge for repeatability but also can integrate crowdsourced knowledge into the mix.

Charting a New Course

Traditional IT operations tools trigger a cascade of alarms that cannot distinguish between the critical, service-impacting events and false positives. This kind of “false alarm” overload can lead to alarm fatigue, which can then lead to outages or breaches. Next-generation data centers require next-generation solutions that allow them to gain the answers they need in real time and actionable recommendations on how to fix the problem – rather than a “sea of red” alarms.

 About the Author:

Dr. Akhil Sahai is an accomplished management and technology leader with 20+ years of experience at large enterprises and startups. Akhil comes to Perspica from HP Enterprise where, as Sr. Director of Product Management, he envisaged, planned and managed the Enterprise-wide Solutions Program. At Dell, Akhil was head of Product Strategy and Management of Dell’s Converged Infrastructure product line. He also led Gale Technologies (News - Alert) as VP of Products to its successful acquisition by Dell, where he oversaw vCloud product and strategy with focus on applications and virtual appliances product line at VMware. He has authored and edited multiple books and holds 16 Technology Patents.




Edited by Stefania Viscusi
FOLLOW US

Subscribe to InfoTECH Spotlight eNews

InfoTECH Spotlight eNews delivers the latest news impacting technology in the IT industry each week. Sign up to receive FREE breaking news today!
FREE eNewsletter

infoTECH Whitepapers