The story is all too real: a CIO receives a notification from the vice president of operations about a mission-critical application outage. The VP is already reeling under pressure to get the application back up and running, which translates into pressure on the operations team to reduce the mean time to repair (MTTR). The outage could have been prevented, in hindsight. It all started with alarms going off. But as one alarm went off, so did another, and another and another, until the actual cause of the incident was buried in the sea of red.
The snowballing effect went quickly from application performance degradation to a major outage. By the time the incident is brought under control, the company may have lost on an average as much as $750,000 for a 90-minute outage, according to the Ponemon report, “Cost of Data Center Outages,” in addition to loss of face and damage to brand value.
The post-mortem activity is just as messy. The current methods of having subject matter experts get into a war room, going through multiple product consoles and logs to identify the cause of an incident with a lot of finger-pointing and passing the buck that is inherent in the process, is putting companies at a severe disadvantage. This article will highlight how deep machine learning-based, root-cause analytics and predictive analytics technologies are helping organizations dramatically prevent such incidents, reduce mean time to repair and save brand reputation.
Too Many Cooks in the Kitchen
In today’s digital world, organizations must consider the design, resource and deployment of IT with a digital-first mentality. This requires teams to manage unparalleled amounts of data while predicting and preventing outages, in real time, and maintaining and delivering agile, reliable applications. The problem is that most organizations must tap several different siloed vendor tools to assist in the monitoring, identifying, mitigation and remediation of incidents and hope that they speak to each other, which traditionally hasn’t happened.
As IT continues to transform from physical to hybrid and multi-cloud environments, and to new architectures, it is becoming impossible for IT administrators to keep up with the multitude of objects, with thousands of metrics generating data in near-real time.
In order to ensure availability, reliability, performance and security of applications in today’s digital, virtualized and hybrid-cloud environments, new approaches must be employed to provide intelligence. Automated, self-learning solutions that analyze and provide insight into ever-changing applications and infrastructure topologies are essential in this transformation.
What Did You Mean By That?
Terms like “big data” and “machine learning” are seeing a lot of play in marketing materials these days because organizations understand that these features can help them tackle the complex and dynamic needs of application performance. However, what vendors say they have and what they actually mean by those phrases don’t always jive. Let’s define our terms:
Best Practices for Preventing and Managing Incidents
Application outage is far too important a business issue to leave to chance or slick marketing. Before a company moves forward with a solution, there are a few points they should keep in mind:
Charting a New Course
Traditional IT operations tools trigger a cascade of alarms that cannot distinguish between the critical, service-impacting events and false positives. This kind of “false alarm” overload can lead to alarm fatigue, which can then lead to outages or breaches. Next-generation data centers require next-generation solutions that allow them to gain the answers they need in real time and actionable recommendations on how to fix the problem – rather than a “sea of red” alarms.
About the Author:
Dr. Akhil Sahai is an accomplished management and technology leader with 20+ years of experience at large enterprises and startups. Akhil comes to Perspica from HP Enterprise where, as Sr. Director of Product Management, he envisaged, planned and managed the Enterprise-wide Solutions Program. At Dell, Akhil was head of Product Strategy and Management of Dell’s Converged Infrastructure product line. He also led Gale Technologies (News - Alert) as VP of Products to its successful acquisition by Dell, where he oversaw vCloud product and strategy with focus on applications and virtual appliances product line at VMware. He has authored and edited multiple books and holds 16 Technology Patents.