Subscribe to the InfoTech eNewsletter

infoTECH Feature

September 15, 2016

Hidden Lessons of Incident Management

By Special Guest
Jason Hand, DevOps Evangelist at VictorOps

One of the most common early goals of implementing DevOps best practices is a deep understanding of your systems in a stable state. However, this objective is not a “one and done” effort. It is important to continuously circle back in some form as changes are introduced. It’s an ongoing exercise for an entire organization as processes, tools and teams improve continuously over time.

In many cases during these beginning stages of DevOps transformations, agreeing on a starting point is where much of the time is spent. An unfortunate consequence of this is that without confidence in understanding where to start, oftentimes we never start at all. “Analysis paralysis” is a very real thing, especially for big organizational changes, and those who are typically risk-averse unfortunately fall victim to this far too easily.

A good starting point in any organization’s efforts to dip their toe into the DevOps pool is with on-call scheduling, incident management and monitoring improvements. Understanding your organization’s existing methods of identifying and responding to abnormalities is one of the easiest and most stimulating first steps.

The immediate benefits of modern DevOps on-call practices are easy to identify and agree on:

  • Anomalies are detected in real-time.
  • The correct operators and engineers are alerted to actionable issues as quickly as possible.
  • Critical context on what’s taking place gives responders exactly what they need in that moment, shaving time and cognitive load.
  • A collaborative space to discuss context, diagnosis and efforts towards repair, means reduced “time to repair” and an increased situational awareness across teams and the organization.

However, what about the concerns and opportunities that aren’t obvious or immediate? What else is at stake? Can more gains be made simply by improving the way companies monitor and manage on-call and incident management processes?

Opportunity to learn

If there is a large gap between identifying a problem and solving it, learning becomes more difficult. The ability to identify contributing factors becomes increasingly problematic the longer time passes. The trail to identifying everything involved with a disruption in service begins to go cold as operators, engineers, and the systems themselves move on to new tasks. Because of this, it becomes very difficult to learn and any opportunity for improvement is missed.

Snowball effect

What may seem like a small or non-critical problem can quickly become a large one if left alone. As time ticks away, seemingly insignificant issues accumulate and grow into large, complex problems that have dangerous long term impacts and are much more difficult to diagnose and repair.

Stay on track

Many follow Agile (News - Alert) Development principles and operate in short development cycles. Shortened sprints are designed and planned to prevent disruptions and context switching, which can be very detrimental to efficiency. However, sprint planning is developed to establish targets and goals, with the caveat that the team can quickly change course if the need arises. By responding to disruptions quickly, we have the greatest chance of achieving those goals.

Leveraging monitoring, alerting, and smart incident management software means having a pulse on your systems. That feedback loop helps everyone stay on track for the greater good of the services you are engineering, even if that means having to change course quickly and often. That is – after all – what Agile and DevOps are designed to provide you with.


The quality of your service is extremely important to not only end users, but the business as a whole. The service you provide IS the brand of the company and not placing quality of service as a top priority can mean extreme negative consequences. System resiliency and reliability as a means of “high availability” is paramount in establishing credibility. Consumers of your product have very little tolerance for regular or lengthy outages. Being consistent is one of the most important things to focus on for any organization. Customers are paying attention to that consistency. Are you?

Downstream consequences

The approach you and your organization use to take on incident management is a key indicator of how much you value continuous improvement. If your team or company culture does not place a high value on learning and striving for improvements in processes, tools, and individuals in a continuous manner, then any efforts to roll out DevOps will fail. This is why the ‘culture of DevOps’ comes up so frequently and why it frustrates many who strongly hold on to ‘old-view’ methods of managing development and operations.

Continuous improvement are at the heart of it all. Empathizing with end users and those involved in engineering and maintaining systems means that nothing is ever “done” or “good enough.” Everything must continuously improve. Establishing a deep understanding of your systems provides insight on where to focus efforts of improvement.

Failing to place understanding and learning as the highest priority means imminent failure of the organization and the products or services it provides.

About the Author

Jason Hand is a DevOps Evangelist at VictorOps, co-organizer of DevOpsDays – Rockies, and author of “ChatOps for Dummies” and “ChatOps: Managing Infrastructure in Group Chat.” Jason is also the host of a number of DevOps related meetups in the Denver/Boulder/San Francisco areas. Jason has spent the last 15 months presenting and giving workshops on a number of DevOps topics, such as blameless post-mortems, ChatOps, alerting and the value of context within incident management.

Edited by Alicia Young

Subscribe to InfoTECH Spotlight eNews

InfoTECH Spotlight eNews delivers the latest news impacting technology in the IT industry each week. Sign up to receive FREE breaking news today!
FREE eNewsletter

infoTECH Whitepapers