Incidents and Accidents: Examining Failure Without Blame

Posted on Thursday, Jan 15, 2015
Dave Zwieback, VP of Engineering at Next Big Sound and Mike Rembetsy, VP of Technical Operations at Etsy discuss learning from the unexpected and examining failure without blame. With practical tips about technical tools and philosophical insights into the human factors and cognitive biases in play, these industry experts offer useful guidance for the thorny questions around the topic of failure.

Show Notes

Dave is at Next Big Sound, which does analytics for creative industries, and he’s seen a few orgs handle failure well, and a lot of organizations handle it poorly. He got interested in blameless postmortems and human factors in discussions with John Allspaw of Etsy, and Allspaw influenced him to read the work of David Wood and Sidney Dekker on human factors. He is writing a book for O’Reilly called Being Blameless.

MCR works at Etsy now, but has spent a lot of time consulting at various firms where he’s seen failure handled with blame. He points out what Rt. Lieutenant Colonel Scott Snook said in Friendly Fire, a book about when two US helicopters were accidentally shot down, that failure is part of complex systems.

MCR: “I work at Etsy, and that’s what we do - we examine failure as a learning opportunity.”

Dave is running his next workshop on Awesome Postmortems in NYC on February 12th, in which

Dave: “Sidney Dekker’s Field Guide to Understanding Human Error is probably the most important book for people like us, meaning people that are in the IT world - it’s very accessible and gives lots of examples from fields outside of IT, but they’ve very relevant to what we do.”

MCR: “Failure is gonna happen. It’s not a matter of if something is going to fail, it’s a matter of when it is going to fail.”

MCR mentions the different categories of failures - those that “fail closed”, that are easy to detect, like disk filling up, and “fail open” - the surprises. He mentions some of the techniques Etsy uses - an IRC warroom, Vidyo video chatting, to resolve an immediate issue. After the immediate issue is solved, the learning begins.

MCR: “We celebrate failure as much as we celebrate success here. […] The three-armed sweater is given to the person who most spectacularly impacted the website in the year.”

On the topic of why to do a blameless postmortem, MCR points out that it’s for learning, and there are both technical and human factors. Dave points out that blaming a person short-circuits the learning. Claiming that a person is the cause of the outage feels like a good story, but it’s not true.

Dave discusses root cause and mentions Allspaw’s excellent blog and a specific post about there being no such thing as a root cause, and Dave disagrees. He believes that outages are caused by change, and the systems with which we work are fundamentally changeable. “The impermanence of systems is the reason that they both function and malfunction.” Mike counters by saying, “Is there really a root cause for something that failed? If a hard drive dies, it’s the same hard drive. It hasn’t changed.” They both agree that it’s a philosophical rabbit hole.

MCR notes that as Etsy grows, they’ve found that user-impacting, service-degrading issues are when they do postmortems, and even if not user-impacting, if they can learn from a failure it’s worth doing one. Dave says, “The more we learn about the complex systems within which we work, the better we’re able to operate them.”

Within a week or two, according to Dave, is common practice of a time in which do the postmortem. MCR mentions that it’s important to write down the timeline almost immediately, definitely within a day or two, but doing it while someone’s amygdala is still triggered (and they are upset) is too soon. Dave points out that the facilitator of a postmortem sets the tone, including reminding people of hindsight bias, and at Next Big Sound they use a specific framework document which Dave will share. He also mentions defusing stress with empathy and humor.

On the topic of evaluating anything you do, MCR mentions that Etsy created Morgue because any department across Etsy can apply these techniques to learn. Dave points out they do retrospectives as well as prospective review at Next Big Sound. MCR says Etsy does both an architectural review and an operability review ahead of time. Dave mentions that answers in prospective reviews can be biased in a positive way, whereas in a “premortem” we imagine things going badly, and try to determine what could lead to that: in essence, harnessing hindsight bias to work for us.

Bridget forgets what decade it is and claims to have seen a presentation at devopsdays 2003. That would have been a nifty trick, since the first one was in 2009. :)

Check Outs

Dave: Mike: Bridget: Trevor:
  • I was on vacation and delightfully disconnected. It’s been pretty awesome. Got a new Kindle and have been reading Game of Thrones before Matt accidentally (though at this point it’s my fault) spoils something.
  • Set up kegbot at our new office, will be doing it’s grand opening later today :) Metrics about office beer / root beer consumption to come!
Matt:

Guests

Dave Zwieback

Dave Zwieback

Dave Zwieback is VP, Engineering at Next Big Sound. Dave is the author of The Human Side of Postmortems and Being Blameless: The Best Way To Learn From Failure (and Success), coming in 2015 from O’Reilly Media. Follow Dave @mindweather or read his blog at mindweather.com.

Michael Rembetsy

Michael Rembetsy

Michael Rembetsy has worked in technical operations for over ten years in the web, healthcare, online media and financial sectors. He started out in the help desk area, but moved to operations shortly thereafter, and has been building and running data center and operations teams ever since. In previous jobs he worked for NBC Universal, iVillage and McDonald’s online game, Monopoly. Currently, Michael is the VP, Technical Operations for Etsy.

Hosts

Matt Stratton

Matt Stratton (he/him)

Matty Stratton is the Director of Developer Relations at Aiven, a well-known member of the DevOps community, and a global organizer of the DevOpsDays set of conferences.

Matty has over 20 years of experience in IT operations and is a sought-after speaker internationally, presenting at Agile, DevOps, and cloud engineering focused events worldwide. Demonstrating his keen insight into the changing landscape of technology, he recently changed his license plate from DEVOPS to KUBECTL.

He lives in Chicago and has three awesome kids, whom he loves just a little bit more than he loves Diet Coke.

Trevor Hess

Trevor Hess

Trevor Hess is a Senior Product Manager at Progress Software working on Chef Software. He currently works on the Chef Application Delivery, Compliance and Infrastructure offerings.

Coming from a background in .NET Software Development and consulting, he has worked with several large multinational organizations to help kick start their journey to the cloud and the world of DevOps practices and principals. He is excited to engage in new experiences, and learning opportunities.

Trevor enjoys having hearty discussions about DevOps as well organizational change and transformation.

Bridget Kromhout

Bridget Kromhout

Bridget Kromhout is a Principal Program Manager at Microsoft Azure, focusing on the open source cloud native ecosystem. Her CS degree emphasis was in theory, but she now deals with the concrete (if ‘cloud’ can be considered tangible). After years on call for production (from enterprise to research to startups) and a couple of customer-facing adventures, she now herds cats and wrangles docs on the product side of engineering. In the wider tech community, she has done much conference speaking and organizing, and advises the global devopsdays organization after leading it for over five years. Living in Minneapolis, she enjoys snowshoeing in the winter and bicycling in the summer (with winter cycling as a stretch goal).


pagerduty

redgate

10thmagnitude