From Fragile to Agile: Reliability Engineering Path 2

From Fragile to Agile: Reliability Engineering

by Mike Angstadt

January 6, 2020

Our engineering mission at H-E-B Digital is to build secure, performant, reliable software that enables us to delight our customers—no matter how they choose to shop. And considering our scale and the wide range of services we need to keep everything running, that can be a little challenging. Factor in that we're a company that's been building and evolving our technology for over a century, and it becomes one tough engineering problem to solve. 

As reliability engineers, our job is to collaborate with teams of other tenacious digital engineers to ensure world-class performance. Here are some of the principles we focus on as we work diligently to ensure the reliability of H-E-B Digital products.

As we do this crucial work, it's crucial that we banish fragile thinking by employing these principles to ensure the reliability of H-E-B Digital products. 

No one is to blame 

The critical ingredient for success is a blameless retrospective culture. 

Complex systems exhibit emergent behaviors that are only discoverable through experimental investments, and there is no amount of pre-production testing that can expose certain defects. You don't know what you don't know until you run it, in prod.  

When we embrace the idea that no complex system failure has a single root-cause or individual person to blame, we're already on our way to incrementally improving our fragile thinking and can not only eliminate the need to assign blame, but can focus on what is actually happening. 

For any newly observed or interesting production incident, we conduct a team incident-review where we meet face-to-face and dive into what happened and how we responded. From there, we identify concrete action items that can prevent the problem from happening again. These actions can be technical (for example, add additional warning alerts to memory consumption), but often are procedural or behavioral. We've learned that when we focus on actions that teams can intrinsically own and then prioritized them at the top of the backlog, the results are mind-blowing. 

Next stop, observability

fragileillo


To make sure we understand how our systems are behaving in production, we need to properly observe them. We treat production environments like a continuously-running scientific experiment: collect the data to draw sound conclusions, prove or disprove hypotheses, and use that information to guide our teams priorities for how we can iteratively improve. 

It starts with basic logging and instrumenting the application code, leads to instituting solid monitoring and alerting, and follows all the way through to enabling detailed forensic introspection capabilities after (or during) an incident. These details help to both prevent and manage incidents better in the future.  

Our observability philosophy follows four basic steps: 

  • Look: Can we see what’s happening? What indicators can we measure? Which should we measure?

  • Measure: What interesting indicators represent service quality to our customers?

  • Define: What are the bounds of good and bad behavior for this indicator over time?

  • Improve: As we iteratively improve these indicators, how can we revise defined performance bounds?

We've made great strides developing observability requirements based on measuring what frustrates our customers and being intentional about improving those indicators, and we've been able to do so sooner rather than later. And, we iterate, which means we improve our service level indicators and service level objectives at increased velocity, particularly when we empower our teams to have real-time conversations directly with our customers.  


Bringing it full circle with error budgets

So now we've got a solid handle on how to identify improvement actions from production incidents, and we know how to gather great data to support a scientific approach to iterate on enhancements—but how do we fold this reliability work into our agile-scrum product-delivery workflows and balance it with new feature work? 

Welcome our friend, the error budget. 

Error budgets are the inverse of availability. Outperforming a service's SLO earns more error budget to spend on risky operations like shipping new code. We're leveraging error budgets to guide when we should prioritize stability work for high-priority systems versus building new features.  

Writing down the team's error budget policy allows engineers, product owners, and stakeholders to agree on what happens once the error budget is exhausted before it's actually exhausted. This typically includes suspension of new-feature development to invest in increasing reliability. After all, if the software isn't running, it doesn't matter how many new features are shipped. Leveraging this policy also lets us know when we've made something stable enough to move onto other work. 

Avoiding failure simply forfeits the ability to learn, so using the error budget gives us permission to break things safely and improve stability. In fact, leveraging this policy and enforcement to normalize latency of heb.com during a recent migration meant successfully scaling 2x while improving performance. 

Repeat the above

We've had incredible progress on our reliability journey thus far at H-E-B Digital, but we're just getting started! As we continue to scale our native mobile offerings, enhance our supply chain, and build world-class business systems, the opportunities to make our fragile systems more resilient by embracing failure with agility are limitless. And, if fixing all this fragility sounds fun to you, we’re hiring.

Mike Angstadt is Director of Engineering. You can connect with him on LinkedIn.

All we’re missing is you.