Founded by Google SRE alumni, it is no surprise that Loon’s Production Engineering/SRE team instituted a culture of blameless postmortems that became a key feature of Loon’s approach to incident response. Blameless postmortems originated as an aerospace practice in the mid-20th century, so it was particularly fitting that they came full circle to be used at a company that melded cutting edge aerospace work with the development of a communications platform and the world’s first stratospheric temporospatial software defined network. The use of postmortems became a standardizing factor across Loon’s teams— from avionics and manufacturing, to flight operations, to software platforms and network service. This blog post discusses how Loon moved from a heterogeneous approach to postmortems to eventually standardize and share this practice across the organization— a shift that helped the company move from R&D to commercial service in 2020.

Background

Postmortems

Many industries have adopted the use of postmortems— they are fairly common in high-risk fields where mistakes can be fatal or extremely expensive. Postmortems are also widespread in industries and projects where bad processes or assumptions can incur expensive project development costs and avoiding repeat mistakes is a priority. Individual industries and organizations often develop their own postmortem standards or templates so that postmortems are easier to create and digest across teams.

Blameless postmortems likely originated in the healthcare and aerospace industries in the mid-20th century. Because of the high cost of failure, these industries needed to create a culture of transparency and continuous improvement that could only come from openly discussing failure. As the original SRE book states, blameless postmortems are key to “an environment where every ‘mistake’ is seen as an opportunity to strengthen the system.”

The goal of a postmortem is to document an incident or event in order to foster learning from it, both among the affected teams and beyond. The postmortem usually includes a timeline of what happened, the solutions implemented, the incident’s impact, the investigation into root causes, and changes or follow-ups to stop it from happening again. To facilitate learning, SRE’s postmortem format includes both what went well— acknowledging the successes that should be maintained and expanded— and what went poorly and needs to be changed. In this way, postmortem action items are key to prioritizing work that ensures the same failures don’t happen again.

Loon

Loon aimed to supply internet access to unserved and underserved populations around the world by providing connectivity via stratospheric balloons. These high altitude “flying cell towers” covered a much wider footprint than a terrestrial tower, and could be deployed (and repositioned) into the most remote corners of the earth without expensive overland transportation and installation. As the first company to attempt anything like this, Loon dealt with a number of systems that were complex, challenging, or novel: superpressure balloons designed to stay aloft for hundreds of days, wind-dependant steering, a software defined network consisting of constantly moving nodes, and extremes of temperature and weather at 20km above Earth’s surface.

Prod Team

The initial high-risk operations of Loon’s mission were avionic: could we launch and steer balloons carrying a networking payload long enough to reach and serve the targeted region? As such, the earliest failure reports within Loon (which weren’t officially called “postmortems” at the time) mostly involved balloon construction or flight, and drew on the experience of team members who had worked in the Avionics, Reliability Engineering, and/or Flight Safety fields. As Loon’s systems evolved and matured, they started to require operational reliability, as well. Just before graduating from a purely R&D project in Google’s “moonshot factory” incubator X to a company with commercial goals, Loon started building a Site Reliability Engineering (SRE) team known internally as Prod Team.

In order to effectively offer internet connectivity to users, Loon had to solve network serving failures with the same rigor as hardware failures. Prod Team took the lead on a number of practices to improve network reliability. The Prod Team had three primary goals:

  • Ensure that the fleet’s automation, management, and safety-critical systems were built and operated to meet the high safety bar of the aviation industry.
  • Lead the integration of the communications services (e.g., LTE) end to end.
  • Own the mission of fielding and providing a reliable commercial service (Loon Library) in the real world.

Postmortems at Loon

The Early Days

Postmortems were one tool for reaching Prod Team’s (SRE’s) goals. Prod Team often interacted with SREs in other infrastructure support teams that the Loon service connected to, such as the team developing the Evolved Packet Core (EPC), our telco partner counterparts, and teams that handle edge network connectivity. Postmortems provided a common tool for sharing incident information across all these teams, and could even span multiple companies when upstream problems impacted customers.

At Loon, postmortems served the following goals:

  • Document and transcribe the events, actions, and remedies related to an incident.
  • Provide a feedback loop to rectify problems.
  • Indicate where to build better safeguards and alerts.
  • Break down silos between teams in order to facilitate cross-functional knowledge sharing and accelerate development.
  • Identify macro themes and blind spots over the longer term.

The combination of aerospace and high tech brought two strong practices of writing postmortems, but also the challenge of how to own, investigate, or follow up on problems that crossed those boundaries, or when it wasn’t clear where the system fault lay.

Loon’s teams across hardware, software, and operations orgs used postmortems, as was standard practice in their fields for incident response. The Flight Operations Team, which handled the day-to-day operations of steering launched balloons, captured in-flight issues in a tracking system. The tracking system was part of the anomaly resolution system devised to identify and resolve root cause problems. Seeking to complement the anomaly resolution system, the Flight Operations Team incorporated the SRE software team’s postmortem format for incidents that needed further investigation— for example, failure to avoid a storm system, deviations from the simulated (expected) flight path that led to an incident, and flight operator actions that directly or indirectly caused an incident. Given that most incidents spanned multiple teams (e.g., when automation failed to catch an incorrect command sent by a flight operator, which resulted in a hardware failure), utilizing a consistent postmortem format across teams simplified collaboration.

The Aviation and Systems Safety Team, which focused on safety related to the flight system and flight process, also brought their own tradition and best practices of postmortems. Their motto, “Own our Safety”, brought a commitment to continually improving safety performance and building a positive safety culture across the company. This was one of the strengths of Loon’s culture: all the organizations were aligned not just on our audacious vision to “connect people everywhere”, but also on doing so safely and effectively. However, because industry standards for postmortems and how to handle different types of problems varied across teams, there was some divergence in process. We proactively encouraged teams to share postmortems between teams, between orgs, and across the company so that anyone could provide feedback and insight into an incident. In that way, anyone at Loon could contribute to a postmortem, see how an incident was handled, and learn about the breadth of challenges that Loon was solving.

Challenges

While everyone agreed that postmortems were an important practice, in a fast moving start-up culture, it was a struggle to comprehensively follow through on action items. This probably comes as no surprise to developers in similar environments— when the platform or services that require investment are rapidly changing or being replaced, it’s hard to spend resources on not repeating the same mistakes. Ideally, we would have prioritized postmortems that focused on best practices and learnings that were applicable to multiple generations of the platform, but those weren’t easy to identify at the time of each incident.

Even though the company was not especially large, the novelty of Loon’s platform and interconnectedness of its operations made determining which team was responsible for writing a postmortem and investigating root causes difficult. For example, a 20 minute service disruption on the ground might be caused by a loss of connectivity from the balloon to the backhaul network, a pointing error with the antennae on the payload, insufficient battery levels, or wind that temporarily blew the balloon out of range. Actual causes could be quite nuanced, and often were attributable to interactions between multiple sub-systems. Thus, we had a chicken-and-egg problem: which team should start the postmortem and investigation, and when should they hand off the postmortem to the teams that likely owned the faulty system or process? Not all teams had a culture of postmortems, so the process could stall depending on the system where the root cause originated. For that reason, Loon’s Prod Team/SREs advocated for a company-wide blameless postmortem culture.

Much of how Loon used postmortems, especially in software development and Prod Team, was in line with SRE industry standards. In the early days of Loon, however, there were no service level objectives or agreements (SLO/As). As Loon was an R&D project, we wrote postmortems when a test network failed to boot after launch, or when performance didn’t meet the team’s predictions, rather than for “service outages”. Later on, when Loon supplied commercial service in disaster relief areas in Peru and Kenya, the Prod Team could more clearly identify the types of user-facing incidents that required postmortems due to failure to meet SLAs.

Improving and Standardizing Loon’s Postmortem Processes

Moving Loon from an R&D model to the model of reliability and safety necessary for a commercial offering required more than simply performing postmortems. Sharing the postmortems openly and widely across Loon was critical to building a culture of continuous improvement and addressing root causes.

To increase cross-team awareness of incidents, in 2019 we instituted a Postmortem Working Group. In addition to reading and discussing recent postmortems from across the company, the goals of the working group were to make it easier to write postmortems, promote the practice of writing postmortems, increase sharing across teams, and discuss the findings of these incidents in order to learn the patterns of failure. Its founding goal was to “Cultivate a postmortem culture in Loon to encourage thoughtful risk taking, to take advantage of mistakes, and to provide structure to support improvement over time.” While the volume of postmortems could ebb and flow across weeks and months, over multiple years of commercial service we expected to be able to identify macro-trends that needed to be addressed with the cooperation of multiple teams.

In addition to the Postmortem Working Group, we also created a postmortem mailing list and a repository of all postmortems, and presented a “Lunch & Learn” on blameless postmortems (see example slide below). Prod Team and several other teams’ meetings had a standing agenda item to review postmortems of interest from across the company, and we sent a semi-annual email celebrating Loon’s “best-of” recent incidents: the most interesting or educational outages.

Once we had a standardized postmortem template in place, we could adopt and reuse it to document commercial service field tests. By recording a timeline and incidents, defining a process and space to determine root causes of problems, recording measurements and metrics, and providing the structure for action item tracking, we brought the benefits of postmortem retrospectives to prospective tasks.

When Loon began commercial trials in countries like Peru and Kenya, we conducted numerous field tests. These tests required engineers from Loon and/or the telco partner to travel to remote locations to measure the strength of the LTE signal on the ground. Prod Team proactively used the postmortem template to document the field tests. It provided a useful format to record the log of test events, results that did and did not match expectations, and links to further investigations into those failures. As a cutting edge project in a highly variable operating environment, using the postmortem template as our default testing template was an acknowledgement that we were in a state of constant and rapid iteration and improvement. These trials took place in early to mid 2020, under the sudden specter of Covid and the subsequent shift towards working from home. The structured communications at the core of Loon’s postmortem structure were particularly helpful as we moved from in-person coordination rooms to WFH.

What Loon Learned from Standardizing Postmortems

Postmortems are widely used in various industries because they are effective. At Loon, we saw that even fast moving startups and R&D projects should invest early in a transparent and blameless postmortem culture. That culture should include a clear process for writing postmortems, clear guidelines for when to conduct a postmortem, and a staffed commitment to follow up on action items.

Meta-reviews across postmortems and outages revealed several trends.

The many points of failure we observed across the range of postmortems were indicative of both the complexity of Loon’s systems and the complexity of some of its supporting infrastructure. Postmortems are equally adept at finding flaky tests and fragile processes vs. hardware failures or satellite network outages. These are complexities familiar to many startups, where postmortems can help manage the tradeoff between making changes safely vs. moving quickly and trying many new things.

Loon was still operating a superhero culture: across a wide range of issues, a small set of experts were repeatedly called upon to fix the system. This dynamic is common in startups, and not meant as a pejorative, but was markedly different from the system maturity that many of Prod Team/SRE were used to. Once we identified this pattern, our plan for commercial service was to staff a 24×7 oncall rotation, complemented by Program Managers driving intention processes to de-risk production

Postmortems provided a space to ask questions like, “What other issues could pop up in this realm?”, which prompted us to solve for the broader case of problems rather than specific problems we’d already seen. This practice also stopped people from brushing off problems in the name of development speed, or from dismissing issues because they “just concerned a prototype”.

Tips and Takeaways

While the specifics of Loon’s journey to standardize postmortems tell the story of one company, we have some tips and takeaways that should be applicable at most organizations.

Tip 1: Adopting a blameless postmortem culture requires everyone to participate

Although the initiative of writing postmortems often originates with a software team, if you want every team to adopt the practice, we suggest trying the following:

  • Give a talk about postmortems and how and why they could benefit all.
  • Form a postmortem working group.
  • Invite people representing different teams to be part of the postmortem working group. They will give insights into what could work better for their respective teams.
  • Don’t make the postmortem working group responsible for writing the postmortems— this approach doesn’t scale. Reviewing and consulting on postmortems may be in scope of their duties, especially while new teams are adopting this practice.

Tip 2: Define a lightweight postmortem process

Especially during adoption, you want teams to see the benefits of postmortems, not the burden of writing them. Creating a postmortem template with minimum requirements can be helpful.

Tip 3: Define a clear owner for postmortems

Who should write a postmortem and when? For software teams with an oncall rotation, the answer is clear: the person who was oncall during the incident is the owner, and we write postmortems when a service interruption breached SLOs. But when the service has no SLOs, or when a team doesn’t have an oncall rotation, you need defined criteria. Bonus points if the outage involves multiple systems and teams. The following exercises can help in this area:

  • Reflect on these topics from the point of view of each team, and from the point of view of the interaction between teams.
  • For each team, define what type of incident(s) should trigger a postmortem.
  • Within the team, define who should own writing each postmortem. Avoid putting the entire burden on the same person frequently; consider forming a rotation.

Tip 4: Encourage blameless postmortems and make people proud of them

Consider some activities that can help foster the blameless postmortem culture:

  • Write a report of the best postmortems over a given period and circulate them broadly.
  • Conduct training on how to write postmortems.
  • Train managers and encourage them to prioritize postmortems on their teams.

Conclusion

When Loon shut down, addressing all these points was still a work in progress. We don’t have a teachable moment of “this postmortem process will solve your failures”, because postmortems don’t do that. However, we could see where postmortems stopped us from needing to deal with the same failures repeatedly… and where sometimes we did experience repeat incidents because the AIs from the first postmortem weren’t prioritized enough. And so this piece of writing— effectively, a postmortem on Loon’s postmortems—serves up a familiar lesson: postmortems work, but only as well as they are widely accepted and adhered to.

 

 

By: Danielle van Dyke (Site Reliability Engineer, Loon) and Giselle Font (Site Reliability Engineer, Google Cloud)
Source: Google Cloud Blog

Previous Cloud Security Podcast By Google Turns 46 - Reflections And Lessons!
Next DevOps And CI/CD On Google Cloud Explained