Sean Kelly is FlightAware’s Senior Director of IT Operations and Reliability
Recently at FlightAware, we began embarking down the path of converting our traditional IT Operations team and practices into Site Reliability Engineering. In this post, we'll take a quick look at what SRE is, why we chose to go in this direction, and how this journey has changed our incident response processes for the better.
First, we should cover the basics and make sure we’re clear on what Site Reliability Engineering (SRE) even is. It is a discipline where software engineering practices are used to solve what would otherwise be traditional Operations problems. For example, an SRE might automate the testing, deployment, validation, and rollback of a software deployment. In the extreme opposite case, an Operations engineer may instead login to a server, run an installer by hand, and then call the developer if it breaks. This is an extreme example, but it is meant to illustrate a sort of mindset difference in problem solving.
SRE is still fairly new, so many organizations play fast and loose with what it means to them or which pieces they apply. At FlightAware, we’ve opted to model SRE using the Google SRE methodologies they write about in their book, but within reason. We have fewer servers and less cloud, so some aspects can be harder to apply while others may just need some scaling down. Some of our practices were already aligned with the SRE approach, but officially adopting it has brought more intent and direction to our trajectory.
Incident response is another important piece of SRE. It concerns itself with both the incident itself and the postmortem process afterwards. Google writes about incident response in their book, and other organizations like PagerDuty and Atlassian also discuss it in length. This increased structure around response to problems is the aspect of SRE we’ve embraced the most thus far.
As FlightAware has grown, the way customers use our services has also grown. We now have customers that use us as a key function of their operations. They want increasingly higher reliability since our service disruptions can have a significant impact to them. Our uptime has always been quite good, but we’ve never had processes or practices in place to hold us to the high standard as teams got larger. What was easy to do through institutional knowledge with a team of ten becomes a lot harder with 12+ teams of varying sizes and focus areas. You need a way to spread best practices within a growing Operations team as well as continuing to push the bar higher.
When I joined FlightAware in 2012, I was hired on as the IT Operations person. While my past is very Ops-focused, I also have experience writing software and approaching problems from a programming perspective. This is true for many of the folks making up our IT Operations team. SRE is a developer-oriented approach to Operations problems, so we were well suited to pivot the existing team to this new approach. FlightAware is a software company, and this allowed us to embrace that existing talent in the team without having to re-hire or extensively re-train.
Given all of this and its increased prominence in the industry, SRE was an obvious choice for us.
Houston, We Have an Incident
Before adopting SRE, we had already begun to work on improving our incident response processes. However, the methodologies laid out in various SRE texts gave us much more guidance and direction for making incidents better. Historically, in an outage, the on-call would get a robocall and dig in. If they got stuck, they’d escalate to colleagues. After it was all over, we often did brief write-ups on what actually broke. That was about the extent of it from start to finish. Much of this was fine when the company was small but started to falter over time especially as both internal and external stakeholders wanted more of an understanding as to what broke and how we were working to mitigate it.
Before we go any further, I’d like to set some context for future examples in this post. You may know FlightAware as a website for tracking flights. While this is absolutely true, we also have an entire portfolio of other products and services. One of our products is called Firehose. Firehose streams JSON-encoded flight events to TCP connected clients, allowing real-time ingestion of events into customers’ systems. Put more simply, Firehose sends messages like departures, arrivals, and flight positions to customers so they can integrate FlightAware into their products.
Throughout this post, I’ll be abusing Firehose by running it through all kinds of failure scenarios to illustrate how we handle incidents. Don’t worry, no Firehoses were harmed in the publishing of this post.
Let’s break Firehose now as a hypothetical example and see how our pre-SRE Systems Engineers would respond.
Firehose Is on Fire
It is 3:07AM and Archer’s phone jolts him out of bed. He logs in to Slack and Zabbix to see what is going on. He sees that there are 60 Zabbix alarms being triggered which seem to essentially boil down to Firehose not passing data to customers.
After confirming the situation, Archer runs the program for notifying customers. Notifications are generated and sent, so Archer continues trying to figure out what is broken.
He finds through a bit of investigation that none of the Firehose services are receiving messages from their upstream feed service. He logs into the upstream service and investigates. Eventually he tries restarting the service to no avail.
Archer is stuck. It appears Firehose isn’t actually broken but rather an upstream service. He knows Lana is on that team, knows a lot about the service, and always answers the phone. He gives her a call at 3:25AM, waking her up for the third week in a row. He doesn’t know it, but she is actually on vacation. Despite that, she still answers and offers to help.
Lana jumps on, putting her vacation on pause, and they both continue to investigate. A few times, they discover that they are investigating the same aspect of the problem. Once, they both restart the same services seconds within each other. They try to coordinate on Slack, but their primary focus is resolving the problem, so coordination isn’t firing on all cylinders.
Finally, between the two of them, they get things going again. Archer sighs in relief as he sends a notice to customers that service is restored. In the morning, he’ll write up exactly what broke so there is a record of what happened and something to work off of in the event customers ask questions. He’ll also need to reach out to Lana to capture what she did and work out which changes actually cleared the issue.
How We Have Changed
These days, we treat every failure as an incident. Full service outages and minor internal-only issues are all incidents. To prevent treating everything like a raging inferno, we classify based on severity and impact:
- SEV-1: Critical issue that warrants public notification and liaison with executive teams. An example would be that all our Firehose instances are down across all data centers.
- SEV-2: Critical system issue actively impacting many customers’ ability to use the product. An example here could be that no new customer connections can be established to Firehose, but existing ones are still working. This example also has a high likelihood of escalating to a SEV-1 as the impact grows due to customer connection churn.
- SEV-3: Stability or minor customer-impacting issues that require immediate attention from service owners. This could be the total failure of Firehose at a single datacenter. Customers can still connect to Firehose at other locations and, in many cases, will already have an active redundant connection at another site.
- SEV-4: Minor issues requiring action, but not affecting customer ability to use the product. This could be a single failed Firehose instance. The customers can still connect to other ones.
- SEV-5: Issues or bugs not affecting customer ability to use the product. This could be the loss of a single disk in a Firehose server. It doesn’t impact the service, but we are at an increased risk of failure.
Any SEV-1s and SEV-2s automatically require a trip through our postmortem process, but more on that later.
When the on-call gets summoned, they take on the responsibility for resolving the issue or escalating the incident response process to get more eyes on the problem. While working an incident, we also have several defined roles to keep responsibilities clear. By default, the on-call owns all roles until they delegate one or more out. The roles we have are as follows:
In our scenario above, Archer was the incident commander, communications, scribe, and worker. When Lana got on the scene, she also joined in a worker role.
A problem you may have noticed was that Archer just reached out to somebody he knew and trusted on another team for help. Given 100 issues of this nature, it is likely that Archer would reach out to Lana 100 times, making her a permanent and ad hoc on-call.
To solve this problem, we traded in our homemade on-call software implementation in favor of PagerDuty. PagerDuty made it a breeze to implement on-call rotations for key teams throughout Engineering that may need to be called on for their expertise. So now, Lana knows when she could expect a call and she’s not alone in the rotation.
So, with just these changes, let’s take a look at our incident again.
Firehose Is on Fire: Again From the Top
It is 3:07AM and Archer’s phone jolts him out of bed. He logs in to Slack and PagerDuty to see what is going on. He sees that there is a Firehose incident rolling up from 60 Zabbix alarms. They seem to essentially boil down to Firehose not passing data to customers. But the PagerDuty incident actually indicates the issue isn’t Firehose, but the upstream feed generator. Archer is Incident Commander, along with all other roles. He marks the incident as a SEV-1 since all Firehose customers are impacted.
Because customers are impacted, Archer uses the same procedure as before to notify customers of the incident. In the new parlance, Archer is acting as the Communications role while doing this. Because this is a SEV-1, he also escalates a notification to leadership using the Response Play feature in PagerDuty.
Like before, Archer spends a few minutes looking into the root cause of the problem, using the PagerDuty data as a guide. Once he realizes he’s got a big problem, he uses another Response Play to summon another SRE. Woodhouse shows up on the scene and takes on the Worker role.
In his Incident Coordinator (IC) role, Archer keeps tabs on what Woodhouse is doing. He’s also still wearing the Communications hat, so he keeps customers and internal stakeholders updated on the status. He may also seek further backup so somebody can take the Scribe role, documenting what is happening in real time. When Woodhouse tells Archer that the problem is upstream and they need help, Archer escalates to the right team to get a developer involved. Before, he would have just dialed Lana since she is very responsive. However, now there is an on-call rotation that is leveraged instead. That is where Pam comes in, pulled in programmatically as the active member in that team’s rotation.
Pam and Woodhouse collaborate on the problem, keeping Archer in the loop on what is being done. Archer is responsible for ensuring they are making progress and not doing conflicting or duplicated work. Unlike before, Archer ensures they aren’t both restarting the same services.
After Pam and Woodhouse resolve the problem, Archer again uses his Communications powers to notify customers and internal stakeholders that they are out of the danger zone. Finally, as the IC, he creates a Jira to launch the postmortem process in the morning and includes the timeline from the Scribe.
Incidents Are Better
This process provides a lot more structure and clear delineation on who is responsible for what during an incident. It also has the potential to yield far better documentation of the incident which will help with the postmortem and improve communication to customers. By expanding the on-call pool, we’ve also made it possible to programmatically escalate to the right people, making it easier to work an incident while also signaling for help. No time is spent war dialing from a company directory; just press the button and let the cloud robots do the work in the background. Since there is an IC role tracking progress, this also helps eliminate the problem where an engineer digs into a problem to the detriment of communications or identifying other paths to resolution.
The second most important part of incident response is the postmortem process. (If you didn't guess, the first is fixing the problem.) The postmortem process is the process where we not only diagnose what happened, but also discuss:
The postmortem isn’t just about identifying what broke and what was done to fix it. It is an opportunity to reflect, improve, and iterate. Sometimes there aren’t any clear action items. Sometimes bug fixes may need to be done in software releases. Occasionally, we find that something we’ve been doing for a long time is not desirable anymore and we need to make changes. Nothing is sacred, and everything is on the table.
FlightAware and I have two requirements for postmortems:
- They are blameless. Nobody is at fault for the incident. Humans make mistakes, so you need to accept that it will happen and design systems that are reasonably tolerant to our ways. Nobody is put on the spot or deemed to be the cause of an incident. The time I accidentally shutdown the wrong PostgreSQL server and caused an outage wasn’t an opportunity to admonish me but rather figure out how to make that harder to do. And yes, I did this.
- Postmortems are not punitive. This dovetails with the point above, but is also a harder one to land. When somebody is on the hook to write up a document outlining what happened, it has the potential to feel like a homework assignment. The right atmosphere and communications have to be put in place to instill the value of the postmortem. The opportunity to write one is actually a powerful voice on making recommendations for improvements. It is not, and never should be, a punishment.
Fortunately, the blameless approach has always been a core value of FlightAware engineering. It is a value I communicate as much as possible.
Our postmortem process is still evolving, but is currently comprised of the following phases:
- Initial writeup: The incident commander gathers all of the facts and data about the incident. This includes the timeline, cause of the incident, what changes were made to resolve it, start and end time, who was involved, etc. We aim to have this completed and peer reviewed within two business days. If there are action items identified that need addressed immediately, they will also be generated and prioritized.
- Deep dive: Now that the facts are understood and agreed upon, the SRE team does more of an introspective pass. A collaborative discussion is carried out to identify any process weaknesses, patterns, and greater or underlying issues that may need to be addressed. Some medium- and long-term action items may also be captured.
- Review: Every two weeks, we have a scheduled incident response meeting attended by the Vice President of Engineering, all Engineering group leads, and our SRE lead. Everyone who participated in an incident response since the last meeting is invited for a discussion and review of the incident. This goes beyond Site Reliability Engineering and has participation from developers as well. By this time, much of the incident is ironed out, understood, and has associated action items. This is a last pass for questions, concerns, feedback, and a general way to keep tabs on how we’re doing with incident response.
Through this process, we endeavor to make our software and services more reliable and resilient. By understanding what leads to a problem and going beyond a basic root cause analysis, we ensure that we are always moving in the right direction and guarding against future problems. This process can also help feed external customer communication, as part of our write up includes describing the customer impact.
Implementing new incident response and postmortem processes at FlightAware has helped provide much more rigor and structure to handling outages. We’ve gone from an ad hoc process to having roles and responsibilities. We’ve implemented postmortem processes to capture what broke but more importantly to identify the actions needed to guard against future failures.
By implementing everything outlined here, we continually iterate on minimizing impact, maximizing lessons learned, and improving software. So far, we’ve seen response times shorten (especially for escalations), visibility for stakeholders increase, and much more cross-team collaboration and ownership during and after postmortems. The impact is clear and positive, and I’m looking forward to how we can continually push the bar higher over time.