Murphy's Endgame: It'll Definitely Go Wrong If We Never Even Try To Fix It
Southwest had a major outage at the end of December 2022 that ruined holiday travel plans for tens of thousands of customers. Then, at the beginning of January 2023, the FAA had to ground all flights because of an outage of their own. Coincidence?? ...well, yes. But also- Commonalities.
So there were some interesting oopsies in the kind-of-important-not-to-have-oopsies airline industry at the end of 2022 and the beginning of 2023. I'm going to highlight two of them that you probably heard about: The Southwest meltdown that caused something like 65% of flights to be canceled over the holidays, and the failure of the FAA's Notice to Air Missions (or NOTAM) system. That one caused a 2-ish-hour “pause on all domestic departures.” These were both bad, for different reasons. But as we will see they share some unfortunate commonalities- not only with each other, but with a great many other organizations and businesses around the globe.
TL;DR: Both of these systems are admittedly complex, as we will see. But the problems were not caused because the systems were complex. The problems were caused because the systems were allowed to be de-prioritized and generally forgotten about by the organizations that were responsible for them. Let's take them one by one.
Part 1: The Southwest Kablooie
Let's get this out of the way first: Running an airline is complex. Southwest is more complex than normal, though. Simply put, their planes do not fly the same kinds of patterns that other national airlines use. Other airlines have a hub and spoke model. They organize X amount of flights from, say, Denver. Then they have spoke flights that go out to an end point, and back to the hub. The end points can be to other hubs, but they're only ever one link long. The flights are out-and-back, out-and-back. The busier the route, the more regularly it gets flown.
Southwest operates in a point-to-point model. Every flight is a direct flight from one city to the next, and the planes do not fly out and back- they fly in loops. So a plane could be scheduled to start in. SFO, go to DEN, then to CLT, then back to SFO. And based on needs, this loop could change every day. Like I said... complicated. Add in the human element (planes can't fly themselves, and humans need to sleep- ideally at home, whenever possible) and you can see how it can get messy, fast.
Luckily, Southwest has a system to handle this. The system is called “SkySolver,” and it is built from a commercial off-the-shelf package that is sold by GE and was “heavily customized” by Southwest. SkySolver was first offered by GE in 2006, and while I can't find a definitive date, my guess is that Southwest's version is from somewhere between then and 2010.
And the system works… pretty well. The software is slow, and if it breaks gate attendants are reduced to working reroutes via a glorified phone tree. But it works… when the weather is good. I saw someone online call Southwest “The best fair-weather airline in the world,” which, as you'll see, is quite the backhanded compliment. The problem with this system is that if there is an incident anywhere, it has the capacity to affect traffic everywhere. And if there are problems in more than one place? Forget about it.
And what happened over the holiday season this year? The weather was bad. In more than one place. This system then.. stopped working. Which, in parlance… is bad. The end result was approximately 17,000 flights being canceled between Dec 21st and the end of the year, which in parlance- is bad.
The problem was simple- SkySolver couldn't keep up, and eventually it crashed completely. There are reports from pilots and gate attendants that SkySolver was inadequate to the Southwest network going back as far as 2015. The issues occur on a small scale regularly- This is how we know about the phone tree. Employees complain about it all the time online. And it's not even that this December's incident was an outlier- there was a SkySolver-based spate of flight cancellations just around this time last year, for exactly the same reasons.
There have been no serious efforts to replace SkySolver- which is problematic because SkySolver is end-of-life. All there have been are patches and vague statements that they would 'do better.' The head of the Pilot's union didn't mince words, saying: “The company has had its head buried in the sand when it comes to its operational processes and IT,” which is probably a common refrain to anyone who's worked with Enterprise level IT.
This is a combination of tech debt and skin-flintery when it comes to hiring enough staff. Not only was this failure a predictable occurrence, many Southwest employees warned that it was going to happen, and the business did nothing. But it's ok though- if you check twitter you'll see that they're 'really, really sorry.' Estimated losses from revenue, refunds, and a goodwill credit program is gonna end up costing Southwest nearly a Billion dollars. That's 11 months profit wiped out because of the “cost savings” of not aggressively replacing SkySolver before it was too late. Ouch.
The really sad thing about this? CEO Bob Jordan is a freaking 34-year veteran of Southwest. He has a computer science degree and worked for real as a programmer and tech lead at HP and at Southwest before becoming an executive. If ANYONE was positioned in such a way as to understand something so basic as “Tech Debt is Bad,” you would think it would have been him.
Alas. On to Part 2. Let's talk about the FAA's failure from the beginning of 2023.
Part 2: The FAA's NOTAM Imbroglio
As most people probably know, the FAA is intimately connected to the airline industry. In short, the FAA basically owns the skies. They provide clearances and approve flight paths, and generally work to make sure planes are not taking off, landing, or flying anywhere near any other planes. To do this they rely on a lot of systems.
One of these systems is NOTAM. This stands for Notice to Air Missions, and provides real-time updates to pilots in the sky about abnormal situations or hazards. These can include big obvious things like a disabled plane causing the closure of a runway, down to big non-obvious hazards like changes in the path of a flock of birds. Pilots are legally not allowed to take off until they're up to date with the relevant NOTAM alerts. For busy airports, there can be as many as 200 NOTAM Alerts that have to be read and acknowledged. So if you're ever wondering why the plane is sitting still doing nothing instead of taking off? It could very well be a big ol' pile of NOTAMs.
The system started off as a phone-based way to notify airports about issues, and was eventually migrated online. How it was brought online is still a matter of debate but it appears that it was a home-grown solution. If the rumors are true, the NOTAM system that failed was written in ADA and running on olllllllld hardware.
So it appears that what happened was a “damaged database file” that was introduced into the system. This caused the system to go into “read only” mode, and no new NOTAMs could be added. No new NOTAMS = FAA grounding every single plane in the US for something like 2 ½ hours.
The speculation on what “a damaged database file” means is running rampant: was it a corrupted input of a previous NOTAM entry? Was it a bad block that was backed up before it was read by the system? Was it Oracleesque in the sense that a software bug inadvertently caused the system to corrupt its own data? We simply don't know. All we know is that the damaged file was there long enough to be in the standard backup. The NOTAM primary system was rebooted and “not properly validated,” and when it came back up it was unable to be updated.
The solution to the problem? I kid you not... it basically boiled down to “turn it off and turn it back on again.”
(And yes, according to people in the industry, there was a NOTAM released about NOTAM being down. Obviously it didn't go out on time, but you do have to love the commitment to the process.)
Now, one likely reason this occurred because there are multiple systems in the NOTAM ecosystem. There's the old system (USNS), and there is the “new” system (FNS). I just did gigantic air quotes because FNS was implemented in 2013. The systems are both running in parallel at the moment, as the migration to FNS is still not complete. Oh, and there's even a new new system, called SWIM, that's slated to be active in 2025. (I'm not gonna go into what those acronyms mean because it's not really relevant.) But the failure appears to have occurred in the older USNS system. Which begs the question, was USNS still getting those updates at the same rigid rate and documentary rigor as the other systems? Again, we simply don't know.
What is unquestionable is that there are a lot of problems with validation, updates, backups, and validations in the system. Clearly there were a lot of steps that were missed, and because there is an ongoing (9 years worth of ongoing apparently) migration happening, probably a lot of kludges introduced that aren't documented.
What kind of commonalities can be drawn from this? To me it's simple.
The problem in both of these cases is organizational. They're both cash-based and what I'd like to call toxically risk-averse. In Southwest's case, they chose not to fund the updating/replacement of SkySolver. In the FAA's case, they likely barely have enough funding to stay afloat. In both cases they had a solution that was, to put it kindly, rickety. And those are the worst kind. You get to a point where you're too scared to touch it, because if it breaks you don't know that you'll be able to fix it. This means no new updates, no new patches, no nothing.
But these two incidents also show the risks of doing nothing. Yes, you solved a crazy complex technical problem. But if you don't update as things change, then now you have another problem. One that's gonna be even harder to solve, the longer you ignore it.