Managing a Technical Transformation (Part 1)

Jonathan Cone is Vice President of Engineering at FlightAware. He has been leading the organization through a technology transformation as it evolves both its products and technical stacks.

Where Did we Start?

When I joined FlightAware almost seven years ago, I knew from the interview process that they used a scripting language called TCL. Before my interviews, I’d never heard of TCL, but I investigated the language before I started, and it seemed straightforward. Sure enough, after joining I was able to become effective within TCL within a few short weeks, but it would be a couple of years before I might start to say I had reached an expert level. This journey was a common one for FlightAware, as most of the technical stack is written in TCL, and you are unlikely to hire engineers with experience in the language. So, this ramping up with the language was baked into the onboarding experience.

Now it wasn’t quite true to say that TCL was the only language in use when I joined FlightAware. There is of course the website, and while the backends were all written in TCL, the frontend was a mix of jQuery, JavaScript, CSS and html. There was some early work going on with React, and that would be increasingly utilized over the next few years. We also used Java for some feed ingestion where a customer was providing a WebSphere MQ interface. So not everything was written in TCL, but around 90% (not a fully accurate number, but illustrative of the amount of TCL code).

If you are wondering why the company had built its stack on top of TCL, the answer is straightforward. TCL was the language that the founders knew best and could utilize to quickly develop a flight tracking application back in 2005 when they were launching the site. And it turns out you can do quite a bit with TCL. It’s an extensible language, and the company had been able to create libraries where needed to support their needs while also supporting the wider community. In fact, if you go and look at FlightAware’s public GitHub account you will see that most of the repositories listed there are various TCL extensions and libraries. And the language can be very performative (at least for a scripted language), allowing FlightAware to process its global data feed in real time using a variety of TCL problems within a data processing pipeline.

The Transformation Starts

But this blog post isn’t about TCL and how everyone should be using it, but rather it is about FlightAware’s shift away from TCL. So first let me say, I have no bones to pick with TCL. I have used it for years, and it is still a language I’ll utilize for certain tasks (especially small analysis where I know all the library calls off the top of my head). But, as the volume of data we are processing has increased, and our need for greater performance, scalability, and tooling has expanded over these past seven years, we have reached the limits of TCL’s capabilities. Because of that ceiling, we have increasingly found ourselves utilizing new languages to solve technical problems at FlightAware. One of my first blog posts was about our shift to C++ for Firehose (FlightAware’s streaming API) because of performance requirements for that service. This was back in 2018, and did not mark any seismic shifts at FlightAware, but was an early indication of things to come.

Within a year or two of that Firehose rewrite, we would see the inclusion of Python as an officially supported language. One of our Engineering directors at the time lobbied for that change following the development of our Machine Learning (ML) ETA models. This was a case where we did not really consider TCL for that application, as the ML community had already built out the tooling for those applications in Python (and other languages) so the investment in TCL to have it support the same would be a huge sunk cost. That ML work demonstrated the value of using programming languages supported by a wider community. With Python, there is almost always at least one popular library for most common tasks and integrations. For example, there is more than one Kafka client library that we can use out of the box. When we first started using Kafka at FlightAware, we had to develop our own client library, wrapping the Kafka C client library with a TCL extension. That also means we had to maintain that client library, updating it whenever a new version of the Kafka C client was released (in theory at least, I don’t know that we were quite that diligent). This is a core advantage that Python or other popular languages will have over TCL: they have a wide community of contributors using the language to solve problems and, therefore, the support libraries already exist for most of the integrations we need. There were other advantages to Python too, the language has linters, static code analysis tools, debuggers and more. These are the things we dreamt of having for TCL, but the investment to add them was more than FlightAware alone could shoulder. The inclusion of Python into FlightAware’s mix of languages was an exciting development, and we all took advantage of that to write new code in Python when possible (there is also the question of how we ensured that engineers were knowledgeable about Python, and we used Alliances to affect that knowledge transfer).

Accelerating the Transformation

Up to this point, I do not believe the language shift at FlightAware was part of a deliberate effort to move away from TCL. The inclusion of new languages (C++, Python) had been driven by specific needs (performance, library support, etc.) and ground-up efforts. There are a few more instances of this. We utilized Haskell for some applications where parallel processing was a key feature and started experimenting with Rust as a replacement system level language for C++ with better memory safety. These experiments and bottom-up efforts were happening during the COVID period and were followed by some significant changes to the company itself (acquisition by Collins Aerospace in 2021).

Coming out of that period of change and uncertainty, we made our first high-level decision to shift a significant portion of our codebase away from TCL. We called this first effort, WebNXT, as the goal was to reimagine FlightAware’s web stack utilizing modern languages, frameworks and tooling. We focused first on the Web stack as that was an area where we were acutely feeling the pains of TCL. There had been significant improvements to web frontend and backend technology over the previous 10 years, so the gap between capabilities of our existing stack and a new modern stack was particularly painful. This was especially true for new hires who were accustomed to a better developer experience; the TCL stack was a real frustration.

Chart a Path Forward

So, in the 2021 company Town Hall our CTO laid down that marker, that we would embrace a technical transformation away from our existing web stack and to something new. But what would this new stack look like? To be fully transparent, we didn’t really know yet. We knew all the pain points with the existing stack, and we knew there were options out there which could improve life in several ways, but there wasn’t just one answer to that question, and we did not have a clear solution in mind yet. We did know that we were going to use React for our front-end applications, but that was only part of the answer. There are lots of additional tools and libraries you can use to flesh out your front-end stack, and that remained an open question. We thought we wanted to switch to using Typescript since React supports it, so that was fairly certain. We were especially uncertain regarding our data backends. Did we want to use node for our backends? Probably not for everything, but maybe in some places. What would we use for our core flight data backends? The list of questions went on and on.

At this point I was just a spectator to this process, but little did I know it would become a focus of my life in a few short months. With the organizational changes happening following FlightAware’s acquisition, I moved into one of the company’s Director of Engineering positions and took on the web team as part of my responsibilities. Prior to that transition though, the team had decided that it would be best to test out a new web stack outside of the main FlightAware web stack as a starting point. Conveniently enough, we had a project we were working on at that time that would need to have an independent technical stack, so we used that as a testing ground.

Some issues were settled early in that experimental project. We would use Typescript for our React work. We like the strict typing and the way that helps us prevent bugs at runtime. We would come to find that that is one gripe we have about Python, the lack of strict typing (Yes, I know you can accomplish it in Python. Most of our code is not annotated properly to achieve that end and it does require a fair amount of overhead to really achieve full type coverage). We also settled on AWS as our cloud provider and had our first introduction to running web services in cloud environments.

Not everything we tried panned out, and most of our web team had been with FlightAware for the past 7, 8 or 9 years, so we did not have huge exposure to the latest developments in front-end technology. This meant we were spending a fair amount of time experimenting and having to reverse course. What we really needed were people who had that experience already.

So, we hired to fill those knowledge gaps. We had existing requisitions for two managers within the group, and we utilized those requisitions to bring in people with experience transitioning from legacy technology stacks and building modern web applications. This meant updating our job postings to include new nice-to-haves like experience with cloud computing (AWS/GCP/Azure), Infrastructure as Code, modern JS frameworks, graphql, etc. Luckily, we were able to find people with those backgrounds.

Just because you have hired people with the experience your team needs, though, doesn’t automatically mean they are going to be successful. They will need some time to build relationships and trust within the organization, and they will need your backing to propose this new direction for your engineering stack. We did not have tremendous time to accomplish that, so we put our new managers in positions to demonstrate their capabilities quickly. At FlightAware, we expect our managers to also be individual contributors for some portion of their time, and our new managers were able to demonstrate their coding chops and the benefits of their proposed approaches early by jumping in the fray. This meant they built the credibility needed when we began proposing what the FlightAware web stack would look like in the future.

Now that we had the talent in place and the relationships built to put some real definitions around WebNXT, what did we determine? We (I’m using we very loosely here, I reviewed, gave blessings and asked questions during this process) proposed the following high-level vision for WebNXT:

We would build our web applications following cloud native principles where we use containers, service meshes, microservices, immutable infrastructure, declarative APIs, continuous delivery, and automation.
Web applications would be standalone, isolated services with separated frontends and backends.
All new web applications and any major changes to existing applications will be done using WebNext methodology. Only small changes or exceptions (where it is universally agreed) will use TCL.

We also defined more specifics around what the architecture would look like, but that last point above was the crux of the vision and the most radical part. We would move on from the existing TCL stack and start fresh. That is a consequential decision given the investment in our existing codebase and the effort required to move functionality into a new paradigm.

We did not come to that decision lightly, and we spent a fair amount of time investigating and building tooling for embedding TCL interpreters so we could re-use code. However, when we decided on a cloud native approach, this really meant building applications that did one thing and did that one thing well. Unfortunately, that was not the approach we had used for much of our TCL development over the years, so we were going to have to peel apart the onion. I think of this as the “rip the band aid off” approach, and it certainly carries certain risks, but in our case those risks associated with rewriting functionality were less than those of continuing within the previous paradigm.

Next Steps

With this high-level plan in place, we went about building buy-in and executing on our plan. There have been several successes and challenges along the way, and I’ll cover that in a subsequent post. We are now about halfway through the transformation, and on the cusp of delivering FlightAware’s website based on the new stack and methodology. That will be an exciting achievement, and I look forward to sharing more details on that process soon.