As FlightAware moves away from its monolithic Tcl tech stack to a distributed micro service architecture, many core services need to be split out from the monolith to keep the system running. Perhaps most important is authentication: a software product needs to have a means of knowing who you are so that it can serve you appropriate, helpful, and actionable information, and make it possible to ensure that your information does not end up in the hands of others (a topic about which we will have more to share later). To that end, this year we launched a new authentication solution thoughtfully designed around a modern approach to building web products.
Our previous solution for authentication was a Tcl library within our monolith. However, we are now moving towards serving multiple independent apps in our monorepo. We no longer use Tcl for building new services, so we needed a new approach to authentication. At a high level, this new approach has the following requirements:
- It needed to support passwordless logins so that FlightAware is no longer in the business of managing sensitive password credentials.
- It needed to have first-class support for Next.js, as our new apps are being built with that framework.
- It needed to support multiple apps well; if you are on one product and move to another, you don’t have to sign back in.
- It needed to be usable in our Tcl monolith. You could perhaps call this item (3a), but this one is unique because it is completely different from Next.js, and sees by far the largest portion of our web traffic as of January 2025.
Early on in the project, we decided that the NextAuth.js library was the best tool for the job. It supported all of the login types we wanted without us having to write them from scratch, was designed with Next.js in mind, and offered JWT authentication, which seemed ideal for providing authentication services to multiple Next.js apps through a common API interface.
However, as the project progressed, we increasingly found some of NextAuth’s design decisions at odds with our requirements. It was incredibly difficult to control what NextAuth put in its refresh tokens, and their size was getting unwieldy. Performance concerns were popping up. Finally, what broke the camel’s back was the realization of how much work it would take to implement the ability for users to manage other sessions, particularly deleting a different session from the one they were currently using, purely using NextAuth. The decision was made–we were ripping NextAuth out of the project and writing our own backend in Go.
We have already written extensively about Go at FlightAware. It is considered a first-class language in the organization, with plenty of custom libraries already written, such as support for OpenTelemetry, or integration testing with Redis and Postgres. While we were losing first-class support for Next.js (now we had to write a client library for downstream Next.js apps ourselves), we were getting the fine-grained control of our tokens that we wanted, as well as resolving our performance concerns overnight with the move to a pre-compiled, optimized binary. Next.js would act only as a frontend, leaving all authentication and database tasks to the Go backend.
With the Go backend, we were free to determine our own optimal authentication strategy. We already knew that we wanted to continue with the JWT strategy for the ability for apps to authenticate you without constantly making database calls. However, we also wanted session lifetimes to be independent of JWT expiration, or even refresh cookie expiration. So, we implemented a JWT expiration of five minutes, after which point your JWT is refreshed using a refresh token also stored in browser cookies. That call does hit a session database, but since it only takes place every 5 minutes at most, database load is much less of a concern. This strategy gives us the fine-grained control of session cookies with the optimized performance of JWTs. It’s nearly identical to the approach Clerk outlined in a blog post of their own, and it increased our confidence in our decisions to see others arriving at similar designs independently.
One of the biggest challenges in the project was architecting support for the Tcl monolith. Not only did we need to update the massive codebase to support a completely different authentication scheme but we also needed to move every FlightAware user over to the new system seamlessly. As a result, we would need to support 2 completely different authentication systems for as long as it took to migrate everybody over. We set about this by silently integrating the monolith with the new authentication system over the summer of 2024. At that time, anybody who had wanted to could have logged in using the new system, but since it was hidden, only our engineers were using this system. Once we were ready to bring more people on and test the migration portion, we forcefully migrated all employee accounts. This was not without some hand-wringing, and we had to do the employee migration 3 separate times to address issues that came up. This logged employees out, and was inconvenient, but everybody was a good sport about it, thankfully.
Another unexpected challenge was moving every login link in the Tcl codebase over to the new system. The Tcl monolith is a massive combination of Apache Rivet pages, React apps (over 30 of them!), and handlebars templates, each technology implementing its own method of link rendering. Some pages opened a login modal, others linked to a login page on the site. At the time all of those were written, the link was one of two possible constants, but as this was no longer the case, a lot of links had to be rewritten in a way that was surprisingly difficult to make reusable. Going into the launch, we were not 100% confident that we had addressed every single instance (it was also surprisingly difficult to grep), so we added some code to redirect the old login path to the new one as a way to add assurance.
The migration used a custom endpoint that took the user’s old session cookie as well as some information about the user to verify the authenticity of the request, and then wiped the old session, replacing it with the new one, and redirected back to the page the user was on before. We also implemented a gradual rollout of the migration based on user ID, so if unanticipated issues arose during the launch, we had the option to pause it and keep the site running for the users yet to be migrated.
Launch day went about as well as one can hope. Migration of users began in the morning, and even after only migrating 1% of users, we found issues–either from undocumented bespoke account configurations or significantly increased system load–that could not have possibly been found in employee testing and were able to fix them on the spot. We were able to complete the migration that afternoon.
So, what did we learn from this project?
Perhaps the most important takeaway was to keep Tcl things in Tcl. As a part of our modernization effort, we adopted a rule that we would not write new code in Tcl unless absolutely necessary. However, the monolith still needs to run and deliver for customers while we work on the pieces of its replacement. In the design process for these core services, it can be tempting to follow this rule to the letter and change how we would build our authentication app to support the monolith better as-is. However, this would be at the expense of the developer and user experiences with the new apps. Not being afraid to build new things in Tcl when it is appropriate paid dividends and will continue to do so.
We also saw the value of owning your core services. We built this authentication solution in about 1 year with about 3 FTEs. Although everything we build incurs a maintenance debt, we can be confident that the service will not require the same amount of engineering resources going forward. Comparing this to the alternative of handing authentication off to a third-party vendor, the organization is seeing six-figure cost savings annually from this choice alone.
Since authentication is a core dependency of delivering services to customers, the service working correctly must take precedence over rushing its delivery. I don’t think we could have done anything differently here, but I do think we felt the pain of not getting this into the hands of users as quickly as possible, as evidenced by the bugs we could only find on launch day.
Although I had some experience modifying small parts of other Go projects at FlightAware, this was my first foray into a full Go project written from scratch. Coming from Next.js as my bread and butter, I had to leave the comfort zone of a metaframework (since Next.js is a framework on React, which is itself a JS framework) and learn how to work outside of these constraints. Ultimately, I found this to be freeing, and highly recommend the language.
I’m excited to see how what we’ve learned from this one informs how we launch our new flight tracking experience, built with the same modern principles and technologies, in 2025.