Flying through the Clouds: FlightAware’s Journey into Machine Learning at Scale

As a Senior Site Reliability Engineer (SRE), Diorge Tavares has been working with FlightAware’s Predictive Technologies team, using machine learning to bring flight arrival predictions to production in FlightAware Foresight.

FlightAware has been tracking flights since 2005. As you can imagine, we have a ton of flight tracking data; it’s in the petabyte range compressed! With a little machine learning, we could do some very interesting things with this data and help airlines optimize operations by providing better estimated times of arrivals (ETAs) for flights. This means that travelers can get to their destinations quicker—for example, an airline might have capacity to run two more flights in a day because they weren’t held up due to a poorly-estimated flight arrival; that’s a win for everyone.‍

The Predictive Technologies Team

By mid-2019, the Predict Team was running full force and brought our first client onboard: Frankfurt Airport. This marked a major milestone for something that had its roots in a hackathon idea. Today, major airlines are using our services at hundreds of airports. We started with about 200 models and have scaled our extract, transform, and load (ETL) process and live streaming machine learning and batch workflows to more than 1100 training models.

Cross-team SRE

The goal was to have the Predict Team deliver a product that was reliable, easy to deploy, and easy to support. To make this happen while also accelerating our pace of development, we decided to embed a site reliability engineer (SRE) on the team alongside developers. I volunteered.

I was looking forward to it and thought it would be fun, but I’ll admit it took a bit to wrap my head around this new machine learning world. I stared in confusion at everyone throwing unfamiliar jargon around with ease. I was still a part of the core SRE team, but my focus was on the Predict Team. I knew I was getting the hang of this new world when my core SRE team colleagues would reflect that same stare of confusion when I gave my morning standup updates.

By joining the Predict Team, I was able to help in several ways. For example, it takes time to funnel new work through the workload queue. Skipping this queue and giving the Predict Team instant access to getting operations tasks done allowed us to speed up the entire development process.

Since I would be involved from the beginning of an infrastructure buildout, I could provide reliability and redundancy expertise to make it a stable and reliable setup from the beginning. Our in-house live prediction streaming environment originally spanned two datacenters with six servers at each datacenter. That was later doubled to 12 at each datacenter, and ultimately processed more than 1.6 million evaluations per live streaming cluster per second.

One of the initial steps we took was to setup Jenkins to automate CI/CD pipelines for building and deploying images into the environment. This allowed for rapid and consistent deployments.

From a process improvement perspective, attending meetings for both the Predict Team and main SRE team allowed me to share processes and information between the two to improve both teams’ processes.

Machine Learning

The machine learning aspect started with a batch workflow stack with Hadoop and Spark systems for ETL and Docker to run inferences. Inferences are drawn when we give a model a new data set and allow it to infer flight arrival times or predictions. For example, we train a model with a year of data from January 2018 to January 2019 and then run an inference for February 2019 to obtain predictions for February and allow the model to work on new data. We needed a good bit of scale to perform these tasks en masse.

We started with about 200 airports with one “arrival time on runway” (or ON) model per airport. As we continued our work, we expanded the number of supported airports, and created a second class of models to estimate “arrival time at gate” (or IN). We ultimately supported 554 airports (1,108 models with two models per airport).

The second part came from our live streaming predictions. From our streaming API product Firehose, customers could access real-time predictions for flights in the air. This used the same models that are part of our batch workflow to follow live flight tracking data and create predictions. This service has a high-uptime SLA and cannot afford to take service impact from upgrades. We set up redundancy within and across datacenters and we could bring down a streamer node in each datacenter and still provide our streaming live prediction service.

To bring Foresight ETAs to production we had rigorous requirements. This service would be relied upon by airlines to determine flight schedules, delays (when needed), and could make a big impact on travelers and the airlines’ bottom line. This needed to be at launch a mature, high availability service. We made the decision to run these on-premises, but we needed to be able to perform maintenance and upgrades without service impacts. As a result, we set up redundancy within each of our two datacenters and across the datacenters. We can bring down a streamer node in each datacenter and still provide our streaming live prediction service.

Just tracking the metrics this created was a challenge. At any point in time, we need to view error rates on the predictions and then retain these so we can refer back to them as needed. Our main monitoring platform, Zabbix, couldn’t handle the massive amount of metrics we would send it. The solution was TimescaleDB (Postgres with a time series database extension) with a Grafana front end to generate the graphs for this. It worked well and could handle all of the needed metrics while at the same time not ballooning in size, which is something Zabbix would have ended up doing.

Did Someone Say Cloud?

It was clear that to create training sets and models in a reasonable time to address model drift, bug fixes, and feature requests by customers, we needed to massively scale. If we made a code change, we could not wait weeks to train our models and see how it worked. Also, it would be hard to meet client’s tight deadlines if it took us weeks to retrain our models and meet sales needs for updated prediction graphs for sales meetings. We had a need for speed.

We started with an in-house Spark cluster of about 12 machines. Just for a baseline, someone would spend a week to generate updated datasets to then turn around and run inferences for our batch process. Generating datasets for an entire year (the time frame we use to train our models) would take weeks, if not longer. I haven’t even touched on the topic training the models, which would take a 32+ core machine and keep at 100% CPU usage on all cores for over 24 hours, depending on the airport size for which we were creating a model.

Instead of using more on-premises hardware, we decided to evaluate some cloud providers (IBM, Azure, Google, and Amazon Web Services) to determine which would be the best fit for our needs factoring cost, support, scaling needs and workflow. I started this evaluation with the idea that I would get the best real-world results by running real-world workloads on them. We learned some surprising things.

The Cloud Evaluation

Our cloud stack included Apache Spark and Kubernetes to train our models with a custom docker image. It was our goal to be vendor agnostic; to accomplish this we opted to write our workflows so they could run in the cloud and in-house. Argoproj (a Kubernetes workflow engine that supports DAG and step-based workflows) was a big help in covering our model training, which had multiple stages we could batch with Argo and run both in-house and in the cloud. Even our managed Spark could still run it in our in-house Spark cluster.

Options 1 and 2: IBM and Azure

We decided against IBM and Azure. IBM’s workflow for Spark didn’t scale the way we wanted it to; it required obtaining a password for accessing each individual cluster. Since we used an individual cluster for each day of a training set for the second phase, if we were creating a dataset for a year that would require pulling in 365 (days in a year) passwords and switching between them if we needed to access the clusters to debug. This was handled better at Google or AWS with a service account that had permissions to the needed resources.

Azure required us to sign a million-dollar contract for our quota needs. Although they eventually provided an option to get around this by using a third party vendor, all the hoops they made us jump through pushed us in the direction of Google and AWS. We questioned what difficulties we might have if we needed more resources in the future.

Option 3: Google

Google gave us our first quota approval after a good bit of smaller requests and account team pushing. They got us up to 100,000 vCPU core quota that we could use. They asked for a lot of capacity planning meetings, but we were actually able to use our quota.

As we started using it, we ran into a few problems with Kubernetes on Google. First, the cluster would go into a non-responsive state. We determined this was because of upgrades that Google performs and for our usage we spin up Kubernetes clusters, run our batch training, and then spin it down. Although we configured for no upgrades to be done while our jobs were running, they were still happening across multiple zones. Google offered a few solutions, but they seemed to be temporary Band-Aids versus fixing the root problem.

Second, Google’s beginner-friendly approach wasn’t a good fit for us. We could get up and running faster, but it ended up being more difficult because our custom model training requirements didn’t always fit in the box of their AutoMagic setup. It would have been nice to be able to see under the hood and configure things as we wanted.

Option 4: Amazon Web Services (AWS)

AWS provided the most quota we had thus far, in the realm of 3600 instances (they were using instance counts at that time for quota or service limits) which equals about 345,600 vCPU cores. But when we tried to use them we ran into “out of capacity errors” in east-2 region. After a few calls we were recommended to switch to east-1 which had more capacity. This solved the problem.

AWS had great configurability and their debug log verbosity was unbeatable – it meant developers could debug problems faster and easier. Also, we found our models trained faster with AWS than Google Cloud Platform (GCP) resulting in actual cost savings, which was something we did not originally expect from AWS. We had the perception at first AWS would be more expensive, not less. This discovery was surprising.

Adding to all this, AWS allowed us to use spot instances (up to 90% discounted VM instances that can be taken back by AWS if needed, depending on demand for them) for an entire Spark EMR cluster. This was something that GCP, at the time, was limited in offering. The faster model training cost savings, in addition to using spot instances for the entire spark EMR and for model training, put us well below our projected cloud spending budget.

It was clear who won the great FlightAware cloud evaluation war of 2019: AWS.

ETL Improvements

Armed with the resources of the Cloud, we decided to tune and optimize our ETL code. In hindsight, we probably should have done that first in order to generate an accurate view of what resources we needed with optimized code. Our in-house ETL that used to take weeks took hours on the optimized code. It was shocking how much of an improvement we received from tuning our code.

Three primary changes made a difference:

Batching our ETL days together. This saved on multiple expensive reads to our s3 storage bucket at AWS. Our first iteration of this resulted in two million reads/writers to the s3 bucket, which AWS did not like. I remember seeing in the logs “Slow Down! You’re going too fast!” when our EMR jobs were failing after this improvement. We scaled it back so we weren’t hitting the s3 bucket over 30,000/sec rate they allow before throttling or denying requests.
Switching from strings to appropriate types based on data. There were parts of the Scala code that would store data in a string and then convert it to an integer multiple times. So we switched from strings to types that were a better fit for the data; sometimes “Double” or “ Long” types. Java Strings have about 40 bytes of overhead over the raw string data.
Caching. We went through our code and identified when to call .cache() on a dataframe so that we avoided doing a lot of expensive repeat work.

These ETL code improvements allowed us to take what would run in hours down to minutes (six minutes per dataset day, to be exact) in the cloud with enough resources. We settled on using a balance between cost effective and still efficient, which allowed us to complete a day or a year dataset within an hour.

Predict by the Numbers

These are some fun facts about our current production service for flight predictions. These numbers are after optimizations and what we ended up using of our cloud quotas.


Total Size on disk for all training sets for one generation of models	5.9 TB
Total number of models being served in production	1,108 (2,416 with redundancy)
How long it takes to train all models in the cloud	6 hours
Number of machines in our in-house spark cluster	55
Number of machines providing live prediction streaming	24
How long it takes to build a new training set for all models in-house vs in the cloud	In-house: 10 hours Cloud: 1 hour
Peak number of cores used in the cloud for Spark	17,472
Peak number of cores used in the cloud for training models	~ 8,000
Average number of tree evaluations per second across one live prediction streaming cluster during the day	> 1.6 million
Number of examples in the largest single training set	90,527,720
Number of examples across all training sets:	8,297,144,525 (took us 24 hours of runtime to compute)

The Future

We’ve come a long way and there is more to come; be sure that whatever we do, we will be soaring in the clouds with our machine learning at scale.

Angle of Attack