A few years ago I developed a fascination with a data set published by the Bureau of Transportation Statistics in the US Department of Transportation: "Airline On-Time Performance and Causes of Flight Delays: On_Time Data." This data usually attracts attention because of flight delays, but actually contains lots of broader information about US airports, airlines, routes, traffic patterns and even, to a point, aircraft utilization. As such, it's really a window into the whole topic of commercial passenger air transportation in the United States. Some basic characteristics of the data set are as follows:
|Time span covered||1987 to 2015|
|Number of scheduled flights||162,212,419|
|Number of aircraft tail numbers||14,858|
|Number of airlines||31|
|Number of airports||388|
|Number of airport (origin,destination) pairs |
with at least one flight
My interest in Apache Spark is no surprise to readers of this blog, but recently the two topics collided when I was learning how to run a Spark cluster through Amazon's Elastic MapReduce service, and read a blog post by Jon Fritz on the Amazon Web Services official blog. The post shows how to run some simple Spark SQL queries against a copy of this data set hosted on Amazon's S3 storage service, conveniently converted to Parquet for easy and efficient access from Spark. I had been shopping for a somewhat real-world project through which to study ways to write efficient computations in core Spark using Scala, and so a project was born.
Is this a good data set for Spark?
Admittedly, the on-time performance data is not huge. But, with a modest cluster, fairly simple queries against the full data set take several minutes, and complex queries, or simple queries written badly, take a lot longer. While there's twenty nine years of data, you can also have quite a lot of fun with a contiguous subset, say just one or two years, and simple queries against that run quickly on an affordable, well configured PC.
At first, the data may seem quite simple. Partly that's an artifact of the denormalization that plagues so many public data sets. But also, the structure of this data is subtle, with significantly graph-like structure at multiple levels. The airports and regular routes between them form a pretty interesting graph, with valuable data on both the vertices and the edges. Multiple flights can be linked by flight number or aircraft tail number. Finally, there are lots of interesting correlations (or absence thereof) with external data to be explored. Weather seems like a good place to start, but the demographic and economic data for nearby cities could be interesting too. I haven't tried using GraphX on this data set yet, but I'm really looking forward to it.
Running the code
I started coding on this over two months ago. I'm sure my explorations will provide material for quite a few posts, but for now I'd just like to introduce the project, which is available on GitHub. See the README for information about how to run the examples. To add another experiment you need to extend the CoreExperiment class, just like the existing core Spark examples do, and tell the 'registry' about it in the Flights class. Then you can either run all the registered experiments, or just the comma separated list you provide on the command line. The README explains how the output is organized. You don't actually need to use Elastic MapReduce to run the examples: you can either download the entire data set (a bit large) or use my ParquetSubsetMain utility to create a subset. Then you can either submit it to the cluster of your choice or use the "--local" flag to run it as a stand alone Scala program. During my own development, I use the latter technique: I run FlightsMain as a stand alone program, using a two-year local Parquet extract of the data. I'm only testing the code on Linux. When I run against the full data set I start an EMR cluster, use sbt's "assembly" command to generate a self-contained JAR, upload it to S3, submit FlightsMain to the cluster, and collect my output from S3.
Why so much framework?
Somewhat to my surprise, the code has ended up rather "framework heavy." This happened in response to goals I initially didn't know I had, but discovered along the way:
- A uniform way to run both core Spark and Spark SQL experiments.
- An easy way to get results, including performance measurements and diagnostics, back out to S3 without lots of maintenance.
- A way to run all the registered experiments or just specific ones, in a specific order, possibly with repetition to help obtain consistent performance results.
- Easy switching between local execution, with a development environment and a debugger, on a subset of the data, and execution on a cluster against all the data.
I think I'm really trying to study two things with this work: how to do real work with core Spark, and benefit from the efficiency advantages of doing so, without drowning in complex Scala code. Frankly, I'm not even sure how great the advantages of using core Spark are, or whether drowning in complex Scala code can be avoided, although I'll point out that the first question can be answered through measurement, while the second is rather subjective. I like Scala and core Spark, but they both take a lot of investment to learn to use well, and they're certainly not for everybody.
Over the next few posts I hope to illustrate some of the basic techniques of core Spark using simple examples for this project. I look forward to hearing from you about your impressions, and about your own experiences with using core Spark from Scala.