Thursday, November 6, 2014

An easy way to start learning Spark


Getting started with Spark

Big data computing systems like Hadoop and Spark tend to be large and complex, making it quite hard to start learning how to use them. And, of course, to solve large problems you really do have to tackle this complexity, as you'll need a cluster large enough for your problem, and realistic enough to measure and tune the performance of your solution. But there's a lot to learn before taking on huge problems, and it's useful to have an easy "on ramp" to get started. Then, as you build skills and competence you can graduate either to using a cluster that somebody else has configured, or configuring one yourself.

Furthermore, the Hadoop and Spark projects have been rather Linux-centric, whereas most of us have easier access to machines running Windows and Mac OS. Again, configuring a production cluster will usually involve Linux, but it's helpful to get started on a system you know well.

In this, the first of a series of posts on learning Spark, I'm going to use an approach that doesn't require you to set up a cluster or build complex software from the sources. It should also work equally well on Windows, Mac OS and Linux -- and I am in fact developing the examples on Windows 8.

Installing the software

You need the following software to follow along:
  1. A Java Development Kit -- I'm using 1.7.0 
  2. A set of Scala binaries -- I'm using 2.10 
  3. IntelliJ IDEA development environment -- I'm using 13.1 Community Edition
  4. The IDEA Scala plugin -- instructions available here
Notice there's no need to download Spark itself.

The code for the examples is available at https://github.com/spirom/LearningSpark. (You will need to make minor adjustments for Mac OS or Linux.)

Creating a project

Create a Scala sbt project in idea. While the programs we write will be very simple we will use sbt to manage our dependencies on the various Spark libraries.




After doing this wait a couple of minutes for IDEA to create the folder structure of a standard project (it tends to behave like it's ready even when it isn't.)


Getting Spark

Once the folder structure has been created you can see that the build.sbt file exists at the top level. 

Edit it to create a very simple project that depends on Spark Core package Version 1.1.0, built for Scala 2.10. 

name := "LearningSpark"

version := "1.0"

libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.1.0"

When you save this, IDEA will prompt you to refresh it, and then spend some time downloading everything you need. That's all there is to it: no building, no cluster setup, and no daemon/service management. You're ready to write and run your first Spark program.


Your first Spark program

Right click on the src/main/scala node in your project and select New/Scala Class, and call you class Ex1_SimpleRDD. Now paste the following code into it, replacing anything IDEA already inserted. . 

import org.apache.spark.{SparkContext, SparkConf}

object Ex1_SimpleRDD {
  def main (args: Array[String]) {
    val conf = new SparkConf().setAppName("Ex1_SimpleRDD").setMaster("local[4]")
    val sc = new SparkContext(conf)

    val numbers = 1 to 10
    val numbersRDD = sc.parallelize(numbers, 4)
    println("Print each element of the original RDD")
    numbersRDD.foreach(println)
  }
}

Notice how we've defined on object rather than a class -- so we can define a main method and get running right away. Let's look at this method in some detail.

The first two lines set up a local Spark environment using four threads. This is all we need to get started. The next line defines a sequence of numbers called 'numbers'. Now we come to the essence of Spark: the Resilient Distributed Dataset, or RDD. This is the data structure on which all Spark computing takes place. In this case we ask Spark to take a regular Scala data structure and turn it into an RDD using the parallelize method on a SparkContext. We can default the number of partitions, or specify it as the second parameter.

To keep these programs brief, simple and as close to idiomatic Scala as we can, we won't put type declarations on our definitions. But it's helpful to know the types, and IDEA helps with this -- position your cursor in the middle of 'numbersRDD' as shown below and hit Alt-= (Windows) or Ctrl-Shift-P (Mac OS). You'll see that it as an RDD containing Scala Int elements.


Finally, in typical Scala style, we can print each element of the RDD. 

Run it

You're now ready to run your first Spark program. Right click on the object you just created in the project explorer and choose Run EX1_SimpleRDD. 

You should see something like the following: 



This is an awful lot of logging, and ultimately much more than we'll want to see, but it is interesting to see once. Notice an awful lot of work seems to get done just the print the numbers, but we'll delve into that fact more deeply later.

Also notice that the numbers appear out of order. I may be tempting to reach all sorts of conclusions about RDDs, the way they're stored, and what order information may be lost when they're created. But those conclusions aren't valid: the RDD is a parallel data structure, The foreach method causes the println loops to be run in parallel on each of the four partitions. Not only will those four print loops start in random order, but their outputs will be interleaved with each other. Run this example a few more times, and you'll see that the order keeps changing. Later we'll see that the actual order of the elements in RDD has NOT been lost.

Tune log verbosity

Now it's time to decrease the logging verbosity down to tthe level of wearnigns and errors only. Create the file src/main/resources/log4j.properties. 



Here is a reasonable starting point for the contents.

log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=WARN
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=WARN

If you run the example again you'll see a much less cluttered response (with yet another order of numbers). 

Do some computing

Next you're ready to actually compute something. If you add the following few lines you can compute a transformed RDD this time containing Scala doubles (remember to check its type) where every element has been divided by 10. This transformation is run independently on each partition with requiring data transfer between the partitions. The next line gathers all the results into a regular Scala array (remember to check its type too) so when we print it's contents they come out in order -- the order wasn't lost after all.


val stillAnRDD = numbersRDD.map(n => n.toDouble / 10)
val nowAnArray = stillAnRDD.collect()
println("Now print each element of the transformed array")
nowAnArray.foreach(println)


Here's the output:


Now print each element of the transformed array
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0

Start understanding RDDs and partitioning

To end your first encounter with Spark, let's use RDD's 'glom' method to see what's in the individual partitions of 'stillAnRDD' and print the results as in the following code. 

val partitions = stillAnRDD.glom()
println("We _should_ have 4 partitions")
println(partitions.count())
partitions.foreach(a => {
  println("Partition contents:" +
    a.foldLeft("")((s, e) => s + " " + e))
})

Hopefully, you've already checked the retun type and seen that it's an RDD[Array[Double]] where each element of this new RDD is the entire contents of a partition, colelcted into an array. Here's the expected output: 

We _should_ have 4 partitions
4
Partition contents: 0.6 0.7
Partition contents: 0.8 0.9 1.0
Partition contents: 0.3 0.4 0.5
Partition contents: 0.1 0.2

Notice how the elements of each partition are in order, but (except when you get lucky) the partitions themselves are not. Remember, 'partitions' is still an RDD, so the foreach loop runs in parallel. It's just the foldleft that runs sequentially on the contents of each of the four arrays.

Also notice that the distribution of elements across the partitions: not too bad at all -- this will become interesting when, in later posts, we look at partitioning and its effect on performance.

Finally, notice how using a single println for each partition keeps the outputs from getting interleaved.

What you've learned

Congratulations: without building anything from sources, setting up a cluster or installing and configuring anything more complex than a Java/Scala development environment you've written and run your first Spark program. Eventually, to solve real problems and diagnose the performance of your solution you'll need a cluster. But you may find an awful lot of your Spark development (and especially exploration and learning) gets done in this simplified environment from now on.

No comments:

Post a Comment