/If you're reading this and wondering about how to get set up on a cluster, you might want to read this guide to getting Apache Spark set up on your cluster
While the spark documentation can definitely be helpful at times, it doesn't always include user friendly guides for the more simple things. This guide will walk you through how to build Spark for your local machine. You might want to do this to try out spark, write some basic code - that sort of thing. In this guide we'll go through how to build it for scala 2.11 and 2.10 which should be pretty similar.
/If you're just arriving here and haven't played with Spark before, I recommend downloading one of the pre-built distributions - it'll save you a lot of time and headache!/
You're going to need Scala installed on your machine to build and use Spark. You can find a guide for that on the Apache Maven website
. You'll likely have to set your
variable and possibly others. Feel free to leave a comment if you're having trouble getting it installed or set up correctly.
First, we've got to download the Spark project. While you can download prebuilt Spark packages for certain Hadoop distributions. I always like to start with the raw package and be able to build it out to meet my requirements. So head on over to the Spark Downloads Page
and get the primary package. Now one thing that hangs me up a fair amount is that the link under step 4 is actually just a link to mirrors - not actually the package. Click that link in order to access a mirror to download. Once you've downloaded it, go ahead and verify it with a checksum.
Once you've got that downloaded, we're going to have to build it. Now some instructions are given here, but it's not always clear what the steps should be in your situation so let's just walk through exactly what I'm doing on OS X. It should be fairly similar on Linux OS's as well.
Decompressing the file
Navigate to the downloaded spark tar file and run the following command:
tar -xvf spark-1.4.1.tgz
You may have to modify the version number to suit the version of Spark that you are downloading but that should be pretty straightforward. Once that's complete you should have a directory in that same folder labelled spark-1.4.1 or something similar. Now the Spark documentation instructions are thorough but aren't always going to give you everything you want out of the box. For example, if you build Spark with SBT, you won't have support for PySpark which can be annoying if that's how you're looking to code. Additionally, unless you specify it correctly you won't get access to the Hive Query engine without building it to support this. Not always a huge deal but especially if I'm working on my local machine - I'd like to get access to it all.
Building with Maven
Now if you're going to build Spark with Maven, you'll build it to work with Yarn and PySpark. Yarn likely won't be necessary on your local machine but you'll probably want PySpark. Now because this is a big project what you should first do is set some memory management configurations in your shell. Here's what is recommended.
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
Once you've done that, let's get building. What you should check prior to building is the Scala version.
If you've got 2.11 skip ahead to the Building for 2.11 section
and if not just keep going!
Building for Scala 2.10
One command should work for you right away!
This is going to give you support for basically anything which is probably best for your local environment.
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package
Now this will take a while so it might be worth it to go grab some coffee.
Building for Scala 2.11
If you're going to build to support scala 2.11 you're going to have to add another command.
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Dscala-2.11 -Phive -Phive-thriftserver -DskipTests clean package
I've had some trouble with the thrift server on 2.11 and you may as well. Removing the
option will help build Spark correctly however that won't allow you to use the Thrift Server. If that's a requirement i recommend building with Scala 2.10.
Building with SBT
Building with SBT (the standard Scala compiler) is supported as well. You can pass in the same parameters that you might to the maven build as they're derived from the same base. The "get everything" build can be found below. Remember however that this will not include PySpark. You'll need to build with Maven
if you want to use that.
build/sbt -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package
Spark is packaged with a couple of examples that you can use and test things out on. The important think to keep in mind is that you're going to be running things on your local machine after following this tutorial - not on a cluster. Therefore you should try and reduce some of the options that you might typically apply to your
For the following commands, make sure that you're at the root directory of the spark distribution. Then run the respective commands listed below. The
that we pass into the below commands is to limit the number of threads that Spark should be running with on your computer.
Running Spark Scala Shell
./bin/spark-shell --master local
Running the PySpark (Python Spark) Shell
./bin/pyspark --master local
Well we just built Spark for use on our local machine. We've built with support for SparkSQL, PySpark, and the Scala Spark shell. This is a great, simple way to get started using Spark. Now what you should do is either dive into a dataset if you're already comfortable with the basis or you can checkout how to get started with Apache Spark RDDs
, one of the core abstractions of Spark. This will walk you through working on a simple dataset on your local machine and prepare yourself for some larger scale analysis!