In this tutorial we're gong to set up a complete predictive modeling pipeline in Spark using DataFrames, Pipelines and MLlib. The first part of this tutorial will explain some of the basic concepts that we're going to need to build this model, walk you through how to download the data we'll use, and lastly create our Spark Cluster on Amazon AWS and read and write from AWS S3!
In this tutorial we're going to be doing a full-stack machine learning project. We're going all the way from data manipulation to feature creation and finally serving predictions.
Thus far I haven't found a good project template for Apache Spark and it's been a repeated process to get it right. In this tutorial, I walk through a simple project template that I've created as an effort to help others get started with Apache Spark in Scala.