In this tutorial we're gong to set up a complete predictive modeling pipeline in Spark using DataFrames, Pipelines and MLlib. The first part of this tutorial will explain some of the basic concepts that we're going to need to build this model, walk you through how to download the data we'll use, and lastly create our Spark Cluster on Amazon AWS and read and write from AWS S3!
This post will dive into some of the details of the Spark Shuffle and what it means for you while using Apache Spark to perform your data analysis in a cluster setting.
This guide will show you how to read in csv files in Apache Spark. We'll walk through how to use this package in both Python and Scala.