Spark | R Programming for Data Scientists and Analysts

Spark is a highly optimized Data Science environment running on Hadoop YARN, with support for Machine Learning through MLib and Mahout, SQL, DataFrames, and Streaming. In this course, you’ll dive into the details of practical data science on the Spark platform, including real-world interaction with other systems in modern Data Science environments.

Retail Price: $1,895.00

Next Date: Request Date

Course Days: 2


Request a Date

Request Custom Course


WHAT YOU'LL LEARN

Join an engaging hands-on learning environment, where you’ll learn:

  • The essentials of Spark architecture and applications
  • How to execute Spark Programs
  • How to create and manipulate both RDDs (Resilient Distributed Datasets) and UDFs (Unified Data Frames)
  • How to integrate machine learning into Spark applications
  • How to use Spark Streaming

WHO SHOULD ATTEND?

Data Scientists and Data Analysts

PREREQUISITES

Before attending this course, you should have:

  • Basic R programming experience
  • Basic knowledge of Statistics and Probability
  • Data Science background
  • Introduction to R | R Programming JumpStart

COURSE OUTLINE

Getting Started - Overview

  • Our Data and our problem set
  • Accessing the cluster, the data, and the tools
  • The Continuous Workshop approach
  • "Let's build a model together"
  • Focus on analysis, exploration, data munging, algorithms
  • Tooling and fundamentals as necessary to get the job done

Spark Introduction

  • Data Science: The State of the Art
  • Hadoop, Yarn, and Spark
  • Architectural Overview
  • MLib Overview
  • HDFS data - Accessing
  • Lab Focus
  • Working with HDFS data
  • Distributed vs. Local Run Modes
  • Spark vs. Other tools (when is Spark the right tool for the job?)
  • Spark vs. SAS
  • Spark Languages (Java, R, Python, and Scala)
  • Hello, Spark

Spark Overview

  • Spark Core
  • Spark SQL
  • Spark and Hive
  • Lab
  • MLib
  • Spark Streaming
  • Spark API

DataFrames

  • DataFrames and Resilient Distributed Datasets (RDDs)
  • Partitions
  • Adding variables to a DataFrame
  • DataFrame Types
  • DataFrame Operations
  • Dependent vs. Independent variables
  • Map/Reduce with DataFrames

Spark SQL

  • Spark SQL Overview
  • Data stores: HDFS, Cassandra, HBase, Hive, and S3
  • Table Definitions
  • Queries

Spark MLib

  • MLib overview
  • MLib Algorithms Overview
  • Classification Algorithms
  • Regression Algorithms
  • Lab Focus
  • Brief Comparison to SAS
  • Here's your split, how to tune regression
  • Decision Trees and forests
  • Lab Focus
  • Brief Comparison to SAS
  • Stepwise approach to Decision Trees
  • Working with Exit Criteria
  • Recommendation with ALS
  • Clustering Algorithms
  • Lab Focus
  • Key Clustering Algorithms
  • Choosing Clustering Algorithms
  • Working with key algorithms
  • Machine Learning Pipelines
  • Linear Algebra (SVD, PCA)
  • Statistics in MLib

Spark Streaming

  • Streaming overview

Streaming with Kafka

  • Kafka overview
  • Kafka and Spark Streaming

Data Flow with NiFi

  • Apache NiFi overview
  • NiFi data flows with Spark/R

Cluster Mode

  • Standalone Cluster
  • Masters and Workers

Spark - the Big Picture

  • Spark in Real-Time and near-Real-Time Decision Support Systems
  • Spark in the Enterprise
  • Best Practices


Sorry! It looks like we haven’t updated our dates for the class you selected yet. There’s a quick way to find out. Contact us at 502.265.3057 or email info@training4it.com


Request a Date