Spark | R Programming for Data Scientists and Analysts

Name: training4it.com
Address: 9913 Shelbyville Rd #200, Louisville, KY, 40223
Telephone: 502.265.3057

Spark is a highly optimized Data Science environment running on Hadoop YARN, with support for Machine Learning through MLib and Mahout, SQL, DataFrames, and Streaming. In this course, you’ll dive into the details of practical data science on the Spark platform, including real-world interaction with other systems in modern Data Science environments.

Retail Price: $1,895.00

Next Date: Request Date

Course Days: 2

Request a Date

Request Custom Course

WHAT YOU'LL LEARN

Join an engaging hands-on learning environment, where you’ll learn:

The essentials of Spark architecture and applications
How to execute Spark Programs
How to create and manipulate both RDDs (Resilient Distributed Datasets) and UDFs (Unified Data Frames)
How to integrate machine learning into Spark applications
How to use Spark Streaming

WHO SHOULD ATTEND?

Data Scientists and Data Analysts

PREREQUISITES

Before attending this course, you should have:

Basic R programming experience
Basic knowledge of Statistics and Probability
Data Science background

Introduction to R | R Programming JumpStart

COURSE OUTLINE

Getting Started - Overview

Our Data and our problem set
Accessing the cluster, the data, and the tools
The Continuous Workshop approach
"Let's build a model together"
Focus on analysis, exploration, data munging, algorithms
Tooling and fundamentals as necessary to get the job done

Spark Introduction

Data Science: The State of the Art
Hadoop, Yarn, and Spark
Architectural Overview
MLib Overview
HDFS data - Accessing
Lab Focus
Working with HDFS data
Distributed vs. Local Run Modes
Spark vs. Other tools (when is Spark the right tool for the job?)
Spark vs. SAS
Spark Languages (Java, R, Python, and Scala)
Hello, Spark

Spark Overview

Spark Core
Spark SQL
Spark and Hive
Lab
MLib
Spark Streaming
Spark API

DataFrames

DataFrames and Resilient Distributed Datasets (RDDs)
Partitions
Adding variables to a DataFrame
DataFrame Types
DataFrame Operations
Dependent vs. Independent variables
Map/Reduce with DataFrames

Spark SQL

Spark SQL Overview
Data stores: HDFS, Cassandra, HBase, Hive, and S3
Table Definitions
Queries

Spark MLib

MLib overview
MLib Algorithms Overview
Classification Algorithms
Regression Algorithms
Lab Focus
Brief Comparison to SAS
Here's your split, how to tune regression
Decision Trees and forests
Lab Focus
Brief Comparison to SAS
Stepwise approach to Decision Trees
Working with Exit Criteria
Recommendation with ALS
Clustering Algorithms
Lab Focus
Key Clustering Algorithms
Choosing Clustering Algorithms
Working with key algorithms
Machine Learning Pipelines
Linear Algebra (SVD, PCA)
Statistics in MLib

Spark Streaming

Streaming overview

Streaming with Kafka

Kafka overview
Kafka and Spark Streaming

Data Flow with NiFi

Apache NiFi overview
NiFi data flows with Spark/R

Cluster Mode

Standalone Cluster
Masters and Workers

Spark - the Big Picture

Spark in Real-Time and near-Real-Time Decision Support Systems
Spark in the Enterprise
Best Practices

Sorry! It looks like we haven’t updated our dates for the class you selected yet. There’s a quick way to find out. Contact us at 502.265.3057 or email info@training4it.com

Request a Date