Developing with Spark for Big Data | Enterprise-Grade Spark Programming for the Hadoop & Big Data Ecosystem

Name: training4it.com
Address: 9913 Shelbyville Rd #200, Louisville, KY, 40223
Telephone: 502.265.3057

Apache Spark, a significant component in the Hadoop Ecosystem, is a cluster computing engine used in Big Data. Building on top of the Hadoop YARN and HDFS ecosystem, it offers order-of-magnitude faster processing for many in-memory computing tasks compared to Map/Reduce. It can be programmed in Java, Scala, Python, and R - the favorite languages of Data Scientists - along with SQL-based front ends. With advanced libraries like Mahout and MLib for Machine Learning, GraphX or Neo4J for rich data graph processing as well as access to other NOSQL data stores, Rule engines and other Enterprise components, Spark is a lynchpin in modern Big Data and Data Science computing.

Retail Price: $2,795.00

Next Date: Request Date

Course Days: 5

Request a Date

Request Custom Course

Course Objectives

This course provides indoctrination in the practical use of the umbrella of technologies that are on the leading edge of data science development focused on Spark and related tools. Working in a hands-on learning environment, students will learn:

The essentials of Spark architecture and applications
How to execute Spark Programs
How to create and manipulate both RDDs (Resilient Distributed Datasets) and UDFs (Unified Data Frames)
How to persist and restore data frames
Essential NOSQL access
How to integrate machine learning into Spark applications
How to use Spark Streaming and Kafka to create streaming applications

Course Prerequisites

This in an intermediate-level course is geared for experienced developers seeking to be proficient in Spark tools & technologies. Attendees should be experienced developers who are comfortable with Java, Scala or Python programming. Students should also be able to navigate Linux command line, and who have basic knowledge of Linux editors (such as VI / nano) for editing code.

Take Before: Students should have attended the course(s) below, or should have basic skills in these areas:

TT2104 Java Programming Fundamentals (for Java supported course flavor)
TTPS4800 Introduction to Python Programming (for Python supported course flavor)
TTSQLB3 Introduction to SQL (Basic familiarity is needed for all editions)

Course Agenda

Please note that this list of topics is based on our standard course offering, evolved from typical industry uses and trends. We’ll work with you to tune this course and level of coverage to target the skills you need most.

Spark Overview

Hadoop Ecosystem
Hadoop YARN vs. Mesos
Spark vs. Map/Reduce
Spark with Map/Reduce: Lambda Architecture
Spark in the Enterprise Data Science Architecture

Spark Component Overview

Spark Shell
RDDs: Resilient Distributed Datasets
Data Frames
Spark 2 Unified DataFrames
Spark Sessions
Functional Programming
Spark SQL
MLib
Structured Streaming
Spark R
Spark and Python

RDDs: Resilient Distributed Datasets

Coding with RDDs
Transformations
Actions
Lazy Evaluation and Optimization
RDDs in Map/Reduce

DataFrames

RDDs vs. DataFrames
Unified Dataframes (UDF) in Spark 2.0
Partitioning

Spark Applications

Spark Sessions
Running Applications
Logging

DataFrame Persistence

RDD Persistence
DataFrame and Unified DataFrame Persistence

Distributed Persistence

Spark Streaming

Streaming Overview
Streams
Structured Streaming
DStreams and Apache Kafka

Accessing NOSQL Data

Ingesting data
Parquet Files
Relational Databases
Graph Databases (Neo4J, GraphX)
Interacting with Hive
Accessing Cassandra Data
Document Databases (MongoDB, CouchDB)

Enterprise Integration

Map/Reduce and Lambda Integration
Camel Integration
Drools and Spark

Algorithms and Patterns

MLib and Mahout
Classification
Clustering
Decision Trees
Decompositions
Pipelines
Spark Packages

Spark SQL

Spark SQL
SQL and DataFrames
Spark SQL and Hive
Spark SQL and JDBC

GraphX

Graph APIs
GraphX
ETL in GraphX
Exploratory Analysis
Graph computation
Pregel API Overview
GraphX Algorithms
Neo4J as an alternative

Alternate Languages

Using Web Notebooks (Zeppelin, Jupyter)
R on Spark
Python on Spark
Scala on Spark

Clustering Spark for Developers

Parallelizing Spark Applications
Clustering concerns for Developers

Performance and Tuning

Monitoring Spark Performance
Tuning Memory
Tuning CPU
Tuning Data Locality
Troubleshooting

Sorry! It looks like we haven’t updated our dates for the class you selected yet. There’s a quick way to find out. Contact us at 502.265.3057 or email info@training4it.com

Request a Date