Introduction to Apache Spark | Hands-on Spark for Big Data & Machine Learning
Course Objectives
This “skills-centric” course is about 50% hands-on lab and 50% lecture, designed to train attendees in core big data/ Spark development and use skills, coupling the most current, effective techniques with the soundest industry practices. Throughout the course students will be led through a series of progressively advanced topics, where each topic consists of lecture, group discussion, comprehensive hands-on lab exercises, and lab review.
This course provides indoctrination in the practical use of the umbrella of technologies that are on the leading edge of data science development focused on Spark and related tools. Working in a hands-on learning environment, students will explore:
- Spark Ecosystem
- Spark Shell
- Spark Data structures (RDD, DataFrame, Dataset)
- Spark SQL
- Modern data formats and Spark
- Spark API
- Spark & Hadoop & Hive
- Spark ML overview
- GraphX
- Time-permitting: Spark Streaming
- Time-permitting: Optional Capstone Workshop (Time-Permitting)
Course Prerequisites
This foundation-level course is geared for intermediate skilled, experienced Developers and Architects (with basic Python experience) who seek to be proficient in advanced, modern development skills working with Apache Spark in an enterprise data environment.
Take Before: Students should have attended the course(s) below, or should have basic skills in these areas:
- TTPS4800 Introduction to Python Programming
- TTSQLB3 Introduction to SQL (Basic familiarity is needed, not in-depth SQL skills)
Course Agenda
Please note that this list of topics is based on our standard course offering, evolved from typical industry uses and trends. We’ll work with you to tune this course and level of coverage to target the skills you need most.
Spark Introduction
- Big data, Hadoop, Spark
- Spark concepts and architecture
- Spark components overview
- Labs: installing and running Spark
The first look at Spark
- Spark shell
- Spark web UIs
- Analyzing dataset – part 1
- Labs: Spark shell exploration
Spark Data structures
- Partitions
- Distributed execution
- Operations: transformations and actions
- Labs: Unstructured data analytics using RDDs
Caching
- Caching overview
- Various caching mechanisms available in Spark
- In memory file systems
- Caching use cases and best practices
- Labs: Benchmark of caching performance
DataFrames and Datasets
- DataFrames Intro
- Loading structured data (JSON, CSV) using DataFrames
- Using schema
- Specifying schema for DataFrames
- Labs: DataFrames, Datasets, Schema
Spark SQL
- Spark SQL concepts and overview
- Defining tables and importing datasets
- Querying data using SQL
- Handling various storage formats: JSON, Parquet, ORC
- Labs: querying structured data using SQL; evaluating data formats
Spark and Hadoop
- Hadoop Primer: HDFS, YARN
- Hadoop + Spark architecture
- Running Spark on Hadoop YARN
- Processing HDFS files using Spark
- Spark & Hive
Spark API
- Overview of Spark APIs in Scala / Python
- The lifecycle of a Spark application
- Spark APIs
- Deploying Spark applications on YARN
- Labs: Developing and deploying a Spark application
Spark ML Overview
- Machine Learning primer
- Machine Learning in Spark: MLib / ML
- Spark ML overview (newer Spark2 version)
- Algorithms overview: Clustering, Classifications, Recommendations
- Labs: Writing ML applications in Spark
GraphX
- GraphX library overview
- GraphX APIs
- Create a Graph and navigating it
- Shortest distance
- Pregel API
- Labs: Processing graph data using Spark
Time Permitting Topics
Spark Streaming
- Streaming concepts
- Evaluating Streaming platforms
- Spark streaming library overview
- Streaming operations
- Sliding window operations
- Structured Streaming
- Continuous streaming
- Spark & Kafka streaming
- Labs: Writing spark streaming applications
Workshop
- Attendees will work on solving real-world data analysis problems using Spark
Sorry! It looks like we haven’t updated our dates for the class you selected yet. There’s a quick way to find out. Contact us at 502.265.3057 or email info@training4it.com
Request a Date