Apache Spark for Data Scientists
Apache Spark is a powerful, open-source processing engine for data in the Hadoop cluster, optimized for speed, ease of use, and sophisticated analytics. The Spark framework supports streaming data processing and complex iterative algorithms, enabling applications to run up to 100x faster than traditional Hadoop MapReduce programs. With Spark, you can write sophisticated applications to execute faster decisions and real-time actions to a wide variety of use cases, architectures, and industries.
WHAT YOU'LL LEARN
Join an engaging hands-on learning environment, where you’ll learn:
- The essentials of Spark architecture and applications
- How to execute Spark Programs
- How to create and manipulate both RDDs (Resilient Distributed Datasets) and UDFs (Unified Data Frames)
- How to integrate machine learning into Spark applications
- How to use Spark Streaming
WHO SHOULD ATTEND?
Data Scientists, System Administrators, Testers, and other technical business professionals who seek to use Spark for data processing and analysis.
PREREQUISITES
Before attending this course, you should have:
- Introduction to Java Programming (at least exposure to basic Java syntax)
- Introduction to SQL (familiarity wits SQL basics)
- Basic knowledge of Statistics and Probability
- Data Science background
- Java 8 Programming and Object Oriented Essentials for Developers New to OO
- Introduction to Writing SQL Queries (TTSQL003)
COURSE OUTLINE
Spark
- Data Science: The State of the Art
- Hadoop, Yarn, and Spark
- Architectural Overview
- Spark and Storm
- MLib and Mahout
- Distributed vs. Local Run Modes
- Hello, Spark
Spark Overview
- Spark Core
- Spark SQL
- Spark and Hive
- MLib
- Mahout
- Spark Streaming
- Spark API
DataFrames
- DataFrames and Resilient Distributed Datasets (RDDs)
- Partitions
- DataFrame Types
- DataFrame Operations
- Map/Reduce with DataFrames
Spark SQL
- Spark SQL Overview
- Data stores: HDFS, Cassandra, HBase, Hive, and S3
- Table Definitions
- ETL in Spark
- Queries
Spark MLib
- MLib overview
- MLib Algorithms Overview
Spark Streaming
- Streaming overview
- Real-time data ingestion
- State
- Window Operations
Spark GraphX
- GraphX overview
- ETL with GraphX
- Graph computation
Performance and Tuning
- Broadcast variables
- Accumulators
- Memory Management
Cluster Mode
- Standalone Cluster
- Masters and Workers
- Configurations
- Working with large data sets
Sorry! It looks like we haven’t updated our dates for the class you selected yet. There’s a quick way to find out. Contact us at 502.265.3057 or email info@training4it.com
Request a Date