Apache Spark Programming (DB 105)
This 3-day course is equally applicable to data engineers, data scientist, analysts, architects, software engineers, and technical managers interested in a thorough, hands-on overview of Apache Spark.
Target Audience
Data scientists, analysts, architects, software engineers, and technical managers with experience in machine learning who want to adapt traditional machine learning tasks to run at scale using Apache Spark.
Prerequisites
- Some familiarity with Apache Spark is helpful but not required.
- Some familiarity with Machine Learning and Data Science concepts are highly recommended but not required.
- Basic programming experience in an object-oriented or functional language is required. The class can be taught concurrently in Python and Scala.
Module 1: Spark Overview
Lecture
- Databricks Overview
- Spark Capabilities
- Spark Ecosystem
- Basic Spark Components
Hands-On
- Databricks Lab Environment
- Working with Notebooks
- Spark Clusters and Files
Module 2: Spark SQL and DataFrames
Lecture
- Use of Spark SQL
- Use of DataFrames / DataSets
- Reading from CSV, JSON, JDBC, Parquet Files & more
- Writing Data
- DataFrame, DataSet and SQL APIs
- Aggregations
- SQL Joins with DataFrames
- Broadcasting
- Catalyst Query Optimization
- Tungsten
- ETL
Hands-On
- Creating DataFrames
- Querying with DataFrames and SQL
- ETL with DataFrames
- Caching
- Visualization
Module 3: Spark Internals
Lecture
- Jobs, Stages and Tasks
- Partitions and Shuffling
- Job Performance
Hands-On
- Visualizing SQL Queries
- Observing Task Execution
- Understanding Performance
- Measuring Memory Use
Module 4: Structured Streaming
Lecture
- Streaming Sources and Sinks
- Structured Streaming APIs
- Windowing and Aggregation
- Checkpointing
- Watermarking
- Reliability and Fault Tolerance
Hands-On
- Reading from TCP
- Reading from Kafka
- Continuous Visualization
Module 5: Machine Learning
Lecture
- Spark ML Pipeline API
- Built-in Featurizing and Algorithms
Hands-On
- Featurization
- Building a Machine Learning Pipeline
Module 6: Graph Processing with GraphFrames
Lecture
- Basic Graph Analysis
- GraphFrames API
Hands-On
- GraphFrames ETL
- Pagerank and Label Propagation with GraphFrames
Sorry! It looks like we haven’t updated our dates for the class you selected yet. There’s a quick way to find out. Contact us at 502.265.3057 or email info@training4it.com
Request a Date