Apache Spark Programming (DB 105)

This 3-day course is equally applicable to data engineers, data scientist, analysts, architects, software engineers, and technical managers interested in a thorough, hands-on overview of Apache Spark.

Retail Price: $3,000.00

Next Date: 12/10/2019

Course Days: 3

Enroll in Next Date

Request Custom Course

Target Audience 

Data scientists, analysts, architects, software engineers, and technical managers with experience in machine learning who want to adapt traditional machine learning tasks to run at scale using Apache Spark.



  • Some familiarity with Apache Spark is helpful but not required.
  • Some familiarity with Machine Learning and Data Science concepts are highly recommended but not required.
  • Basic programming experience in an object-oriented or functional language is required. The class can be taught concurrently in Python and Scala.

Module 1: Spark Overview

  • Databricks Overview
  • Spark Capabilities
  • Spark Ecosystem
  • Basic Spark Components


  • Databricks Lab Environment
  • Working with Notebooks
  • Spark Clusters and Files

Module 2: Spark SQL and DataFrames

  • Use of Spark SQL
  • Use of DataFrames / DataSets
  • Reading from CSV, JSON, JDBC, Parquet Files & more
  • Writing Data
  • DataFrame, DataSet and SQL APIs
  • Aggregations
  • SQL Joins with DataFrames
  • Broadcasting
  • Catalyst Query Optimization
  • Tungsten
  • ETL


  • Creating DataFrames
  • Querying with DataFrames and SQL
  • ETL with DataFrames
  • Caching
  • Visualization

Module 3: Spark Internals

  • Jobs, Stages and Tasks
  • Partitions and Shuffling
  • Job Performance


  • Visualizing SQL Queries
  • Observing Task Execution
  • Understanding Performance
  • Measuring Memory Use

Module 4: Structured Streaming

  • Streaming Sources and Sinks
  • Structured Streaming APIs
  • Windowing and Aggregation
  • Checkpointing
  • Watermarking
  • Reliability and Fault Tolerance


  • Reading from TCP
  • Reading from Kafka
  • Continuous Visualization

Module 5: Machine Learning

  • Spark ML Pipeline API
  • Built-in Featurizing and Algorithms


  • Featurization
  • Building a Machine Learning Pipeline

Module 6: Graph Processing with GraphFrames

  • Basic Graph Analysis
  • GraphFrames API


  • GraphFrames ETL
  • Pagerank and Label Propagation with GraphFrames
Course Dates Course Times (EST) Delivery Mode GTR
12/10/2019 - 12/12/2019 10:00 AM - 6:00 PM Virtual gauranteed to run course date Enroll
12/10/2019 - 12/12/2019 10:00 AM - 6:00 PM Virtual gauranteed to run course date Enroll