Apache Spark for Machine Learning and Data Science (DB 301)

This 3-day course is primarily for data scientists but is directly applicable to analysts, architects, software engineers, and technical managers interested in a thorough, hands-on overview of Apache Spark and its applications to Machine Learning.

Retail Price: $3,000.00

Next Date: 10/29/2019

Course Days: 3


Enroll in Next Date

Request Custom Course


Target Audience

Data scientists, analysts, architects, software engineers, and technical managers with experience in machine learning who want to adapt traditional machine learning tasks to run at scale using Apache Spark.

 

Prerequisites

 

  • Some familiarity with Apache Spark is helpful but not required.
  • Some familiarity with Machine Learning and Data Science concepts are highly recommended but not required.
  • Basic programming experience in an object-oriented or functional language is required. The class can be taught concurrently in Python and Scala.

Module 1: Spark Overview
Lecture

  • Databricks Overview
  • Spark Capabilities
  • Spark Ecosystem
  • Basic Spark Components

Hands-On

  • Databricks Lab Environment
  • Working with Notebooks
  • Spark Clusters and Files

Module 2: Spark SQL and DataFrames
Lecture

  • Use of Spark SQL
  • Use of DataFrames / DataSets
  • Reading & Writing Data
  • DataFrame, DataSet and SQL APIs
  • Catalyst Query Optimization
  • Tungsten
  • ETL

Hands-On

  • Creating DataFrames
  • Querying with DataFrames
  • Querying with SQL
  • ETL with DataFrames
  • Caching
  • Visualization

Module 3: Spark Internals
Lecture

  • Jobs, Stages, and Tasks
  • Partitions and Shuffling
  • Job Performance

Hands-On

  • Visualizing SQL Queries
  • Observing Task Execution
  • Understanding Performance
  • Measuring Memory Use

Module 4:Machine Learning
Lecture

  • Spark MLlib Pipeline API
  • Built-in Featurizing and Algorithms
  • Cross-Validation and Grid Search for Hyperparameter Tuning
  • Evaluation Metrics
  • Data Partitioning Strategies
  • Spark integration with Scikit-learn

Hands-On

  • NLP/Text Classification with Logistic Regression
  • Decision Tree vs. Random Forest
  • Data imputation with Alternating Least Squares
  • Clustering with K-Means
  • Neural Networks
  • Spark-sklearn

Module 5: Structured Streaming
Lecture

  • Streaming Sources and Sinks
  • Structured Streaming APIs
  • Windowing & Aggregation
  • Checkpointing
  • Watermarking
  • Reliability and Fault Tolerance

Hands-On

  • Reading from TCP
  • Continuous Visualization

Module 6: Graph Processing with GraphFrames
Lecture

  • Basic Graph Analysis
  • GraphFrames API

Hands-On

  • GraphFrames ETL
  • Pagerank and Label Propagation with GraphFrames
Course Dates Course Times (EST) Delivery Mode GTR
10/29/2019 - 10/31/2019 10:00 AM - 6:00 PM Virtual gauranteed to run course date Enroll
10/29/2019 - 10/31/2019 10:00 AM - 6:00 PM Virtual gauranteed to run course date Enroll
12/11/2019 - 12/13/2019 10:00 AM - 6:00 PM Virtual gauranteed to run course date Enroll
12/11/2019 - 12/13/2019 10:00 AM - 6:00 PM Virtual gauranteed to run course date Enroll