Developing with Spark for Big Data | Enterprise-Grade Spark Programming for the Hadoop & Big Data Ecosystem

Apache Spark, a significant component in the Hadoop Ecosystem, is a cluster computing engine used in Big Data. Building on top of the Hadoop YARN and HDFS ecosystem, it offers order-of-magnitude faster processing for many in-memory computing tasks compared to Map/Reduce. It can be programmed in Java, Scala, Python, and R - the favorite languages of Data Scientists - along with SQL-based front ends. With advanced libraries like Mahout and MLib for Machine Learning, GraphX or Neo4J for rich data graph processing as well as access to other NOSQL data stores, Rule engines and other Enterprise components, Spark is a lynchpin in modern Big Data and Data Science computing.

Retail Price: $2,795.00

Next Date: Request Date

Course Days: 5


Request a Date

Request Custom Course


Course Objectives

This course provides indoctrination in the practical use of the umbrella of technologies that are on the leading edge of data science development focused on Spark and related tools.  Working in a hands-on learning environment, students will learn:

  • The essentials of Spark architecture and applications
  • How to execute Spark Programs
  • How to create and manipulate both RDDs (Resilient Distributed Datasets) and UDFs (Unified Data Frames)
  • How to persist and restore data frames
  • Essential NOSQL access
  • How to integrate machine learning into Spark applications
  • How to use Spark Streaming and Kafka to create streaming applications

 

Course Prerequisites

This in an intermediate-level course is geared for experienced developers seeking to be proficient in Spark tools & technologies. Attendees should be experienced developers who are comfortable with Java, Scala or Python programming.  Students should also be able to navigate Linux command line, and who have basic knowledge of Linux editors (such as VI / nano) for editing code.

Take Before: Students should have attended the course(s) below, or should have basic skills in these areas:

  • TT2104          Java Programming Fundamentals (for Java supported course flavor)
  • TTPS4800      Introduction to Python Programming (for Python supported course flavor)
  • TTSQLB3        Introduction to SQL (Basic familiarity is needed for all editions)

Course Agenda

 

Please note that this list of topics is based on our standard course offering, evolved from typical industry uses and trends. We’ll work with you to tune this course and level of coverage to target the skills you need most.

Spark Overview

  • Hadoop Ecosystem
  • Hadoop YARN vs. Mesos
  • Spark vs. Map/Reduce
  • Spark with Map/Reduce: Lambda Architecture
  • Spark in the Enterprise Data Science Architecture

Spark Component Overview

  • Spark Shell
  • RDDs: Resilient Distributed Datasets
  • Data Frames
  • Spark 2 Unified DataFrames
  • Spark Sessions
  • Functional Programming
  • Spark SQL
  • MLib
  • Structured Streaming
  • Spark R
  • Spark and Python

RDDs: Resilient Distributed Datasets

  • Coding with RDDs
  • Transformations
  • Actions
  • Lazy Evaluation and Optimization
  • RDDs in Map/Reduce

DataFrames

  • RDDs vs. DataFrames
  • Unified Dataframes (UDF) in Spark 2.0
  • Partitioning

Spark Applications

  • Spark Sessions
  • Running Applications
  • Logging

DataFrame Persistence

  • RDD Persistence
  • DataFrame and Unified DataFrame Persistence

Distributed Persistence

Spark Streaming

  • Streaming Overview
  • Streams
  • Structured Streaming
  • DStreams and Apache Kafka

Accessing NOSQL Data

  • Ingesting data
  • Parquet Files
  • Relational Databases
  • Graph Databases (Neo4J, GraphX)
  • Interacting with Hive
  • Accessing Cassandra Data
  • Document Databases (MongoDB, CouchDB)

Enterprise Integration

  • Map/Reduce and Lambda Integration
  • Camel Integration
  • Drools and Spark

Algorithms and Patterns

  • MLib and Mahout
  • Classification
  • Clustering
  • Decision Trees
  • Decompositions
  • Pipelines
  • Spark Packages

Spark SQL

  • Spark SQL
  • SQL and DataFrames
  • Spark SQL and Hive
  • Spark SQL and JDBC

GraphX

  • Graph APIs
  • GraphX
  • ETL in GraphX
  • Exploratory Analysis
  • Graph computation
  • Pregel API Overview
  • GraphX Algorithms
  • Neo4J as an alternative

Alternate Languages

  • Using Web Notebooks (Zeppelin, Jupyter)
  • R on Spark
  • Python on Spark
  • Scala on Spark

Clustering Spark for Developers

  • Parallelizing Spark Applications
  • Clustering concerns for Developers

Performance and Tuning

  • Monitoring Spark Performance
  • Tuning Memory
  • Tuning CPU
  • Tuning Data Locality
  • Troubleshooting


Sorry! It looks like we haven’t updated our dates for the class you selected yet. There’s a quick way to find out. Contact us at 502.265.3057 or email info@training4it.com


Request a Date