JumpStart to Developing in Apache Spark

Apache Spark is an important component in the Hadoop Ecosystem as a cluster computing engine used for Big Data. Building on top of the Hadoop YARN and HDFS ecosystem, Spark offers faster in-memory processing for computing tasks when compared to Map/Reduce. It can be programmed in Java, Scala, Python, and R along with SQL-based front-ends. With advanced libraries like Mahout and MLib for Machine Learning, GraphX, or Neo4J for rich data graph processing, as well as access to other NoSQL data stores, Rule engines, and components, Spark is a lynchpin in modern Big Data and Data Science computing. This course introduces you to enterprise-grade Spark programming and the components to craft complete data science solutions. This is a fast-paced course intended to show topical overviews and “big-picture” interactions, while providing you with hands-on experience. This course is offered in Java, and with some alterations, Python, Scala, and R.

Retail Price: $2,195.00

Next Date: Request Date

Course Days: 3


Request a Date

Request Custom Course


WHAT YOU'LL LEARN

Join an engaging hands-on learning environment, where you’ll learn:

  • The essentials of Spark architecture and applications
  • How to execute Spark Programs
  • How to create and manipulate both RDDs (Resilient Distributed Datasets) and UDFs (Unified Data Frames)
  • How to persist and restore data frames
  • Essential NoSQL access
  • How to integrate machine learning into Spark applications
  • How to use Spark Streaming and Kafka to create streaming applications

WHO SHOULD ATTEND?

Experienced Developers and Architects who seek proficiency in working with Apache Spark in an enterprise data environment.

PREREQUISITES

Before attending this course, you should have:

  • Java programming experience
  • Python programming experience
  • Basic understanding of SQL
  • Comfort with navigating the Linux command line
  • Basic knowledge of Linux editors (such as VI/nano) for editing code
  • Java 8 Programming and Object Oriented Essentials for Developers New to OO
  • Introduction to Writing SQL Queries (TTSQL003)
  • Python Programming Essentials

COURSE OUTLINE

Overview of Spark

  • Hadoop Ecosystem
  • Hadoop YARN vs. Mesos
  • Spark vs. Map/Reduce
  • Spark: Lambda Architecture
  • Spark in the Enterprise Data Science Architecture

Spark Component Overview

  • Spark Shell
  • RDDs: Resilient Distributed Datasets
  • Data Frames
  • Spark 2 Unified DataFrames
  • Spark Sessions
  • Functional Programming
  • Spark SQL
  • MLib
  • Structured Streaming
  • Spark R
  • Spark and Python

RDDs: Resilient Distributed Datasets

  • Coding with RDDs
  • Transformations
  • Actions
  • Lazy Evaluation and Optimization
  • RDDs in Map/Reduce

DataFrames

  • RDDs vs. DataFrames
  • Unified Dataframes (UDF) in Spark 2.x
  • Partitioning

DataFrame Persistence

  • RDD Persistence
  • DataFrame and Unified DataFrame Persistence
  • Distributed Persistence

Accessing NoSQL Data

  • Ingesting data
  • Relational Databases and Sqoop
  • Interacting with Hive
  • Graph Data
  • Accessing Cassandra Data

Spark SQL

  • Spark SQL
  • SQL and DataFrames
  • Spark SQL and Hive
  • Spark SQL and JDBC

Machine Learning

  • ML Lib
  • Mahout

Spark Streaming

  • Streaming Overview
  • Streams
  • Structured Streaming
  • Lambda Streaming
  • Spark and Kafka


Sorry! It looks like we haven’t updated our dates for the class you selected yet. There’s a quick way to find out. Contact us at 502.265.3057 or email info@training4it.com


Request a Date