Developing with Spark for Big Data | Enterprise-Grade Spark Programming for the Hadoop & Big Data Ecosystem
Course Objectives
This course provides indoctrination in the practical use of the umbrella of technologies that are on the leading edge of data science development focused on Spark and related tools. Working in a hands-on learning environment, students will learn:
- The essentials of Spark architecture and applications
- How to execute Spark Programs
- How to create and manipulate both RDDs (Resilient Distributed Datasets) and UDFs (Unified Data Frames)
- How to persist and restore data frames
- Essential NOSQL access
- How to integrate machine learning into Spark applications
- How to use Spark Streaming and Kafka to create streaming applications
Course Prerequisites
This in an intermediate-level course is geared for experienced developers seeking to be proficient in Spark tools & technologies. Attendees should be experienced developers who are comfortable with Java, Scala or Python programming. Students should also be able to navigate Linux command line, and who have basic knowledge of Linux editors (such as VI / nano) for editing code.
Take Before: Students should have attended the course(s) below, or should have basic skills in these areas:
- TT2104 Java Programming Fundamentals (for Java supported course flavor)
- TTPS4800 Introduction to Python Programming (for Python supported course flavor)
- TTSQLB3 Introduction to SQL (Basic familiarity is needed for all editions)
Course Agenda
Please note that this list of topics is based on our standard course offering, evolved from typical industry uses and trends. We’ll work with you to tune this course and level of coverage to target the skills you need most.
Spark Overview
- Hadoop Ecosystem
- Hadoop YARN vs. Mesos
- Spark vs. Map/Reduce
- Spark with Map/Reduce: Lambda Architecture
- Spark in the Enterprise Data Science Architecture
Spark Component Overview
- Spark Shell
- RDDs: Resilient Distributed Datasets
- Data Frames
- Spark 2 Unified DataFrames
- Spark Sessions
- Functional Programming
- Spark SQL
- MLib
- Structured Streaming
- Spark R
- Spark and Python
RDDs: Resilient Distributed Datasets
- Coding with RDDs
- Transformations
- Actions
- Lazy Evaluation and Optimization
- RDDs in Map/Reduce
DataFrames
- RDDs vs. DataFrames
- Unified Dataframes (UDF) in Spark 2.0
- Partitioning
Spark Applications
- Spark Sessions
- Running Applications
- Logging
DataFrame Persistence
- RDD Persistence
- DataFrame and Unified DataFrame Persistence
Distributed Persistence
Spark Streaming
- Streaming Overview
- Streams
- Structured Streaming
- DStreams and Apache Kafka
Accessing NOSQL Data
- Ingesting data
- Parquet Files
- Relational Databases
- Graph Databases (Neo4J, GraphX)
- Interacting with Hive
- Accessing Cassandra Data
- Document Databases (MongoDB, CouchDB)
Enterprise Integration
- Map/Reduce and Lambda Integration
- Camel Integration
- Drools and Spark
Algorithms and Patterns
- MLib and Mahout
- Classification
- Clustering
- Decision Trees
- Decompositions
- Pipelines
- Spark Packages
Spark SQL
- Spark SQL
- SQL and DataFrames
- Spark SQL and Hive
- Spark SQL and JDBC
GraphX
- Graph APIs
- GraphX
- ETL in GraphX
- Exploratory Analysis
- Graph computation
- Pregel API Overview
- GraphX Algorithms
- Neo4J as an alternative
Alternate Languages
- Using Web Notebooks (Zeppelin, Jupyter)
- R on Spark
- Python on Spark
- Scala on Spark
Clustering Spark for Developers
- Parallelizing Spark Applications
- Clustering concerns for Developers
Performance and Tuning
- Monitoring Spark Performance
- Tuning Memory
- Tuning CPU
- Tuning Data Locality
- Troubleshooting
Sorry! It looks like we haven’t updated our dates for the class you selected yet. There’s a quick way to find out. Contact us at 502.265.3057 or email info@training4it.com
Request a Date