Hadoop Developer Foundation | Explore Hadoop, HDFS, Hive, Yarn, Spark and More
Course Objectives
This “skills-centric” course is about 50% hands-on lab and 50% lecture, designed to train attendees in core big data/ Spark development and use skills, coupling the most current, effective techniques with the soundest industry practices. Throughout the course students will be led through a series of progressively advanced topics, where each topic consists of lecture, group discussion, comprehensive hands-on lab exercises, and lab review.
Working in a hands-on learning environment led by our expert Hadoop team, students will explore:
- Introduction to Hadoop
- HDFS
- YARN
- Data Ingestion
- HBase
- Oozie
- Working with Hive
- Hive (Advanced)
- Hive in Cloudera
- Working with Spark
- Spark Basics
- Spark Shell
- RDDs (Condensed coverage)
- Spark Dataframes & Datasets
- Spark SQL
- Spark API programming
- Spark and Hadoop
- Machine Learning (ML / MLlib)
- GraphX
- Spark Streaming
Course Prerequisites
This in an intermediate-level course is geared for experienced developers seeking to be proficient in Hadoop, Spark tools & related technologies. Attendees should be experienced developers who are comfortable with programming languages. Students should also be able to navigate Linux command line, and who have basic knowledge of Linux editors (such as VI / nano) for editing code.
In order to gain the most from this course, attending students should be:
- Familiar with a programming language
- Comfortable in Linux environment (be able to navigate Linux command line, edit files using vi or nano)
Course Outline
Day One
Introduction to Hadoop
- Hadoop history, concepts
- Ecosystem
- Distributions
- High-level architecture
- Hadoop myths
- Hadoop challenges
- Hardware and software
HDFS
- Design and architecture
- Concepts (horizontal scaling, replication, data locality, rack awareness)
- Daemons: Namenode, Secondary Namenode, Datanode
- Communications and heart-beats
- Data integrity
- Read and write path
- Namenode High Availability (HA), Federation
Day Two
YARN
- YARN Concepts and architecture
- Evolution from MapReduce to YARN
Data Ingestion
- Flume for logs and other data ingestion into HDFS
- Sqoop for importing from SQL databases to HDFS, as well as exporting back to SQL
- Copying data between clusters (distcp)
- Using S3 as complementary to HDFS
- Data ingestion best practices and architectures
- Oozie for scheduling events on Hadoop
HBase
- (Covered in brief)
- Concepts and architecture
- HBase vs RDBMS vs Cassandra
- HBase Java API
- Time series data on HBase
- Schema design
Oozie
- Introduction to Oozie
- Features of Oozie
- Oozie Workflow
- Creating a MapReduce Workflow
- Start, End, and Error Nodes
- Parallel Fork and Join Nodes
- Workflow Jobs Lifecycle
- Workflow Notifications
- Workflow Manager
- Creating and Running a Workflow
- Oozie Coordinator Sub-groups
- Oozie Coordinator Components, Variables, and Parameters
Day Three
Working with Hive
- Architecture and design
- Data types
- SQL support in Hive
- Creating Hive tables and querying
- Partitions
- Joins
- Text processing
Hive (Advanced)
- Transformation, Aggregation
- Working with Dates, Timestamps, and Arrays
- Converting Strings to Date, Time, and Numbers
- Create new Attributes, Mathematical Calculations, Windowing Functions
- Use Character and String Functions
- Binning and Smoothing
- Processing JSON Data
- Execution Engines (Tez, MR, Spark)
Day Four
Hive in Cloudera (or tools of choice)
Working with Spark
Spark Basics
- Big Data, Hadoop, Spark
- What’s new in Spark v2
- Spark concepts and architecture
- Spark ecosystem (core, spark sql, mlib, streaming)
Spark Shell
- Spark web UIs
- Analyzing dataset – part 1
RDDs (Condensed coverage)
- RDDs concepts
- RDD Operations / transformations
- Labs : Unstructured data analytics using RDDs
- Data model concepts
- Partitions
- Distributed processing
- Failure handling
- Caching and persistence
Spark Dataframes & Datasets
- Intro to Dataframe / Dataset
- Programming in Dataframe / Dataset API
- Loading structured data using Dataframes
Spark SQL
- Spark SQL concepts and overview
- Defining tables and importing datasets
- Querying data using SQL
- Handling various storage formats : JSON / Parquet / ORC
Spark API programming (Scala and Python)
- Introduction to Spark API
- Submitting the first program to Spark
- Debugging / logging
- Configuration properties
Spark and Hadoop
- Hadoop Primer: HDFS / YARN
- Hadoop + Spark architecture
- Running Spark on YARN
- Processing HDFS files using Spark
- Spark & Hive
Capstone project (Optional)
- Team design workshop
- The class will be broken into teams
- The teams will get a name and a task
- They will architect a complete solution to a specific useful problem, present it, and defend the architecture based on the best practices they have learned in class
Optional Additional Topics – Please Inquire for Details
Machine Learning (ML / MLlib)
- Machine Learning primer
- Machine Learning in Spark: MLlib / ML
- Spark ML overview (newer Spark2 version)
- Algorithms: Clustering, Classifications, Recommendations
GraphX
- GraphX library overview
- GraphX APIs
Spark Streaming
- Streaming concepts
- Evaluating Streaming platforms
- Spark streaming library overview
- Streaming operations
- Sliding window operations
- Structured Streaming
- Continuous streaming
- Spark & Kafka streaming
Sorry! It looks like we haven’t updated our dates for the class you selected yet. There’s a quick way to find out. Contact us at 502.265.3057 or email info@training4it.com
Request a Date