Details
This four day course of Spark Developer is for data engineers, analysts, architects; software engineers; IT operations; and technical managers interested in a thorough, hands-on overview of Apache Spark.
The course covers the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, as well as Spark’s streaming capabilities and machine learning APIs.
Objectives
After taking this class you will be able to:
- Describe Spark’s fundamentalmechanics
- Use the core Spark APIs to operate ondata
- Articulate and implement typical use cases forSpark
- Build data pipelines with SparkSQL andDataFrames
- Analyze Spark jobs using the UIs and logs Create Streaming and Machine Learning jobs
- Required
◦ Basic to intermediate Linux knowledge, including: Theabilitytouseatexteditor,suchasvi
Familiarity with basic command-line options such a mv, cp, ssh, grep, cd, useradd
◦ Knowledge of application development principles
• Recommended
◦ Knowledge of functionalprogramming
◦ Knowledge of Scala orPython
◦ Beginner fluency withSQL
Course Overview
Lesson 1 – Introduction to Apache Spark (Day 1)
- Describe the features of ApacheSpark
- Advantages ofSpark
- How Spark fits in with the Big Data applicationstack
- How Spark fits in withHadoop
- Define Apache Sparkcomponents
Lesson 2 – Load and Inspect Data in Apache Spark
- Describe different ways of getting data intoSpark
- Create and use Resilient Distributed Datasets(RDDs)
- Apply transformation toRDDs
- Use actions onRDDs
- Load and inspect data inRDD
- Cache intermediateRDDs
- Use Spark DataFrames for simplequeries
- Load and inspect data inDataFrames
Lesson 3 – Build a Simple Apache Spark Application(Day 2)
- Define the lifecycle of a Sparkprogram
- Define the function ofSparkContext
- Create theapplication
- Define different ways to run a Sparkapplication
- Run your Sparkapplication
- Launch theapplication
- Review loading and exploring data inRDD
- Load and explore data inRDD
- Describe and create PairRDD
- Create and explorePairRDD
- Control partitioning acrossnodes
Lesson 5 – Work with DataFrames (Day 3)
- CreateDataFrames
◦ From existingRDD
◦ From datasources
- Work with data inDataFrames
◦ Use DataFrameoperations
◦ UseSQL
◦ Explore data inDataFrames
- Create user-defined functions(UDF)
◦ UDF used with ScalaDSL
◦ UDF used withSQL
◦ Create and use user-definedfunctions
- RepartitionDataFrames
- Supplemental Lab: Build a standaloneapplication
Lesson 6 – Monitor Apache Spark Applications (Day 3)
- Describe components of the Spark executionmodel
- Use Spark Web UI to monitor Sparkapplications
- Debug and tune Sparkapplications
- Use the Spark WebUI
Lesson 7 – Introduction to Apache Spark Data Pipelines (Day 3)
- Identify components of Apache Spark UnifiedStack
- List benefits of Apache Spark over Hadoopecosystem
- Describe data pipeline usecases
Lesson 8 – Create an Apache Spark Streaming Application(Day 4)
- Describe Spark Streamingarchitecture
- Create DStreams and a Spark Streamingapplication
- BuildandrunaStreamingapplicationwhichwritestoHBase
- Apply operations onDStream
- Define windowoperations
◦ Build and run a Streaming application withSQL
◦ BuildandrunaStreamingapplicationwithWindowsandSQL
- Describe how Streaming applications arefault-tolerant
Lesson 9 – Use Apache Spark GraphX (Day 4)
- DescribeGraphX
- Define regular, directed, and propertygraphs
- Create a propertygraph
- Perform operations ongraphs
- Create a propertygraph
- Apply graphoperations
Lesson 10 – Use Apache Spark MLlib (Day 4)
- Describe SparkMLlib
- Describe the Machine Learningtechniques
◦ Classification
◦ Clustering
◦ Collaborativefiltering
- Use collaborative filtering to predict userchoice
- Load and inspect data using the Sparkshell