Get in Touch

Course Outline

Introduction:

  • Apache Spark within the Hadoop Ecosystem
  • Brief overview of Python and Scala

Foundational Concepts (Theory):

  • System Architecture
  • Resilient Distributed Datasets (RDD)
  • Transformations and Actions
  • Stages, Tasks, and Dependencies

Practical Workshop: Mastering Basics in the Databricks Environment:

  • Hands-on exercises with the RDD API
  • Essential action and transformation functions
  • PairRDDs
  • Join operations
  • Strategies for data caching
  • Hands-on exercises with the DataFrame API
  • Spark SQL
  • DataFrame operations: select, filter, group, and sort
  • User Defined Functions (UDF)
  • Exploration of the DataSet API
  • Stream processing

Practical Workshop: Deployment in the AWS Environment:

  • Core concepts of AWS Glue
  • Differentiating between AWS EMR and AWS Glue
  • Example jobs run on both platforms
  • Analysis of advantages and disadvantages

Additional Topics:

  • Introduction to Apache Airflow for orchestration

Requirements

Programming skills (preferably in Python or Scala)

Foundational knowledge of SQL

 21 Hours

Number of participants


Price per participant

Testimonials (3)

Upcoming Courses

Related Categories