Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
-
Introduction
- Hadoop history and core concepts
- The Hadoop ecosystem
- Available distributions
- High-level architecture overview
- Common Hadoop myths
- Hadoop challenges (hardware and software)
- Labs: Discussion of your Big Data projects and associated problems
-
Planning and Installation
- Choosing software and Hadoop distributions
- Cluster sizing and growth planning
- Hardware and network selection
- Rack topology design
- Installation procedures
- Multi-tenancy implementation
- Directory structures and log management
- Benchmarking performance
- Labs: Cluster installation and running performance benchmarks
-
HDFS Operations
- Core concepts (horizontal scaling, replication, data locality, rack awareness)
- Nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
- Health monitoring techniques
- Command-line and browser-based administration
- Adding storage and replacing defective drives
- Labs: Getting familiar with HDFS command lines
-
Data Ingestion
- Using Flume for logs and other data ingestion into HDFS
- Using Sqoop to import data from SQL databases to HDFS and export back to SQL
- Hadoop data warehousing with Hive
- Copying data between clusters using distcp
- Utilizing S3 as a complement to HDFS
- Best practices and architectures for data ingestion
- Labs: Setting up and utilizing Flume and Sqoop
-
MapReduce Operations and Administration
- Parallel computing before MapReduce: comparing HPC with Hadoop administration
- Managing MapReduce cluster loads
- Nodes and Daemons (JobTracker, TaskTracker)
- Walkthrough of the MapReduce User Interface
- MapReduce configuration
- Job configuration
- Optimizing MapReduce performance
- Fool-proofing MapReduce: Guidance for programmers
- Labs: Running MapReduce examples
-
YARN: New Architecture and Capabilities
- YARN design goals and implementation architecture
- New actors: ResourceManager, NodeManager, Application Master
- Installing YARN
- Job scheduling under YARN
- Labs: Investigating job scheduling
-
Advanced Topics
- Hardware monitoring
- Cluster monitoring
- Adding and removing servers, and upgrading Hadoop
- Backup, recovery, and business continuity planning
- Oozie job workflows
- Hadoop High Availability (HA)
- Hadoop Federation
- Securing your cluster with Kerberos
- Labs: Setting up monitoring systems
-
Optional Tracks
- Cloudera Manager for cluster administration, monitoring, and routine tasks; installation and usage. In this track, all exercises and labs are conducted within the Cloudera distribution environment (CDH5)
- Ambari for cluster administration, monitoring, and routine tasks; installation and usage. In this track, all exercises and labs are conducted within the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0)
Requirements
- Proficiency in basic Linux system administration
- Fundamental scripting capabilities
Prior knowledge of Hadoop and Distributed Computing is not mandatory, as these topics will be introduced and explained throughout the course.
Lab Environment
Zero Install Requirement: Students are not required to install Hadoop software on their personal machines. A fully functional Hadoop cluster will be provided for use during the course.
Participants will need to have the following:
- An SSH client (Linux and Mac systems come with built-in SSH clients; for Windows, PuTTY is recommended)
- A web browser to access the cluster. We recommend using Firefox with the FoxyProxy extension installed
21 Hours
Testimonials (1)
Hands on exercises. Class should have been 5 days, but the 3 days helped to clear up a lot of questions that I had from working with NiFi already