Big Data Training Outline

curriculum made for the real world

Module 1 

Introduction: Linux Environment, Setting Up Linux Environment, Linux Basic commands, Skills Evaluation
Development Principles: Setting Up Scala Environment, Scala Development Principles, Scala Object oriented Programming and functional programing concepts, Scala programming features


Setting Up Python Environment: Python Development Principles, Python programming concepts - classes, collections, exception handling, Study Cases For Big Data
Introduction to Big Data: Hadoop Overview and History, Overview of the Hadoop Ecosystem, Hadoop Architecture 

Module 2 

Introduction, and install Hadoop: HDFS: What it is, and how it works, Installing Dataset into HDFS, HDFS Architecture Read/Write Anatomy, HDFS commands 

YARN explained: Classical version of Map Reduce, MapReduce: What it is, and how it works, Mapreduce Architecture: word count program, Mapreduce Combiner & Partitioner concepts and practical demo, Mapreduce Distributed Cache, MapSidejoin and ReduceSideJoin 

Integrating DBMS with Hadoop: Setting up MySQL Server and creating & loading datasets into Mysql, Sqoop Architecture, Writing the Sqoop Import Commands to transfer data from RDBMS to HDFS/Hive, Writing the Sqoop Export Commands to transfer data from HDFS/Hive to RDBMS 

Flume explained: Set up Flume and publish logs with it, Set up Flume to monitor a directory and store its data in HDFS, Use case implementation in flume 

Hive: Use Hive to find the most popular game, How Hive works, Hive Architecture, Hive Metastore, Hive Partitioning and Bucketing, Hive file Format, Hive JSON,CSV,XML,ORC & Regex Serde, Use Hive to find the game with the highest average rating, Skills Evaluation 

Module 3 

ZooKeeper explained: Kafka explained, Kafka Architecture, Setting up Kafka, and publishing data, Publishing web logs with Kafka, Kafka Streams,Kafka connect API, Kafka Performance tuning 

Basic SQL query: SQL Queries joins, SQL Windows functions, SQL indexes,view, SQL performance tuning
Spark and MapReduce difference: The Resilient Distributed Dataset (RDD), DataFrame & Dataset difference, Spark core transformations and actions, Spark performance tuning 

Spark Dataframe API: Spark dataset API,SparkSQL(reading data from different data sources (mysql, Hive, Mongo) Spark Sql Join, Shared variable, Spark cluster types, Spark SQL, Spark file system ingestion for CSV, XML, JSON 

Module 4 

Spark Streaming concepts: DStream, Spark structured streaming using Kafka), Spark socket program, Spark file format structured streaming, Spark stateful streaming 

Data Pipeline Building: with Nifi/Kafka/Spark, Skills Evaluation RESTAPI Implementation in python: CSV,JSON,XML parser in python POC for Spark core, Spark Sql, Spark Streaming 

What is HBase: Hbase shell commands, Hbase Hive integration, Hbase shell commands, Hbase Hive integration, Hbase Sqoop integration. 

Module 5 

Airflow Installation, Airflow Webserver, Airflow scheduler, Architecture , Bashoperator, hive operator and Sparkoperator, Airflow executors, Airflow Xcom, Airflow hook 

Snowflake SQL data warehouse tool and integration with spark, Cloudera CDH cluster and Databricks cluster, Cassandra concepts and Cassandra shell commands, Snowflake SQL data warehouse tool and integration with spark

Oozie explained: Set up a simple Oozie workflow, Zeppelin overview 

MongoDB concepts and mongodb shell commands:  Mongo aggregation framework, indexing, mongo python integration, Mongo Installation Introduction to Apache NiFi: Installing And Configuring NiFi
Important Concepts: FlowFile, Processor and Connector Nifi example use case 

Module 6 

AWS cloud Introduction: AWS account creation, comparison between Azure,GCP and AWS, AWS EC2 different types of instances, AWS EBS volume, AWS Security Group, AWS S3, VPC, AWS autoscaling, RDS, DynamoDB, LAMBDA, EMR cluster setup, spark job deployment in EMR, Redshift Architecture, Redshift copy command to copy from S3 to Redshift tables, Azure Data factory and Azure Databricks cluster 

Google cloud platform GCS, Data proc, PubSub, cloud SQL Composer, database migration services, API Gateway

Git, Git revert and Git Reset, Git branch, Git merge Apache Impala Explained 

Module 7 

Capstone Project Selection 

Jenkins installation, Jenkins CICD pipeline and jenkins webhook, jenkins scheduler Docker Installation, Docker commands 

Docker jenkins pipeline, Docker compose, Docker volume, Docker swarm Kubernetes Architecture, Kubernetes Pods, replica set
kubernetes deployments types, Kubernetes replication controller 

Agile Methodology, Interview questions on various general topic such as cluster size, data volume Unit testing in spark scala, python  

The course outline above is a general overview of topics covered and skills learned. It is subject to change. Actual course may slightly differ from the outlined topics and assignments.

Ready for the next step?