Big Data Training Outline
curriculum made for the real world
Module 1
Introduction: Linux Environment, Setting Up Linux Environment, Linux Basic commands, Skills Evaluation
Development Principles: Setting Up Scala Environment, Scala Development Principles, Scala Object oriented Programming and functional programing concepts, Scala programming features
Setting Up Python Environment: Python Development Principles, Python programming concepts - classes, collections, exception handling, Study Cases For Big Data
Introduction to Big Data: Hadoop Overview and History, Overview of the Hadoop Ecosystem, Hadoop Architecture
Module 2
Introduction, and install Hadoop: HDFS: What it is, and how it works, Installing Dataset into HDFS, HDFS Architecture Read/Write Anatomy, HDFS commands
YARN explained: Classical version of Map Reduce, MapReduce: What it is, and how it works, Mapreduce Architecture: word count program, Mapreduce Combiner & Partitioner concepts and practical demo, Mapreduce Distributed Cache, MapSidejoin and ReduceSideJoin
Integrating DBMS with Hadoop: Setting up MySQL Server and creating & loading datasets into Mysql, Sqoop Architecture, Writing the Sqoop Import Commands to transfer data from RDBMS to HDFS/Hive, Writing the Sqoop Export Commands to transfer data from HDFS/Hive to RDBMS
Flume explained: Set up Flume and publish logs with it, Set up Flume to monitor a directory and store its data in HDFS, Use case implementation in flume
Hive: Use Hive to find the most popular game, How Hive works, Hive Architecture, Hive Metastore, Hive Partitioning and Bucketing, Hive file Format, Hive JSON,CSV,XML,ORC & Regex Serde, Use Hive to find the game with the highest average rating, Skills Evaluation
Module 3
ZooKeeper explained: Kafka explained, Kafka Architecture, Setting up Kafka, and publishing data, Publishing web logs with Kafka, Kafka Streams,Kafka connect API, Kafka Performance tuning
Basic SQL query: SQL Queries joins, SQL Windows functions, SQL indexes,view, SQL performance tuning
Spark and MapReduce difference: The Resilient Distributed Dataset (RDD), DataFrame & Dataset difference, Spark core transformations and actions, Spark performance tuning
Spark Dataframe API: Spark dataset API,SparkSQL(reading data from different data sources (mysql, Hive, Mongo) Spark Sql Join, Shared variable, Spark cluster types, Spark SQL, Spark file system ingestion for CSV, XML, JSON
Module 4
Spark Streaming concepts: DStream, Spark structured streaming using Kafka), Spark socket program, Spark file format structured streaming, Spark stateful streaming
Data Pipeline Building: with Nifi/Kafka/Spark, Skills Evaluation RESTAPI Implementation in python: CSV,JSON,XML parser in python POC for Spark core, Spark Sql, Spark Streaming
What is HBase: Hbase shell commands, Hbase Hive integration, Hbase shell commands, Hbase Hive integration, Hbase Sqoop integration.
Module 5
Airflow Installation, Airflow Webserver, Airflow scheduler, Architecture , Bashoperator, hive operator and Sparkoperator, Airflow executors, Airflow Xcom, Airflow hook
Snowflake SQL data warehouse tool and integration with spark, Cloudera CDH cluster and Databricks cluster, Cassandra concepts and Cassandra shell commands, Snowflake SQL data warehouse tool and integration with spark
Oozie explained: Set up a simple Oozie workflow, Zeppelin overview
MongoDB concepts and mongodb shell commands: Mongo aggregation framework, indexing, mongo python integration, Mongo Installation Introduction to Apache NiFi: Installing And Configuring NiFi
Important Concepts: FlowFile, Processor and Connector Nifi example use case
Module 6
AWS cloud Introduction: AWS account creation, comparison between Azure,GCP and AWS, AWS EC2 different types of instances, AWS EBS volume, AWS Security Group, AWS S3, VPC, AWS autoscaling, RDS, DynamoDB, LAMBDA, EMR cluster setup, spark job deployment in EMR, Redshift Architecture, Redshift copy command to copy from S3 to Redshift tables, Azure Data factory and Azure Databricks cluster
Google cloud platform GCS, Data proc, PubSub, cloud SQL Composer, database migration services, API Gateway
Git, Git revert and Git Reset, Git branch, Git merge Apache Impala Explained
Module 7
Capstone Project Selection
Jenkins installation, Jenkins CICD pipeline and jenkins webhook, jenkins scheduler Docker Installation, Docker commands
Docker jenkins pipeline, Docker compose, Docker volume, Docker swarm Kubernetes Architecture, Kubernetes Pods, replica set
kubernetes deployments types, Kubernetes replication controller
Agile Methodology, Interview questions on various general topic such as cluster size, data volume Unit testing in spark scala, python
The course outline above is a general overview of topics covered and skills learned. It is subject to change. Actual course may slightly differ from the outlined topics and assignments.