Data Engineering

Enterprise Data Engineering Training

Online

On-site

Hybrid

Enterprise Data Engineering Training

Build a strong foundation in modern Data Engineering using enterprise tools and proven architecture patterns, from ingestion to transformation and serving. Learn how to design reliable pipelines with governance, quality, orchestration, and performance best practices to power analytics and business reporting at scale.

Duration:

5 days

Level:

Intermediate

Get a Quote

1500+ users onboarded

Who will Benefit from this Training?

Software Engineers transitioning into Data Engineering
Data Analysts moving toward pipeline development
ETL Developers and BI Engineers modernizing data stacks
Junior to Intermediate Data Engineers
Cloud Engineers and DevOps Engineers supporting data platforms
Tech Leads and Data Architects who need scalable platform patterns

Training Objectives

Write production-grade SQL and Python for data extraction, transformation, and validation.
Design scalable data models for analytics and business reporting.
Build reliable ETL/ELT pipelines using orchestration and modular transformations.
Process high-volume datasets using Apache Spark (PySpark) distributed processing patterns.
Work effectively with cloud data warehouses and understand performance and cost tradeoffs.
Build real-time streaming pipelines using Kafka fundamentals.
Apply enterprise best practices for data quality, observability, and governance readiness.
Deliver an end-to-end mini-capstone data platform solution.

Build a high-performing, job-ready tech team.

Personalise your team’s upskilling roadmap and design a befitting, hands-on training program with Uptut

get started

Key training modules

Comprehensive, hands-on modules designed to take you from basics to advanced concepts

Module 1: Spark for Data Engineering (Big Picture)
1. Why Spark is used for ETL
2. Distributed processing
3. Scale-out workloads
4. Fault tolerance
5. ETL vs ELT in the real world
6. Spark execution basics (driver and executors, lazy evaluation, transformations vs actions)
7. Spark UI overview (what engineers monitor)
8. Hands-on: Identify which workloads require Spark vs SQL-only tools
Module 2: Databricks Environment Overview
1. Databricks workspace components
2. Notebook workflow
3. Interactive clusters vs job clusters
4. Cluster sizing basics
5. Hands-on: Lab 1: Setup workspace and create first notebook
6. Hands-on: Lab 2: Create cluster and validate Spark session
Module 3: Reading Data into Spark DataFrames
1. Read CSV with schema inference and explicit schema
2. Read JSON nested data
3. Read Parquet
4. Schema management (inferSchema pitfalls, specifying schema for stability)
5. Hands-on: Lab 3: Read raw CSV orders dataset and inspect schema
6. Hands-on: Lab 4: Read nested JSON events and flatten selected fields
Module 4: Core Transformations (Real ETL Patterns)
1. Selecting and renaming columns
2. Filtering rows
3. Type casting
4. Handling null values
5. Adding derived columns
6. Hands-on: Lab 5: Clean raw orders table (standardize dates, handle missing fields, fix numeric types)
Module 5: Joins and Aggregations in Spark
1. Join types and pitfalls (row explosion)
2. Aggregations using groupBy
3. Rollups (intro)
4. Performance hint concepts (intro)
5. Hands-on: Lab 6: Join customers + orders
6. Hands-on: Lab 7: Create daily revenue aggregates
Module 6: Why Delta Lake (Lakehouse Foundations)
1. Problems with raw Parquet only (inconsistent writes, partial failures, no table history)
2. Delta Lake benefits (ACID transactions, schema enforcement, time travel)
3. Hands-on: Lab 8: Write cleansed dataset as Delta table
4. Hands-on: Lab 9: Query Delta table and view versions
Module 7: Incremental ETL Patterns (Production Requirement)
1. Full refresh vs incremental
2. Partition-based incremental loads
3. Watermark strategy concepts
4. Handling late arriving data
5. Hands-on: Lab 10: Implement incremental append into Delta (partitioned by date)
6. Hands-on: Lab 11: Validate incremental load row counts across runs
Module 8: Merge/Upsert Workflows (CDC Style)
1. The need for MERGE (updates, dedup, CDC-like merges)
2. Using Delta MERGE INTO
3. Building dimension tables using upserts
4. Hands-on: Lab 12: MERGE customers into curated customer dimension table
5. Hands-on: Lab 13: Update records and verify current state changes correctly
Module 9: Data Validation and Quality Checks in Spark Pipelines
1. Schema checks
2. Null checks
3. Duplicate checks
4. Range checks
5. Stop pipeline on quality failure vs warn only
6. Hands-on: Lab 14: Add validation stage before publishing curated output
7. Hands-on: Lab 15: Generate a data quality report output
Module 10: Debugging Broken Data Pipelines
1. Schema mismatch
2. Bad data types
3. Corrupt records
4. Missing columns
5. Join explosion issues
6. Hands-on: Lab 16: Fix the broken pipeline exercise (wrong schema, invalid join key, corrupted file handling, missing partition folder)
Module 11: Spark Performance Fundamentals (Must-Know)
1. Why Spark pipelines become slow (shuffles, skewed joins, too many small files)
2. Partitioning concepts (repartition vs coalesce)
3. Caching and persistence strategy
4. Broadcast joins (intro)
5. Hands-on: Lab 17: Optimize job by reducing shuffle
6. Hands-on: Lab 18: Fix slow join performance using broadcast join
Module 12: File Optimization and Lakehouse Maintenance
1. Small files problem
2. Compaction strategies concept
3. Z-Ordering concept (Databricks)
4. VACUUM and retention concept
5. Optimize table operations overview
6. Hands-on: Lab 19: Run OPTIMIZE on Delta table and compare query performance
7. Hands-on: Lab 20: Run VACUUM safely and explain retention impact
Module 13: Building Reusable ETL Pipelines
1. Notebook to modular pipeline approach
2. Using parameters (processing date, source path, target path)
3. Configuration-driven pipelines
4. Pipeline logging patterns
5. Hands-on: Lab 21: Add parameters to ETL notebook (run_date, env)
6. Hands-on: Lab 22: Generate run metrics (rows processed, runtime, failed records)
Module 14: Orchestration Options (Databricks Jobs + Airflow Overview)
1. Databricks Jobs (scheduling, retries, notifications)
2. Airflow integration overview (triggering Databricks job run, dependency chaining)
3. Hands-on: Lab 23: Convert notebook into a Databricks job and schedule it
4. Hands-on: Lab 24: Simulate failure and validate retries
Module 15: Security and Access Control (Intro)
1. Workspace permissions (concept)
2. Secure handling of credentials
3. Secrets management concept (Databricks secrets)
4. Hands-on: Define access policy for data engineers, analysts, external contractors
Module 16: End-to-End Pipeline Design (Batch ELT Architecture)
1. Architecture: raw → cleansed → curated → marts
2. Choosing Delta vs Parquet
3. Partition keys
4. Incremental vs full
5. SLA and validation strategy
6. Hands-on: Workshop: Design the capstone pipeline architecture
Module 17: Capstone Implementation (Hands-on Build)
1. Capstone goal: Build an end-to-end ELT pipeline using Databricks + Delta
2. Ingest raw datasets (customers, orders, payments)
3. Build cleansed Delta tables
4. Build curated tables (customer dimension MERGE, daily revenue fact table)
5. Implement incremental processing (run-by-date parameter)
6. Add data quality checks (unique order_id, non-null customer_id, valid amounts)
7. Produce reporting tables (top customers, revenue by region)
8. Hands-on: Deliverables: notebooks/scripts, Delta tables, validation report, KPI queries
Module 18: Fix the Broken ETL Challenge
1. Broken scenario: wrong schema
2. Broken scenario: missing partition folder
3. Broken scenario: join explosion
4. Broken scenario: duplicates causing incorrect revenue
5. Hands-on: Lab 25: Troubleshoot and fix the broken ETL pipeline
Module 19: Production Readiness Checklist
1. Idempotency checklist
2. Backfill strategy
3. Late arriving data plan
4. Observability plan (logs, run metrics, alerts)
5. Cost optimization checklist
6. Hands-on: Workshop: Create a production readiness checklist for your team

Hands-on Experience with Tools

No items found.

Training Delivery Format

Flexible, comprehensive training designed to fit your schedule and learning preferences

Opt-in Certifications

AWS, Scrum.org, DASA & more

100% Live

on-site/online training

Hands-on

Labs and capstone projects

Lifetime Access

to training material and sessions

How Does Personalised Training Work?

get started

Skill-Gap Assessment

Analysing skill gap and assessing business requirements to craft a unique program

1

Personalisation

Customising curriculum and projects to prepare your team for challenges within your industry

2

Implementation

Supplementing training with consulting support to ensure implementation in real projects

3

Why Data Engineering for your business?

Faster decision-making: Enable trusted dashboards and analytics by standardizing pipelines and data models.
Operational efficiency: Reduce manual reporting effort with automated batch and streaming pipelines.
Scalable foundations: Build a platform that supports growth, high data volumes, and advanced analytics.
AI readiness: Prepare clean, validated datasets required for successful AI/ML initiatives.
Governance and compliance: Improve audit readiness with repeatable pipelines, validation checks, and lineage practices.

Lead the Digital Landscape with Cutting-Edge Tech and In-House " Techsperts "

Discover the power of digital transformation with train-to-deliver programs from Uptut's experts. Backed by 50,000+ professionals across the world's leading tech innovators.

GET STARTED

Frequently Asked Questions

1. What are the pre-requisites for this training?

The training does not require you to have prior skills or experience. The curriculum covers basics and progresses towards advanced topics.

2. Will my team get any practical experience with this training?

With our focus on experiential learning, we have made the training as hands-on as possible with assignments, quizzes and capstone projects, and a lab where trainees will learn by doing tasks live.

3. What is your mode of delivery - online or on-site?

We conduct both online and on-site training sessions. You can choose any according to the convenience of your team.

4. Will trainees get certified?

Yes, all trainees will get certificates issued by Uptut under the guidance of industry experts.

5. What do we do if we need further support after the training?

We have an incredible team of mentors that are available for consultations in case your team needs further assistance. Our experienced team of mentors is ready to guide your team and resolve their queries to utilize the training in the best possible way. Just book a consultation to get support.

Enterprise Data Engineering Training

Who will Benefit from this Training?

Training Objectives

Build a high-performing, job-ready tech team.

Key training modules

Hands-on Experience with Tools