Online
On-site
Hybrid

Enterprise Data Engineering Training

Build a strong foundation in modern Data Engineering using enterprise tools and proven architecture patterns, from ingestion to transformation and serving. Learn how to design reliable pipelines with governance, quality, orchestration, and performance best practices to power analytics and business reporting at scale.

Duration:
5 days
Rating:
4.8/5.0
Level:
Intermediate
1500+ users onboarded

Who will Benefit from this Training?

  • Software Engineers transitioning into Data Engineering
  • Data Analysts moving toward pipeline development
  • ETL Developers and BI Engineers modernizing data stacks
  • Junior to Intermediate Data Engineers
  • Cloud Engineers and DevOps Engineers supporting data platforms
  • Tech Leads and Data Architects who need scalable platform patterns

Training Objectives

  • Write production-grade SQL and Python for data extraction, transformation, and validation.
  • Design scalable data models for analytics and business reporting.
  • Build reliable ETL/ELT pipelines using orchestration and modular transformations.
  • Process high-volume datasets using Apache Spark (PySpark) distributed processing patterns.
  • Work effectively with cloud data warehouses and understand performance and cost tradeoffs.
  • Build real-time streaming pipelines using Kafka fundamentals.
  • Apply enterprise best practices for data quality, observability, and governance readiness.
  • Deliver an end-to-end mini-capstone data platform solution.

Build a high-performing, job-ready tech team.

Personalise your team’s upskilling roadmap and design a befitting, hands-on training program with Uptut

Key training modules

Comprehensive, hands-on modules designed to take you from basics to advanced concepts
Download Curriculum
  • Module 1: Spark for Data Engineering (Big Picture)
    1. Why Spark is used for ETL
    2. Distributed processing
    3. Scale-out workloads
    4. Fault tolerance
    5. ETL vs ELT in the real world
    6. Spark execution basics (driver and executors, lazy evaluation, transformations vs actions)
    7. Spark UI overview (what engineers monitor)
    8. Hands-on: Identify which workloads require Spark vs SQL-only tools
  • Module 2: Databricks Environment Overview
    1. Databricks workspace components
    2. Notebook workflow
    3. Interactive clusters vs job clusters
    4. Cluster sizing basics
    5. Hands-on: Lab 1: Setup workspace and create first notebook
    6. Hands-on: Lab 2: Create cluster and validate Spark session
  • Module 3: Reading Data into Spark DataFrames
    1. Read CSV with schema inference and explicit schema
    2. Read JSON nested data
    3. Read Parquet
    4. Schema management (inferSchema pitfalls, specifying schema for stability)
    5. Hands-on: Lab 3: Read raw CSV orders dataset and inspect schema
    6. Hands-on: Lab 4: Read nested JSON events and flatten selected fields
  • Module 4: Core Transformations (Real ETL Patterns)
    1. Selecting and renaming columns
    2. Filtering rows
    3. Type casting
    4. Handling null values
    5. Adding derived columns
    6. Hands-on: Lab 5: Clean raw orders table (standardize dates, handle missing fields, fix numeric types)
  • Module 5: Joins and Aggregations in Spark
    1. Join types and pitfalls (row explosion)
    2. Aggregations using groupBy
    3. Rollups (intro)
    4. Performance hint concepts (intro)
    5. Hands-on: Lab 6: Join customers + orders
    6. Hands-on: Lab 7: Create daily revenue aggregates
  • Module 6: Why Delta Lake (Lakehouse Foundations)
    1. Problems with raw Parquet only (inconsistent writes, partial failures, no table history)
    2. Delta Lake benefits (ACID transactions, schema enforcement, time travel)
    3. Hands-on: Lab 8: Write cleansed dataset as Delta table
    4. Hands-on: Lab 9: Query Delta table and view versions
  • Module 7: Incremental ETL Patterns (Production Requirement)
    1. Full refresh vs incremental
    2. Partition-based incremental loads
    3. Watermark strategy concepts
    4. Handling late arriving data
    5. Hands-on: Lab 10: Implement incremental append into Delta (partitioned by date)
    6. Hands-on: Lab 11: Validate incremental load row counts across runs
  • Module 8: Merge/Upsert Workflows (CDC Style)
    1. The need for MERGE (updates, dedup, CDC-like merges)
    2. Using Delta MERGE INTO
    3. Building dimension tables using upserts
    4. Hands-on: Lab 12: MERGE customers into curated customer dimension table
    5. Hands-on: Lab 13: Update records and verify current state changes correctly
  • Module 9: Data Validation and Quality Checks in Spark Pipelines
    1. Schema checks
    2. Null checks
    3. Duplicate checks
    4. Range checks
    5. Stop pipeline on quality failure vs warn only
    6. Hands-on: Lab 14: Add validation stage before publishing curated output
    7. Hands-on: Lab 15: Generate a data quality report output
  • Module 10: Debugging Broken Data Pipelines
    1. Schema mismatch
    2. Bad data types
    3. Corrupt records
    4. Missing columns
    5. Join explosion issues
    6. Hands-on: Lab 16: Fix the broken pipeline exercise (wrong schema, invalid join key, corrupted file handling, missing partition folder)
  • Module 11: Spark Performance Fundamentals (Must-Know)
    1. Why Spark pipelines become slow (shuffles, skewed joins, too many small files)
    2. Partitioning concepts (repartition vs coalesce)
    3. Caching and persistence strategy
    4. Broadcast joins (intro)
    5. Hands-on: Lab 17: Optimize job by reducing shuffle
    6. Hands-on: Lab 18: Fix slow join performance using broadcast join
  • Module 12: File Optimization and Lakehouse Maintenance
    1. Small files problem
    2. Compaction strategies concept
    3. Z-Ordering concept (Databricks)
    4. VACUUM and retention concept
    5. Optimize table operations overview
    6. Hands-on: Lab 19: Run OPTIMIZE on Delta table and compare query performance
    7. Hands-on: Lab 20: Run VACUUM safely and explain retention impact
  • Module 13: Building Reusable ETL Pipelines
    1. Notebook to modular pipeline approach
    2. Using parameters (processing date, source path, target path)
    3. Configuration-driven pipelines
    4. Pipeline logging patterns
    5. Hands-on: Lab 21: Add parameters to ETL notebook (run_date, env)
    6. Hands-on: Lab 22: Generate run metrics (rows processed, runtime, failed records)
  • Module 14: Orchestration Options (Databricks Jobs + Airflow Overview)
    1. Databricks Jobs (scheduling, retries, notifications)
    2. Airflow integration overview (triggering Databricks job run, dependency chaining)
    3. Hands-on: Lab 23: Convert notebook into a Databricks job and schedule it
    4. Hands-on: Lab 24: Simulate failure and validate retries
  • Module 15: Security and Access Control (Intro)
    1. Workspace permissions (concept)
    2. Secure handling of credentials
    3. Secrets management concept (Databricks secrets)
    4. Hands-on: Define access policy for data engineers, analysts, external contractors
  • Module 16: End-to-End Pipeline Design (Batch ELT Architecture)
    1. Architecture: raw → cleansed → curated → marts
    2. Choosing Delta vs Parquet
    3. Partition keys
    4. Incremental vs full
    5. SLA and validation strategy
    6. Hands-on: Workshop: Design the capstone pipeline architecture
  • Module 17: Capstone Implementation (Hands-on Build)
    1. Capstone goal: Build an end-to-end ELT pipeline using Databricks + Delta
    2. Ingest raw datasets (customers, orders, payments)
    3. Build cleansed Delta tables
    4. Build curated tables (customer dimension MERGE, daily revenue fact table)
    5. Implement incremental processing (run-by-date parameter)
    6. Add data quality checks (unique order_id, non-null customer_id, valid amounts)
    7. Produce reporting tables (top customers, revenue by region)
    8. Hands-on: Deliverables: notebooks/scripts, Delta tables, validation report, KPI queries
  • Module 18: Fix the Broken ETL Challenge
    1. Broken scenario: wrong schema
    2. Broken scenario: missing partition folder
    3. Broken scenario: join explosion
    4. Broken scenario: duplicates causing incorrect revenue
    5. Hands-on: Lab 25: Troubleshoot and fix the broken ETL pipeline
  • Module 19: Production Readiness Checklist
    1. Idempotency checklist
    2. Backfill strategy
    3. Late arriving data plan
    4. Observability plan (logs, run metrics, alerts)
    5. Cost optimization checklist
    6. Hands-on: Workshop: Create a production readiness checklist for your team

Hands-on Experience with Tools

No items found.
No items found.
No items found.

Training Delivery Format

Flexible, comprehensive training designed to fit your schedule and learning preferences
Opt-in Certifications
AWS, Scrum.org, DASA & more
100% Live
on-site/online training
Hands-on
Labs and capstone projects
Lifetime Access
to training material and sessions

How Does Personalised Training Work?

Skill-Gap Assessment

Analysing skill gap and assessing business requirements to craft a unique program

1

Personalisation

Customising curriculum and projects to prepare your team for challenges within your industry

2

Implementation

Supplementing training with consulting support to ensure implementation in real projects

3

Why Data Engineering for your business?

  • Faster decision-making: Enable trusted dashboards and analytics by standardizing pipelines and data models.
  • Operational efficiency: Reduce manual reporting effort with automated batch and streaming pipelines.
  • Scalable foundations: Build a platform that supports growth, high data volumes, and advanced analytics.
  • AI readiness: Prepare clean, validated datasets required for successful AI/ML initiatives.
  • Governance and compliance: Improve audit readiness with repeatable pipelines, validation checks, and lineage practices.

Lead the Digital Landscape with Cutting-Edge Tech and In-House " Techsperts "

Discover the power of digital transformation with train-to-deliver programs from Uptut's experts. Backed by 50,000+ professionals across the world's leading tech innovators.

Frequently Asked Questions

1. What are the pre-requisites for this training?
Faq PlusFaq Minus

The training does not require you to have prior skills or experience. The curriculum covers basics and progresses towards advanced topics.

2. Will my team get any practical experience with this training?
Faq PlusFaq Minus

With our focus on experiential learning, we have made the training as hands-on as possible with assignments, quizzes and capstone projects, and a lab where trainees will learn by doing tasks live.

3. What is your mode of delivery - online or on-site?
Faq PlusFaq Minus

We conduct both online and on-site training sessions. You can choose any according to the convenience of your team.

4. Will trainees get certified?
Faq PlusFaq Minus

Yes, all trainees will get certificates issued by Uptut under the guidance of industry experts.

5. What do we do if we need further support after the training?
Faq PlusFaq Minus

We have an incredible team of mentors that are available for consultations in case your team needs further assistance. Our experienced team of mentors is ready to guide your team and resolve their queries to utilize the training in the best possible way. Just book a consultation to get support.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.