Data Engineering

ETL/ELT with Apache Spark + Databricks Bootcamp

Online

On-site

Hybrid

ETL/ELT with Apache Spark + Databricks Bootcamp

Build a strong foundation in building scalable ETL/ELT pipelines with Apache Spark and Databricks, from core transformations to Delta Lake lakehouse patterns. Learn how to implement incremental processing, optimize performance, enforce data quality, operationalize pipelines, and troubleshoot real production failures confidently.

Duration:

4 days

Level:

Intermediate

Get a Quote

1500+ users onboarded

Who will Benefit from this Training?

Data Engineers
Cloud Data Engineers
Analytics Engineers
Data Platform Engineers
Backend engineers moving into big data engineering
BI engineers working with transformation pipelines at scale

Training Objectives

Understand Spark architecture and why it is a standard for large-scale ETL.
Build ETL/ELT pipelines using Databricks notebooks and Spark DataFrames.
Read and write data in CSV, JSON, Parquet, and Delta Lake formats.
Apply production transformation patterns including cleansing, deduplication, joins, aggregations, and incremental logic.
Implement scalable patterns using partitioning, file sizing, and performance tuning fundamentals.
Build reliable pipelines using structured logging, validation checks, and error handling strategies.
Understand Delta Lake and lakehouse patterns including ACID tables, time travel, and merge/upsert.
Orchestrate pipelines using Databricks Jobs and understand Airflow integration approaches.
Deliver an end-to-end ETL/ELT pipeline capstone project with production readiness practices.

Build a high-performing, job-ready tech team.

Personalise your team’s upskilling roadmap and design a befitting, hands-on training program with Uptut

get started

Key training modules

Comprehensive, hands-on modules designed to take you from basics to advanced concepts

Module 1: Spark Architecture and Execution Model (Why Spark for Large-Scale ETL)
1. Why Spark is used (distributed processing, scale-out workloads, fault tolerance)
2. Spark components (driver, executors, cluster manager concepts)
3. Lazy evaluation and the DAG execution model
4. Transformations vs actions and what triggers compute
5. Hands-on: Lab: Inspect a Spark job plan and identify stages/shuffles for a sample workload
Module 2: Databricks Fundamentals (Workspace, Notebooks, Clusters)
1. Databricks workspace components and notebook workflow
2. Cluster types (interactive vs job clusters) and sizing basics
3. Databricks filesystem concepts and data access patterns
4. Development workflow (notebook, repos, job runs concept)
5. Hands-on: Lab: Create a notebook, start a cluster, and validate Spark session and basic DataFrame operations
Module 3: Reading and Writing Data (CSV, JSON, Parquet, Delta)
1. Reading CSV (schema inference vs explicit schema) and common pitfalls
2. Reading nested JSON and flattening selected fields
3. Parquet fundamentals (columnar format, schema, performance benefits)
4. Delta Lake basics (tables, transactions, metadata)
5. Hands-on: Lab: Read CSV + JSON, transform to Parquet, and write curated output as a Delta table
Module 4: Production Transformations with Spark DataFrames
1. Cleansing patterns (null handling, type casting, standardizing dates)
2. Deduplication patterns (keys, window functions concepts)
3. Joins and join pitfalls (row explosion, skew awareness)
4. Aggregations for analytics datasets (daily revenue, KPIs)
5. Hands-on: Lab: Build a cleansing + dedup + join + aggregate pipeline for an orders dataset
Module 5: Incremental Processing Logic (ELT at Scale)
1. Full refresh vs incremental processing trade-offs
2. Partition-based incremental loads (by date/region/source)
3. Watermark strategy concepts (updated_at, ingestion timestamp)
4. Handling late-arriving data concept and reprocessing windows
5. Hands-on: Lab: Implement an incremental pipeline and validate correct row counts across multiple runs
Module 6: Delta Lake and Lakehouse Patterns (ACID, Time Travel, Merge)
1. Why Delta (ACID transactions, consistent writes, schema enforcement)
2. Time travel and table history for auditability
3. Merge/upsert patterns for CDC-like workflows
4. Building curated dimensions and facts using MERGE
5. Hands-on: Lab: Write Delta tables, query versions, and implement MERGE into a dimension table
Module 7: Scalability and Performance Fundamentals (Partitioning, File Sizing, Tuning)
1. Shuffles and why they slow down Spark
2. Partitioning controls (repartition vs coalesce) and when to use each
3. Small files problem and file sizing strategy
4. Intro tuning patterns (caching, broadcast joins concept)
5. Hands-on: Lab: Optimize a slow job by reducing shuffle and improving join strategy
Module 8: Reliability and Quality in Data Pipelines (Logs, Validation, Errors)
1. Structured logging patterns (run id, dataset, rows processed)
2. Validation checks (schema, nulls, duplicates, ranges)
3. Fail-fast vs warn-only strategies and quarantine patterns
4. Debugging broken pipelines (schema mismatch, corrupt records, join explosion)
5. Hands-on: Lab: Add validation stages, generate a quality report, and fix a broken pipeline scenario
Module 9: Orchestration with Databricks Jobs (Airflow Integration Overview)
1. Databricks Jobs fundamentals (scheduling, parameters, retries, notifications)
2. Job cluster vs interactive cluster for production
3. Parameterizing pipelines (run_date, env, paths)
4. Airflow integration approaches (trigger job runs, dependency chaining concepts)
5. Hands-on: Lab: Convert a notebook pipeline into a scheduled Databricks Job with retries and parameters
Module 10: Capstone Project (End-to-End ETL/ELT Pipeline with Production Readiness)
1. Capstone goal: Deliver an end-to-end lakehouse pipeline (raw → cleansed → curated)
2. Ingest and transform using Spark DataFrames (CSV/JSON → Parquet/Delta)
3. Implement incremental processing and Delta MERGE for curated tables
4. Add validation checks, structured logging, and error handling
5. Operationalize using Databricks Jobs and provide runbook notes
6. Hands-on: Capstone Lab: Deliver working notebooks/jobs, Delta tables with history, and KPI query outputs

Hands-on Experience with Tools

No items found.

Training Delivery Format

Flexible, comprehensive training designed to fit your schedule and learning preferences

Opt-in Certifications

AWS, Scrum.org, DASA & more

100% Live

on-site/online training

Hands-on

Labs and capstone projects

Lifetime Access

to training material and sessions

How Does Personalised Training Work?

get started

Skill-Gap Assessment

Analysing skill gap and assessing business requirements to craft a unique program

1

Personalisation

Customising curriculum and projects to prepare your team for challenges within your industry

2

Implementation

Supplementing training with consulting support to ensure implementation in real projects

3

Why ETL/ELT with Spark + Databricks for your business?

Faster big data processing: Spark handles large-scale transformations efficiently across massive datasets.
Unified lakehouse approach: Combine data engineering, analytics, and ML on one platform with Databricks.
Improved pipeline reliability: Delta Lake patterns support ACID transactions and safer incremental loads.
Better performance optimization: Use partitioning, caching, and tuning to reduce runtime and cost.
Future-proof data platform: Enable advanced analytics and AI workloads with scalable architectures.

Lead the Digital Landscape with Cutting-Edge Tech and In-House " Techsperts "

Discover the power of digital transformation with train-to-deliver programs from Uptut's experts. Backed by 50,000+ professionals across the world's leading tech innovators.

GET STARTED

Frequently Asked Questions

1. What are the pre-requisites for this training?

The training does not require you to have prior skills or experience. The curriculum covers basics and progresses towards advanced topics.

2. Will my team get any practical experience with this training?

With our focus on experiential learning, we have made the training as hands-on as possible with assignments, quizzes and capstone projects, and a lab where trainees will learn by doing tasks live.

3. What is your mode of delivery - online or on-site?

We conduct both online and on-site training sessions. You can choose any according to the convenience of your team.

4. Will trainees get certified?

Yes, all trainees will get certificates issued by Uptut under the guidance of industry experts.

5. What do we do if we need further support after the training?

We have an incredible team of mentors that are available for consultations in case your team needs further assistance. Our experienced team of mentors is ready to guide your team and resolve their queries to utilize the training in the best possible way. Just book a consultation to get support.

ETL/ELT with Apache Spark + Databricks Bootcamp

Who will Benefit from this Training?

Training Objectives

Build a high-performing, job-ready tech team.

Key training modules

Hands-on Experience with Tools