Online
On-site
Hybrid

AWS Glue Deep Dive

Build a strong foundation in AWS Glue for production-grade ETL, from catalogs and crawlers to optimized Spark transformations. Learn how to build incremental pipelines, handle schema evolution, tune performance and cost, validate output quality, and troubleshoot real Glue job failures at scale.

Duration:
2 days
Rating:
4.8/5.0
Level:
Intermediate
1500+ users onboarded

Who will Benefit from this Training?

  • Data Engineers
  • Cloud Data Engineers (AWS)
  • Analytics Engineers working with AWS data lake pipelines
  • Data Platform Engineers
  • DevOps/Cloud Engineers supporting Glue environments

Training Objectives

  • Understand AWS Glue architecture and core components in detail.
  • Build production-ready Glue ETL pipelines for batch data processing.
  • Work confidently with Glue Data Catalog including databases, tables, crawlers, and schema updates.
  • Use AWS Glue jobs effectively including Spark-based ETL, Glue Studio workflows (intro), and Python Shell jobs (overview).
  • Transform data using Glue DynamicFrames and Spark DataFrames.
  • Implement incremental processing patterns using partition-based loads and watermark strategy concepts.
  • Handle schema drift and evolving datasets safely.
  • Optimize Glue jobs for performance and cost using worker sizing, DPUs, partitioning, and pushdown predicates.
  • Implement data quality and validation workflows inside Glue pipelines.
  • Troubleshoot and debug Glue jobs using CloudWatch logs and practical techniques.

Build a high-performing, job-ready tech team.

Personalise your team’s upskilling roadmap and design a befitting, hands-on training program with Uptut

Key training modules

Comprehensive, hands-on modules designed to take you from basics to advanced concepts
Download Curriculum
  • Module 1: AWS Glue Architecture and Core Components
    1. What AWS Glue is and where it fits in AWS data platforms
    2. Core components (Glue Data Catalog, Crawlers, Jobs, Triggers, Workflows)
    3. Execution model overview (Spark on AWS-managed infrastructure)
    4. IAM roles and permissions required for Glue pipelines
    5. Hands-on: Lab: Explore Glue console, identify core components, and map a Glue ETL flow end-to-end
  • Module 2: Glue Data Catalog Deep Dive (Databases, Tables, Crawlers)
    1. Catalog structure (databases, tables, partitions)
    2. Crawlers configuration (data stores, classifiers, schedules)
    3. Schema inference behavior and common pitfalls
    4. Schema updates, versioning behavior, and safe update patterns
    5. Hands-on: Lab: Create a database + crawler, generate tables, and validate partition discovery
  • Module 3: Building Production-Ready Glue ETL Pipelines (Batch Processing)
    1. ETL pipeline structure (raw → cleansed → curated)
    2. Job design patterns (modular transforms, reusable functions)
    3. Input and output formats (CSV, JSON, Parquet) and S3 layout strategy
    4. Job parameterization (paths, dates, environments)
    5. Hands-on: Lab: Build a batch Glue job to read raw data from S3, transform, and write curated Parquet
  • Module 4: Glue Jobs Overview (Spark ETL, Glue Studio Intro, Python Shell Overview)
    1. Spark-based Glue jobs (scripts, bookmarks concept, job parameters)
    2. Glue Studio visual jobs overview (when to use, limitations)
    3. Triggers and scheduling patterns
    4. Python Shell jobs overview (lightweight orchestration and utilities)
    5. Hands-on: Lab: Create a Glue Spark job and run it with parameters; explore Glue Studio workflow UI
  • Module 5: Transformations with DynamicFrames and Spark DataFrames
    1. DynamicFrames vs DataFrames (when to use each)
    2. Core DynamicFrame transforms (ApplyMapping, ResolveChoice, DropNullFields)
    3. Switching between DynamicFrame and DataFrame safely
    4. Common transformation patterns (join, dedup, aggregate)
    5. Hands-on: Lab: Implement cleansing transformations using DynamicFrames, then convert to DataFrame for advanced logic
  • Module 6: Incremental Processing Patterns (Partitions and Watermarks)
    1. Full load vs incremental processing trade-offs
    2. Partition-based incremental loads (by date, by region, by source)
    3. Watermark strategy concepts (updated_at, ingestion timestamp)
    4. Handling late-arriving data concept
    5. Hands-on: Lab: Implement partition-based incremental loads and validate row counts across multiple runs
  • Module 7: Handling Schema Drift and Evolving Datasets
    1. Why schema drift happens (new columns, type changes, nested structures)
    2. Crawler schema change behaviors and safe update settings
    3. ResolveChoice strategies and compatible type evolution patterns
    4. Backward/forward compatibility and data contract concepts
    5. Hands-on: Lab: Simulate schema changes and update Glue job logic to handle drift safely
  • Module 8: Performance and Cost Optimization for Glue Jobs
    1. Understanding DPUs, workers, and job sizing
    2. Partitioning and file sizing to reduce shuffle and runtime
    3. Pushdown predicates and reading only necessary data
    4. Job tuning checklist (caching, broadcast joins concept, avoiding small files)
    5. Hands-on: Lab: Tune a slow job by adjusting workers, partitions, and pushdown predicates and compare runtimes
  • Module 9: Data Quality and Validation in Glue Pipelines
    1. Data validation patterns (schema checks, null checks, duplicates, range checks)
    2. Fail-fast vs warn-only strategy
    3. Generating validation outputs (bad records, quality report)
    4. Publishing curated output only after validation passes
    5. Hands-on: Lab: Add a validation stage to a Glue job and generate a data quality report
  • Module 10: Troubleshooting and Debugging Glue Jobs (CloudWatch + Practical Techniques)
    1. Reading CloudWatch logs and common error categories
    2. Debug techniques (sample runs, printing schemas, limiting data)
    3. Common failures (permissions, missing partitions, schema mismatch, memory issues)
    4. Operational best practices (retries, alarms concepts, runbooks)
    5. Hands-on: Lab: Fix a broken Glue pipeline scenario using CloudWatch logs and step-by-step debugging

Hands-on Experience with Tools

No items found.
No items found.
No items found.

Training Delivery Format

Flexible, comprehensive training designed to fit your schedule and learning preferences
Opt-in Certifications
AWS, Scrum.org, DASA & more
100% Live
on-site/online training
Hands-on
Labs and capstone projects
Lifetime Access
to training material and sessions

How Does Personalised Training Work?

Skill-Gap Assessment

Analysing skill gap and assessing business requirements to craft a unique program

1

Personalisation

Customising curriculum and projects to prepare your team for challenges within your industry

2

Implementation

Supplementing training with consulting support to ensure implementation in real projects

3

Why AWS Glue Deep Dive for your business?

  • Faster ETL delivery: Build and run serverless data pipelines with minimal infrastructure work.
  • Reduced operational burden: Managed Spark reduces cluster setup, patching, and scaling concerns.
  • Improved metadata management: Glue Data Catalog enables consistent discovery and governance.
  • Cost-efficient processing: Pay for job execution time instead of always-on clusters.
  • Better integration: Seamlessly connect S3, Redshift, RDS, Athena, and modern lakehouse workflows.

Lead the Digital Landscape with Cutting-Edge Tech and In-House " Techsperts "

Discover the power of digital transformation with train-to-deliver programs from Uptut's experts. Backed by 50,000+ professionals across the world's leading tech innovators.

Frequently Asked Questions

1. What are the pre-requisites for this training?
Faq PlusFaq Minus

The training does not require you to have prior skills or experience. The curriculum covers basics and progresses towards advanced topics.

2. Will my team get any practical experience with this training?
Faq PlusFaq Minus

With our focus on experiential learning, we have made the training as hands-on as possible with assignments, quizzes and capstone projects, and a lab where trainees will learn by doing tasks live.

3. What is your mode of delivery - online or on-site?
Faq PlusFaq Minus

We conduct both online and on-site training sessions. You can choose any according to the convenience of your team.

4. Will trainees get certified?
Faq PlusFaq Minus

Yes, all trainees will get certificates issued by Uptut under the guidance of industry experts.

5. What do we do if we need further support after the training?
Faq PlusFaq Minus

We have an incredible team of mentors that are available for consultations in case your team needs further assistance. Our experienced team of mentors is ready to guide your team and resolve their queries to utilize the training in the best possible way. Just book a consultation to get support.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.