DevOps

Observability and Monitoring with Prometheus and Grafana

Online

On-site

Hybrid

Observability and Monitoring with Prometheus and Grafana

Build a strong foundation in monitoring and alerting using Prometheus and Grafana, from metrics fundamentals to production troubleshooting. Learn how to build Kubernetes monitoring dashboards, design meaningful alerts, and respond to real incident patterns with operational confidence.

Duration:

3 days

Level:

Intermediate

Get a Quote

1500+ users onboarded

Who will Benefit from this Training?

DevOps Engineers
SRE Engineers
Cloud Engineers
Platform Engineers
Kubernetes Administrators
Backend Engineers owning production services
Tech Leads responsible for uptime and production stability

Training Objectives

Explain observability vs monitoring and the pillars of observability (metrics, logs, traces).
Understand Prometheus architecture including server, exporters, service discovery, and TSDB basics.
Collect metrics from Linux servers, applications, containers, and Kubernetes clusters.
Use PromQL to query, aggregate, and troubleshoot common infrastructure and application issues.
Build actionable Grafana dashboards for infrastructure, applications, and Kubernetes workloads.
Configure alerting using Prometheus alert rules and Alertmanager routing, grouping, and silencing.
Apply monitoring strategy patterns including Golden Signals, RED metrics, and SLI/SLO basics.
Deploy Prometheus and Grafana using best practices including Helm and kube-prometheus-stack.
Use ServiceMonitor/PodMonitor to onboard new targets in Kubernetes-native way.
Create recording rules to standardize metrics and improve query performance.
Implement blackbox synthetic checks and alert on availability/latency failures.
Tune alerts to reduce noise and add runbook-ready context for faster incident response.
Perform real-world troubleshooting using dashboards, metrics, and events through incident simulations.

Build a high-performing, job-ready tech team.

Personalise your team’s upskilling roadmap and design a befitting, hands-on training program with Uptut

get started

Key training modules

Comprehensive, hands-on modules designed to take you from basics to advanced concepts

Module 1: Observability Fundamentals (Monitoring vs Observability)
1. Monitoring vs observability (what’s the difference in real operations)
2. Three pillars of observability (metrics, logs, traces)
3. Why “unknown unknowns” require observability, not only monitoring
4. Where Prometheus and Grafana fit in the observability stack
5. Hands-on: Activity: Map metrics/logs/traces to real incidents (CPU spike, latency, DB slowdown)
Module 2: Prometheus Architecture Deep Dive
1. Prometheus server components (scraper, storage, query engine)
2. Exporters overview (node exporter, app exporters, custom metrics)
3. Service discovery concepts (static vs dynamic targets)
4. TSDB basics (time series model, labels, retention)
5. Hands-on: Lab: Install Prometheus and explore targets, metrics endpoint, and TSDB health
Module 3: Collecting Metrics from Linux, Apps, Containers, and Kubernetes
1. Linux server metrics using node_exporter (CPU, memory, disk, network)
2. Application metrics patterns (HTTP requests, latency, errors)
3. Container metrics overview (cAdvisor concept, container resource visibility)
4. Kubernetes cluster metrics overview (kube-state-metrics, API server metrics concept)
5. Hands-on: Lab: Onboard Linux + app targets and validate scrape health and label consistency
Module 4: PromQL for Querying and Troubleshooting
1. PromQL basics (instant vectors, range vectors, label filters)
2. Aggregation and grouping (sum, avg, max, by, without)
3. Rates and counters (rate(), irate(), increase())
4. Troubleshooting patterns (high CPU, memory pressure, saturation, error spikes)
5. Hands-on: Lab: Write PromQL queries to detect CPU saturation, memory leak symptoms, and high error rates
Module 5: Building Actionable Grafana Dashboards
1. Dashboard design principles (actionable, minimal, outcome-driven)
2. Infrastructure dashboards (CPU, memory, disk, network)
3. Application dashboards (latency, throughput, errors)
4. Kubernetes dashboards (pods, nodes, deployments, restarts)
5. Hands-on: Lab: Build 3 dashboards (infra + app + k8s) with templating and drilldowns
Module 6: Alerting with Prometheus Rules and Alertmanager
1. Alert rules structure (expr, for, labels, annotations)
2. Alertmanager routing (receivers, routes, matchers)
3. Grouping and inhibition to reduce noise
4. Silencing and maintenance windows
5. Hands-on: Lab: Create alert rules and route them using Alertmanager with grouping + silences
Module 7: Monitoring Strategy Patterns (Golden Signals, RED, SLI/SLO)
1. Golden Signals (latency, traffic, errors, saturation)
2. RED metrics for services (Rate, Errors, Duration)
3. SLI/SLO basics and why they reduce alert fatigue
4. Choosing what to alert on (symptoms vs causes)
5. Hands-on: Activity: Convert noisy alerts into SLO-aligned symptom alerts
Module 8: Deploying Prometheus and Grafana with Helm (kube-prometheus-stack)
1. Why kube-prometheus-stack is the standard on Kubernetes
2. Helm installation workflow and values management
3. Prometheus Operator basics (CRDs and controllers concept)
4. Best practices (retention, resources, persistence, HA concepts)
5. Hands-on: Lab: Deploy kube-prometheus-stack via Helm and validate Prometheus + Grafana access
Module 9: Kubernetes-Native Target Onboarding with ServiceMonitor and PodMonitor
1. ServiceMonitor vs PodMonitor (when to use each)
2. Label matching and selector strategy
3. Scrape config best practices (intervals, timeouts)
4. Common onboarding issues (missing labels, wrong ports, auth problems)
5. Hands-on: Lab: Create ServiceMonitor/PodMonitor for a sample app and validate discovery in Prometheus
Module 10: Recording Rules for Performance and Standardization
1. Why recording rules matter (faster dashboards, reusable metrics)
2. Standardizing metrics names and labels
3. Precomputing rates and aggregations
4. Rule evaluation intervals and performance trade-offs
5. Hands-on: Lab: Create recording rules for RED metrics and use them in Grafana panels
Module 11: Blackbox Synthetic Monitoring (Availability and Latency)
1. Blackbox exporter basics (HTTP, TCP, ICMP probes)
2. Synthetic checks for external endpoints and APIs
3. Alerting on availability and latency SLO breaches
4. Designing reliable checks (timeouts, retries, regions concept)
5. Hands-on: Lab: Configure blackbox probes and alert on downtime/latency failures
Module 12: Alert Tuning and Runbook-Ready Incident Response
1. Reducing noise (deduplication, grouping, severity levels)
2. Alert annotations that help responders (summary, impact, next actions)
3. Runbook links and troubleshooting steps in alerts
4. Escalation paths and ownership for alerts
5. Hands-on: Lab: Improve 5 noisy alerts by adding context, runbook steps, and correct thresholds
Module 13: Incident Simulations (Troubleshooting with Metrics + Dashboards + Events)
1. Incident workflow (detect, triage, isolate, mitigate, confirm)
2. Using Grafana drilldowns to locate bottlenecks
3. Using PromQL to confirm hypotheses (saturation, latency, errors)
4. Kubernetes events and restarts correlation with metrics
5. Hands-on: Capstone Lab: Run incident simulations (CPU spike, memory leak, pod crash, latency regression) and resolve using dashboards and metrics

Hands-on Experience with Tools

No items found.

Training Delivery Format

Flexible, comprehensive training designed to fit your schedule and learning preferences

Opt-in Certifications

AWS, Scrum.org, DASA & more

100% Live

on-site/online training

Hands-on

Labs and capstone projects

Lifetime Access

to training material and sessions

How Does Personalised Training Work?

get started

Skill-Gap Assessment

Analysing skill gap and assessing business requirements to craft a unique program

1

Personalisation

Customising curriculum and projects to prepare your team for challenges within your industry

2

Implementation

Supplementing training with consulting support to ensure implementation in real projects

3

Why Observability and Monitoring for your business?

Reduce downtime and incident duration: Faster detection and diagnosis lowers outage impact.
Prevent performance degradation: Proactive monitoring catches issues before users complain.
Improve release confidence: Monitoring validates stability after deployments and changes.
Avoid infrastructure overspending: Identify underutilized resources and tune scaling decisions.
Increase reliability and trust: Stable systems improve retention and reduce escalations.

Lead the Digital Landscape with Cutting-Edge Tech and In-House " Techsperts "

Discover the power of digital transformation with train-to-deliver programs from Uptut's experts. Backed by 50,000+ professionals across the world's leading tech innovators.

GET STARTED

Frequently Asked Questions

1. What are the pre-requisites for this training?

The training does not require you to have prior skills or experience. The curriculum covers basics and progresses towards advanced topics.

2. Will my team get any practical experience with this training?

With our focus on experiential learning, we have made the training as hands-on as possible with assignments, quizzes and capstone projects, and a lab where trainees will learn by doing tasks live.

3. What is your mode of delivery - online or on-site?

We conduct both online and on-site training sessions. You can choose any according to the convenience of your team.

4. Will trainees get certified?

Yes, all trainees will get certificates issued by Uptut under the guidance of industry experts.

5. What do we do if we need further support after the training?

We have an incredible team of mentors that are available for consultations in case your team needs further assistance. Our experienced team of mentors is ready to guide your team and resolve their queries to utilize the training in the best possible way. Just book a consultation to get support.

Observability and Monitoring with Prometheus and Grafana

Who will Benefit from this Training?

Training Objectives

Build a high-performing, job-ready tech team.

Key training modules

Hands-on Experience with Tools