Job Summary
The Site Reliability Engineer (SRE) ensures the availability, performance, and reliability of cloud‑native SaaS platforms running on Microsoft Azure. The role focuses on observability, incident management, and automation, with Dynatrace as the primary monitoring and AIOps platform.
Key Responsibilities
Operate and support production workloads on Azure (AKS, Azure PaaS services) Implement and manage end‑to‑end observability using Dynatrace (metrics, logs, traces, baselines, alerts) Define and track SLIs, SLOs, and error budgets to drive reliability decisions Lead or participate in incident response, root cause analysis, and blameless postmortems Automate operational tasks using Infrastructure as Code (Terraform, ARM/Bicep, Helm) Collaborate with DevOps and engineering teams to improve deployment safety, resilience, and performance Support on‑call rotations and ensure continuous service improvement . Strong experience with Microsoft Azure, especially AKS and cloud networking\\\\r\\\\nHands‑on expertise with Dynatrace (services, PurePaths, dashboards, alerting)\\\\r\\\\nExperience operating Kubernetes‑based production systems\\\\r\\\\nSolid understanding of SRE principles (reliability, toil reduction, automation)\\\\r\\\\nExperience with incident management and troubleshooting distributed systems
Skill Requirements
Strong experience with Microsoft Azure, especially AKS and cloud networking Hands‑on expertise with Dynatrace (services, PurePaths, dashboards, alerting) Experience operating Kubernetes‑based production systems Solid understanding of SRE principles (reliability, toil reduction, automation) Experience with incident management and troubleshooting distributed systems
Other Requirements
Strong experience with Microsoft Azure, especially AKS and cloud networking Hands‑on expertise with Dynatrace (services, PurePaths, dashboards, alerting) Experience operating Kubernetes‑based production systems Solid understanding of SRE principles (reliability, toil reduction, automation) Experience with incident management and troubleshooting distributed systems