Job Summary
We are seeking a highly skilled Senior Observability Engineer to lead the evolution of our enterprise observability platform. In this role, you will transition our systems to an open, standards-based observability architecture using OpenTelemetry (OTel). You will design and scale OTel solutions across polyglot microservices (Java, Python, etc.) and Google Cloud Platform (GCP) resources, routing telemetry to Dynatrace, Prometheus, and Jaeger. A major focus of this role is improving the overall developer experience by creating onboarding tools, documentation, migration paths, and automated Quality of Service (QoS) reporting. The Subject Matter Expert (Support & Ops) plays a critical role in ensuring the timely resolution of escalations and incidents while adhering to quality norms and service level agreements (SLAs). This position is pivotal in enhancing customer satisfaction through effective analysis, communication, and process improvement.
Key Responsibilities
1. Ensure Timely Resolution And Quality Compliance Of Escalated Incidents By Utilizing Dynatrace For Performance Monitoring And Powershell For Automation, Aligning With Agreed Slas.
2. Conduct Value-Added Activities, Including Mentoring Team Members And Preparing Standard Operating Procedures Using Python For Process Automation, While Maintaining Effective Documentation And Promoting Knowledge Sharing.
3. Validate Change Order Implementation Plans And Ensure Human Error Compliance By Leveraging Dynatrace Insights And Powershell Scripts, While Actively Participating In Capacity Planning Initiatives.
4. Facilitate Positive Customer Feedback And Satisfaction By Participating In Customer Meetings, Employing Effective Communication Skills To Understand And Address Any Issues Faced.
5. Validate Analyses Such As Root Cause Analysis And Trend Analysis Using Python And Present Performance Reports To Key Business Stakeholders, Ensuring Data-Driven Decision-Making.
Key Responsibilities Expand Observability Coverage: Design and implement standardized OpenTelemetry (OTel) SDK and API configurations for polyglot environments, focusing on Python and Java. Architect and deploy OTel-based collection mechanisms for GCP-native resources, including serverless components (Cloud Functions) and asynchronous messaging (Pub/Sub). Ensure all OTel-ingested metrics, traces, and logs are seamlessly mapped, enriched, and visualized within Dynatrace using native OTLP ingestion. Enhance Interoperability: Design, deploy, and maintain high-availability OpenTelemetry Collector pipelines to receive, process, batch, and export telemetry data. Configure OTel pipelines to dynamically route telemetry to multiple backends (e.g., exporting traces to Dynatrace and Jaeger, and metrics to Prometheus). Establish unified semantic conventions, tagging schemas, and context propagation standards across all services. Improve User Experience & Developer Enablement: Build a frictionless onboarding experience for software engineering teams by creating reusable OTel templates and bootstrap libraries. Create automated migration tools and scripts to help teams transition from legacy/proprietary agents to OpenTelemetry. Design and build automated pipelines to generate customizable Quality of Service (QoS) and Service Level Objective (SLO) reports using Dynatrace and OTel APIs.
Skill Requirements
1. In-Depth Knowledge Of Dynatrace For Performance Monitoring And Incident Management.
2. Proficiency In Powershell Scripting For Process Automation And Incident Resolution.
3. Strong Analytical Skills With The Ability To Conduct Root Cause Analysis And Trend Analysis Using Python.
4. Excellent Communication Skills For Effective Stakeholder Engagement And Presentation.
OpenTelemetry Ecosystem: Deep expertise in the OTel Collector architecture, OTel SDKs/APIs, OTLP protocol, auto-instrumentation agents, and semantic conventions. Observability Backends: Advanced administration of Dynatrace (including Smartscape, Davis AI, and OTLP ingestion), as well as open-source tools like Prometheus and Jaeger. Programming Languages: High proficiency in writing, debugging, and instrumenting code in Python and Java. Cloud Infrastructure: Hands-on experience monitoring Google Cloud Platform (GCP) resources, specifically GKE, Cloud Functions, and Pub/Sub. Infrastructure as Code (IaC) & CI/CD: Experience writing Terraform to deploy observability infrastructure and integrating instrumentation into CI/CD pipelines (e.g., GitHub Actions, GitLab CI, or Jenkins).
Other Requirements
1. Optional But Valuable: Itil Foundation Certification.
2. Optional But Valuable: Dynatrace Certification
Professional Experience: 5+ years of experience in Site Reliability Engineering (SRE), DevOps, or Platform Engineering with a heavy focus on observability. Education: Bachelor’s degree in Computer Science, Software Engineering, Information Technology, or a related technical field (or equivalent practical experience). Certifications (Highly Preferred): Dynatrace Certified Professional or Associate. GCP Professional Cloud DevOps Engineer or Cloud Architect. Certified Kubernetes Administrator (CKA). Industry Contributions (Plus): Active contributions to the OpenTelemetry open-source project or related CNCF communities